# Chapter 1: Exploratory Data Analysis

## Elements of Structured Data
- In this section they are simply stating the type of data we can stumble upon (continuous, discrete, categorical, binary and ordinal)

## Rectangular Data
- In this section they showcase what kind of data format is used in data science. It's mostly the Rectangular Data format (spreadsheet like).
- data.frame doesn't support hierarchical indexing so you have to use external packages data.table and dplyr.
- They also show other type of data format other than the rectangular one: Time series, Spatial and Network based (Graph).

## Estimate of Location
- In this section they showcase different way of getting a typical value for a dataset with the purpose of summarizing this dataset.
- A list of these measure is given: Mean, Weighted Mean, Median, Weighted Median, Trimmed Mean
- Here I'll use the formula to create the function to calculate these centrality measures

In [67]:
# Test data
values <- c(1,2,3,4,5,6,7,8,9)
weights <- c(1,2,1,1,1,1,1,1,10)
# Mean
# Definition: sum of all values divided by the number of values
my_mean <- function(values) {
    sum <- 0
    for (value in values){
        sum <- sum + value
    }
    
    return(sum/length(values))
}

# Weighted Mean
# Definition: sum the multiplication of each element at x_i by a weight w_i and divide by the sum of the weights
my_weighted_mean <- function(values, weights){
    # here some error checking would be a good idea
    # to make sure that the weights and values vector are of the same size
    stopifnot(length(values) == length(weights))
    
    sum_value <- 0
    for (i in 1:length(values)){
        sum_value <- sum_value + values[i]*weights[i]
    }
    
    sum_weight <- 0
    for (weight in weights){
        sum_weight <- sum_weight + weight
    }
    
    return (sum_value/sum_weight)
}

# Median
# Definition: is the middle number in a sorted list of data for odd length vector
#             if the vector is even then its the average of the two middle number
my_median <- function(values){
    #Sorting
    values <- sort(values)
    if(length(values) %% 2 == 0){
        mid_i_1 = (length(values)/2)
        mid_i_2 = (length(values)/2) + 1
        return ((values[mid_i_1] + values[mid_i_2])/2)
    }
    return (values[ceiling(length(values)/2)])
}

# Weighted Median

# Trimmed Mean
# Definition: sum of all values divided by the number of values after 
#             dropping the top and bottom trim_amount values.
my_trimmed_mean <- function(values, trim_amount=0.3){
    # Calculating the number of values to reject at each end
    num_rejected <- floor(trim_amount*length(values))
    
    # Trimming the values and calculating the mean of the remainder
    start <- 1+num_rejected
    end <- length(values) - num_rejected
    return (my_mean(values[start:end])) 
}

# Testing
print(paste("Mean: " ,toString(my_mean(values)), " Should give 5"))
print(paste("Weighted Mean: " ,toString(my_weighted_mean(values, weights)), " Should give 6.73"))
print(paste("Trimmed Mean: " ,toString(my_trimmed_mean(values, 0.5)), " Should give 5"))
print(paste("Median: " ,toString(my_median(values)), " Should give 5"))

[1] "Mean:  5  Should give 5"
[1] "Weighted Mean:  6.73684210526316  Should give 6.73"
[1] "Trimmed Mean:  5  Should give 5"
[1] "Median:  5  Should give 5"
