# Building Functions and Visualizing Distribution

## Functions

In this section, we learned how to create functions and why they are so important in our work. Throughout, we will determine how we can program these functions, and provide visualizations that help us further understand what they are actually doing. The functions that will be touched on are statistical functions (mean, median, standard deviation, kurtosis, etc). We write code to compute these statistics because doing them by hand is tedious work, and can often lead to simple mistakes that provide inaccurate findings. 

## Building Functions

In python, functions can take the form of:

In [None]:
def function_name(object1, object2, objectn):
    <operations>

We can start by constructing a simple function, the sum() function. We start this buy using i=0 (since list indices start at 0) and process the elements until we reach the index n-1. Our for-loop will take the form:

In [None]:
n=0
for values in range(n):
    n+= 1

This for-loop will be used to return our sum. We will construct our function below:

In [4]:
def sum(list_obj):
    sum=0;
    n = len(list_obj)
    for i in range(n):
        sum += list_obj[i]
    return sum

list1 = [1, 5, 12, 41, 2, 12]
list2 = [i ** 3 for i in range(4,10)]
sum_list1 = sum(list1)
sum_list2 = sum(list2)
print(sum_list1)
print(sum_list2)

73
1989


The sum function is very simple, but it is an integral part of other functions we will write.

## Statistical Functions

Now that we determined how to make a simple summation function, we can take things a step further and construct functions that can help us describe data. We can derive a mean() function using our sum function that we created above.

In [3]:
def mean(list_obj):
    n = len(list_obj)
    mean_ = sum(list_obj) / n
    return mean_

mean_list1 = mean(list1)
mean_list2 = mean(list2)
print("Mean of list one: ", mean_list1)
print("Mean of list two: ", mean_list2)

Mean of list one:  14.75
Mean of list two:  331.5


Since these two functions are now set up, we can use them as the framework for other core statistical measures:

1. median
2. mode
3. variance
4. standard deviation
5. covariance
6. correlation

These values help us provide extra information about a datasets shape and structure, and help us draw conclusions based on the dataset as a whole. We will start by using our functions to construct a mode() function.

The mode of a dataset is defined as the observation that appears most frequently within said dataset. Instead of using the methods used above, we will be using a dictionary in order to determine this value. This is due to the fact that we can see how many times a certain value is linked with the respective key:

In [5]:
def mode(list_obj):
    max_count = 0
    counter_dict = {}
    for value in list_obj:
        counter_dict[value] = 0;
    for value in list_obj:
        counter_dict[value] += 1
        
    count_list = list(counter_dict.values())
    
    max_count = max(count_list)
    
    mode_ = [key for key in counter_dict if counter_dict[key] == max_count]
    
    return mode_

mode_list1 = mode(list1)
mode_list2 = mode(list2)

print("The mode of list one is: ", mode_list1)
print("The mode of list two is: ", mode_list2)

The mode of list one is:  [12]
The mode of list two is:  [64, 125, 216, 343, 512, 729]


The median of a dataset is the middlemost value. We will first determine whether or not the length of the list is odd or even. This is because it is much more straightforward if the list length is odd. If the list is even, we will just take the average of the two middle terms:

In [7]:
def median(list_obj):
    n = len(list_obj)
    list_obj = sorted(list_obj)
    
    if n % 2 != 0:
        middle_index = int((n-1)/2)
        median_ = list_obj[middle_index]
    else:
        upper_middle_index = int(n/2)
        lower_middle_index = upper_middle_index - 1
        
        median_ = mean(list_obj[lower_middle_index : upper_middle_index + 1])
        
    return median_
    
median_list1 = median(list1)
median_list2 = median(list2)
print("The median of list 1 is: ", median_list1)
print("The median of list 2 is: ", median_list2)

The median of list 1 is:  8.5
The median of list 2 is:  279.5


## Statistics Describing Distributions

The average values we've computed so far are helpful, but they do not give as much information as we sometimes need. Now, we will build functions that describe statistical relationships. We will start by constructing functions that calculate a population's variance and standard deviation. 

In [8]:
def variance(list_obj, sample = False):
    list_mean = mean(list_obj)
    n = len(list_obj)
    
    sum_sq_diff = 0
    for val in list_obj:
        sum_sq_diff += (val - list_mean) ** 2
        
    if sample == False:
        variance_ = sum_sq_diff / n
    else:
        variance_ = sum_sq_diff / (n-1)
        
    return variance_

variance_list1 = variance(list1)
variance_list2 = variance(list2)

print("The variance of list 1 is: ", variance_list1)
print("The variance of list 2 is: ", variance_list2)

The variance of list 1 is:  185.1388888888889
The variance of list 2 is:  53042.916666666664


We will include an option to calculate sample variance. To do this, we simple set sample=True:

In [9]:
s_variance_list1 = variance(list1, sample = True)
s_variance_list2 = variance(list2, sample = True)
print("The sample variance of list 1 is: ", s_variance_list1)
print("The sample variance of list 2 is: ", s_variance_list2)

The sample variance of list 1 is:  222.16666666666669
The sample variance of list 2 is:  63651.5


The standard deviation is calculated by taking the square root of the variance. We will do the same as before, including an option for the sample standard deviation.

In [11]:
def SD(list_obj, sample = False):
    SD_ = variance(list_obj, sample) ** (1/2)
    
    return SD_

SD_list1 = SD(list1)
SD_list2 = SD(list2)
print("The standard deviation of list 1 is: ", SD_list1)
print("The standard deviation of list 2 is: ", SD_list2)
sample_SD_list1 = SD(list1, sample = True)
sample_SD_list2 = SD(list2, sample = True)
print("The sample standard deviation of list 1 is: ", sample_SD_list1)
print("The sample standard deviation of list 2 is: ", sample_SD_list2)

The standard deviation of list 1 is:  13.606575207923886
The standard deviation of list 2 is:  230.31047884685287
The sample standard deviation of list 1 is:  14.905256343540914
The sample standard deviation of list 2 is:  252.29248898847544


The functions we have remaining are covariance, correlation, skewness and kurtosis. 

Covariance measures the average relationship between two variables. In order to compute this, our lists that we use in our functions must be of equal length:

In [16]:
def covariance(list_obj1, list_obj2, sample = False):
    mean1 = mean(list_obj1)
    mean2 = mean(list_obj2)
    
    cov = 0
    n1 = len(list_obj1)
    n2 = len(list_obj2)
    
    if n1 == n2:
        n = n1
        for i in range(n1):
            cov += (list_obj1[i] - mean1) * (list_obj2[i] - mean2)
            
        if sample == False:
            cov = cov / n
            
        else:
            cov = cov / (n - 1)
            
        return cov
    else:
        print("List lengths are not equal")
        print("Length of list 1: ", n1)
        print("Length of list 2: ", n2)
        
population_cov = cov(list1, list2)
print("Population covariance is: ", population_cov)
sample_cov = cov(list1, list2, sample = True)
print("Sample covariance is: ", sample_cov)

Population covariance is:  486.0833333333334
Sample covariance is:  583.3000000000001


We will use our covariance and transform it into correlation by dividing by the product of the standard deviations. Our correlation() function will make use of our covariance() and stdev() functions:

In [17]:
def correlation(list_obj1, list_obj2):

    cov = covariance(list_obj1, list_obj2)
    SD1 = SD(list_obj1)
    SD2 = SD(list_obj2)
    corr = cov / (SD1 * SD2)
    return corr

 
corr_1_2 = correlation(list1, list2)
print("Correlation of lists 1 and 2 is: ", corr_1_2)

Correlation of lists 1 and 2 is:  0.15511300289452798


Our next function will be skewness. This statistic is useful because not all distributions are normal.

Distributions can be skewed due to having either long or fat tails. If a tail is long, this means that it contains values that are relatively far from the mean value of the data. If a tail is fat, there exists a greater number of observations whose values are relatively far from the mean than is predicted by a normal distribution. For example, if a distribution includes a long tail on the right side, but is normal otherwise, it is said to have a positive skew. The same can be said of a distribution with a fat right tail. Skewness can be ambiguous concerning the shape of the distribution. If a distribution has a fat right tail and a long left tail that is not fat, it is possible that its skewness will be zero, even though the shape of the distribution is asymmetric (I was not sure of a better way to explain this):

In [18]:
def skewness(list_obj, sample = False):
    mean_ = mean(list_obj)
    SD_ = SD(list_obj, sample)
    skew = 0
    n = len(list_obj)
    for val in list_obj:
        skew += (val - mean_) ** 3
        skew = skew / n if not sample else n * skew / ((n - 1)*(n - 1) * SD_ ** 3)
        
    return skew


population_skew_list1 = skewness(list1, sample = False)
population_skew_list2 = skewness(list2, sample = False)
print("Population skewness of list 1 is: ", population_skew_list1)
print("Population skewness of list 2 is: ", population_skew_list2)
sample_skew_list1 = skewness(list1, sample = True)
sample_skew_list2 = skewness(list2, sample = True)
print("Sample skewness of list 1 is: ", sample_skew_list1)
print("Sample skewness of list 2 is: ", sample_skew_list2)

Population skewness of list 1 is:  81.70854518731268
Population skewness of list 2 is:  10628543.294043103
Sample skewness of list 1 is:  -5.846187474702679e-06
Sample skewness of list 2 is:  0.9386629526374385


Lastly, we will develop a function for kurtosis. This is an absolute measure of the weight of outliers. This tells us the weight of the values within the distribution's tails:

In [19]:
def kurtosis(list_obj, sample = False):
    mean_ = mean(list_obj)
    kurt = 0
    SD_ = SD(list_obj, sample)
    n = len(list_obj)
    for x in list_obj:
        kurt += (x - mean_) ** 4
    kurt = kurt / (n * SD_ ** 4) if not sample else  n * (n + 1) * kurt / \
    ((n - 1) * (n - 2) * (SD_ ** 4)) - (3 *(n - 1) ** 2) / ((n - 2) * (n - 3))
    
    return kurt

population_kurt_list1 = kurtosis(list1)
population_kurt_list2 = kurtosis(list2)
print("Population kurtosis of list 1: ", population_kurt_list1)
print("Population kurtosis of list 2: ", population_kurt_list2)
sample_kurt_list1 = kurtosis(list1, sample = True)  
sample_kurt_list2 = kurtosis(list2, sample = True)  
print("Sample kurtosis of list 1: ", sample_kurt_list1)  
print("Sample kurtosis of list 2: ", sample_kurt_list2)

Population kurtosis of list 1:  3.5011033553587194
Population kurtosis of list 2:  1.9633602867077833
Sample kurtosis of list 1:  24.384654359388804
Sample kurtosis of list 2:  10.929402508693101


## Using a Nested Dictionary to Organize Statistics

We will now organize our statistics using a dictionary. This allows us to have all of our statistical values stored in one place. We will call this stats_dict:

In [20]:
stats_dict = {}

stats_dict[1] = {}
stats_dict[1]["list"] = list1

stats_dict[2] = {}
stats_dict[2]["list"] = list2

for key in stats_dict:  

    lst = stats_dict[key]["list"]  

    stats_dict[key]["sum"] = sum(lst)  
    stats_dict[key]["mean"] = mean(lst)  
    stats_dict[key]["median"] = median(lst)  
    stats_dict[key]["mode"] = mode(lst)  
    stats_dict[key]["variance"] = variance(lst)  
    stats_dict[key]["standard deviation"] = SD(lst)    
    stats_dict[key]["skewness"] = skewness(lst)  
    stats_dict[key]["kurtosis"] = kurtosis(lst)  
  
print(stats_dict)

{1: {'list': [1, 5, 12, 41, 2, 12], 'sum': 73, 'mean': 12.166666666666666, 'median': 8.5, 'mode': [12], 'variance': 185.1388888888889, 'standard deviation': 13.606575207923886, 'skewness': 81.70854518731268, 'kurtosis': 3.5011033553587194}, 2: {'list': [64, 125, 216, 343, 512, 729], 'sum': 1989, 'mean': 331.5, 'median': 279.5, 'mode': [64, 125, 216, 343, 512, 729], 'variance': 53042.916666666664, 'standard deviation': 230.31047884685287, 'skewness': 10628543.294043103, 'kurtosis': 1.9633602867077833}}
