# Introduction to Data Science - Homework 1
*CS 5963 / MATH 3900, University of Utah, http://datasciencecourse.net/*

Due: Friday, September 2, 11:59pm.

This homework is designed to practice the skills we learned in Lab 1: working with loops, conditions, functions, and the built-in Python data structures. Make sure to go through the lab again in case you have any troubles.

In this homework we'll do some calculations that are also available in various libraries. For the purpose of this homework, however, **stick to standard python functionality and the math library** and re-implement, e.g., the functionality for calculating the mean of a vector instead of just calling a mean function. 

However, we encourage you to check your results using, e.g., the [NumPy library](http://docs.scipy.org/doc/numpy-1.11.0/reference/routines.statistics.html) and include the checks as a separate code cell. 

## Your Data
Fill out the following information: 

*First Name:*   
*Last Name:*   
*E-mail:*   
*UID:*  


## Part 1: Vector data

We first will work with a vector of yearly average temperatures from New Haven published [here](https://vincentarelbundock.github.io/Rdatasets/datasets.html). The data is included in this repository in the file `nhtmep.csv`.

The data is stored in the CSV format, which is a simple textfile with 'Comma Seperated Values'.
To load the data to into a (nested) python array, we use the [csv](https://docs.python.org/3/library/csv.html) library. The following code reads the file and stores it in a vector:

In [2]:
# import the csv library
import csv
# import the math library we'll use later
import math

# initialize the array
temperature_vector = []

# open the file and append the values of the last column to the array
with open('nhtemp.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # remove the first item as it is the title.
    next(filereader)
    for row in filereader:
        # here we append to the array and also cast from string to float
        temperature_vector.append(float(row[2]))
        
# print the vector to see if it worked

a = 0
print (temperature_vector[a])
print(sorted(temperature_vector))

49.9
[47.9, 48.4, 48.8, 49.3, 49.3, 49.4, 49.4, 49.6, 49.8, 49.8, 49.9, 50.2, 50.2, 50.4, 50.5, 50.6, 50.6, 50.6, 50.7, 50.8, 50.8, 50.9, 50.9, 50.9, 50.9, 50.9, 51.0, 51.0, 51.1, 51.1, 51.3, 51.4, 51.4, 51.5, 51.5, 51.6, 51.6, 51.7, 51.7, 51.7, 51.7, 51.8, 51.8, 51.8, 51.9, 51.9, 51.9, 51.9, 52.0, 52.0, 52.1, 52.3, 52.6, 52.6, 52.7, 52.8, 53.0, 53.1, 54.0, 54.6]


We'll use the `temperature_vector` to calculate a couple of standard statistical measures next.

### Task 1.1: Calculate the Mean of a Vector

Write a function that calculates and returns the [arithmetic mean](https://en.wikipedia.org/wiki/Arithmetic_mean) of a vector that you pass into it. 

Pass the temperature vector into this function and print the result. Provide a written interpretation of your results (e.g., "The mean temperature for New Haven for the years 1912 to 1971 is XXX degrees Fahrenheit.")

In [3]:
#import numpy

def mean(temp_list):
    temp_total = 0.0
    
    for temp in temp_list:
        temp_total += temp
        
    return temp_total/len(temp_list)

mean(temperature_vector)
    

51.16

**Your Interpretation:**

The mean function defined here takes each sequence in the list and adds all of their values which are stored in the variable temp_total. The sum of all of these values is then divided by the number of these values, determined by the length of the list.

### Task 1.2: Calculate the Median of a Vector
Write a function that calculates and returns the [median](https://en.wikipedia.org/wiki/Median) of a vector. Pass the temperature vector into this function and print the result. Make sure that your function works for both, functions with an even and with an odd number of elements. In case of an even number of elements, use the mean of the two middle values. Provide a written interpretation of your results.

Hint: the [`sorted()`](https://docs.python.org/3/library/functions.html#sorted) function might be helpful for this.

In [4]:
## your code goes here

def median(temp_list):
    sorted_temp = sorted(temp_list)
    median = 0
    
    if (len(temp_list)%2==0):
        return((sorted_temp[int(len(sorted_temp)/2)-1]+sorted_temp[int(len(sorted_temp)/2)])/2)
        
    #elif len(temperature_vector%2==1): 
    else: 
        return(sorted_temp[len(sorted_temp)//2])


#import numpy

#numpy.median(temperature_vector)


# the call to your function
median(temperature_vector)

51.2

**Your Interpretation:**

The median function first sorts all of the values from least to greatest using the sorted function. If the length of the list is even (determined by length divided by two, remainder == 0) then the middle numbers (at positions length/2 and (length/2)-1) are averaged by adding them and dividing them by two. We subtract one of the middle numbers' value by one because the positions are numbered starting at 0 and doing this will return the two middle numbers in an even list.

This is confusing so I'll give an example:

Imagine an even list containing numbers 1,2,3,4,5,6,:

list length = 6

following the said formula returns number 3 (at postion 2) and 4 (at position 3). Then adding and dividing them by two returns the median 3.5.

Otherwise, if the length list is odd, we divide the length by two and round down using the floor function '//'. This will return a rounded-down integer instead of a '.5' (For example, 7 // 2 = 3, not 3.5). If the list of numbers contains 1,2,3,4,5,6,7 then the integer returned is 3. The median number returned at position 3 is 4, which is the fourth number on the list. 
 

### Task 1.3: Calculate the Standard Deviation of a Vector

Write a function that calculates and returns the [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) of a vector. Pass the temperature vector into this function and print the result. Provide a written interpretation of your results.

The standard deviation is the square root of the average of the squared deviations from the mean, i.e.,

$$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} {{(x_i - \mu)}^2} }$$

where $\mu$ is the mean of the vector. Hint: use your mean function to calculate it.

Hint: the `sqrt()` function from the [`math library`](https://docs.python.org/3/library/math.html) might be helpful for this. If you use a seperate file you need to load the library as we did in Part 1 to read in the data. The import looks like this:

In [5]:
import math

def standard_deviation(temp_list):
    temp_vector_mean = mean(temp_list)
    total = 0.0
    
    
    for temp in temp_list:
        total += (temp-temp_vector_mean) ** 2
        
    return math.sqrt(total/len(temp_list))

        
# the call to your function
standard_deviation(temperature_vector)

1.2550166001558176

**Your Interpretation:**

For the standard_deviation function, temp_vector_mean is a variable created using the mean() function created in the exercise above. The for loop allows each number in the list to be subtracted from the mean to find the difference. The difference squared is stored in a variable called total.

The standard deviation is found by getting the square root of total after total is divided by n (determined by length function of list).


### Task 1.4: Histogram

Write a function that takes a vector and an integer `b` and calculates a [histogram](https://en.wikipedia.org/wiki/Histogram) with `b` bins. The function should return an array containing two arrays. The first should be the counts for each bin, the second should contain the borders of the bins.

For `b=5` your output should look like this: 

`[[3, 12, 33, 10, 2], [47.9, 49.24, 50.58, 51.92, 53.26, 54.6]]`

Here, the first array gives the size of these bins, the second defines the bands. I.e., the first band from 47.9-49.24 has 3 entries, the second, from 49.24-50.58 has 12 entries, etc. 

Provide a written interpretation of your results. Comment on whether the histogram is skewed, and if so, in which direction.

In [63]:
def histogram(temp_list, number_of_bins):
    
    min_temp = min(temp_list)
    max_temp = max(temp_list)
    temp_range = max_temp - min_temp
    bin_width = temp_range / number_of_bins
    
    
    borders = []
    
    for counter in range(number_of_bins + 1):
        borders.append(min_temp + counter*bin_width)
    
    a = 0
    b = 1
    
    #return borders
    
    occurrences = [0] * number_of_bins
    
    #for temp in temp_list:
        #for counter in range(number_of_bins):

            #print("Counter is {}", counter)
            
            #year refers to position in the list
            
    year = 0
    list_length = len(temp_list)


    while year < list_length:

        if temp_list[year] >= (min_temp + (bin_width * a)) and temp_list[year] < (min_temp + bin_width * b):
            occurrences[a] +=1
            a = 0
            b = 1
            year += 1
        
        elif temp_list[year]==max_temp:

            occurrences[number_of_bins-1] += 1
            a = 0
            b = 1
            year += 1
            
        else:
            a += 1
            b += 1  
                    
    return [occurrences, borders]

    
        #range(number_of_bins)



# the call to your function
histogram(temperature_vector, 5)
#print(range(5))

[[3, 12, 33, 10, 2], [47.9, 49.24, 50.58, 51.92, 53.26, 54.6]]

**Your interpretation:** 

This function takes as parameters a list of data and the number of quantiles (or "bins") you want to be displayed in your histogram. 

I defined important variables first and then wrote the code:

min_temp: uses the min() function to return the minimum temperature value on the list
max_temp: uses the max() function ...
temp_range = max_temp minus min_temp to find the spread of the data
bin_width: the entire spread of the data divided by the number of bins. (For example, if the spread were 10 and the number of bins were 5, then the width of each bin would be 2)


I wrote the function for borders first, the second-listed function, because after you define where the borders are, then you can define how many data points fall within each bin.

borders function:

First we create a counter based on the range but we add one because n+1 borders make n bins. Then we create a for loop based on the counter, which counter multiplies how many bin_widths we want to add to min_temp. The counter starts at 0 so zero bin_widths are added to the min_temp as the first value. Then n more bin_widths are added.


that creates (n+1) values for where the borders of the bins will be, starting at min_temp + (bin_widths * 0), until min_temp + (bin_widths * 5).


occurences function:

The occurences function measures how many occurences that a temperature fits in each bin. To do so, I created a list called occurences that is long as the number of bins. Then I created a while loop that iterates through the values of the temp_list (variable called year). If the year is not within the range of the first bin, then the while loop incements to the parameters of the next bin. Once the value finds the right bin, that bin is incremented by one, the "year" position is incremented by one, and the next value is placed in its respective bin.

### Task 1.5: Filtering
Write a function that takes a vector and returns a vector that contains every other element of the original vector. Print the result of the function as applied to the temperature vector.

Hint: slicing might be helpful here.

In [37]:
def skip(a):
    
    every_other = []
    for counter in range(len(a)):
        if counter %2 == 0:
            every_other.append(a[counter])
            
    return every_other

skip(temperature_vector)

[49.9,
 49.4,
 49.4,
 49.8,
 49.3,
 50.8,
 49.3,
 48.4,
 50.9,
 51.5,
 51.8,
 49.8,
 50.4,
 51.8,
 48.8,
 51.0,
 51.7,
 52.1,
 51.0,
 51.4,
 53.1,
 52.0,
 50.9,
 50.2,
 51.6,
 50.5,
 51.7,
 51.7,
 51.9,
 51.9]

## Part 2: Working with Matrices

For the second part of the homework, we are going to work with matrices. The [dataset we will use](https://www.wunderground.com/history/airport/KSLC/2015/1/1/CustomHistory.html?dayend=31&monthend=12&yearend=2015&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=) contains different properties of the weather in Salt Lake City for 2015 (temperature, humidity, sea level, ...). It is stored in the file [`SLC_2015.csv`](SLC_2015.csv) in this repository.

We first read the data from the file and store it in a nested python array (`weather_matrix`). A nested python array is an array, where each element is an array itself. Here is a simple example: 

In [None]:
arr1 = [1,2,3]
arr2 = ['a', 'b', 'c']

nestedArr = [arr1, arr2]
nestedArr

We provide you with the data import code, which will write the data into the nested list `temperature_matrix`. The list contains one list for each month, which, in turn, contain the mean temparature of every day of that month. 

In [38]:
# initialzie the 12 arrays for the months
temperature_matrix = [[] for i in range(12)]

# open the file and append the values of the last column to the array
with open('SLC_2015.csv') as csvfile:
    filereader = csv.reader(csvfile, delimiter=',', quotechar='|')
    # get rid of the header
    next(filereader)
    for row in filereader:
        month = int(row[0].split('/')[0])
        mean_temp = int(row[2])
        temperature_matrix[month-1].append(mean_temp)

print(temperature_matrix)

# the mean tempertarure on August 23. Note the index offset:
print("Mean temp on August 23: " + str(temperature_matrix[7][22]))


[[15, 19, 26, 28, 37, 38, 38, 36, 35, 31, 39, 36, 35, 30, 31, 31, 37, 44, 40, 35, 31, 31, 31, 33, 42, 41, 44, 42, 36, 40, 39], [39, 49, 50, 50, 53, 57, 60, 53, 55, 45, 43, 47, 46, 48, 43, 40, 38, 44, 47, 44, 39, 33, 31, 35, 44, 35, 37, 36], [40, 37, 34, 33, 39, 43, 45, 45, 46, 50, 54, 50, 51, 56, 62, 63, 61, 53, 47, 53, 57, 54, 52, 47, 42, 48, 56, 62, 53, 57, 63], [46, 44, 44, 54, 60, 50, 52, 46, 49, 53, 58, 50, 57, 56, 33, 44, 50, 54, 56, 56, 60, 61, 61, 59, 51, 46, 50, 57, 65, 63], [63, 71, 68, 67, 62, 59, 58, 57, 49, 53, 59, 68, 65, 65, 53, 48, 56, 58, 55, 59, 58, 58, 55, 57, 62, 59, 61, 61, 64, 71, 76], [80, 68, 69, 68, 69, 70, 66, 73, 77, 78, 72, 74, 75, 76, 81, 77, 78, 83, 83, 78, 81, 78, 78, 83, 82, 84, 87, 88, 91, 89], [87, 87, 87, 89, 79, 79, 76, 75, 73, 72, 77, 79, 81, 77, 80, 80, 79, 74, 74, 73, 76, 77, 75, 78, 78, 84, 77, 66, 70, 76, 79], [80, 79, 69, 76, 82, 74, 76, 69, 72, 79, 83, 81, 83, 88, 83, 79, 77, 72, 74, 76, 81, 74, 76, 84, 85, 78, 77, 80, 85, 82, 75], [82, 83, 82

We will now use the nested array `temperature_matrix` to compute the same metrics as in Part 1.

**Note:** Since the lists in the matrix are of varying lengths (28 to 31 days) many of the standard NumPy functions won't work.

### Task 2.1: Calculates the mean of a whole matrix

Write a function that calculates the mean of a matrix. For this version calculate the mean over all elements in the matrix as if it was one large vector. 
Pass in the matrix with the weather data and return the result. Provide a written interpretation of your results.
Can you use your function from Part 1 and get a valid result?

In [48]:
def mean_matrix(temp_matrix):
    
    #temp_sum = 0.0
    mean_temp = 0.0
    flat_list = []
       
    for temp_list in temp_matrix:
        
        for temp in temp_list:
            flat_list.append(temp)        
        
    mean_temp = mean(flat_list)
    
    return mean_temp

    flat_list = []
    

# the call to your function
mean_matrix(temperature_matrix)

56.76712328767123

**Your Interpretation:** 

This function iterates through the rows of a matrix and appends all of the values into one long list. Then it adds up all the values, divides them by how many values there are and returns the mean.

### Task 2.2:  Calculate the mean of each vector of a matrix

Write a function that calculates the mean temperature of each month and returns an array with the means for each column. Provide a written interpretation of your results. Can you use the function you implemented in Part 1 here efficiently? If so, use it.

In [53]:
def mean_matrix_columns(temp_matrix):
    
    temp_sum = 0.0
    vector_mean = 0.0
    mean_values = []   
        
    for temp_list in temp_matrix:
        mean_values.append(mean(temp_list))
    
    return mean_values
    

# the call to your function
mean_matrix_columns(temperature_matrix)

[34.54838709677419,
 44.32142857142857,
 50.096774193548384,
 52.833333333333336,
 60.483870967741936,
 77.86666666666666,
 77.87096774193549,
 78.35483870967742,
 71.43333333333334,
 61.16129032258065,
 39.96666666666667,
 31.548387096774192]

**Your Interpretation:**

This function is similar to the one above it but instead of returning the mean of all values as a single number, it returns the mean of all rows in a list. A matrix is a list of lists, but the return value here is just a list of means. 

### Task 2.3:  Calculate the median of a whole matrix

Write a function that calculates and returns the median of a matrix over all values (independent from which row they are coming) and returns it. Provide a written interpretation of your results. Can you use your function from Part 1 and get a valid result?

In [58]:
def median_matrix(temp_matrix):
    
    #temp_sum = 0.0
    median_temp = 0.0
    flat_list = []
       
    for temp_list in temp_matrix:
        
        for temp in temp_list:
            flat_list.append(temp)        
        
    median_temp = median(flat_list) 
    
    return median_temp

    flat_list = []
    
# the call to your function
median_matrix(temperature_matrix)

57

**Your Interpretation:**

Uses same formula structure as mean_matrix but uses median() function instead.

### Task 2.4: Calculate the median of each vector of a matrix

Write a function that calculates the median of each sub array (i.e. each column in the csv file) in the matrix and returns an array of medians (one entry for column in the csv file). To do so, use the function you implemented in Part 1. Provide a written interpretation of your results. 

In [57]:
def median_matrix_columns(temp_matrix):
    
    temp_sum = 0.0
    vector_mean = 0.0
    median_values = []   
        
    for temp_list in temp_matrix:
        median_values.append(median(temp_list))
    
    return median_values
    
median_matrix_columns(temperature_matrix)

[36, 44.0, 51, 53.5, 59, 78.0, 77, 79, 73.0, 62, 40.0, 32]

**Your Interpretation:**

Uses same formula structure as mean_matrix_columns but uses median() function instead.

### Task 2.5: Calculate the standard deviation of a whole matrix

Write a function that calculates the standard deviation of a matrix over all values in the matrix (ignoring from which column they were coming) and returns it. Can you use your function from Part 1 and get a valid result? Provide a written interpretation of your results. 

In [61]:
def standard_deviation_matrix(temp_matrix):
    
    #temp_sum = 0.0
    std_dev = 0.0
    flat_list = []
       
    for temp_list in temp_matrix:
        
        for temp in temp_list:
            flat_list.append(temp)        
        
    std_dev = standard_deviation(flat_list)
    
    return std_dev

    flat_list = []
    
# the call to your function
standard_deviation_matrix(temperature_matrix)

17.908994103709954

**Your Interpretation:**

Uses same formula structure as mean_matrix but uses standard_deviation() function instead.

### Task 2.6: Calculate the standard deviation of each vector of a matrix

Write a function that calculates the standard deviation of each array in the matrix and returns an array of standard deviations (one standard deviation for each column). To do so, use the function you implemented in Part 1.
Pass in the matrix with the temperature data and return the result. Provide a written interpretation of your results - is the standard deviation consistent across the seasons? 

In [60]:
def standard_deviation_matrix_columns(temp_matrix):
    
    temp_sum = 0.0
    vector_mean = 0.0
    std_dev_values = []   
        
    for temp_list in temp_matrix:
        std_dev_values.append(standard_deviation(temp_list))
    
    return std_dev_values
    
# the call to your function
standard_deviation_matrix_columns(temperature_matrix)

[6.5047809200539595,
 7.343868051591318,
 8.263229231729458,
 6.923791511078947,
 6.272679973334109,
 6.535713852025312,
 5.020872148142359,
 4.666617114845193,
 7.552850823070421,
 6.937959048194395,
 8.715822138820615,
 8.96890222245524]

**Your Interpretation:** 
Uses same formula structure as mean_matrix_columns but uses standard_deviation() function instead.