## DS200: Introduction to Data Sciences
# Lab Assignment 6: Variance, Standard Deviation, Chebychev’s Bounds, and Standard Units (2 points)

In [None]:
from datascience import *
import numpy as np

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

In this lab assignment, we will use the United Airlines flight delay dataset as an example to learn about variability.

Let's first load the dataset.

In [None]:
flights = Table.read_table('united_summer2015.csv')
flights

# Part 1: Mean, Variance, Standard Deviation

The table `flights` contains a column named `Delay`, which consists of the departure delay times, in minutes, of over thousands of United Airlines flights in the summer of 2015.

### Problem 1A: Mean (0.1 points)

Calculate the mean value of all departure delay times, and store the value in a variable `mean_delay`.

*Hint*: The mean value is around 16.66.


In [None]:
# write your solution to Problem 1A in this code cell

### Problem 1B: Variance and Standard Deviation (0.2 points)

First, calculate the variance of all departure delay times.

*Hint:* Recall that variance is the average squared deviation from the mean (see Lecture 6 notes for definition of variance), so to calculate the variance, you can follow the steps below:
* extract the column `Delay` from the `flights` table,
* calculate every flight's deviation from `mean_delay` and store the resulting array in variable `deviations`,
* use `np.square` to calculate the squared deviations,
* calculate the mean of all squared deviations and store the result in variable `variance`.

Second, calculate the standard deviation of all delay times. 

*Hint:* Recall that standard deviation is the positive square root of  variance. You can calculate standard deviation by taking the square root of variable `variance`. Another way of calculating standard deviation is by directly using the function `np.std` with all the delay times as the input argument. To verify your code, you can calculate standard deviation using both ways and then check whether you get the same answer (which should be around 39.48).







In [None]:
# write your solution to Problem 1B in this code cell

# Part 2: Chebychev’s Bounds, Standard Units

The Russian mathematician Pafnuty Chebychev (1821-1894) proved a result, called **Chebychev’s inequality**, which states the following:

*For all lists, and all numbers $z$, the proportion of entries in the list that are in the range of $[\mu - z \cdot \sigma, \mu + z \cdot \sigma]$ is at least $1-\frac{1}{z^2}$.*

Note that Chebychev’s result gives a lower bound, not an exact answer or an approximation. $\mu$ is the mean, and $\sigma$ is the standard deviation of the list.

### Problem 2A: Calculation of Chebychev’s Bounds (0.3 points)

Calculate Chebychev’s bounds for the proportion of flight delay times that are in the range of $[\mu - z \cdot \sigma, \mu + z \cdot \sigma]$ for the following $z$ values:

*  $z=1$
*  $z=4$
*  $z=5$

*Hint*: For instance, as shown in the lecture notes, when $z=2$, Chebychev's result states that the proportion of values in the range of  $[\mu - 2 \sigma, \mu + 2 \sigma]$ is at least $1 - \frac{1}{2^2} = 0.75$.

**Answer to Problem 2A**: 
(Please write your answer to Problem 2A here. You can either write the answer as text in this text cell, or write Python code in a new code cell to print out your answer.)


In the calculation above, the quantity $z$ measures *standard units*, the number of standard deviations from the mean. 

To convert a value to standard units, first calculate how far it is from the mean, and then divide this deviation with the standard deviation:

$z = \frac{\text{value} - \mu}{\sigma}$

where $\mu$ is the mean and $\sigma$ is standard deviation.

### Problem 2B: Function for Calculating Standard Units (0.3 points)

* Write a function that converts a value (or an array of values) to standard units, given the value, the mean, and the standard deviation.

* Use this function to convert all flight delay times to standard units and then add a column `Delay (Standard Units)` to the table `flights`.

*Hint:* Your table should look like [this](https://drive.google.com/file/d/1lwcXY4KDEaazbd8VNakqEdPZL9hyNzGZ/view?usp=sharing).

In [None]:
# write your solution to Problem 2B in this code cell

### Problem 2C: Standard Units in the Population (0.2 points)

Write code that calculates the fraction of flights that are within the range $[\mu - 2 \sigma, \mu + 2 \sigma]$ **using the column `Delay (Standard Units)`.**

*Hint:* The fraction is around 95.6%.



In [None]:
# write your solution to Problem 2C in this code cell

# Part 3: Empirical Distribution and Variability of Sample Means

The Central Limit Theorem states the following:

*The distribution of the mean of large random samples drawn with replacement will be roughly normal, regardless of the distribution of the population.*

Next, let's use the flight delay example to examine the theorem in action. 


We'll treat all flight delay times as the population. Let's draw a histogram of the population to look at its distribution.

In [None]:
flights.hist('Delay', bins=50)

We can draw a large sample of size 2000 from the population, and look at the mean delay in the sample:



In [None]:
sample_2000 = flights.sample(2000)
sample_mean = np.mean(sample_2000.column('Delay'))
sample_mean

Run the code cell above several times, and compare the sample mean with the population mean (~16.658) that you calculated in Problem 1A. 

### Problem 3A: Empirical Distribution of the Sample Means (0.3 points)

* Write a `for` loop that draws 2000 samples from the population, with each sample having a size of 1000. That is, each sample should contain 1000 flight delay times randomly drawn with replacement from all flight delay times.

* Calculate and record the mean delay for each one of the 2000 samples.

* Draw a histogram to show the distribution of the sample mean delay times. 

*Hint:* Your histogram should look similar to [this](https://drive.google.com/file/d/1s2j9hGC6beDgcb_h4Za2KDdMJDufdf7Y/view?usp=sharing).


In [None]:
# write your solution to Problem 3A in this code cell

### Problem 3B: Sample Size vs. Variability of the Sample Mean (0.6 points)

As we learned in class, the larger the sample size, the smaller the standard deviation of sample means. 

* Write code to calculate the standard deviation of sample means, with sample sizes of 400, 900, 1600. 

* Verify whether the calculated standard deviations **roughly** follow the sample mean variability rule below:

std. dev. of sample means = std. dev. of population / $\sqrt{\text{sample size}}$


In [None]:
# Write your solution to Problem 3B in this cell. 
# Note that you can also add a new text cell to explain your answer about verifying the sample mean variability rule.
