# Class 17: Introduction to Statistical Inference

Plan for today:
- Review for loops and writing functions
- Introduction to statistical inference
- Introduction to hypothesis tests


In [None]:
import YData

# YData.download.download_class_code(17)   # get class code    
# YData.download.download_class_code(17, TRUE) # get the code with the answers 

# YData.download.download_class_file('project_template.ipynb', 'homework')  # downloads the class project template 
# YData.download_homework(7)  # downloads the 7th homework 

YData.download_data("movies.csv")
YData.download_data("daily_bike_totals.csv")

If you are using colabs, please run the code below

In [None]:
# !pip install https://github.com/lederman/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import statistics
import pandas as pd
import numpy as np
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

### 0.1 Warm-up exercises using for loops

As we discussed last class, loops allow us to repeat a process many times. They are particularly useful in conjuction with lists to process and store multiple values. 

Let's start with a quick warm up exercise on for loops. The code below loads our bike data from 2014 and can creates two lists which are:

- `weekday`: A list of Booleans specifying whether a day is a weekday
- `num_trips`: A list containing how many trips were taken on each day


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

bikes = pd.read_csv("daily_bike_totals.csv", parse_dates = [0])
bikes_2014 = bikes.query("date > '2013-12-31'").query("date < '2015-01-01'")

weekday = bikes_2014["weekday"].to_list()
num_trips = bikes_2014["trips"].to_list()

print(weekday[0:5])
print(num_trips[0:5])

### Warm up exercise 0.1.1

Please use a for loop to a list called `weekday_trips` which contains the number of trips that occurred only on weekdays. 


In [None]:
# Alternative solution using the zip() function




### Warm up exercise 0.1.2

Now add to your code so that you also create a list called `weekend_trips` that contain the number of trips that occured on all weekend. 

Once you have created these lists create side-by-side boxplots to compare the number of trips taken on weekdays and weekends. 


### Warmup 0.2 - Functions

### 0.2.1: A function

Write a function named `string_and_length` which takes a string and returns a string containing the original string and its length.
For example, `string_and_length("ABC")` would return the string `"ABC3"`.



In [None]:
s = "This is my string"




### 0.2.2: Loop over the function

Write code that takes the string `s` below, strips the spaces, and produces a string that contains each word of the original string, followed by the length of that word.

For example, if `s = "This is my string"` then, `newstring = "This4is2my2string6"`

## 1. Statistical inference

In statistical inference we use a smaller sample of data to make claims about a larger population of data. 

As an example, let's look at the [2020 election](https://www.cookpolitical.com/2020-national-popular-vote-tracker) between Donald Trump and Joe Biden, and let's focus on the results from the state of Georgia. After all the votes had been counted, the resuts showed that:

- Biden received 2,461,854 votes
- Trump received 2,473,633 votes

Since we have all the votes on election data, we can precisely calculate the population parameter of the proportion of votes that Biden received, which we will denote with the symbol $\pi_{Biden}$. 

Let's create names `num_trump_votes` and `num_biden_votes`, and calculate `true_prop_Biden` which is the value $\pi_{Biden}$. 

In [None]:
num_trump_votes = 2461854  # 2,461,854
num_biden_votes = 2473633  # 2,473,633


# calculate the proportion of people who voted for Biden
true_prop_Biden = num_biden_votes/(num_biden_votes + num_trump_votes)

true_prop_Biden

The code below creates a DataFrame called `georgia_df` that captures these election results. Each row in the DataFrame represents a votes. The column `Voted Biden` is `True` if a voter voted for Biden and `False` if the voter voted for Trump. 

In [None]:
biden_votes = np.repeat(True, num_biden_votes)     # create 2,473,633 Trues for the Biden votes
trump_votes = np.repeat(False, num_trump_votes)    # create 2,461,854 Falses for the Trump votes
election_outcome = np.concatenate((biden_votes, trump_votes))  # put the votes together

georgia_df = pd.DataFrame({"Voted Biden": election_outcome})  # create a DataFrame with the data
georgia_df = georgia_df.sample(frac = 1)   # shuffle the order to make it more realistic

georgia_df.head()

Now suppose we didn't know the actual value of $\pi_{Biden}$ and we wanted to estimate it based on a poll of 1,000 voters. We can simulate this by using the pandas `.sample(n = )` method.

Let's simulate sampling random voters

In [None]:
# sample 10 random points
georgia_df.sample(10)  

In [None]:
# simulate proportions of voters that voted for Biden - i.e., p-hats

one_sample = georgia_df.sample(1000)

np.mean(one_sample['Voted Biden'])

### 1b. Creating a sampling distribution via taking random samples

Suppose 100 polls were conducted. How many of them would show that Biden would get the majority of the vote? 

Let's simulate this "sampling distribution" of statistics now... 


In [None]:
%%time

sample_size = 1000
num_simulations = 100

sampling_dist = []

...

In [None]:
# plot a histogram of the sampling distribution

plt.hist(sampling_dist, edgecolor = "black");

### 1c. Creating a sampling distributions via coin flips (a faster way to simulate data)

Rather than simulating polling outcomes by pulling random samples from a DataFrame, let's simulate each vote by simulating randomly flipping a coin, where the probability of getting a "Head" (True value) is the probability of Biden getting a vote.

To do this we can use our `flip_coins(n, prob_heads, return_prop)` function we wrote in the previous lecture. 
 


In [None]:
# Function for teh previous lecture:
 
def flip_coins(n, prob, return_prop = False):
    
    rand_nums = np.random.rand(n)
    num_heads = np.sum(rand_nums <= prob)
    
    if return_prop:
        return num_heads/n
    else:
        return num_heads

In [None]:
%%time

# sampling distribution of many polls conducted

sample_size = 1000
num_simulations = 100

sampling_dist = []

...

In [None]:
plt.hist(sampling_dist, edgecolor = "black", bins = 10);

## 2. Hypothesis tests

In hypothesis testing, we start with a claim about a population parameter (e.g., µ = 4.2, or π = 0.25).

This claim implies we should get a certain distribution of statistics, called "The null distribution". 

If our observed statistic is highly unlikely to come from the null distribution, we reject the claim. 

We can break down the process of running a hypothesis test into 5 steps. 

1. State the null and alternative hypothesis
2. Calculate the observed statistic of interest
3. Create the null distribution 
4. Calculate the p-value 
5. Make a decision

Let's run through these steps now!


#### Step 1: State the null and alternative hypothesis

$H_0: \pi = 0.5$

$H_A: \pi < 0.5$


#### Step 2: Calculate the observed statistic of interest


In [None]:
# load the data

movies = pd.read_csv("movies.csv")

movies.head(3)

In [None]:
# reduce data to a smaller number of columns: "title" and "binary"

movies_smaller = movies[["title", "binary"]]

In [None]:
# calculate the proportion of movies that pass the Bechdel test

booleans_passed = ...

...


#### Step 3: Create the null distribution 

We need to create a null distribution, which is the distribution of statistics we would expect to get if the null hypothesis is true. 

**Question**: about what percent of the movies would we expect to pass the Bechdel test if the null distribution was true? 

**Answer**: 50%

Let's create simulated data that is consistent with this!


In [None]:
# Let's generate one proportion consistent with the null hypothesis

# get the total number of movies in our dataset n
n = movies.shape[0]
print(n)

# proportion consistent with the null hypothesis
null_prop = .5


# one statistic consistent with null hypothesis
...


In [None]:
# Let's generate a null distribution 

null_dist = []

....


In [None]:
# visualize the null distribution 

plt.hist(null_dist, edgecolor = "black", bins = 20) #, range = (.4, .6));
plt.plot(prop_passed, 30, '.', markersize = 30, color = "red");

#### Step 4: Calculate the p-value 

Calculate the proportion of points in the null distribution that are more extreme than the observed statistic. 


In [None]:
# Calculate the p-value

# create a Boolean vector indicating whether each p-hat in the null distribution 
# was greater than the observed proportion that passed the Bechdel test
stats_more_extreme = ...

print(stats_more_extreme[0:5])

# calculate the p-value 
p_value = ...


#### Step 5: Make a decision

Since the p-value is very small (essentially zero) it is very unlikely that our statistic come from the null distribution. Thus we will reject the null hypothesis and conclude that less than 50% of movies pass the Bechdel test. 
