<center><h1>Hypothesis Testing<h1></center>

When conducting data analysis, we want to say something meaningful about our data. Often, we want to know if a change or difference we see in a dataset is “real” or if it’s just a normal fluctuation or a result of the specific sample of people we have chosen to measure. A difference we observe in our data is only important if we can be reasonably sure that it is representative of the population as a whole, and reasonably sure that our result is repeatable.

This question of whether a difference is significant or not is essential to making decisions based on that difference. Some instances where this might come up include:

* Performing an A/B test — are the different observations really the results of different conditions (i.e., Condition A vs. Condition B)? Or just the result of random chance?

* Conducting a survey — is the fact that men gave slightly different responses than women a real difference between men and women? Or just the result of chance?

## Central Limit Theorem (sample mean vs population mean)

The sample mean of a larger sample set will more closely approximate the population mean. This phenomenon, known as the Central Limit Theorem, states that if we have a large enough sample size, all of our sample means will be *sufficiently* close to the population mean.

In [5]:
import numpy as np
np.random.seed(42)

# Create population and find population mean
population = np.random.normal(loc=65, scale=100, size=3000) #loc: mean, scale: standard deviation, size: size
population_mean = np.mean(population)

# Select increasingly larger samples
extra_small_sample = population[:10]
small_sample = population[:50]
medium_sample = population[:100]
large_sample = population[:500]
extra_large_sample = population[:1000]

# Calculate the mean of those samples
extra_small_sample_mean = np.mean(extra_small_sample)
small_sample_mean = np.mean(small_sample)
medium_sample_mean = np.mean(medium_sample)
large_sample_mean = np.mean(large_sample)
extra_large_sample_mean = np.mean(extra_large_sample)

# Print them all out!
print( "Extra Small Sample Mean: {} (extremely unlikely to be close to the population mean)".format(extra_small_sample_mean))
print( "Small Sample Mean: {} (very unlikely to be close to the population mean)".format(small_sample_mean))
print( "Medium Sample Mean: {} (quite unlikely to be close to the population mean)".format(medium_sample_mean))
print( "Large Sample Mean: {} (quite likely to be close to the population mean)".format(large_sample_mean))
print("Extra Large Sample Mean: {} (very likely to be close to the population mean)".format(extra_large_sample_mean))

print( "\nPopulation Mean: {}".format(population_mean))

Extra Small Sample Mean: 109.8061111698756 (extremely unlikely to be close to the population mean)
Small Sample Mean: 42.452609474386 (very unlikely to be close to the population mean)
Medium Sample Mean: 54.615348260590615 (quite unlikely to be close to the population mean)
Large Sample Mean: 65.68379945886475 (quite likely to be close to the population mean)
Extra Large Sample Mean: 66.93320558223255 (very likely to be close to the population mean)

Population Mean: 68.2000835875835


## Hypothesis Tests

Suppose we want to know if mean are more likely to sign up for a given programming class than women. We invite 100 men and 100 women to this class. After one week, 34 women sign up and 39 men sign up. More men than women signed up, but is this a "real" difference?.

“What is the probability that men and women have the same level of interest in this class and that the difference we observed is just chance?”

In other words, “If we gave the same invitation to every person in the world, would more men still sign up?”

A more formal version is: “What is the probability that the two population means are the same and that the difference we observed in the sample means is just chance?”

These statements are all ways of expressing a **null hypothesis**. A null hypothesis is a statement that the observed difference is the result of chance (so, not correlated).

## Type I or Type II

1. The first kind of error, known as a **Type I error**, is finding a correlation between things that are not related (false positives). This occurs when  we reject the null hypothesis even though it's true.
    
>For example, let’s say you conduct an A/B test for an online store and conclude that interface B is significantly better than interface A at directing traffic to a checkout page. You have rejected the null hypothesis that there is no difference between the two interfaces. If, in reality, your results were due to the groups you happened to pick, and there is actually no significant difference between interface A and interface B in the greater population, you have been the victim of a false positive.


2. The second kind of error, a **Type II error**, is failing to find a correlation between things that are actually related (false negative). This happens when we accept the null hypothesis even though it's false.

>For example, with the A/B test situation, let’s say that after the test, you concluded that there was no significant difference between interface A and interface B. If there actually is a difference in the population as a whole, your test has resulted in a false negative.


## p-Values

A hypothesis test provides a numerical answer, called a **p-value**, that helps us decide how confident we can be in the result. In this context, a *p-value* is the probability that we yield the observed statistics under the assumption that the null hypothesis is true.

Statistical hypothesis tests return a **p-value** which indicates the probability that the null hypothesis of a test is true. If the p-value is less than or equal to the *significance level*, then the null hypothesis is rejected in favor of the alternative hypothesis. And, if the p-value is greater than the significance level, then the null hypothesis is not rejected.

A p-value of 0.05 would mean that if we assumed the null hypothesis is true (i.e no correlation), then there's a 5% chance that the data results in what was observed due only to random sampling error. This generally means that there is a 5% chance that there is no difference between the two population means.

Generally, we want a p-value of less than 0.05, meaning that there is less than a 5% chance that our results are due to random chance.

## Univariate T-test (1 Sample T-test)

Suppose that a product manager wants the average age of visitors to BuyPie.com to be 30. In the past hour, the website had 100 (14 in the code) visitors and the average age was 31. Are the visitors too old? Or is this just the result of chance and a small sample size?

A **univariate T-test** (or 1 Sample T-test) is a type of hypothesis test that compares a sample mean to a hypothetical (or previously established) population mean and determines the probability that the sample came from a distribution with the desired (or established) mean.

In this case, the null hypothesis can be phrased as such: "The set of samples belongs to a population with the target (or established) mean".

SciPy has a function called <code>ttest_1samp</code>, which performs a 1 Sample T-Test. 

<code>ttest_1samp</code> requires two inputs, a distribution of values and an expected (or established) mean:

```python
from scipy.stats import ttest_1samp

t_stat, p_val = ttest_1samp(example_distribution, expected_or_previously_established_mean)
```

#### Univariate T-test Example

In [10]:
from scipy.stats import ttest_1samp
import numpy as np
"""
We have a small dataset representing the ages of customers to BuyPie.com in the past hour
their average age is 31. Our target age average age is 30. Should we be satisfied with
this 31 result or was our sample mean of ages the result of a small sample size and random chance?
1 Sample T-test can answer this question.
"""

ages = np.array([32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22])
print("The sample mean of ages is ", np.mean(ages))
desired_mean = 30
t_stat, p_val = ttest_1samp(ages, desired_mean) #30 is the expected mean
print("The probability that the sample did indeed come from a distribution with ", \
      desired_mean, " desired mean is of ", p_val )

The sample mean of ages is  31.0
The probability that the sample did indeed come from a distribution with  30  desired mean is of  0.5605155888171379


This means that we should be 56% confident that the sample is representative of the population (i.e there's a 56% *confidence* that there's no *significant difference* between the sample and the population).
We won't reject the null hypothesis but then again, this result does not mean that if we wait for more visitors to BuyPie, then the average age would definetly be 30 and not 31.

p-values give us an idea of how confident we can be in a result. Just because we don't have enough data to detect a difference doesn't meann that there isn't one. Generally, the more samples we have, the smaller difference we'll be able to detect.

In [14]:
"""
We have loaded a dataset daily_visitors that represents the ages of visitors to BuyPie.com in the last 1000 days. 
Each entry daili_visitors[i] is an array of entries representing the age per visitor to the website on day i.
We predicted that the average age would be 30, and we want to know if the actual data differs from that. 
If we get a pval < 0.05 we can conclude that it is unlikely that our sample has a true mean of 30. 
Thus, the hypothesis test has correctly rejected the null hypothesis, and we call that a correct result.
"""

correct_results = 0 #Start the counter at 0

daily_visitors = np.genfromtxt("daily_visitors.csv", delimiter=",")

for i in range(1000): #1000 experiments (days)
    t_stat, pval = ttest_1samp(daily_visitors[i], 30)
    #print(pval)
    if pval < 0.05:
        #print(np.mean(daily_visitors[i]))
        correct_results += 1
    else:
        #print(np.mean(daily_visitors[i]))
        pass
print("We correctly recognized that the distribution was different in "\
      + str(correct_results) + " out of 1000 experiments.")

We correctly recognized that the distribution was different in 499 out of 1000 experiments.


## 2 Sample T-test

Suppose that last week, the average amount of time spent per visitor on a website was 25 minutes. This week, the average amount of time spent per visitor to a website was 28 minutes. Did the average time spent per visitor change? Or is this part of natural fluctuations?

One way of testing wheter this difference is significant is by using a 2 Sample T-Test. A **2 Sample T-Test** compares two sets of data, which are both approximately normally distributed.

The null hypothesis, in this case, is that the two distributions have the same mean.

We can use SciPy's <code>ttest_ind</code> function to perform a 2 Sample T-Test. It takes two distributions as inputs and returns the t-statistic and a p-value.

```python
from scipy.stats import ttest_ind

t_stat, p_val = ttest_ind(example_distribution_1, example_distribution_2)
```

**2 Sample T-Test Example**

In [8]:
from scipy.stats import ttest_ind
"""
We've created two distributions representing the time spent per visitor to BuyPie.com last week, week1,
and the time spent per visitor to BuyPie.com this week, week2.
"""
week1 = np.genfromtxt("week1.csv")
week2 = np.genfromtxt("week2.csv")

week1_mean = np.mean(week1)
week2_mean = np.mean(week2)

print("First week's average time spent " + str(week1_mean))
print("Second week's average time spent "+ str(week2_mean))

week1_std = np.std(week1)
week2_std = np.std(week2)

print("First week's standard deviation of time spent " + str(week1_std))
print("Second week's standard deviation of time spent "+ str(week2_std))

t_stat, pval = ttest_ind(week1,week2)
print("\n" + "The probability that the two sample distributions have the same mean is of " + str(pval))

First week's average time spent 25.448059395140003
Second week's average time spent 29.021568107748
First week's standard deviation of time spent 4.531693387077697
Second week's standard deviation of time spent 5.497966708651848

The probability that the two sample distributions have the same mean is of 0.00067676768998613


We should then reject the null hypothesis, and conclude that these distributions do not have the same mean. This makes sense and is expected because our data sets are static (unchanging) and their means are quite different: 25.4480593952 and 29.0215681076.

## ANOVA (Analysis of Variance)

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of <code>0.5</code> is to use ANOVA. **ANOVA** tests the null hypothesis that all of the datasets have the same mean. If we reject the null hypothesis with ANOVA, we're saying that at least one of the sets has a different mean; however, it does not tell us which datasets are different.

We can use the SciPy function <code>f_oneway</code> to perform ANOVA on multiple datasets. It takes in each dataset as a different input and returns the t-statistic and the p-value. For example, if we were comparing scores on a videogame between math majors, writing majors and psychology majors, we could run an ANOVA test with this line:

```python
from scipy.stats import f_oneway

fstat, pval = f_oneway(scores_mathematicians,
                       scores_writers,
                       scores_psychologists)
```

The null hypothesis, in this case is that all three populations have the same mean score on this videogame. If we reject this null hypothesis (by getting a p-value less than 0.05) we can say that we are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, we can't make any conclusions on which two populations have a sifnificant difference.

#### ANOVA. Example:
Suppose that we own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. We want to know if the average ant sales over the past year are significantly different between the three locations. (The null hypothesis in this case is “There is no significant difference in sales between the stores.”)

In [5]:
from scipy.stats import f_oneway

a = np.genfromtxt("store_a.csv")
b = np.genfromtxt("store_b.csv")
c = np.genfromtxt("store_c.csv")

fstat, pval = f_oneway(a,b,c)
print("There's a " + str(pval) + " probability that there's no significant difference between the stores.")

There's a 0.00015341166007838315 probability that there's no significant difference between the stores.


We should reject the null hypothesis (pval<0.05). This means that there's a 1-pval = 0.9998465883399216 probability that there *is indeed* a relationship bewteen the store's sales.  

## Assumptions of Numerical Hypothesis Tests

### 1. The samples should each be normally distributed...ish

<img src="attachment:image.png" width="250px">

In this scenario, using a numerical hypothesis testing would be inappropiate.

### 2. The population standard deviations of the groups should be equal.

For ANOVA and 2-Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means. 

To check for similarity between the standard deviations, it is normally sufficient to divide the two standard deviations and see if the ratio is "close enough" (staying within 10% should suffice).

### 3. The samples must be independent.
When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Example of samples that are not independent:

* the number of goals scored per soccer player before, during and after undergoing a rigorous training regime.
* a group of patient's blood pressure levels before, during, and after the administration of a drug.



## Tukey's Range Test

Let's say that we have performed ANOVA to compare three sets of data from three VeryAnts stores. We received the result that there is some significant difference between datasets.

Now, we have to find out *which* datasets are different. We can perform a  **Tukey's Range** to determine the difference between datasets. 

If we feed in three datasets, such as the sales at the VeryAnts store locations A,B, and C, Tukey's Test can tell us which pairs of locations are distinguishable from each other. 

The function to perform Tukey's Range Test is <code>pairwise_tukeyhsd</code> found in <code>statsmodel</code> not <code>scipy</code>. It accepts a list of all the data and a list of labels that tell the function which elements of the list are from which set. We also provid the desired significance level we want (usually 0.05).

For example, if we were looking to compare mean scores of movies that are dramas, comedies, or docuumentaries, we would make a call to <code>pairwise_tukeyhsd</code> like this:



```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd
    
movie_scores = np.concatenate([drama_scores, comedy_scores, documentary_scores])
labels = ['drama'] * len(drama_scores) + ['comedy'] * len(comedy_scores) + ['documentary'] * len(documentary_scores)
    
tukey_results = pairwise_tukeyhsd(movie_scores, labels, 0.05)
```

It will return a table of information, telling you wheter or not to reject the null hypothesis for each pair of datasets.

#### Tukey's Range Test. Example:

In [18]:
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
"""
We have concatenated the sales data in lists a, b, and c and put it into list v.
We have also provided a labels list that keeps track of which elements of v come from a, b, or c
"""
a = np.genfromtxt("store_a.csv")
b = np.genfromtxt("store_b.csv")
c = np.genfromtxt("store_c.csv")

fstat, pval = f_oneway(a,b,c)

#ANOVA's p-value of  0.00015341166007838315 (pointint out that we should go for the Tukey's Range Test)
print("ANOVA's p-value: " + str(pval)) 

#Using our data from ANOVA, we create v and labels:
v = np.concatenate([a,b,c])
labels = ['a'] * len(a) + ['b'] * len(b) + ['c'] * len(c)

tukey_results = pairwise_tukeyhsd(v, labels, 0.05)
print(tukey_results)

ANOVA's p-value: 0.00015341166007838315
Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     a      b   7.2767  0.001  3.2266 11.3267   True
     a      c   4.0115 0.0529 -0.0385  8.0616  False
     b      c  -3.2651 0.1411 -7.3152  0.7849  False
----------------------------------------------------


From this result, note that the *significant difference* between datasets comes from stores <code>a</code> and <code>b</code> (reject null hypothesis = True)

## Binomial Test

Let's imagine that we're analyzing the percentage of customers who make a purchase after visiting a website. We have a set of 1000 customers from this month, 58 of whom made a purchase. Over the past year, the number of visitors per every 1000 who make a purchase is at around 72. Thus, our marketing department has set our target number of purchases per 1000 visits to be 72. We would like to know if this month's number, 58, is a *significant difference* from that target or a result of a natural fluctuations.

How do we begin comparing this, if there's no mean or standard deviation that we can use? The data is divided into two discrete categories, "made a purchase" and "did not make a purchase".

If we have a dataset where the entries are not numbers, but categories instead, we have to use different methods. To analyze a dataset like this, with two different possibilities for entries, we can use a **Binomial Test**. A Binomial Test compares a categorical dataset to some expectation.

Examples:

* Comparing the actual percent of emails that were opened to quarterly goals.
* Comparing the actual percentage of respondents who gave a certain survey response to the expected survey response.
* Comparing the actual number of heads from 1000 coin flips of a weighted coin to the expected number of heads.

The null hypothesis, in this case, would be that there is no difference between the observed behavior and the expected behavior. If we get a p-value of less than 0.05, we can reject that hypothesis and determine that there is a difference bewteen the observation and the expectation.

Scipy has a function called <code>binom_test</code> which performs a Binomial Test for you. <code>binom_test</code> requires three inputs, the number of observed successes, the number of total trials, and the expected probability success. For example, with 1000 coin flips of a fair coin, we would expect a "success rate" (the rate of getting heads) to be 0.5. Let's imagine we get 525 heads. Is the coin weighted? This function call would look like:
```python    
from scipy.stats import binom_test
    
pval = binom_test(525, n=1000, p=0.5)
```    
It returns a p-value, telling us how confident we can be that the sample of values was likely to occur with the specified probability. If we get a p-value less than 0.05, we can reject the null hypothesis and say that it is likely the coin is actually weighted, and that the probability of getting heads is statistically different than 0.5

#### Binomial Test. Example:

In [7]:
from scipy.stats import binom_test
"""
Suppose the goal of VeryAnt's marketing team this quarter was to have 6% of customers click a link that was emailed to them. 
They sent out a link to 10000 customers and 510 clicked the link, which comes out to 5.1% instead of 6%.
Did they do significantlly worse than the target? Let's use a binomial test to answer this question.
"""
pval = binom_test(510, n=10000, p=0.06)
print("Probability that the 510 visits were due to natural fluctuation " + str(pval))

"""
For the next quarter, marketing has tried out a new email tactic, including puns in every line of every email. 
As a result, 590 people out of 10000 opened the link in the newest email.
If we still wanted the mean to be 6% of emails opened, but now have 5.9% of emails opened, what is the new p-value?
"""
pval2 = binom_test(590, n=10000, p=0.06)
#This has a higher probability of being a natural fluctuation because we're closer to the actual target.
print("Probability that the 590 visits were due to natural fluctuation " + str(pval2))

Probability that the 510 visits were due to natural fluctuation 0.00011592032724546606
Probability that the 590 visits were due to natural fluctuation 0.6891529835730346


Note that the first scenario showed us that we had to reject the null hypothesis because we actually were not meeting the target and the low metrics were not caused by natural fluctuation.

After making changes on the marketing department, we didn't exactly meet our 600 clicks target yet but the 590 metric and the p-value associated with it shows us that it is very likely (~68% probability) that this is due to natural fluctuation.

## Chi Squared Test

Now, what if we also wanted to track if visitors added any items to their shopping cart? With three discrete categories of data we can no longer use a Binomial Test. If we have two or more categorical datasets that we want to compare, we should use a **Chi Squared** test. It is useful in situations like:

* An A/B test where half of users were shown a green submit button and the other half were shown a purple submit button. Was one group more likely to click the submit button?

* Men and women were both given a survey asking "Which of the following three products is your favorite?" Did the men and women have significantly different preferences?

In SciPy, you can use the function <code>chi2_contingency</code> to perform a Chi Squared Test. The input to <code>chi2_contingency</code> is a contingency table where:

* The columns are each a different condition, such as men vs women or Interface A vs Interface B.
* The rows represent different outcomes, like "Survey Response A" vs "Survey Response B", or "Clicked a Link" vs "Didn't Click"

The table can have as many rows and columns as you need.

In this case the null hypothesis is that there's no significant difference between the datasets. We reject that hypothesis, and state that there is a significant difference between two of the datasets if we get a p-value less than `0.05`.

#### Chi Squared Test. Example:

In [8]:
"""
The management at the VeryAnts ant store wants to know if their two most popular species of ants, 
the Leaf Cutter and the Harvester, vary in popularity US, CANADA, and FRANCE kids.
"""

from scipy.stats import chi2_contingency 
"""
Before making statistical calculations, we can make a guess by looking at this first table: 
For each country individually, the majority of students show an obvious preference to the Harvester ant, 
regardless of the country of origin. 
This is exactly the null hypothesis (no significant difference between ant preference and country of origin).
"""
# Contingency table
#         harvester |  leaf cutter
# ----+------------------+------------
# US     | 30       |  10
# CANADA | 35       |  5
# FRANCE | 28       |  12

X = [[30, 10],
     [35, 5],
     [28, 12]],


chi2, pval, dof, expected = chi2_contingency(X)
print("There's a",pval, "probability that there's no relationship between ant preference and country of origin")
"""
40 new students come from Spain. You notice a different attitude, the majority of them don't show a preference 
towards the Harvester ant, as the previous 3 groups did; their preferences are equally shared (20-20).
You start to be suspicious that maybe the country does matter to the preferences of ants. 
The calculation of chi-square test and p-value proves that. 
Low p-value -> rejection of Ho -> there seems to be actually an association.
"""
# Contingency table
#           harvester |  leaf cutter
# ----+------------------+------------
# US      | 30       |  10
# CANADA  | 35       |  5
# FRANCE  | 28       |  12
# SPAIN   | 20       |  20
X = [[30, 10],
     [35, 5],
     [28, 12],
     [20, 20]]

chi2, pval, dof, expected = chi2_contingency(X)
print("There's a",pval, "probability that there's no relationship between ant preference and country of origin")

There's a 0.15508230807673704 probability that there's no relationship between ant preference and country of origin
There's a 0.002812834559546625 probability that there's no relationship between ant preference and country of origin


In the first scenario, we fail to reject the null hypothesis.

In the second scenario, we reject the null hypothesis due to a low probability of it being true. (After adding Spain, it looks like there could actually be a relationship between countries and ants)

Rather than comparing the popularity of the two types of ants, we are interested in whether there is a difference in the relative popularity of the two types of ants among different countries.

For US, CANADA and FRANCE, Harvester Ants are far more popular than Leaf Cutters. For SPAIN, however, the two types of ants are equally popular. That’s what is different about the SPAIN data from the other 3 groups, which is why including it lowered the p-value.

## Review:

![image.png](attachment:image.png)

### Scenarios
>* **One categorical:** <br/> (Gender only) Is there a difference in the number of men and women in the population? $H_1$: There is a difference. $H_0$: There is no difference.

>* **Two categorical:** <br/>(Gender and age group) Does the proportion of males and females differ across age groups? $H_1$: Number of men & women is dependant of age category. $H_0$: Number of men & womrn are independent of age category.

>* **One numeric:** <br/>(Height) Is the average height different from a previosly established height? $H_1$:There is a difference? $H_0$: There is no difference.


>* **One numeric and one categorical:** <br/>
(Gender and Height) Is there a difference between the average height of men and women?  $H_1$: There is a difference $H_0$: There is no difference. T-test<br/>
(Age group and Height) Is there a difference between the average height of children, adults and the elderly?  $H_1$: There is a difference $H_0$: There is no difference. ANOVA.

>* **Two numeric:** <br/>
(Height and Weight) Is there a relationship between height and weight? $H_1$:There is a correlation $H_0$: There is no correlation