# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [40]:
import pandas as pd

df  = pd.read_json('searchlog.json', lines=True)

mean = df[['search_ui','search_count']].groupby('search_ui',as_index=False).mean()



mean_A = mean['search_count'].iloc[0]
mean_B = mean['search_count'].iloc[1]

diff = mean_A-mean_B

print('the difference of the search_count means between interface A and B is :', diff)
mean


the difference of the search_count means between interface A and B is : -0.13500569535052287


Unnamed: 0,search_ui,search_count
0,A,0.663793
1,B,0.798799


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [48]:
import numpy as np

ITER = 10000
THRESHOULD = 0.135
cnt = 0

ori_count = df['search_count'].to_numpy()

for i in range(ITER):
	#permutation
	np.random.shuffle(ori_count)
	df['search_count'] =  ori_count.tolist()
	mean = df[['search_ui','search_count']].groupby('search_ui',as_index=False).mean()
	mean_A = mean['search_count'].iloc[0]
	mean_B = mean['search_count'].iloc[1]

	diff = mean_A-mean_B
	if(diff>THRESHOULD):
		cnt += 1

p_value = cnt/ITER
print('p value = ',p_value)
if(p_value < 0.01):
	print('Null hypothesis rejected.')
else:
	print('The difference is caused by design difference.')


p value =  0.1263
The difference is caused by design difference.


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it?**

**A.** Yes, it is considered as p-hacking. Because in this case, we are perfroming multiple analysis on the same dataset and we are manipulating analysis to produce statistically significant results. To solve this problem, it is to use Bonferroni correction. We record the number of all siginificance tests conducted and divid one's criterion for significance by this number.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [61]:

df  = pd.read_json('searchlog.json', lines=True)

# count 
is_A = df[(df['is_instructor'] == True) & (df['search_ui'] == 'A')].count().iloc[0]
is_B = df[(df['is_instructor'] == True) & (df['search_ui'] == 'B')].count().iloc[0]
not_A = df[(df['is_instructor'] == False) & (df['search_ui'] == 'A')].count().iloc[0]
not_B = df[(df['is_instructor'] == False) & (df['search_ui'] == 'B')].count().iloc[0]

is_instructor = is_A + is_B
not_instructor = not_A + not_B

A = is_A + not_A
B = is_B + not_B

Sum = A+B

# exp
exp_is_A = (is_instructor * A)/Sum
exp_is_B = (is_instructor * B)/Sum
exp_not_A = (not_instructor * A)/Sum
exp_not_B = (not_instructor * B)/Sum

# chi-square value
x = (is_A - exp_is_A)**2/exp_is_A + (is_B-exp_is_B)**2/exp_is_B\
	+ (not_A - exp_not_A)**2/exp_not_A + (not_B-exp_not_B)**2/exp_not_B

print('chisquare value is: ', x)
print('degree of freedom is: ', 1)

chisquare value is:  0.6731740891275046
degree of freedom is:  1


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** Our hypothesis here is that $H_0$ `is_instructor` and `search_ui` are not correlated, and $H_1$ `is_instructor` and `search_ui` are correlated. The degree of freedom here is 1, and assume we use the 0.05 as the level of confidence, we get the value 3.84( by looking up in the given table of chi-square value). Our value is 0.673 which is smaller than 3.84, so we cannot reject $H_0$, that is, `is_instructor` and `search_ui` are not correlated.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 9.