# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [1]:
#<-- Write Your Code -->
import pandas as pd
import numpy as np

data = pd.read_json('searchlog.json', lines=True)
# ab_summary = data.pivot_table(values='search_count', index='search_ui', aggfunc=np.sum)
# ab_summary['total'] = data.pivot_table(values='search_count', index='search_ui', aggfunc=lambda x: len(x))
# ab_summary['means'] = data.pivot_table(values='search_count', index='search_ui')
# print(ab_summary)
print(data)

ab_summary2 = data.pivot_table(values='search_count',index='search_ui',aggfunc=[np.sum,len,np.mean,np.var,np.std])
ab_summary2

          uid  is_instructor search_ui  search_count
0     6061521           True         A             2
1    11986457          False         A             0
2    15995765          False         A             0
3     9106912           True         B             0
4     9882383          False         A             0
..        ...            ...       ...           ...
676  16768212          False         B             0
677   7643715           True         A             0
678  14838641          False         A             0
679   6454817          False         A             0
680   9276990          False         B             3

[681 rows x 4 columns]


Unnamed: 0_level_0,sum,len,mean,var,std
Unnamed: 0_level_1,search_count,search_count,search_count,search_count,search_count
search_ui,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
A,231,348,0.663793,2.125832,1.458023
B,266,333,0.798799,2.516625,1.586387


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [2]:
#<-- Write Your Code -->
from scipy import stats
from numpy.random import seed
from numpy.random import randn
from numpy import mean

data_A = data[data['search_ui']=='A']
data_B = data[data['search_ui']=='B']
t, p = stats.ttest_ind(data_A.search_count,data_B.search_count,equal_var=False)
print("library t = " + str(t))
print("library p = " + str(p))

def permutation_test(A, B, num_sample=10000):
    num_A = len(A)
    diff_A_B = np.abs(np.mean(A) - np.mean(B))
    A_B = np.concatenate([A, B])
    counter = 0
    for _ in range(num_sample):
        np.random.shuffle(A_B)
        counter += diff_A_B < np.abs(np.mean(A_B[:num_A]) - np.mean(A_B[num_A:]))
    return counter / num_sample
print("p value: ",permutation_test(data_A.search_count, data_B.search_count, 10000))

library t = -1.1548592629736951
library p = 0.2485609905408175
p value:  0.2459


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

**A.** I think it is a p-hacking since the original UI collects data from both instructors and non-instructors. Therefore, we should not use the same dataset. If we want to do A/B analysis again, it is better to collect new data (in this case, the UI should be allowed for only instructors). In general, avoiding p-hacking comes down to awareness, planning ahead, and being open when post-hoc manipulation is legitimately needed.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [3]:
#<-- Write Your Code -->
# Test value with library
import pandas
from scipy.stats import chi2_contingency

data = pd.read_json('searchlog.json', lines=True)
contingency_table = pd.crosstab(
    data['is_instructor'],
    data['search_ui'],
    margins = True
)
print(contingency_table)
row_sums = contingency_table.iloc[0:2,2].values
col_sums = contingency_table.iloc[2,0:2].values
total = contingency_table.loc['All', 'All']
f_obs = np.append(contingency_table.iloc[0][0:2].values, contingency_table.iloc[1][0:2].values)
f_expected = []
for j in range(2):
    for col_sum in col_sums:
        f_expected.append(col_sum*row_sums[j]/total)
chi_squared_statistic = ((f_obs - f_expected)**2/f_expected).sum()
print('Chi-squared Statistic: {}'.format(chi_squared_statistic))
p_value = 1 - stats.chi2.cdf(x=chi_squared_statistic,  # Find the p-value
                             df=1)
print("P value: ",p_value)

### Check with library
f_obs = np.array([contingency_table.iloc[0][0:2].values,
                  contingency_table.iloc[1][0:2].values])
from scipy import stats
print("\n\nlibrary value without Yates' correction for continuity: ", stats.chi2_contingency(f_obs,correction=False))
print("library value: ", stats.chi2_contingency(f_obs))

''' To check expected values
from scipy.stats.contingency import expected_freq
print(expected_freq(f_obs))
'''

search_ui        A    B  All
is_instructor               
False          233  213  446
True           115  120  235
All            348  333  681
Chi-squared Statistic: 0.6731740891275046
P value:  0.41194715912043356


library value without Yates' correction for continuity:  (0.6731740891275046, 0.41194715912043356, 1, array([[227.91189427, 218.08810573],
       [120.08810573, 114.91189427]]))
library value:  (0.5473712356215867, 0.459393799574249, 1, array([[227.91189427, 218.08810573],
       [120.08810573, 114.91189427]]))


' To check expected values\nfrom scipy.stats.contingency import expected_freq\nprint(expected_freq(f_obs))\n'

Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** So sincec p-value is greater than 1% we can reject our hull hypothesis. Chi square is a non-parametric test that is used to show association between two qualitative variables (in this case: is_instructor and search_ui) ; while correlation is used to test the correlation between two quantitative variables. 

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.