# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [1]:
import pandas as pd

# read input
df_log = pd.read_json('searchlog.json', lines=True)

# function to computer difference
def compute_diff(df, verbose):
    df_group = df_log[['search_ui', 'search_count']].groupby('search_ui', as_index=False)
    df_mean = df_group.mean()

    mean_A = float(df_mean[df_mean['search_ui'] == 'A']['search_count'])
    mean_B = float(df_mean[df_mean['search_ui'] == 'B']['search_count'])

    if (verbose): 
        print("mean of search_count for interface 'A': %2.3f" % (mean_A))
        print("mean of search_count for interface 'B': %2.3f" % (mean_B))
        print("difference of means: %2.3f" % (mean_B - mean_A))

    return (mean_B - mean_A)

base_diff = compute_diff(df_log, True);

mean of search_count for interface 'A': 0.664
mean of search_count for interface 'B': 0.799
difference of means: 0.135


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [12]:
import numpy as np 
from IPython.display import clear_output
from random import shuffle

# variables
iterations_count = 10000
verbose_count = 250

# get an array from search_count column
arr_searchcount = df_log['search_count'].to_numpy()

# create an empty array to keep differences later
arr_diff = np.empty((0,1))

for i in range(iterations_count):
    
    # shuffle search count items
    shuffle(arr_searchcount)
    
    # update search_count column after shuffle
    df_log['search_count'] = arr_searchcount
    
    # print every 500 iteration
    verbose = (i % verbose_count == 0)
    if (verbose):
        clear_output(wait=True)
        print('iteration#: %d' % (i))

    # compute diff value after shuffling
    diff = compute_diff(df_log, verbose)
    
    # add diff value to array of differences
    arr_diff = np.append(arr_diff, diff)

# length of items greater than or equal to base diff value [=0.135]
count = len(arr_diff[ arr_diff >= base_diff ])

# compute p-value
p_value = count / iterations_count

# print output
print('\n')
print('p-value: %2.3f' % (p_value))

if (p_value <= 0.01):
    print('(%2.3f <= 0.01) -> (significant) alternative hypothesis accepted.' % (p_value))
else:
    print('(%2.3f >= 0.01) -> (not significant) alternative hypothesis rejected.' % (p_value))

iteration#: 9750
mean of search_count for interface 'A': 0.739
mean of search_count for interface 'B': 0.721
difference of means: -0.018


p-value: 0.124
(0.124 >= 0.01) -> (not significant) alternative hypothesis rejected.


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it? **

**A.** <i>Yes. If we want to do some other analysis the same dataset to find something interesting, it will be considered as p-hacking. The solution is to divide the level of significance by a factor (example: alpha/2) to do the analysis.</i>

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [8]:
# get cross tab for is_instructor and search_ui
df_ct = pd.crosstab(df_log['is_instructor'], df_log['search_ui'], margins=True) \
            .rename_axis(None, axis=0) \
            .rename_axis(None, axis=1)

# get a matrix of numbers from cross tab for further usage
matrix = np.array(df_ct, dtype=float)

# get ground total value
grand_total = matrix[-1:,-1:]

# get total column values
col_all = matrix[:-1,-1:]

# get total row values
row_all = matrix[-1:,:-1]

# get observed matrix values [without total]
o_matrix = matrix[:-1,:-1]

# generate expected matrix
e_matrix = (row_all * col_all) / grand_total

# compute final matrix
f_matrix = e_matrix - o_matrix
f_matrix = np.power(f_matrix, 2)
f_matrix = f_matrix / e_matrix

# compute chi-square and degree of freedom
chi_square = np.sum(f_matrix)
degree_of_freedom = (f_matrix.shape[0] - 1) * (f_matrix.shape[1] - 1)

# print output
print("chi-squared value: %f" % (chi_square))
print("degree of freedom: %d" % (degree_of_freedom))

chi-squared value: 0.673174
degree of freedom: 1


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** <br>
<i>
Ho = is_instructor and search_ui are not correlated.<br>
Ha = is_instructor and search_ui are correlated.<br>

Regarding a specific level of confidence [=0.05], we need to look up chi-square probabilities table. We find the row which corresponds to the problem degree of freedom [=1] and find the column for level of confidence [=0.05]. This value equals to 3.841 in this sample. <br><br>

Since the calculated chi-square value 0.673174 < 3.841 [to the left of 0.05], it turns out  the p-value will be greater than 0.05. It's not significant. So, we will reject the alternative hypothesis.

we can conclude that is_instructor and search_ui are not correlated.
</i>

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 7.