# Assignment 9: Hypothesis Testing (Part 1)

## Objective

In many situations, we cannot get the full population but only a sample. If we derive an interesting result from a sample, how likely can we derive the same result from the entire population? In other words, we want to know whether this result is a true finding or it just happens in the sample by chance. Hypothesis testing aims to answer this fundamental question. 


**Hypothesis Testing**
1. Why A/B testing?  
2. What is a permutation test? How to implement it?
3. What is p-value? How to avoid p-hacking? 
4. What is a chi-squared test? How to implement it?


## Task 1. A/B Testing
> Acknowledgment: Thank [Greg Baker](http://www.cs.sfu.ca/~ggbaker/) for helping me to prepare this task.

A very common technique to evaluate changes in a user interface is A/B testing: show some users interface A, some interface B, and then look to see if one performs better than the other.

Suppose I started an A/B test on CourSys. Here are the two interfaces that I want to compare with. I want to know whether a good placeholder in the search box can attract more users to use the `search` feature.


![](img/ab-testing.png)

The provided [searchlog.json](searchlog.json) has information about users' usage. The question I was interested in: is the number of searches per user different?

To answer this question, we need to first pick up a **test statistic** to quantify how good an interface is. Here, we choose "the search_count mean". 

Please write the code to compute **the difference of the search_count means between interface A and Interface B.** 

In [1]:
#<-- Write Your Code -->
import pandas as pd

data = pd.read_json("searchlog.json", lines=True)
df = pd.DataFrame(data)

search_counts_A = df.loc[df['search_ui'] == 'A', 'search_count']
search_counts_B = df.loc[df['search_ui'] == 'B', 'search_count']

# Calculate the mean search_count for interface A
mean_search_count_A = search_counts_A.mean()

# Calculate the mean search_count for interface B
mean_search_count_B = search_counts_B.mean()

# Calculate the difference in search_count means
diff_mean = mean_search_count_B - mean_search_count_A

print(f"The difference in search_count means between interface A and B is: {diff_mean}")

The difference in search_count means between interface A and B is: 0.13500569535052287


Suppose we find that the mean value increased by 0.135. Then, we wonder whether this result is just caused by random variation. 

We define the Null Hypothesis as
 * The difference in search_count mean between Interface A and Interface B is caused by random variation. 
 
Then the next job is to check whether we can reject the null hypothesis or not. If it does, we can adopt the alternative explanation:
 * The difference in search_count mean  between Interface A and Interface B is caused by the design differences between the two.

We compute the p-value of the observed result. If p-value is low (e.g., <0.01), we can reject the null hypothesis, and adopt  the alternative explanation.  

Please implement a permutation test (numSamples = 10000) to compute the p-value. Note that you are NOT allowed to use an implementation in an existing library. You have to implement it by yourself.

In [2]:
#<-- Write Your Code -->
import numpy as np


numSamples = 10000
diff_means = np.zeros(numSamples)


for i in range(numSamples):
    # Concatenate the search_count values and shuffle them randomly
    concat_search_counts = np.concatenate([search_counts_A, search_counts_B])
    np.random.shuffle(concat_search_counts)

    # Rearrange search_count values into two groups
    perm_search_counts_A = concat_search_counts[:len(search_counts_A)]
    perm_search_counts_B = concat_search_counts[len(search_counts_A):]

    diff_means[i] = np.mean(perm_search_counts_A) - np.mean(perm_search_counts_B)

p_value = p_value = np.sum(diff_means >= diff_mean) / numSamples
print(f"The p-value is {p_value}")

The p-value is 0.1268


Suppose we want to use the same dataset to do another A/B testing. We suspect that instructors are the ones who can get more useful information from the search feature, so perhaps non-instructors didn't touch the search feature because it was genuinely not relevant to them.

So we decide to repeat the above analysis looking only at instructors.

**Q. If using the same dataset to do this analysis, do you feel like we're p-hacking? If so, what can we do with it?**

**A.** Yes, this is p-hacking. Since we are keeping doing analysis on the same data sets. We could decrease the level of signficance (eg. alpha/2), and it is important to plan the analysis and clarify the hypothesis testing in advance.

## Task 2. Chi-squared Test 

There are tens of different hypothesis testing methods. It's impossible to cover all of them in one week. Given that this is an important topic in statistics, I highly recommend using your free time to learn some other popular ones such as <a href="https://en.wikipedia.org/wiki/Chi-squared_test">Chi-squared test</a>, <a href="https://en.wikipedia.org/wiki/G-test">G-test</a>, <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">T-test</a>, and <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a>.

On the searchlog dataset, there are two categorical columns: `is_instructor` and `search_ui`. In Task D, your job is to first learn how a Chi-Squired test works by yourself and then use it to test whether `is_instructor` and `search_ui` are correlated. 

Please write code to compute the Chi-squared stat. Note that you are **not** allowed to call an existing function (e.g., stats.chi2, chi2_contingency). 

In [3]:
#<-- Write Your Code -->
# Create a contingency table
contingency_table = pd.crosstab(data['is_instructor'], data['search_ui'])

# Compute the expected frequencies for each cell in the contingency table
row_totals = contingency_table.sum(axis=1)
col_totals = contingency_table.sum(axis=0)
total = sum(row_totals)
expected = np.outer(row_totals, col_totals) / total

# Compute the chi-squared statistic
observed = contingency_table.to_numpy()
chi_squared = np.sum((observed - expected)**2 / expected)

# Compute the degrees of freedom
degrees_of_freedom = (len(row_totals)-1) * (len(col_totals)-1)

print(f"The chi-squred value is {chi_squared}")
print(f"The degree of freedom is {degrees_of_freedom}")


The chi-squred value is 0.6731740891275046
The degree of freedom is 1


Please explain how to use Chi-squared test to determine whether `is_instructor` and `search_ui` are correlated. 

**A.** Once we have calculated the Chi-squared statistic, we can use a Chi-squared distribution table to find the p-value associated with the test. The p-value is the probability of observing a test statistic as extreme as the one calculated from our data, assuming that the null hypothesis is true. We find the row which corresponds to the problem degree of freedom [=1] and find the column for level of confidence (usually 0.05). This value equals to 3.841 in this sample. 

If the p-value is less than our chosen significance level (usually 0.05), we reject the null hypothesis and conclude that there is a significant association between the two variables. If the p-value is greater than our chosen significance level, we fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest a significant association between the two variables. Here, we find that 0.67317 < 3.841, therefore, we fail to reject the null hypothesis, then `is_instructor` and `search_ui` are correlated.

## Submission

Complete the code in this notebook, and submit it to the CourSys activity Assignment 9.