# 1. Objective

To introduce the notion of random samples and hypothesis testing

## 1.1 Preliminaries

In [1]:
import pandas as pd
import numpy as np

import scipy.stats as st

import altair as alt

# 2. Random samples

The focus of descriptive statistics is in describing a sample of data. However, when we want to ascertain if the characteristics of a sample hold in the larger population from which this sample was drawn, randomness becomes important.


Imagine that you are a part of Google Pay's development team, and you have rolled out a new feature recently. You want to understand if the feature rolled out was perceived to lead to a better user experience compared with your competitor PhonePe. You want to take a sample of existing customers and push a survey to them comparing your app with PhonePe.

In [None]:
#@title Population of customers using both Google Pay & Phone Pe
payment_preferences = pd.DataFrame({'age': np.random.choice(range(25, 75), 100),
                                    'avg_income_lakh': np.random.choice(range(20, 40), 100),
                                    'preferred_gateway': np.random.choice(['Google Pay', 'PhonePe'], 100)})

(alt.Chart(payment_preferences)
    .encode(alt.X('age:Q',
                  axis=alt.Axis(grid=False)), 
            alt.Y('avg_income_lakh:Q',
                  axis=alt.Axis(grid=False)))
    .mark_circle(size=100)
    .properties(width=600,
                height=400)).interactive()

In [None]:
#@title An extreme scenario
payment_preferences['preferred_gateway_extreme'] = np.where(payment_preferences.age < 45,
                                                             'Google Pay', 'PhonePe')

(alt.Chart(payment_preferences)
    .encode(alt.X('age:Q',
                  axis=alt.Axis(grid=False)), 
            alt.Y('avg_income_lakh:Q',
                  axis=alt.Axis(grid=False)),
            alt.Color('preferred_gateway_extreme',
                      scale=alt.Scale(domain=['PhonePe', 'Google Pay'],
                                      range=['#6739B7', '#2DA94F'])))
    .mark_circle(size=100)
    .properties(width=600,
                height=400)).interactive()


The biggest contribution of randomness is to remove the impact of any bias that might interfere with the observed variables of interest in the sample. This is because in a random sample, **every point in the population has equal chance to be included in the sample**.

**Challenge:**
If you were Google Pay, how would you roll out an in-app popup survey to a random sample of your users?

In [None]:
#@title A general scenario
(alt.Chart(payment_preferences)
    .encode(alt.X('age:Q',
                  axis=alt.Axis(grid=False)), 
            alt.Y('avg_income_lakh:Q',
                  axis=alt.Axis(grid=False)),
            alt.Color('preferred_gateway',
                      scale=alt.Scale(domain=['PhonePe', 'Google Pay'],
                                      range=['#6739B7', '#2DA94F'])))
    .mark_circle(size=100)
    .properties(width=600,
                height=400)).interactive()

# 3. Sample-based inference


## 3.1 Formulating hypotheses

From a statistical perspective, we want to esure that the inference we make on the sample at hand generalizes to the population, assuming that the data is a random sample.

The intuition we have on a variable of interest (e.g., model accuracy in prodution, user rating on app features) is encoded as the alternative hypothesis. 

For example, in this case, we want to establish that the user experience of Google Pay is better than that of PhonePe. This becomes the alternative hypothesis.


Following the tenets of the scientific method, the null hypothesis always encodes a `no difference` scenario. In this example, it means that there is no difference in user experience for Google Pay and PhonePe. Assuming the null hypothesis is true, we choose a random sample and estimate the probability of observing the alternative hypothesis. This probability is called the `p-value`.

It follows from this argument that if the p-value is very less, then there is a rare chance that the alternative hypothesis observed in the random sample **if the null hypothesis is true**. Hence, evidence points us to reject the null hypothesis (*'there is little chance that it is true'*)

How less is an acceptable value? The significance level of the test $\alpha$ encodes this. It is common practise to choose $\alpha = .05$.

## 3.2 Sampling

As section 3.1 indicates, sampling, that is, extracting a subset of a population of interest and observing their characteristics is a crucial step in undertaking *inference*. Here, we infer the parameters of the population from that of the sample.

### 3.2.1 Random sampling

We saw that for statistical guarantees to kick-in random sampling is necessary. In the context of random sampling, every unit in the population has an equal chance of being included into the sample.


### 3.2.2 Sampling variation

When we make inferences about the population using the sample, we need guarantees that the sample characteristics will hold even with repeated sampling. The variation we expect to observe in repeated sampling from the population is called sampling variation.

## 3.3 The Central Limit Theorem

The CLT links the sample mean to the sampling variation expected around the mean by specifying its distribution to be a normal distribution.

So, when estimating a population mean, the sample mean becomes the point-estimate and the standard deviation of the sample mean (referred to as the standard error) allows us to specify the *expected* variation around the sample mean.

We can also speak of the probability that the sample mean takes a specific value, *given a population mean*.

## 3.4 A workflow for hypothesis testing

**Step 1:** Encode your intuition as the alternative hypothesis

**Step 2:** Write down the corresponding null hypothesis (this will always be an equality)

**Step 3:** Compute the p-value for your alternative hypothesis. If p-value < 0.05, reject the null hypothesis

### 3.4.1 Example 1

Google Pay rolled out an important backend update aimed at reducing the latency on payment confirmation. However, they want to understand if this a significant difference from previous versions. The null and alternative hypotheses for this scenario are:


$$
H_0: \text{latency after update} = \text{latency before update} 
$$
$$
H_1: \text{latency after update} < \text{latency before update} 
$$


Validation of hypotheses requires the creation of appropriate test statistics. When the null hypothesis is true, test statistics have a value of 0 (signifying no difference). 

A natural test statistic in this case would be the standardized difference in sample means.

Now we collect data to test our hypotheses. In this case, we collect the latency observed on a random sample before the update was rolled out and the latency observed on another random sample after the update was rolled out.

In [None]:
#@title Latency data pre and post update 
latency_before_update = np.random.normal(loc=1.7, 
                                         scale=0.3, 
                                         size=200)

latency_after_update = np.random.normal(loc=1.5, 
                                        scale=0.3, 
                                        size=200)

In [None]:
(latency_before_update.mean(), latency_before_update.std())

(1.7015616647127434, 0.3042769553997559)

In [None]:
(latency_after_update.mean(), latency_after_update.std())

(1.4682866549947369, 0.29202154893595955)

### 3.4.2 Example 2

A model demonstrated an accuracy of 0.7 during the training phase. Before rolling this model out to production, you want to test if this stated accuracy holds even on live data. So, the first step is to set up the null and alternative hypotheses. 


$$
H_0: \text{model accuracy} = 0.7 
$$
$$
H_1: \text{model accuracy} \neq 0.7 
$$


Validation of hypotheses requires the creation of appropriate test statistics. When the null hypothesis is true, test statistics have a value of 0 (signifying no difference). 

A natural test statistic in this case would be the standardized difference of the sample mean from 0.7.

Now, we collect the data and observe the sample statistics. **Ensure that the hypotheses are stated before the sample data is observed**.

In [4]:
#@title Data from a model tested for 8 weeks = 56 days on random samples of live data
model_accuracy_data = np.random.normal(loc=0.65, 
                                       scale=0.1, 
                                       size=8*7)
model_accuracy_production = pd.Series(model_accuracy_data, 
                                      name='accuracy_in_production',
                                      index=['day_'+ str(i+1) for i in range(8*7)])

In [5]:
model_accuracy_production.describe()

count    56.000000
mean      0.634169
std       0.102853
min       0.392450
25%       0.574229
50%       0.632236
75%       0.695349
max       0.922091
Name: accuracy_in_production, dtype: float64

### 3.4.3 Example 3

Google Pay wants to understand if consumers perceive their app to be better than PhonePe in terms of ease of use. In this scenario, the null and alternate hypotheses are:


$$
H_0: \text{ease of use for Google Pay} = \text{ease of use for PhonePe} 
$$
$$
H_1: \text{ease of use for Google Pay} > \text{ease of use for PhonePe} 
$$

To validate the null hypothesis, we will need to collect data on ease of use for Google Pay and PhonePe. One way would be to [survey](https://forms.gle/vmdekzGKejvWFZ7D9) a random sample of users of both apps to rate them on a scale measuring ease of use.

# Summary

- Hypothesis testing involves using a randomly drawn sample to infer the characteristis of the population from which this sample was drawn.
- Null hypothesis is an equality that represents status quo
- Alternative hypothesis represents our intuition of the difference
- Once the null and alternative hypotheses are set up, data is collected to test if the null hypothesis can be rejected