# 1. Objective

To introduce the t-test to test the validity of null hypotheses involving two samples

## 1.1 Preliminaries

In [6]:
import pandas as pd
import numpy as np

import scipy.stats as st

# 2. Testing hypotheses

## 2.1 Example 1

Google Pay rolled out an important backend update aimed at reducing the latency on payment confirmation. However, they want to understand if this a significant difference from previous versions. The null and alternative hypotheses for this scenario are:


$$
H_0: \text{latency after update} = \text{latency before update} 
$$
$$
H_1: \text{latency after update} < \text{latency before update} 
$$


Validation of hypotheses requires the creation of appropriate test statistics. When the null hypothesis is true, test statistics have a value of 0 (signifying no difference). 

A natural test statistic in this case would be the standardized difference in sample means (so we could compare it to a standard t-distribution).

Now we collect data to test our hypotheses. In this case, we collect the latency observed on a random sample before the update was rolled out and the latency observed on another random sample after the update was rolled out.

In [11]:
#@title Latency data pre and post update 
latency_before_update = np.random.normal(loc=1.7, 
                                         scale=0.3, 
                                         size=200)

latency_after_update = np.random.normal(loc=1.5, 
                                        scale=0.3, 
                                        size=200)

In [12]:
(latency_before_update.mean(), latency_before_update.std())

(1.7091304934719025, 0.2962031812057711)

In [13]:
(latency_after_update.mean(), latency_after_update.std())

(1.4991684270216807, 0.271025206108724)

In [14]:
two_sample_ttest = st.ttest_ind(latency_before_update, latency_after_update,
                                equal_var=False)

In [15]:
two_sample_ttest.pvalue

9.607286083723884e-13

## 2.2 Example 2

A model demonstrated an accuracy of 0.7 during the training phase. Before rolling this model out to production, you want to test if this stated accuracy holds even on live data. So, the first step is to set up the null and alternative hypotheses. 


$$
H_0: \text{model accuracy} = 0.7 
$$
$$
H_1: \text{model accuracy} \neq 0.7 
$$


Validation of hypotheses requires the creation of appropriate test statistics. When the null hypothesis is true, test statistics have a value of 0 (signifying no difference). 

A natural test statistic in this case would be the standardized difference of the sample mean from 0.7 (so we could compare it to a standard t-distribution).

Now, we collect the data and observe the sample statistics. **Ensure that the hypotheses are stated before the sample data is observed**.

In [7]:
#@title Data from a model tested for 8 weeks = 56 days on random samples of live data
model_accuracy_data = np.random.normal(loc=0.65, 
                                       scale=0.1, 
                                       size=8*7)
model_accuracy_production = pd.Series(model_accuracy_data, 
                                      name='accuracy_in_production',
                                      index=['day_'+ str(i+1) for i in range(8*7)])

We have 56 samples of accuracy here.

In [8]:
model_accuracy_production.describe()

count    56.000000
mean      0.642556
std       0.115290
min       0.429578
25%       0.552529
50%       0.645886
75%       0.713281
max       0.960680
Name: accuracy_in_production, dtype: float64

In [9]:
one_sample_ttest = st.ttest_1samp(model_accuracy_production, 0.7)

In [10]:
one_sample_ttest.pvalue

0.0004566864427527832

The p-value > .05, hence if the null hypothesis is true, there is substantial probability that the sample will have a mean of 0.664. Hence, given the evidence, we conclude that the accuracy observed in production is 0.7, with the actual value observed is within the realm of sampling variation.

## 2.3 Example 3

Google Pay wants to understand if consumers perceive their app to be better than PhonePe in terms of ease of use. In this scenario, the null and alternate hypotheses are:


$$
H_0: \text{ease of use for Google Pay} = \text{ease of use for PhonePe} 
$$
$$
H_1: \text{ease of use for Google Pay} > \text{ease of use for PhonePe} 
$$

To validate the null hypothesis, we will need to collect data on ease of use for Google Pay and PhonePe. One way would be to survey a random sample of users of both apps to rate them on a scale measuring ease of use.

In [None]:
data_df = pd.read_csv('')

In [None]:
data_df.head()

In [None]:
paired_ttest = st.ttest_rel(data_df.gpay_eou, data_df.phonepe_eou)

In [None]:
paired_ttest.pvalue

# Summary

t-tests can be used to validate the comparision between two parameters of interest. It has three flavors - one sample, two sample & paired samples - depending on the way the hypothesis and data collection are set up 