<a href="https://colab.research.google.com/github/rosslogan702/hypothesis_testing_notes/blob/master/two_sample_t_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing - 2 Sample T-Test

# Contents

The focus of this notebook is 2 Sample T-Tests.

The notebook will cover the following:



1.   Description
2.   Manual Calculation
3.   Practical Examples using Scipy Library
4. Assumptions



# 1. Description

A 2 Sample T-Test compares two sets of data, which are both approximately normally distributed to determine if there is a significant difference between them.

### Example

Suppose that the average sales per day last week for an online store was 210 items per day. 

This week the average sales figure per day is 285 items. Did the average sales figure per day change or is this just part of natural fluctuations?

The 2 Sample T-Test could be used to try and determine this.



# 3. Practical Examples using Scipy Library

## Example 1 - Audio Book App Usage Time

A new audio book app collects data from two different weeks. The data is the app usage time of its users. It finds that the average amount of listening time from its user in the first week was x minutes and y minutes in the second week.

Did the average usage time per user change or is this part of natural fluctuations?

In [0]:
# Times in minutes of audio book app usage listening time from week 1
week_1_times = [23.90506824, 26.67631982, 27.27433886, 24.25757125, 
                32.40423483, 39.56919357, 23.07010059, 29.82068109,
                27.59433809, 28.05639569, 27.06757262, 30.41192979,
                25.71358554, 24.94294823, 28.23123807, 24.95337555,  
                18.51231639, 27.46234762, 28.38016611, 13.91205901,
                29.02615866, 26.90746774, 22.8677726,  24.8938289,  
                25.96947935, 26.86869621, 20.72676456, 27.35988314,
                20.68408581, 21.19846143, 16.25800931, 23.92517681,  
                24.47923229, 29.47050863, 27.28425372, 26.93339272,  
                28.61026924, 18.88377042, 33.65468651, 25.69470077,
                20.98291356, 22.69700387, 28.60278855, 21.36000443,  
                30.77685156, 20.83415999, 23.79367158, 19.7556718,
                29.54421084, 20.1433138]

In [0]:
# Times in mintes of audio book app usage listening time from week 2
week_2_times = [18.63431907, 31.28788036, 34.96797943, 21.81678117, 
                28.21619974, 39.39313736, 35.52223207, 27.54222109, 
                33.64395433, 25.31673581, 28.81392191, 30.7358016, 
                26.37241881, 26.0945555, 26.34073477, 19.42196017, 
                32.58797652, 24.84001926, 28.93348335, 20.43667584, 
                22.72495967, 32.31728012, 35.384306, 29.66709637, 
                24.53512973, 30.91406007, 19.56117513, 24.90816833, 
                30.13163726, 31.47466199, 27.77683598, 16.51307462, 
                35.0770162, 31.74818107, 36.36053496, 27.70500593, 
                29.49869936, 27.65575346, 37.18504075, 25.16055104, 
                29.26553553, 38.22163057, 28.92102091, 24.8215439, 
                38.30155495, 34.76020645, 22.26869162, 28.82593733, 
                32.00975127, 36.46437665]

## Step 1 - Define Null Hypothesis and Alternative Hypothesis

**Null Hypothesis**: There is no significant difference between the listening times  for the audio book app between week 1 and week 2.

**Alternative Hypothesis**: There is a significant difference between the listening times for the audio book app between week 1 and week 2.

## Step 2 - Prepare Data & Run Test

In [0]:
from scipy.stats import ttest_ind
import numpy as np

In [0]:
# Collect Statistics about each week
week_1_mean = np.mean(week_1_times)
week_2_mean = np.mean(week_2_times)

week_1_std = np.std(week_1_times)
week_2_std = np.std(week_2_times)

In [0]:
# Print statistics
print('Week 1 mean: {}'.format(week_1_mean))
print('Week 1 std:  {}'.format(week_1_std))
print('Week 2 mean: {}'.format(week_2_mean))
print('Week 2 std:  {}'.format(week_2_std))

Week 1 mean: 25.4480593952
Week 1 std:  4.531693386680561
Week 2 mean: 29.0215681076
Week 2 std:  5.497966708987187


In [0]:
# Run test using scipy lib
_, pval = ttest_ind(week_1_times, week_2_times)

## Step 3 - Collect & Analyse Results

In [0]:
print('pval: {:.6f}'.format(pval))
if pval < 0.05:
  print("Result is statistically significant! The listening time between week 1 and week 2 has changed!")
else:
  print("Result is not statistically signifcant! The listening time between week 1 and week 2 has not changed!")

pval: 0.000677
Result is statistically significant! The listening time between week 1 and week 2 has changed!


# 4. Assumptions & Notes

The order of the datasets entered into the scipy library does not matter. We are testing whether there is a significant difference between both sets of data. The order is irrelevant.  

The data is assumed to be approximately normally distributed across both sets of data.