# Week 5:  Statistics


## PARTB - Lab Session 5

## 1. Background

### 1.1. Simple guide to statistical testing

We will be studying in our exercises the following:

Given a random sample $\lbrace x_i \rbrace$ of size $n$:

1) What kind of variable is $x$: _discrete_ or _continuous_?

2) Is it a question of the form _"Is the value of $x$ ..."_?

3) Are data normally distributed?


### 1.2. Hypothesis tests

This is a summary of hypothesis tests we have seen in the hands-on practice Jupyter notebook:

1) Test for population proportions:

   * `statsmodels.stats.proportion.proportions_ztest()`
   

2) Test that the mean of __one sample__ equals a population mean of unknown variance:

   * `scipy.stats.ttest_1samp()`
   

3) Test that the means of __two independent samples__ are equal, assuming that populations have identical variances

   * `scipy.stats.ttest_ind()`
   

4) Test that the means of __two dependent samples__ are equal, assuming that populations have identical variances

   * `scipy.stats.ttest_rel()`
   

5) Test that samples are drawn from a normal distribution

   * `scipy.stats.shapiro()`
   

6) Test that samples are drawn from populations with equal variances

   * `scipy.stats.levene()`
   

## Exercises

In [4]:
import pandas as pd
import numpy as np

from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Configure fonts for all subsequent plots
plt.rc('font', **{'family':'sans-serif','sans-serif':['Arial'], 'size': 16})

## Exercise 1
### Rolling dice


A die is rolled 1000 times and the following results are observed. 

We would expect an average of $\frac{1}{6}$ for each value, with a standard deviation of $0.01179$.


__Is there evidence to suggest that the die is biased?__

In [5]:
np.random.seed(123456789)

results = np.random.choice([1, 2, 3, 4, 5, 6], 1000)

values, counts = np.unique(results, return_counts=True)

np.asarray((values, counts)).T

array([[  1, 179],
       [  2, 172],
       [  3, 158],
       [  4, 156],
       [  5, 166],
       [  6, 169]], dtype=int64)

In [6]:
# Suggested solution
import statsmodels.stats.proportion as pr

z, p = pr.proportions_ztest(179,
                            1000,
                            1./6.,
                            'two-sided')

print(f'p-value is {p:.3f}')

p-value is 0.309


## Exercise 2
### ___t-test___ 1 sample

The average systolic blood presume is $120$ mmHg. 
After sampling $100$ data scientists we found an average of $132.3$ with a standard deviation of $19$:

In [7]:
np.random.seed(123456789)

samples = np.random.normal(loc=130, scale=20, size=100)
print("{} samples with mean {:.1f} and standard deviation {:.1f}".format(samples.size,
                                                                         samples.mean(),
                                                                         samples.std()))

100 samples with mean 132.3 and standard deviation 19.0


In [8]:
# Your solution


## Exercise 3 
### Weight watchers

We measure the weight of 9 individuals before and after they have been following a new diet for a week. 

In [9]:
data = {
    'before': [118, 126, 134, 162, 145, 188, 173, 125, 137],
    'after':  [123, 131, 132, 159, 138, 182, 170, 128, 137]
}

df = pd.DataFrame(data)
df

Unnamed: 0,before,after
0,118,123
1,126,131
2,134,132
3,162,159
4,145,138
5,188,182
6,173,170
7,125,128
8,137,137


__What can we conclude about the diet?__

In [10]:
# Your solution


## Exercise 4: Mrs Brown bakery

#### Based on _cimt.org.uk_

Mrs Brown owns a small bakery on Baker Street. She believes that, by keeping her windows open, the smell of freshly-baked goods encourages passers-by to buy her products. She recorded the following sales (in £):

| Windows closed | Windows opened |
| -------------: | -------------: |
|          193.5 |          202.0 |
|          192.2 |          204.5 |
|          199.4 |          207.0 |
|          177.6 |          215.5 |
|          205.4 |          190.8 |
|          200.6 |          215.6 |
|          181.8 |          208.8 |
|          169.2 |          187.8 |
|          172.2 |          204.1 |
|          192.8 |          185.7 |



In [11]:
data = {
    'closed': [193.5, 192.2, 199.4, 177.6, 205.4, 200.6, 181.8, 169.2, 172.2, 192.8],
    'opened': [202.0, 204.5, 207.0, 215.5, 190.8, 215.6, 208.8, 187.8, 204.1, 185.7]
}

df = pd.DataFrame(data)
df

Unnamed: 0,closed,opened
0,193.5,202.0
1,192.2,204.5
2,199.4,207.0
3,177.6,215.5
4,205.4,190.8
5,200.6,215.6
6,181.8,208.8
7,169.2,187.8
8,172.2,204.1
9,192.8,185.7


__Should Mrs Brown keep her store's windows open?__

In [12]:
# Your solution


## Exercise 5. Teaching programming

#### Based on _cimt.org.uk_

Mrs Kooner decided to test whether her students understand better the theory or practice of Data Science. So, her exam paper consisted of one theoretical and one practical questions.

The results for each student are as follows:

In [13]:
data = {
    'Theory':   [72, 82, 93, 65, 76, 89, 81, 58, 95, 91],
    'Practice': [75, 79, 84, 71, 82, 91, 85, 68, 90, 92]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Theory,Practice
0,72,75
1,82,79
2,93,84
3,65,71
4,76,82
5,89,91
6,81,85
7,58,68
8,95,90
9,91,92


__Is the difference significant?__

In [14]:
# Your solution


## Exercise 6. Netflix movies

Let's assume that the average Netflix movie is $99.1$ minutes long.

Looking into recently added movies this week, we observe movies with the following duration (in minutes):

\begin{equation}
106, 75, 136,  95, 112, 101, 104, \\
 94, 97, 123, 115, 115, 132, 125
\end{equation}


__Is our assumption valid?__

In [15]:
# Your solution