<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Statistical Inference

---

## Learning Objectives / Agenda

#### How much does my data actually tell me about the world?

- Define a confidence interval and a p-value
- Understand the theory of hypothesis testing
- Know how to perform hypothesis tests and how to calculate confidence intervals and p-values using Python
- Articulate the main considerations of study design and the problem with p-values

#### Why do things happen and how do we know?

- Define correlation and calculate it using Python
- Create appropriate plots to visualise correlations with Python
- Describe the difference between correlation and causation
- Articulate some ways to test for causation

# Part 1

## How much does my data actually tell me about the world?

Imagine you want to know the average height of people in your school.

Since it's not practical to measure all the people in your school, you **sample** 100 people at **random** and measure their heights. The mean and standard deviation of these heights are 1.5m and 0.1m.

- The size of the sample is very important because this determines the standard deviation
- The randomness of the sample is very important to avoid bias in the calculation

Based on the choice of sample size and randomness, you'll be more or less confident of your data

**How confident are you that people are 1.5m tall on average?**

Confidence intervals give you a tool to measure that

They're "intervals" because your confidence is tied to a **range** of values

You'd report something like *"based on my sample, people are on average between 1.3m and 1.7m tall, with a 95% confidence."*

Where do those numbers come from?

In [3]:
from scipy import stats

stats.norm.interval(0.95, loc=1.5, scale=0.1)

(1.3040036015459946, 1.6959963984540054)

What does changing the 95% to 90% or 99% do?

In [2]:
from scipy import stats

print("90% CI:", stats.norm.interval(0.9, loc=1.5, scale=0.1))
print("99% CI:", stats.norm.interval(0.99, loc=1.5, scale=0.1))

90% CI: (1.3355146373048528, 1.6644853626951472)
99% CI: (1.24241706964511, 1.75758293035489)


What if we have a confidence interval of 10% or 99.999%?

In [3]:
print("10% CI:", stats.norm.interval(0.10, loc=1.5, scale=0.1))
print("99.999% CI:", stats.norm.interval(0.99999, loc=1.5, scale=0.1))

10% CI: (1.4874338653144925, 1.5125661346855075)
99.999% CI: (1.0582826586529994, 1.9417173413467606)


What does this interval mean exactly?

- For a 10% confidence, we can say that 10% of our samples lie between 1.48m and 1.51m
- For a 90% confidence, we can say that 90% of our samples lie between 1.33m and 1.66m

...etc

...but

if you are only confident about 10% of your samples, the remaining 90% could be very spread or very central, you have no way of knowing

So, the higher the confidence, the higher chance you have to correctly predict the range of your data

http://rpsychologist.com/d3/CI/

## How sure can I be of my findings?

In doing science, we always want to err on the side of being sceptical.

If you measure a difference between things, or an effect of X on Y, you want to assume it's due to chance.

Then you have tools to try and suggest otherwise.

### Example

I want to find out if there's a significant height difference between horse jockeys and players from the NBA.

The way we frame this in hypothesis testing is we have **two** hypotheses.

The **null** hypothesis $H_0$, which assumes there is **no** difference (or no effect of X on Y)

The **alternate** hypothesis $H_1$, which assumes there **is** a difference (or an effect)

What are my hypotheses in this case?

$H_0$: there is no difference between jockeys and basketball players.

$H_1$: there **is** such a difference.

Then we need to decide on a **significance level**.

In statistics this is called **p-value**, which describes the probability of the null hypotesis:
- if the p-value is small (< 0.05), the null hypotesis is likely to be wrong
- if the p-value is not small enough (> 0.05), the null hypotesis is not necessarily wrong, hence it cannot be rejected

For comparing the means of two groups we can use a t-test.

https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind

In [13]:
import pandas as pd
import numpy as np

df = pd.DataFrame()

np.random.seed(42)

df["jockeys"] = np.random.normal(150, 10, 100)
df["jockeys_2"] = np.random.normal(150, 10, 100)
df["basketball_players"] = np.random.normal(190, 10, 100)

df.describe()

Unnamed: 0,jockeys,jockeys_2,basketball_players
count,100.0,100.0,100.0
mean,148.961535,150.223046,190.648963
std,9.081684,9.53669,10.842829
min,123.802549,130.812288,157.587327
25%,143.990943,141.943395,183.445565
50%,148.730437,150.841072,190.976957
75%,154.059521,155.381704,197.044374
max,168.522782,177.201692,228.527315


In [14]:
from scipy import stats

t_statistic, p_value = stats.ttest_ind(df["jockeys"],df["basketball_players"])

print(p_value)

2.452490316264101e-74


That's a very small number. That means it is extremely unlikely that this difference is due to chance.

Let's try with another random set of jockeys.

In [15]:
t_statistic_2, p_value_2 = stats.ttest_ind(df["jockeys"], df["jockeys_2"])

print(p_value_2)

0.3392652865361483


In the first case, the p-value was tiny.

That means that **assuming the null hypothesis**, i.e. "there is no significant difference between groups" (which we always do)...

it would be **extremely unlikely** to get two samples with such different means **purely by chance**.

Therefore there **is** a significant difference between the groups, and we **reject the null hypothesis**.

In the second case, the p-value was 0.34, meaning it is 34% likely we'd get a difference due to chance.

That's not enough evidence to conclude a difference, so we **fail to reject the null hypothesis**.

Important wording!

Case 1: *"we reject the null hypothesis"* and **not** *"we proved a difference"* or *"we proved the alternate hypothesis"*

Case 2: *"we fail to reject the null hypothesis"* and **not** *"we proved there is no difference"*

Remember, we're always cautious about our findings

The word **prove** is banned from a Data Scientist's vocabulary