# MODULES

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

import pandas as pd
from statsmodels.formula.api import ols
import statsmodels as sm

from scipy.stats import chi2
from scipy.stats import chisquare


___

# DIFFERENCE BETWEEN SAMPLES
## Overview

Comparing samples aims to determine if some characteristics of the population have an impact on the variable of interest. More specifically, we check if different values of some **categorical variable(s)** lead to **different probability distributions** for the variable of interest.

<br></br>
![png](../../img/stat_tests/stat_tests_diff_between_samples.png)


___

# CORRELATION BETWEEN VARIABLES
## Overview

Correlation is the measure of dependance between **two continuous or ordinal variables**; It typically indicates their linear relationship, but more broadly measures how in sync they vary. This is expressed by their **covariance**. 

A more common measure is the **[Pearson product-moment correlation coefficient](https://en.wikipedia.org/wiki/Correlation_and_dependence#Pearson's_product-moment_coefficient)**, built on top of the covariance. It's akin to the standard variation vs the variance for bivariate data and represents how far the relationship is from the line of best fit.

The correlation coefficient divides the covariance by the product of the standard deviations. This normalizes the covariance into a unit-less variable whose values are between -1 and +1.

The line of best fit has a slope equal to the Pearson coefficient multiplied by SDy / SDx.

<br></br>
![png](../../img/stat_tests/stat_tests_correlation.png)


___

# MODELING
## Overview

Linear Regression:
+ only incude variables that are correlated to the outcome.
+ check for collinearity.

<br></br>
![png](../../img/stat_tests/stat_tests_modeling.png)


___

# CENTRAL LIMIT THEOREM (CLT)

## Definition

A group of samples having the same size $N$ will have mean values **normally distributed** around the population mean $\mu$, regardless of the original distribution. This normal distribution has:
+ the **same mean** $\mu$ as the population.
+ a standard deviation called **standard error** equal to $\sigma / \sqrt(n)$, where $\sigma$ is the SD of the population.

## Confidence Intervals

Because the sampling distribution of sample statistic is **normally distributed**, 95% of all sample means fall within two standard errors of the actual population mean. In other words: we can say with a 95% confidence level that the **population parameter** lies within a confidence interval of plus-or-minus two standard errors of the **sample statistic**. 

Given some sample statistic $\mu$ and the population parameter $\mu_0$, there are three possible **alternate hypotheses**:

| Left-tailed  | Two-sided     | Right-tailed    |
|-----------------:|:-----------------:|:-------------------:|
| $\mu \lt \mu_0$ | $\mu \neq \mu_0$   | $\mu \gt \mu_0$     |

The p-value being smaller than $\alpha$ would mean that the sample statistic under $H_0$ is in the blue areas of the **sampling distribution of sample statistic**, depending on the alternate hypothesis.

<br></br>
<img class="center-block" src="https://sebastienplat.s3.amazonaws.com/21a0a7a855f51f6426dfbf6115b872161490032937519"/>

_Note: for two-tailed tests, we use $\alpha/2$ for each tail. This ensures the total probability of extreme values is $\alpha$._

## Z-Scores

We can use two factors to assess the probability of observing the experimental results under the null hypothesis:
+ The [**Z-score**](https://en.wikipedia.org/wiki/Standard_score) represents the number of standard deviations an observation is from the mean.
+ The sampling distribution of sample statistic is centered around the population parameter and has a standard error linked to the population variance. 

It means that we can calculate the z-score of our sample statistic to calculate its p-value.


___

# Z-Tests

+ Z-tests for mean serve the same purpose as ANOVA
+ Z-tests for proportions serve the same purpose as chi-square tests

Both tests compare a sample to a given population. Formally, the population SD needs to be known, but we can use a t-test with the sample standard deviation if not.

## Mean Tests

A **mean test** look for a specific value, typically the average of a population parameter. Its null hypothesis states that $\mu = \mu_0$.

+ According to the CLT, the mean of our sample is part of a normal distribution centered around its population mean
+ The null hypothesis states that the sample belongs to the initial population

It means that we can calculate the position of the sample mean in the sampling distribution that follows $H_0$, provided we know the population standard deviation (see formula in code). 

We calculate its p-value based on the alternate hypothesis to draw our conclusions.


In [47]:
# example: test if a website redesign improved load time. H0: old_load >= new_load, alpha = 0.01
# old_load_mean = 3.125, old_load_sd = 0.700
# new_load_mean = 2.875, new_load_sample_size = 40
# the sample mean of new_load_mean under H0 is in the 1.2% percentile, outside of our cutoff area: we fail to reject H0

z_score = (2.875 - 3.125) / (0.700 / np.sqrt(40))
print('p-value: {:.1%} > 1% cutoff'.format(stats.norm.cdf(x=z_score)))


p-value: 1.2% > 1% cutoff
