# Inferential Statistics

## Library loading

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns 
from scipy import stats

## Hypothesis testing

We want to test if our the **sample mean** is not equal to the **population mean** = 80.94. We also know that our **sample** has a size of 25 individuals.

$t = \frac{(\bar{X}-\mu_{0})}{\hat{\sigma}/\sqrt{n}}$

where:

* $\bar{X}$ is the **sample mean**
* $\mu_{0}$ is the **population mean**
* $\hat{\sigma}$ is the **sample standard deviation**
* $n$ is the number of measures in our sample

In [2]:
#Example 1 

# Calculating T statistic for one sample t test

import math

sample_mean = 80.94
pop_mean = 85
sample_std = 11.6
n = 25
statistic = (sample_mean - pop_mean)/(sample_std/math.sqrt(n))
print("Statistic is: ", statistic)

Statistic is:  -1.750000000000001


The critical region is defined by a critical value which is symmetrical on either side of the Y axis.

We use the significance level value, alpha, to check the critical value using a z-table. 

You can find the Z table here on this link [https://www.math.arizona.edu/~rsims/ma464/standardnormaltable.pdf]. 

Since this is a two tailed test, we will use alpha/2 ie 0.025 and see that the critical value is -1.96 (on the left hand side of the Y axis).   

Now we can compare the calculated statistic and the critical value to see if the null hypothesis is rejected or not. 

In this case the statistic is calculated to be -1.75 and the critical value is -1.96, hence we are not in the rejection region. Hence we fail to reject the null hypothesis.

In [4]:
# 2 sample t test 

sample_mean1 = 105.5
sample_std1 = 20.1
n1 = 34
sample_mean2 = 90.9
sample_std2 = 12.2
n2 = 29

pooled_sample_std = math.sqrt(((n1-1)*sample_std1**2 + (n2-1)*sample_std2**2)/(n1+n2-2))
statistic = (sample_mean1-sample_mean2)/(pooled_sample_std*math.sqrt((1/n1)+(1/n2)))
print("T Statistic is: ", statistic)

T Statistic is:  3.4101131776909535


In [5]:
# Using python to find the p value and critical value

from scipy.stats import t

print("P value is: ", 1- t.cdf(statistic,n1+n2-2))
print("Critical Value of z is: ", t.ppf(0.025, n1+n2-2)) #alpha is 0.05

P value is:  0.0005783712704484634
Critical Value of z is:  -1.9996235841149783


In this case, since the test statistic is more than the absolute value of "critical value", it is in the rejection region. Hence we reject the null hypothesis.

Also by looking at the p value directly, we can reject the null hypothesis as it is less than 0.05

In [6]:
#using python to conduct ANOVA 

import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.read_excel('anova_class_example_data.xlsx', sheet_name='data_collected')
data.head()

model = ols('Percent_increase_in_sales ~ C(Display_design)',data=data).fit()
table = sm.stats.anova_lm(model)
print(table)

                     df    sum_sq       mean_sq          F        PR(>F)
C(Display_design)   3.0  66870.55  22290.183333  66.797073  2.882866e-09
Residual           16.0   5339.20    333.700000        NaN           NaN


P value is very low "PR(>F)"- ie probability of observing F statistics greater than some value.

P value = 0.000000002882

Thus we can reject the null hypothesis 