In [2]:
# Import required packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats
import statistics
import scipy.stats.distributions as dist

## Confidence Interval

**CONFIDENCE INTERVAL** = (**POINT ESTIMATE** - **RELIABILITY FACTOR** x **STANDARD ERROR**, **POINT ESTIMATE** + **RELIABILITY FACTOR** x **STANDARD ERROR**)

**RELIABILITY FACTOR** = Z-static or t-static based on the distribution of the data and knowledge of the sample data and population mean.

**STANDARD ERROR** = $\frac{\text{Standard Deviation}}{\sqrt{\text{Sample Size}}}$

**Margin of error** = **RELIABILITY FACTOR**  X  **STANDARD ERROR**

When to use:
1. **Z-score**:
    - Large sample size (n>30)
    - Population variance is known

2. **t-score**:
    - Small sample size (n<30)
    - Population variance is unknown

When population size is large and we don't know the population variance, then we can use **Central Limit Theorem (CLT)** to get multiple sample means from the given data and then use the t-score.

1. Confidence interval looking into **single population** with **known population variance**:

\begin{equation}
\text{CI} = \bar{x} \pm Z_{\frac{\alpha}{2}} \frac{\sigma}{\sqrt{n}}
\end{equation}

2. Confidence interval looking into **single population** with **unknown population variance**:

\begin{equation}
\text{CI} = \bar{x} \pm  t_{n-1, \frac{\alpha}{2}} \frac{s}{\sqrt{n}}
\end{equation}

3. Confidence interval looking into **2 dependent populations** with **unknown population variance**:

\begin{equation}
\text{CI} = \bar{d} \pm  t_{n-1, \frac{\alpha}{2}} \frac{s_{d}}{\sqrt{n}}
\end{equation}

\begin{equation}
d = \left(x_{1} - y_{1}, x_{2} - y_{2}, x_{3} - y_{3}, ...., x_{n} - y_{n}\right)
\end{equation}

4. Confidence interval looking into **2 independent populations** with **known population variance**:

\begin{equation}
\text{CI} = \left(\bar{x} - \bar{y}\right) \pm Z_{\frac{\alpha}{2}} \sqrt{\frac{\sigma_{x}^{2}}{n_{x}} + \frac{\sigma_{y}^{2}}{n_{y}}}
\end{equation}

5. Confidence interval looking into **2 independent populations** with **unknown population variance but assumed to be equal**:

\begin{equation}
\text{CI} = \left(\bar{x} - \bar{y}\right) \pm t_{n_{x}+n_{y}-2, \frac{\alpha}{2}} \sqrt{\frac{s_{p}^{2}}{n_{x}} + \frac{s_{p}^{2}}{n_{y}}}
\end{equation}

\begin{equation}
s_{p}^{2} = \frac{\left(n_{x} - 1\right)s_{x}^{2} + \left(n_{y} - 1\right)s_{y}^{2}}{n_{x} + n_{y} - 2}
\end{equation}

### Data

Sample Sales Data available on Kaggle contains Order Info, Sales, Customer, Shipping, etc., Used for Segmentation, Customer Analytics, Clustering and More. Inspired for retail analytics. (https://www.kaggle.com/kyanyoga/sample-sales-data)

In [42]:
df = pd.read_csv('sales_data_sample.csv', sep=",", encoding='Latin-1')
df

Unnamed: 0,ORDERNUMBER,QUANTITYORDERED,PRICEEACH,ORDERLINENUMBER,SALES,ORDERDATE,STATUS,QTR_ID,MONTH_ID,YEAR_ID,...,ADDRESSLINE1,ADDRESSLINE2,CITY,STATE,POSTALCODE,COUNTRY,TERRITORY,CONTACTLASTNAME,CONTACTFIRSTNAME,DEALSIZE
0,10107,30,95.70,2,2871.00,2/24/2003 0:00,Shipped,1,2,2003,...,897 Long Airport Avenue,,NYC,NY,10022,USA,,Yu,Kwai,Small
1,10121,34,81.35,5,2765.90,5/7/2003 0:00,Shipped,2,5,2003,...,59 rue de l'Abbaye,,Reims,,51100,France,EMEA,Henriot,Paul,Small
2,10134,41,94.74,2,3884.34,7/1/2003 0:00,Shipped,3,7,2003,...,27 rue du Colonel Pierre Avia,,Paris,,75508,France,EMEA,Da Cunha,Daniel,Medium
3,10145,45,83.26,6,3746.70,8/25/2003 0:00,Shipped,3,8,2003,...,78934 Hillside Dr.,,Pasadena,CA,90003,USA,,Young,Julie,Medium
4,10159,49,100.00,14,5205.27,10/10/2003 0:00,Shipped,4,10,2003,...,7734 Strong St.,,San Francisco,CA,,USA,,Brown,Julie,Medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2818,10350,20,100.00,15,2244.40,12/2/2004 0:00,Shipped,4,12,2004,...,"C/ Moralzarzal, 86",,Madrid,,28034,Spain,EMEA,Freyre,Diego,Small
2819,10373,29,100.00,1,3978.51,1/31/2005 0:00,Shipped,1,1,2005,...,Torikatu 38,,Oulu,,90110,Finland,EMEA,Koskitalo,Pirkko,Medium
2820,10386,43,100.00,4,5417.57,3/1/2005 0:00,Resolved,1,3,2005,...,"C/ Moralzarzal, 86",,Madrid,,28034,Spain,EMEA,Freyre,Diego,Medium
2821,10397,34,62.24,1,2116.16,3/28/2005 0:00,Shipped,1,3,2005,...,1 rue Alsace-Lorraine,,Toulouse,,31000,France,EMEA,Roulet,Annette,Small


###  Lets estimate the 95% confidence interval for sales for each country from 2003 to 2004.

In [4]:
df1 = df[df['YEAR_ID'].isin([2003,2004])]

gpd_data = df1.groupby('COUNTRY')['SALES'].agg(['count','mean','std']).reset_index()
gpd_data['std_error'] = gpd_data.apply(lambda row: row['std']/row['count']**0.5, axis = 1)
gpd_data['CI_95'] = gpd_data.apply(lambda row: scipy.stats.t.interval(0.95, row['count']-1, row['mean'], row['std_error']), axis = 1)
print(gpd_data[['COUNTRY','CI_95']])

        COUNTRY                                     CI_95
0     Australia   (3131.8935623683346, 3706.573057349975)
1       Austria    (2948.921260784013, 4093.829791847565)
2       Belgium    (2726.481122465292, 3943.319677534708)
3        Canada   (2752.527528867365, 3489.6229629359136)
4       Denmark   (3182.4002800297058, 4138.409053303628)
5       Finland   (3199.6979794785548, 4308.827946447372)
6        France   (3241.1779414632206, 3702.503018536779)
7       Germany   (3096.4756649974884, 4015.527238228318)
8       Ireland    (2503.834247856112, 4715.719502143887)
9         Italy  (3007.8059631171686, 3722.7873702161646)
10        Japan     (2935.48255414086, 4179.873160144854)
11       Norway   (3217.450179301049, 4016.9898206989515)
12  Philippines    (2964.122816013692, 4267.856414755539)
13    Singapore    (3245.414641142219, 4168.407200963044)
14        Spain    (3307.378976668714, 3719.409560880693)
15       Sweden     (3154.340556270969, 4127.59781107597)
16  Switzerlan

###  Lets estimate the 95% confidence interval for sales in cities Barcelona and Sevilla

In [39]:
df1 = df1[df1['COUNTRY'] == 'Spain']
df1 = df1[df1['CITY'].isin(['Barcelona','Sevilla'])]

gpd_data = df1.groupby('CITY')['SALES'].agg(['count','mean','var']).to_dict()

pooled_var = ( (gpd_data['count']['Barcelona']-1)*gpd_data['var']['Barcelona'] + 
              gpd_data['count']['Sevilla']*gpd_data['var']['Sevilla'] ) / (gpd_data['count']['Barcelona']+gpd_data['count']['Sevilla']-2)

CI_95 = np.array(scipy.stats.t.interval(0.95, gpd_data['count']['Barcelona']+gpd_data['count']['Sevilla']-2, 
                      abs(gpd_data['mean']['Barcelona']-gpd_data['mean']['Sevilla']),
                      (pooler_var/gpd_data['count']['Barcelona'] + pooler_var/gpd_data['count']['Sevilla'])**0.5))

CI_95[CI_95<0] = 0

print('CI:', CI_95)

CI: [   0.         1336.91252641]


## **ANOVA Formulation:**

Using ANOVA, identify if there is any significant difference between the mean of the three groups.

$H_{0}:$ mean of all groups is equal

$H_{A}:$ at least one of the means is different

$SS_{total} = \sum_{i=1}^{n} \left(y_{i} - \bar{y}\right)^2 \tag{0}$

$SS_{residual} = \sum_{j \in G} \left( \sum_{i=1}^{n_{j}} \left( y_{i,j} - \bar{y_{j}} \right)^2 \right) \forall G \in \{Group1, Group2, ..\} \tag{1}$

$SS_{total} = SS_{explained} + SS_{residual} \tag{2}$

Hence,

$SS_{explained} = SS_{total} - SS_{residual} \tag{3}$

$df_{explained} = $ number of groups $- 1$

$df_{residual} = $ number of observations - number of groups

$MS_{explained} = \frac{SS_{explained}}{df_{explained}} \tag{4}$

$MS_{residual} = \frac{SS_{residual}}{df_{residual}} \tag{5}$

$F_{static} = \frac{MS_{explained}}{MS_{residual}} \tag{6}$

**P-value** = $F\left(df_{explained}, df_{residual}\right)$

$\alpha$ = level of significance

If **P-value** $\lt \alpha$ : We reject the $H_{0}$

If **P-value** $\gt \alpha$ : We don't have enough evidence to reject the $H_{0}$

In [40]:
df = pd.DataFrame()
df['A'] = [19.7, 20.1, 21.3, 23.5,  9.3, 27.1, 11.6, 12.2, 15.9, 17. , 17.2, 18.4, 19.4, 23.4,  2. ]
df['B'] = [23. , 24.5, 24.6, 27.1, 12. , 27.8, 12.8, 16.2, 19.8, 22.4, 23.6, 25.3, 27.9,  4.6, 35.2]
df['C'] = [21.6, 25.5, 25.9, 30.7,  3. , 16.5, 22.7, 24.2, 26.2, 28.4, 28.5, 30.7, 32.2, 33.8, 34.5]

df

Unnamed: 0,A,B,C
0,19.7,23.0,21.6
1,20.1,24.5,25.5
2,21.3,24.6,25.9
3,23.5,27.1,30.7
4,9.3,12.0,3.0
5,27.1,27.8,16.5
6,11.6,12.8,22.7
7,12.2,16.2,24.2
8,15.9,19.8,26.2
9,17.0,22.4,28.4


In [41]:
total_mean = np.mean(df.iloc[:,:].values)
A_mean = df['A'].mean()
B_mean = df['B'].mean()
C_mean = df['C'].mean()

ss_total = np.sum((df.iloc[:,:].values - total_mean)**2)
ss_residual = np.sum(np.sum((df['A'].values - A_mean)**2) + 
                     np.sum((df['B'].values - B_mean)**2) + 
                     np.sum((df['C'].values - C_mean)**2)
                    )
ss_explained = ss_total - ss_residual

df1 = df.shape[1] - 1
df2 = df.size - df.shape[1]

ms_explained = ss_explained/df1
ms_residual = ss_residual/df2

f_static = ms_explained/ms_residual
print('F-stat:', f_static)

p_value = 1 - scipy.stats.f.cdf(f_static, df1, df2)
print('P-value:', p_value)

print('We can reject the Null Hypothesis at 0.05 significance level')
print('But we do have enough evidence to reject the Null Hypothesis at 0.01 significance level')

F-stat: 4.9412309941313985
P-value: 0.011825050132581727
We can reject the Null Hypothesis at 0.05 significance level
But we do have enough evidence to reject the Null Hypothesis at 0.01 significance level


## Equality of Proportions

To test if population proportions of two groups are significantly different using statistics

### Data

We will be using popular Titanic data from Kaggle and see if the sex has any effect on the survival of an individual.
(https://www.kaggle.com/c/titanic/data)

In [50]:
data = pd.read_csv('titanic_data.csv')

pvt_data = data.pivot_table(index = 'Sex', columns = 'Survived', values = 'PassengerId', aggfunc = 'count')

pvt_data['total'] = pvt_data.sum(axis = 1)

pvt_data

Survived,0,1,total
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,81,233,314
male,468,109,577


$p_{1}$ = proportion of females who survived

$p_{2}$ = proportion of males who survived

$p$ = proportion of survived from both sex combined

$H_{0} : p_{1} - p_{2} = 0 \tag{0}$

$H_{A} : p_{1} - p_{2} \ne 0 \tag{1}$

Test-statistic:

\begin{equation}
Z = \frac{\left(p_{1} - p_{2}\right) - 0}{\sqrt{p(1-p)\left(\frac{1}{n_{1}}+\frac{1}{n_{2}}\right)}}
\tag{2}
\end{equation}

In [52]:
p1 = 233/314
p2 = 109/577
p = (233+109)/(314+577)

z_statistic = abs((p1-p2)/( p*(1-p) * (1/314 + 1/577) )**0.5)

pvalue = 2*dist.norm.cdf(-np.abs(z_statistic)) # Multiplied by two indicates a two tailed testing.
print('P-value:',pvalue)

print('Since the P-value is very low, we can reject the null hypothesis and accept the alternate hypothesis.')
print('Hence the sex has a huge effect on the survival of an individual in the tragedy')

P-value: 3.7117477701134797e-59
Since the P-value is very low, we can reject the null hypothesis and accept the alternate hypothesis.
Hence the sex has a huge effect on the survival of an individual in the tragedy
