# Today's Coding Topics
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/data-programming-with-python/blob/main/2023-summmer/2023-06-28/notebook/concept_and_code_demo.ipynb)

* Recap of previous lecture
* Statistics with Python


# Recap of previous lecture

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.set_option('display.max_columns',None) #unlimited
pd.set_option('display.max_rows',None)

## Pandas practice

In [None]:
%%time

Employees = pd.read_excel('../data/Employees.xls')
Territory = pd.read_excel('../data/SalesTerritory.xls')
Customers = pd.read_excel('../data/Customers.xls')
Orders = pd.read_excel('../data/ItemsOrdered.xls')

### Grouping

Reading Materials: 
* (official doc): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
* (summary) https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

#### What is the earliest birthdate for all employees?

SQL logic
```sql
SELECT MIN(e.BirthDate) FROM dbo.Employees AS e;
```

In [None]:
Employees.head(3)

In [None]:
Employees.columns

In [None]:
type(Employees.dtypes)

In [None]:
Employees.loc[:,['BirthDate']].head(3)

In [None]:
Employees.dtypes

In [None]:
# Employees.dtypes.reset_index()
# Employees.dtypes['BirthDate']
str(Employees.dtypes['BirthDate'])

In [None]:
Employees.BirthDate.dtypes

In [None]:
'1970-01-01' < '2023-06-26'

In [None]:
Employees.BirthDate.min()

In [None]:
Employees.BirthDate.max()

In [None]:
Employees.BirthDate.nunique()

#### Add to the above, the most recent birthdate for all employees

SQL logic
```sql
SELECT 
  MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e;
```

In [None]:
x = [4,5,1,2,3]
min(x), max(x)

* Lexicographic order [[wikipedia](https://en.wikipedia.org/wiki/Lexicographic_order)]

In [None]:
'2ab' < '1ab'

In [None]:
# 'abcdefg'

'a' > 'b'

In [None]:
Employees.agg({'BirthDate':['min','max']}).T

# Employees.agg({'BirthDate':['min','max']})

In [None]:
Employees.agg({'BirthDate':[min,max]}).T

In [None]:
Employees.agg({'BirthDate':[min,max]}).T.reset_index(drop=True)
# Employees.agg({'BirthDate':['min','max']}).T.reset_index(drop=False)

#### Show the above results broken down by gender

SQL logic
```sql
SELECT 
  e.Gender
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender
;
```

In [None]:
Employees.groupby('Gender')['BirthDate'].min().reset_index()

In [None]:
Employees.groupby('Gender').agg({'BirthDate':[min,max]})

In [None]:
Employees.groupby('Gender').agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

#### Show the above results broken down by gender, and salaried/hourly

SQL logic
```sql
SELECT 
  e.Gender
  , e.SalariedFlag
  , MIN(e.BirthDate) AS 'Earliest Birthday'
  , MAX(e.BirthDate) AS 'Most Reecent Birthday'
FROM dbo.Employees AS e
GROUP BY e.Gender, e.SalariedFlag
;
```

In [None]:
Employees.groupby(['Gender','SalariedFlag']).agg(
    min_bday=('BirthDate',min),
    max_bday=('BirthDate',max)
).reset_index()

#### What are the average vacation hours for all employees?

SQL logic
```sql
SELECT AVG(e.VacationHours)
FROM dbo.Employees AS e	
;
```

In [None]:
Employees.VacationHours.mean()

#### Show the above results broken down and ordered by job title¶

SQL logic
```sql
SELECT 
  e.JobTitle
  , AVG(e.VacationHours) AS 'Average Vacation'
  , MIN(e.VacationHours) AS 'Minimum Vacation'
FROM dbo.Employees AS e
GROUP BY e.JobTitle
;
```

In [None]:
Employees.groupby('JobTitle')['VacationHours'].min().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].mean().reset_index().head(3)

In [None]:
Employees.groupby('JobTitle')['VacationHours'].apply(lambda x: sum(x)/len(x)).reset_index().head(3)

In [None]:
Employees.groupby('JobTitle').agg(
    avg_pto_left=('VacationHours',lambda x: sum(x)/len(x)),
    min_pto_left=('VacationHours',min)
).reset_index()

# The Python Statistics Landscape

There are many Python statistics libraries for you to work with.

* **Foundation Libraries**
    * `statistics`: built-in Python library for descriptive statistics (link: https://docs.python.org/3/library/statistics.html)
    * `numpy`: numerical computing, numpy arrays, covered in lecture 03
    * `scipy`: scientific computing based on numpy, the `scipy.stats` module (link: https://docs.scipy.org/doc/scipy/reference/stats.html) covers a large number of probability distributions and statistical functions (link: https://www.scipy.org/)
    
* **Data Science Libraries**
    * `pandas`: 1D and 2D labeled data manipulations and computation, covered in lecture 04
    * `statsmodels`: a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration (link: https://www.statsmodels.org/stable/index.html)
    * `matplotlib`: graphs and visualization (link: https://matplotlib.org/)

# Descriptive Statistical Analysis

Descriptive statistics is about describing and summarizing data. It uses two main approaches:

* The quantitative approach describes and summarizes data numerically.
* The visual approach illustrates data with charts, plots, histograms, and other graphs.

You can apply descriptive statistics to one or many datasets or variables. When you describe and summarize a single variable, you’re performing univariate analysis. When you search for statistical relationships among a pair of variables, you’re doing a bivariate analysis. Similarly, a multivariate analysis is concerned with multiple variables at once.


**[Case Study]**

**Atlanta Police Department Crime Data** ![APD Logo](https://atlantapd.galls.com/photos/partners/atlantapd/logo.jpg)


The Atlanta Police Department provides raw crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Load the 2009-2019 crime data

In [None]:
df = pd.read_csv('../data/COBRA-2009-2019.csv',sep=',',header=0)
df.head(3)

In [None]:
df.shape

In [None]:
df.info()

## Quantitative Analysis

In [None]:
df['rpt_yr'] = df['Report Date'].map(lambda x: x[:4])

df.head(3)

In [None]:
## number of reports every year
# df['rpt_yr'] = df['Report Date'].map(lambda x: x[:4])

num_rpt_by_yr = df.groupby('rpt_yr').agg(
    num_row=('Report Number',len),
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
num_rpt_by_yr

In [None]:
df.groupby('rpt_yr')['Report Number'].nunique().reset_index()

In [None]:
## number of cases per shift in 2019
num_rpt_by_shift = df[df.rpt_yr=='2019'].groupby('Shift Occurence').agg(
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
num_rpt_by_shift

In [None]:
## number of cases per shift in the past 3 years
num_rpt_by_yr_shift = df[df.rpt_yr>='2017'].groupby(['rpt_yr','Shift Occurence']).agg(
    num_rpt=('Report Number',lambda x: len(set(x)))
).reset_index()
# num_rpt_by_yr_shift
num_rpt_by_yr_shift.sort_values(by=['Shift Occurence','rpt_yr'])

In [None]:
## % of cases per shift in the past 3 years
num_rpt_by_yr_shift2 = pd.merge(
    num_rpt_by_yr_shift,
    num_rpt_by_yr.loc[:,['rpt_yr','num_rpt']].copy().rename(columns={'num_rpt':'annual_total'}),
    on='rpt_yr'
)
num_rpt_by_yr_shift2.sort_values(by=['Shift Occurence','rpt_yr'])

In [None]:
num_rpt_by_yr_shift2['percent'] = [
    round(subtotal/total,2)
    for subtotal,total in zip(num_rpt_by_yr_shift2.num_rpt,num_rpt_by_yr_shift2.annual_total)
]
num_rpt_by_yr_shift2.sort_values(by=['Shift Occurence','rpt_yr'])

**Can you do better??**

## Visual Analysis

### Visualize the YOY change of % by shift with bar chart

In [None]:
num_rpt_by_yr_shift2.rpt_yr = num_rpt_by_yr_shift2.rpt_yr.astype(int)
dw = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Day Watch']
ew = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Evening Watch']
mw = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Morning Watch']
unk = num_rpt_by_yr_shift2[num_rpt_by_yr_shift2['Shift Occurence']=='Unknown']

plt.plot(dw.rpt_yr, dw.percent, '-o', label='day watch')
plt.plot(ew.rpt_yr, ew.percent, '-o', label='evening watch')
plt.plot(mw.rpt_yr, mw.percent, '-o', label='morning watch')
plt.plot(unk.rpt_yr, unk.percent, '-o', label='unknown')

plt.xticks(ticks=[2017,2018,2019])
plt.legend()
plt.show()

# Hypothesis Test and Confidence Interval

## Statistical Distributions

### Normal distribution

The Normal distribution or Gaussian distribution is by far the most important of all the distribution functions. This is due to the fact that the mean values of all distribution functions approximate a normal distribution for large enough sample numbers. Mathematically, the normal distribution is characterized by a mean value $\mu$, and a standard deviation $\sigma$:
$$
f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$

When $\mu=0$ and $\sigma=1$, the distribution is called the `standard normal distribution`:
$$
f(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}
$$

The **68-95-997** rule:
![](https://miro.medium.com/max/24000/1*IZ2II2HYKeoMrdLU5jW6Dw.png)

| Range      | Probability within range | Probability outside range |
|------------|--------------------------|---------------------------|
| Mean ± 1SD | 68.3%                    | 31.7%                     |
| Mean ± 2SD | 95.4%                    | 4.6%                      |
| Mean ± 3SD | 99.7%                    | 0.27%                     |

**Generate random samples from the normal distribution**

In [None]:
from scipy import stats

rvs = stats.norm.rvs(loc=0, scale=1, size=5, random_state=123)
rvs

In [None]:
type(rvs)

In [None]:
import numpy as np

In [None]:
np.random.seed(123)
np.random.randn(5) # Return a sample (or samples) from the "standard normal" distribution.

For random samples from $N(\mu, \sigma^2)$, use:

$$\sigma \cdot np.random.randn(...) + \mu$$

In [None]:
import matplotlib.pyplot as plt

In [None]:
# plt.hist(np.random.randn(10000))
plt.hist(np.random.randn(10000),density=True)
plt.title('Histogram of random samples from standard normal distribution')
plt.show()

**Calculate PDF**

In [None]:
stats.norm.pdf(x=[-3,-2,-1,1,2,3],loc=0,scale=1)

In [None]:
## you could also do
distNorm = stats.norm(loc=0,scale=1)
distNorm.pdf(x=[-3,-2,-1,1,2,3])

In [None]:
# the histogram from random sample
plt.hist(np.random.randn(10000),density=True,label='random sample') 
# construct the pdf curve
xs = np.linspace(start=-4,stop=4,num=100)
ys = stats.norm.pdf(x=xs)
plt.plot(xs,ys,label='norm pdf')

plt.title('hist vs. pdf (Normal Distribution)')
plt.legend()
plt.show()

**Calculate CDF** (Cumulative Distribution Function)

In [None]:
stats.norm.cdf(x=[-3,-2,-1,1,2,3],loc=0,scale=1)

In [None]:
# the histogram from random sample
plt.hist(np.random.randn(10000),density=True,label='random sample') 
# construct the pdf curve
xs = np.linspace(start=-4,stop=4,num=100)
ys_pdf = stats.norm.pdf(x=xs)
plt.plot(xs,ys_pdf,label='norm pdf')

# construct the cdf curve
ys_cdf = stats.norm.cdf(x=xs)
plt.plot(xs,ys_cdf,label='norm cdf')

plt.title('hist vs. pdf (Normal Distribution)')
plt.legend()
plt.grid()
plt.show()

**`ppf`: Percent Point Function (Inverse of CDF)**

In [None]:
stats.norm.ppf([0.05,0.95])

**Central limit theorem**

https://en.wikipedia.org/wiki/Central_limit_theorem

In probability theory, the **central limit theorem (CLT)** establishes that in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve), even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory, because it implies that probabilistic and statistical methods that work for normal distributions can be applied to many problems involving other types of distributions.

More specifically, central limit theorem states that if $ X_{1},X_{2},...,X_{n}$ are each a random sample of size $n$, taken from a population with mean $\mu$ and finite variance $\sigma^2$ and if $\bar{X}$ is the sample mean, then the limiting form of the distribution of $Z=\frac {{\bar {X}}_{n}-\mu }{\sigma /\surd n}$ as $n\to \infty$, is the standard normal distribution.

![](https://i.ytimg.com/vi/4YLtvNeRIrg/maxresdefault.jpg)

### $t$ distribution

The sample distribution of mean values for samples from a normally distributed population. Typically used for small sample numbers, when the true mean/SD are not known.

If $\bar{x}$ is the sample mean, and $s$ is the sample standard deviation, then
$$
\frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{\nu}
$$ where $\nu=n-1$ represents the degree of freedom, and $n$ is the sample size.

When $n$ is large enough, $t$ distribution asymptotically approaches standard normal distribution

In [None]:
# generate random numbers from the t distribution

n = 20
df = n - 1
rvs = stats.t.rvs(df=df,size=5,random_state=123)
rvs

In [None]:
rvs = stats.t.rvs(df,size=5,random_state=123)
rvs

In [None]:
# plot the pdf of t distribution
xs = np.linspace(start=-4,stop=4,num=100)
ys_1 = stats.t.pdf(x=xs,df=1)
ys_5 = stats.t.pdf(x=xs,df=5)
ys_10 = stats.t.pdf(x=xs,df=10)
ys_20 = stats.t.pdf(x=xs,df=20)
ys_100 = stats.t.pdf(x=xs,df=100)
plt.plot(xs,ys_1,label='df=1')
plt.plot(xs,ys_5,label='df=5')
plt.plot(xs,ys_10,label='df=10')
plt.plot(xs,ys_20,label='df=20')
plt.plot(xs,ys_100,label='df=100')
# plot the pdf of standard normal distribution
ys = stats.norm.pdf(x=xs)
plt.plot(xs,ys,'o',label='norm pdf')
plt.legend()
plt.show()

In [None]:
# plot the cdf of t distribution
xs = np.linspace(start=-4,stop=4,num=100)
ys_1 = stats.t.cdf(x=xs,df=1)
ys_5 = stats.t.cdf(x=xs,df=5)
ys_10 = stats.t.cdf(x=xs,df=10)
ys_20 = stats.t.cdf(x=xs,df=20)
ys_100 = stats.t.cdf(x=xs,df=100)
plt.plot(xs,ys_1,label='df=1')
plt.plot(xs,ys_5,label='df=5')
plt.plot(xs,ys_10,label='df=10')
plt.plot(xs,ys_20,label='df=20')
plt.plot(xs,ys_100,label='df=100')
# plot the cdf of standard normal distribution
ys = stats.norm.cdf(x=xs)
plt.plot(xs,ys,'o',label='norm cdf')
plt.legend()
plt.show()

**`ppf`: Percent Point Function (Inverse of CDF)**

In [None]:
# when n=20, df = n-1 =19
stats.t.ppf([0.05,0.95],df=df)

### $\chi^2 distribution$
$\chi^2$ (chi-square) distribution describes the distribution of the summed squares of random variates from a standard normal distribution. The sum squares of $n$ independent random samples from standard normal distribution follows a chi-square distribution of $n$ degrees of freedom:
$$
\sum_{i=1}^{n}X_i^2 \sim \chi_n^2
$$
For $n$ independent random samples from normal distribution with a standard deviation of $\sigma$, the following test statistic follows the chi-square distribution of $n-1$ degrees of freedom:
$$
\sum_{i=1}^{n}(\frac{X_i-\bar{X}}{\sigma})^2 = (n-1)\frac{s}{\sigma^2} \sim \chi_{n-1}^2
$$
where $s$ stands for sample standard deviation. This can be used in hypothesis test of comparison between sample standard deviation and population standard deviation.

It is also commonly used in statistical independence or association between two or more categorical variables using the following test statistic:
$$
\sum_{i=1}^{k}\frac{({frequency}_{observed} - {frequency}_{expected})^2}{{frequency}_{expected}} \sim \chi_{df}^2
$$
where 
$$
df = k - 1 - #_parameters_estimated
$$ see example here: http://sites.utexas.edu/sos/guided/inferential/categorical/chi2/


In [None]:
n = 20
df = n-1
rvs = stats.chi2.rvs(df=df,size=5,random_state=123)
rvs

In [None]:
# plot the pdf of chi2 distribution
plt.figure(figsize=(20,10))

xs = np.linspace(start=0,stop=100,num=1000)
ys_1 = stats.chi2.pdf(x=xs,df=1)
ys_2 = stats.chi2.pdf(x=xs,df=2)
ys_3 = stats.chi2.pdf(x=xs,df=3)
ys_4 = stats.chi2.pdf(x=xs,df=4)
ys_6 = stats.chi2.pdf(x=xs,df=6)
ys_9 = stats.chi2.pdf(x=xs,df=9)
plt.plot(xs,ys_1,label='df=1')
plt.plot(xs,ys_2,label='df=2')
plt.plot(xs,ys_3,label='df=3')
plt.plot(xs,ys_4,label='df=4')
plt.plot(xs,ys_6,label='df=6')
plt.plot(xs,ys_9,label='df=9')
plt.xlim(0,20)
plt.ylim(0,0.5)
plt.legend()
plt.show()


In [None]:
# plot the cdf of chi2 distribution
plt.figure(figsize=(20,10))

xs = np.linspace(start=0,stop=100,num=1000)
ys_1 = stats.chi2.cdf(x=xs,df=1)
ys_2 = stats.chi2.cdf(x=xs,df=2)
ys_3 = stats.chi2.cdf(x=xs,df=3)
ys_4 = stats.chi2.cdf(x=xs,df=4)
ys_6 = stats.chi2.cdf(x=xs,df=6)
ys_9 = stats.chi2.cdf(x=xs,df=9)
plt.plot(xs,ys_1,label='df=1')
plt.plot(xs,ys_2,label='df=2')
plt.plot(xs,ys_3,label='df=3')
plt.plot(xs,ys_4,label='df=4')
plt.plot(xs,ys_6,label='df=6')
plt.plot(xs,ys_9,label='df=9')
plt.xlim(0,20)
plt.ylim(0,1)
plt.legend()
plt.show()

**`ppf`: Percent Point Function (Inverse of CDF)**

In [None]:
# when df=9
stats.chi2.ppf([0.05,0.95],df=9)

### $F$ distribution

This distribution is named after Sir Ronald Fisher, who developed the F distribution for use in determining critical values in **ANOVAs** (`Analysis Of Variance`).

If we want to investigate whether two groups have the same variance, we have to calculate the ratio of the sample standard deviations squared (assume $S_1^2 > S_2^2$):
$$
\frac{S_1^2}{S_2^2} \sim F_{df_1,df_2} = \frac{\chi_{df_1}^2/df_1}{\chi_{df_2}^2/df_2} \sim F_{N_1-1,N_2-1}
$$
where $\chi_{df_1}^2$ and $\chi_{df_2}^2$ are the chi-squared statistics of sample one and two respectively, and $df_1$ and $df_2$ are their degrees of freedom, in which case
$$
df_1 = N_1-1
$$
and
$$
df_2 = N_2-1
$$ ($N_1$ and $N_2$ are sample sizes of the two samples)

In [None]:
# plot the pdf of F distribution
plt.figure(figsize=(20,10))

xs = np.linspace(start=0,stop=5,num=1000)
ys_11 = stats.f.pdf(x=xs,dfn=1,dfd=1)
ys_21 = stats.f.pdf(x=xs,dfn=2,dfd=1)
ys_52 = stats.f.pdf(x=xs,dfn=5,dfd=2)
ys_100100 = stats.f.pdf(x=xs,dfn=100,dfd=100)
plt.plot(xs,ys_11,label='F(1/1)')
plt.plot(xs,ys_21,label='F(2/1)')
plt.plot(xs,ys_52,label='F(5/2)')
plt.plot(xs,ys_100100,label='F(100/100)')

plt.xlim(0,3)
plt.ylim(0,3)
plt.legend()
plt.show()

In [None]:
# plot the cdf of F distribution
plt.figure(figsize=(20,10))

xs = np.linspace(start=0,stop=5,num=1000)
ys_11 = stats.f.cdf(x=xs,dfn=1,dfd=1)
ys_21 = stats.f.cdf(x=xs,dfn=2,dfd=1)
ys_52 = stats.f.cdf(x=xs,dfn=5,dfd=2)
ys_100100 = stats.f.cdf(x=xs,dfn=100,dfd=100)
plt.plot(xs,ys_11,label='F(1/1)')
plt.plot(xs,ys_21,label='F(2/1)')
plt.plot(xs,ys_52,label='F(5/2)')
plt.plot(xs,ys_100100,label='F(100/100)')

plt.xlim(0,3)
plt.ylim(0,1)
plt.legend()
plt.show()

**`ppf`: Percent Point Function (Inverse of CDF)**

In [None]:
# when dfn=100, dfd=100
stats.f.ppf([0.05,0.95],dfn=100,dfd=100)

## Hypothesis Test

wikipedia: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing

`Hypothesis` is a statement about a parameter. A `hypothesis test` is a standard procedure to test a statement (the `hypothesis`), and typically we need to select between two complementary `hypothesis`:
* `Null hypothesis` ($H_0$): A statment about an established fact of a parameter. The null hypothesis is generally assumed to be true until evidence indicates otherwise (similar to the case that a defendant of a jury trial is presumed innocent until proven guilty). It is normally expressed as Math equation, and **it must contain a condition of equality, such as $=,\geq, \leq $**.
* `Alternative hypothesis` ($H_1$): A statement that the parameter has a value that differs from the null hypothesis.
 Needs a strong support from data to change our thinking and contradicts Ho. Expressed as Math statement it contains $\neq, <, >$.

We also need a `test statistic` (a quantity derived from the sample). Typically it is selected or defined in such a way as to quantify, within observed data, behaviours that would distinguish the `null` from the `alternative hypothesis`, where such an alternative is prescribed, or that would characterize the null hypothesis if there is no explicitly stated alternative hypothesis. Normally, we should have a good idea on the sampling distribution of the test statistic.
* List of commonly used `test statistic`: https://en.wikipedia.org/wiki/Test_statistic

| Null Hypothesis        | Alternative Hypothesis      | Type of Alternative |
|------------------------|-----------------------------|---------------------|
|                        | $H_1$: $\theta < \theta_0$    | lower one-sided     |
| $H_0$: $\theta=\theta_0$ | $H_1$: $\theta > \theta_0$    | upper one-sided     |
|                        | $H_1$: $\theta \neq \theta_0$ | two-sided           |

|                                   | $H_0$ is true (Truly not guilty) |    $H_1$ is true (Truly guilty)   |
|-----------------------------------|---------------------------|----------------------------|
|  Accept null hypothesis Acquittal |        Right decision       | Wrong decision **Type II Error** |
| Reject null hypothesis Conviction | Wrong decision **Type I Error** |        Right decision        |

### One Population Proportion

**Example: How to tell if a coin is fair?**

**Problem**: Suppose we tossed a coin 100 times and we have obtained 38
Heads and 62 Tails. Is the coin biased toward tails? 

$H_0$: $p_{head} = 0.5$

$H_1$: $p_{head} \neq 0.5$ (two-sided)

Significance level $\alpha=0.05$

`Test statistic`: $z = \frac{\hat{p}-p_0}{SD(p_0)} \sim N(0,1)$ according to `central limit theorem`, where $SD(p_0) =\sqrt{\frac{p_0q_0}{n}} = \sqrt{\frac{p_0(1-p_0)}{n}} $.
![](https://www.investopedia.com/thmb/pF9cbALKXUA617NzyoKozi1B0rQ=/954x380/filters:no_upscale():max_bytes(150000):strip_icc()/Clipboard01-5c94e6b446e0fb00010ae8ed.jpg)

In [None]:
n = 100
p = 38/n
sd = (p*(1-p)/n)**0.5
z = (p-0.5)/sd

In [None]:
z

In [None]:
p = 2 * stats.norm.cdf(z) # two-sided

In [None]:
p

In [None]:
stats.norm.ppf([0.025,0.975]) # the 5% confidence interval boundary for two-sided alternative hypothesis

We can also do t-test: $t = \frac{\hat{p}-p_0}{SD(p_0)} \sim t_{n-1}$

In [None]:
p = 2 * stats.t.cdf(z, df = n-1) # two-sided
p

In [None]:
stats.t.ppf([0.025,0.975], df=n-1) # the 5% confidence interval boundary for two-sided alternative hypothesis

We can also use the `statsmodels` library to do the z-test

In [None]:
import statsmodels.api as sm

In [None]:
sm.stats.proportions_ztest(count=38,nobs=100,value=0.5,alternative='two-sided')

### Two Population Proportion Difference

**Problem**: A car manufacturer aims to improve the quality of the products by reducing the defects and also increase the customer satisfaction. Therefore, he monitors the efficiency of two assembly lines in the shop floor. In line A there are 18 defects reported out of 200 samples. While the line B shows 25 defects out of 600 cars. At α 5%, is the differences between two assembly procedures are significant?

$H_0$: $p_1 - p_2 = 0$

$H_1$: $p_1 - p_2 \neq 0$ (two-sided)

Significance level $\alpha=0.05$

`Test statistic`: $z = \frac{\hat{p_1}-\hat{p_2} - 0}{SD} \sim N(0,1)$ according to `central limit theorem`, where $SD = \sqrt{p_0(1-p_0)(\frac{1}{n_1}+\frac{1}{n_2})}$, and $p_0 = \frac{x_1+x_2}{n_1+n_2}$.

In [None]:
import math

x1 = 18
n1 = 200
x2 = 25
n2 = 600

p1 = x1/n1
p2 = x2/n2
p0 = (x1+x2)/(n1+n2)
# sd = math.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
sd = math.sqrt(p0*(1-p0)*(1/n1 + 1/n2))
z = (p1-p2)/sd
z

In [None]:
p = 2 * (1-stats.norm.cdf(z)) # two-sided
p

In [None]:
p1, p2

We can also use the `statsmodels` library to do the z-test

In [None]:
sm.stats.proportions_ztest(
    count=np.array([x1,x2]),
    nobs=np.array([n1,n2]),
    value=0,
    alternative='two-sided'
)

In [None]:
sm.stats.proportions_ztest(
    count=np.array([x1,x2]),
    nobs=np.array([n1,n2]),
    value=0,
    alternative='larger'
)

### One Population Mean

**Problem**: Your company wants to improve sales. Past sales data indicate that the average sale was \\$100 per transaction. After training your sales force, recent sales data (taken from a sample of 25 salesmen) indicates an average sale of \\$130, with a standard deviation of \\$15. Did the training work? Test your hypothesis at a 5\% alpha level.

$H_0$: $\mu = \mu_0$

$H_1$: $\mu \geq \mu_0$ (upper one-side)

Significance level $\alpha=0.05$

`z-test`: $z = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \sim N(0,1)$, where $s$ is the sample standard deviation

`t-test`: $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \sim t_{n-1}$, where $s$ is the sample standard deviation, and $n$ is the sample size

In [None]:
## z-test
mu0 = 100
xbar = 130
n = 25
s = 15

z = (xbar - mu0)/(s/math.sqrt(n))
z

In [None]:
p = 1-stats.norm.cdf(z)
p

In [None]:
## t-test
t = z
p = 1-stats.t.cdf(t, df=n-1)
p

### Two Population Mean Difference

**Problem #1**: Does right‐ or left‐handedness affect how fast people type? Random samples of students from a typing class are given a typing speed test (words per minute), and the results are compared. Significance level for the test: 0.10. Because you are looking for a difference between the groups in either direction (right‐handed faster than left, or vice versa), this is a two‐tailed test.

| Group | Handedness | n  | $\bar{x}$ | s   |
|-------|------------|----|-----------|-----|
| 1     | Left       | 9  | 59.3      | 4.3 |
| 2     | Right      | 16 | 55.8      | 5.7 |

$H_0$: $\mu_1 - \mu_2 = 0$

$H_1$: $\mu_1 - \mu_2 \neq 0$ (two-sided)

Significance level $\alpha=0.05$

Like before, assume the two groups have the same variance, we could do either `z-test` or `t-test`.
$$
\frac{\bar{x_1} - \bar{x_2} - 0}{\sqrt{s_p^2(\frac{1}{n_1} + \frac{1}{n_2})}} \sim N(0,1)
$$
or
$$
\frac{\bar{x_1} - \bar{x_2} - 0}{\sqrt{s_p^2(\frac{1}{n_1} + \frac{1}{n_2})}} \sim t_{n_1+n_2-2}
$$
Here, $s_p$ is the pooled variance $s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}$


In [None]:
n1 = 9
n2 = 16

xbar1 = 59.3
xbar2 = 55.8

s1 = 4.3
s2 = 5.7

sp = math.sqrt(((n1-1)*(s1**2) + (n2-1)*(s2**2))/(n1+n2-2))
sp

In [None]:
# z test
z = (xbar1 - xbar2)/(sp*math.sqrt(1/n1+1/n2))
z

In [None]:
p = (1-stats.norm.cdf(z))*2
p

In [None]:
## t-test
t = z
p = (1-stats.t.cdf(t, df=n1+n2-2))*2
p

**Problem #2**: An experiment is conducted to determine whether intensive tutoring (covering a great deal of material in a fixed amount of time) is more effective than paced tutoring (covering less material in the same amount of time). Two randomly chosen groups are tutored separately and then administered proficiency tests. Use a significance level of α < 0.05.

| Group | Method | n  | $\bar{x}$ | s   |
|-------|------------|----|-----------|-----|
| 1     | Intensive       | 12  | 46.31      | 6.44 |
| 2     | Paced      | 10 | 42.79      | 7.52 |


$H_0$: $\mu_1 - \mu_2 = 0$

$H_1$: $\mu_1 - \mu_2 \neq 0$ (two-sided)

Significance level $\alpha=0.05$

Like before, assume the two groups have the same variance, we could do either `z-test` or `t-test`.
$$
\frac{\bar{x_1} - \bar{x_2} - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \sim N(0,1)
$$
or
$$
\frac{\bar{x_1} - \bar{x_2} - 0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \sim t_{n_1+n_2-2}
$$



In [None]:
n1 = 12
n2 = 10

xbar1 = 46.31
xbar2 = 42.79

s1 = 6.44
s2 = 7.52

# z-test
z = (xbar1-xbar2)/math.sqrt(s1**2/n1 + s2**2/n2)
z

In [None]:
p = (1-stats.norm.cdf(z))*2
p

In [None]:
# t-test

t = z
p = (1-stats.t.cdf(t, df=n1+n2-2))*2
p