In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.weightstats import ztest

In [2]:
df = pd.read_csv('./data/train.csv')

In [3]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [6]:
mean_count = df.groupby('Neighborhood')['SalePrice'].agg(['mean','count'])

In [7]:
mean_count['diff'] = mean_count['mean'] - df['SalePrice'].mean()

In [8]:
mean_count.sort_values(by='diff')

Unnamed: 0_level_0,mean,count,diff
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MeadowV,98576.470588,17,-82344.725302
IDOTRR,100123.783784,37,-80797.412107
BrDale,104493.75,16,-76427.44589
BrkSide,124834.051724,58,-56087.144166
Edwards,128219.7,100,-52701.49589
OldTown,128225.300885,113,-52695.895005
Sawyer,136793.135135,74,-44128.060755
Blueste,137500.0,2,-43421.19589
SWISU,142591.36,25,-38329.83589
NPkVill,142694.444444,9,-38226.751446


In [9]:
nr_df = df[ df['Neighborhood'] == 'NridgHt'].copy()
ot_df = df[ df['Neighborhood'] == 'OldTown'].copy()
sw_df = df[ df['Neighborhood'] == 'SawyerW'].copy()

## Hypothesis:

$H_0$ is that there is no statistically significant difference bewteen the sample mean and the population mean.

$H_a$ is that there is a difference **(two-sided)**



For future reference, because scipy can use alternate terms:

the mean price is greater than **(one-sided, right-sided)**

the mean price is less than **(one-sided, left-sided)**

T-test uses the [t-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution)

Z-test uses the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)

For sample < 30, use t-test.

As sample size approaches 30, the t-distribution approaches the normal.

## Z-Test

When we are working with a sampling distribution, the z score is equal to <br><br>  $\Large z = \dfrac{{\bar{x}} - \mu_{0}}{\dfrac{\sigma}{\sqrt{n}}}$

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>$\sigma$ is the population standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

The denominator $\frac{\sigma}{\sqrt{n}}$, is the standard error

In [10]:
pop_mu = df['SalePrice'].mean()
pop_mu

180921.19589041095

In [11]:
pop_std = df['SalePrice'].std()
pop_std

79442.50288288662

In [12]:
x_bar = nr_df['SalePrice'].mean()
x_bar

316270.6233766234

In [13]:
n = nr_df.shape[0]
n

77

In [14]:
# z score
z = (x_bar - pop_mu)/(pop_std/np.sqrt(n))

z

14.950264190395693

In [28]:
# we can use stats to calculate the percentile
print(stats.norm.cdf(abs(z)))

# We can also use the survival function to calculate the probability
print(stats.norm.sf(abs(z)))

0.9999999999991132
8.869008726566645e-13


Now, with one without much difference

In [16]:
x_bar = sw_df['SalePrice'].mean()
x_bar

186555.7966101695

In [17]:
n = sw_df.shape[0]
n

59

In [18]:
# z score
z = (x_bar - pop_mu)/(pop_std/np.sqrt(n))
z

0.5447989147989706

In [27]:
# we can use stats to calculate the percentile
print(stats.norm.cdf(abs(z)))

# We can also use the survival function to calculate the probability
print(stats.norm.sf(abs(z)))

0.9999999999991132
8.869008726566645e-13


**LEFT TAIL**

In [29]:
x_bar = ot_df['SalePrice'].mean()
n = ot_df.shape[0]
z = (x_bar - pop_mu)/(pop_std/np.sqrt(n))
# we can use stats to calculate the percentile

print("Left Tail Test")
print('Percentile for OldTown ', stats.norm.cdf(abs(z)))

# We can also use the survival function to calculate the probability
print('P-Value (survival function) for OldTown ',stats.norm.sf(abs(z)))

0.9999999999991132
8.869008726566645e-13


Statsmodel

In [30]:
# statsmodel ztest
ztest_score, p_value= ztest(nr_df['SalePrice'], value=pop_mu, alternative='two-sided')
ztest_score, p_value

(12.321351268471998, 6.951954865727274e-35)

In [31]:
# statsmodel ztest two sample
ztest_Score, p_value= ztest(x1=nr_df['SalePrice'],x2=ot_df['SalePrice'], alternative='two-sided')
ztest_Score, p_value

(17.304855209252345, 4.323667172960236e-67)

## T-Test

> **$t$-test**:
> 
> - Calculate the **$t$-statistic** using the sample's standard deviation $s$:
> $$\large t = \frac{\bar{x}-\mu}{\frac{s}{\sqrt{n}}}$$
> - We calculate the p-value from the **$t$-distribution**

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>s is the sample standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

One sample (compare to population)

In [32]:
# Let's continue to assume our alpha is 0.05
x_bar = nr_df['SalePrice'].mean()
s = nr_df['SalePrice'].std()
n = nr_df.shape[0]

t_stat = (x_bar - pop_mu)/(s/np.sqrt(n))
t_stat

12.321351268471998

In [33]:
# Calculate our t-critical value t*
crit_t = stats.t.ppf(0.05, n-1)
crit_t

-1.6651513533271274

In [34]:
# Calculate the p-value (two-tailed, so multiply by 2)
stats.t.sf(abs(t_stat),df=n-1)*2

8.476781919681297e-20

#### Check it!

In [35]:
t_statistic, p_value = stats.ttest_1samp(nr_df['SalePrice'],popmean=pop_mu, alternative='two-sided')

t_statistic, p_value

TypeError: ttest_1samp() got an unexpected keyword argument 'alternative'

### Two Sample T-Test

Check the variance of the two samples.

In [36]:
nr_df['SalePrice'].var()

9291522722.790499

In [37]:
mv_df['SalePrice'].var()

NameError: name 'mv_df' is not defined

In [None]:
t_statistic, p_value = stats.ttest_ind(nr_df['SalePrice'], 
                                       mv_df['SalePrice'], 
                                       equal_var=False, 
                                       alternative='two-sided')

t_statistic, p_value

In [38]:
import scipy
scipy.__version__

'1.5.0'