This notebook, by [felipe.alonso@urjc.es](mailto:felipe.alonso@urjc.es)

In this notebook we will:

1. Solve hypothesis testing exercices for **comparing two proportions**

2. Solve hypothesis testing for **contingency tables**


## Preliminars

#### How to build a contingency table

- There are different options here, but a quick an easy way is to use the [pd.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html) function

#### Other uses of chi-square statistic

- [Feature selection](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) in machine learning: if a feature is independent of the target then is uninformative.


In [1]:
import pandas as pd
import numpy as np

housing_data = pd.read_csv('C:\\Users\\riul0\\Desktop\\Inference\\inference_prof\\inference\\data\AmesHousing.csv',sep=',', decimal = '.')
housing_data.dtypes

Order               int64
PID                 int64
MS SubClass         int64
MS Zoning          object
Lot Frontage      float64
Lot Area            int64
Street             object
Alley              object
Lot Shape          object
Land Contour       object
Utilities          object
Lot Config         object
Land Slope         object
Neighborhood       object
Condition 1        object
Condition 2        object
Bldg Type          object
House Style        object
Overall Qual        int64
Overall Cond        int64
Year Built          int64
Year Remod/Add      int64
Roof Style         object
Roof Matl          object
Exterior 1st       object
Exterior 2nd       object
Mas Vnr Type       object
Mas Vnr Area      float64
Exter Qual         object
Exter Cond         object
                   ...   
Bedroom AbvGr       int64
Kitchen AbvGr       int64
Kitchen Qual       object
TotRms AbvGrd       int64
Functional         object
Fireplaces          int64
Fireplace Qu       object
Garage Type 

In [2]:
print(housing_data.columns[housing_data.dtypes == 'object'])
print(housing_data['Street'].unique())

pd.crosstab(housing_data['Street'],housing_data['House Style'],margins=True)

Index(['MS Zoning', 'Street', 'Alley', 'Lot Shape', 'Land Contour',
       'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Exter Qual',
       'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating', 'Heating QC',
       'Central Air', 'Electrical', 'Kitchen Qual', 'Functional',
       'Fireplace Qu', 'Garage Type', 'Garage Finish', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence', 'Misc Feature',
       'Sale Type', 'Sale Condition'],
      dtype='object')
['Pave' 'Grvl']


House Style,1.5Fin,1.5Unf,1Story,2.5Fin,2.5Unf,2Story,SFoyer,SLvl,All
Street,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Grvl,2,0,8,0,0,1,1,0,12
Pave,312,19,1473,8,24,872,82,128,2918
All,314,19,1481,8,24,873,83,128,2930


In [3]:
housing_data.head(10)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900
5,6,527105030,60,RL,78.0,9978,Pave,,IR1,Lvl,...,0,,,,0,6,2010,WD,Normal,195500
6,7,527127150,120,RL,41.0,4920,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,213500
7,8,527145080,120,RL,43.0,5005,Pave,,IR1,HLS,...,0,,,,0,1,2010,WD,Normal,191500
8,9,527146030,120,RL,39.0,5389,Pave,,IR1,Lvl,...,0,,,,0,3,2010,WD,Normal,236500
9,10,527162130,60,RL,60.0,7500,Pave,,Reg,Lvl,...,0,,,,0,6,2010,WD,Normal,189000


# 1. Comparing two proportions

### Exercise 1

Time magazine reported the result of a telephone poll of 800 adult Americans (smokers vs non-smokers). The question posed of the Americans who were surveyed was: "Should the federal tax on cigarettes be raised to pay for health care reform?" The results of the survey were the following_

- 351 out of 605 non-smokers said 'yes'
- 41 out of 195 smokers said 'yes'

<div class="alert alert-block alert-info">
Is there sufficient evidence at 5% confidence level to conclude that the two populations differ significantly with respect to their opinions?
</div>

In [4]:
n1 = 605
n2 = 195
p1 = 351/n1
p2 = 41/n2

p_pooled = (351+41)/(n1+n2)

SE = np.sqrt(p_pooled*(1-p_pooled)*(1/n1+1/n2))
z_stat = (p1-p2)/SE

from scipy.stats import norm
p_val = 2*norm().cdf(-1*np.abs(z_stat))  
print(p1,p2,z_stat, p_val)


0.5801652892561984 0.21025641025641026 8.985900954503084 2.566230446480293e-19


### Exercise 2

A 30-year study was conducted with nearly 90,000 female participants. During a 5-year screening period, each woman was randomized to one of two groups: in the first group, women received regular mammograms to screen for breast cancer, and in the second group, women received regular non-mammogram breast cancer exams. No intervention was made during the following 25 years of the study, and we’ll consider death resulting from breast cancer over the full 30-year period. Results from the study are summarized in the following table

|Treatment |Death fro breast cancer|No death from breast cancer|
|---|-:-|---:|
|Mammogram|500|44425|
|Control|505|44405|

<div class="alert alert-block alert-info">
Can we conclude that mammograms have no benefits or harm?
</div>

In [3]:
n1 = 500+44425
n2 = 505+ 44405
p_m = 500/n1
p_non_m = 505/n2

p_pooled = (500+505)/(n1+n2)

SE = np.sqrt(p_pooled*(1-p_pooled)*(1/n1+1/n2))         # Isn't it 90.000 females?
z_stat = (p_m-p_non_m)/SE

from scipy.stats import norm
p_val =2*(norm().cdf(-1*np.abs(z_stat))) 
print(p_m,p_non_m,z_stat, p_val)

0.011129660545353366 0.011244711645513248 -0.1639329561578003 0.8697839227860011


### Exercise 3

[Meuer and Woessner](https://journals.sagepub.com/doi/abs/10.1177/1477370818809663) describe an experiment to test the effect of electronic monitoring (tagging) on “low-risk” prisoners. Forty-eight (male) prisoners were randomly allocated to two groups:

* In the experimental group, the prisoner served the last part of his sentence under “supervised early work release”, involving the use of an open prison and electronic tagging.
* In the control group, the prisoner served the last part of his sentence in prison, as normal.

Following the end of the sentence, the prisoners were followed up for two years. It was recorded whether each prisoner reoffended. The results were as follows:

|group|sample size|	number reoffending|	\% reoffending|
|---|---|---|---|
|experimental|	24|	7|	29.2%|
|control|	30|	15|	50.0%|

<div class="alert alert-block alert-info">
Can we conclude that early release and tagging of prisoners affect the likelihood of reoffending?
</div>

In [7]:
n1 = 24
n2 = 30
p_exp = 7/24
p_c = 15/30


p_pooled = (7+15)/(n1+n2)

SE = np.sqrt(p_pooled*(1-p_pooled)*(1/n1+1/n2))           # Isn't it 48 prisioners?
z_stat = (p_exp-p_c)/SE

from scipy.stats import norm

p_val =2*(norm().cdf(-1*np.abs(z_stat))) 
print(p_exp,p_c,z_stat, p_val)

0.2916666666666667 0.5 -1.5482302947089446 0.12156686001105872


# 2. Hypothesis testing for contingency tables

SciPy stats provides with a number of functions to perform inference analysis for contingency tables:

- [`chi2_contingency`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html#scipy.stats.chi2_contingency)

- [`fisher_exact`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html#scipy.stats.fisher_exact)

- [`expected_freq`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.expected_freq.html#scipy.stats.contingency.expected_freq)


### Exercise 4

We consider data from a random sample of 275 jurors in a small county. Jurors identified their racial group, as shown in the following table

|Race|White| Black| Hispanic| Other|
|---|---|---|---|---|
|Representation in juries (counts) |205| 26| 25| 19|    
|Registered voters (%)|0.72 |0.07 |0.12 |0.09|
 

<div class="alert alert-block alert-info">
Are these jurors racially representative of the population?
</div>

In [13]:
from scipy.stats import chi2


observed = np.array([205 , 26, 25, 19])
n = observed.sum()
exp_val = np.array([0.72, 0.07, 0.12, 0.09])*n
print(observed,exp_val)

chi2_stat = ((observed-exp_val)**2/exp_val).sum()
print(chi2_stat)

p_value = 1-chi2(df=4-1).cdf(chi2_stat)
print(p_value)

[205  26  25  19] [198.    19.25  33.    24.75]
5.889610389610387
0.1171061913085063


### Exercise 5

In a survey of 237 students smoking habits and exercise levels were observed

|Smoking status| exercise: regular|exercise: some/none|
|---|---|---|        
|Never|87|102|
|Occasional|12|7|
|Regular|9|8|
|Heavy|7|4|


<div class="alert alert-block alert-info">
Is smoking status independent of exercise level?
</div>

In [8]:
from scipy.stats import  chi2_contingency, chi2   

chi2_val, pvalue ,df, expected_freqs = chi2_contingency( [ [87, 12,9,7] , [102, 7,8,4] ] )
print(chi2_val)
print(pvalue)
df = (2-1)*(4-1)

p_val = 1-chi2(df=df).cdf(chi2_val)
print(p_val)

3.2328182226131847
0.35710308004083213
0.35710308004083213


### Exercise 6

The table below shows the observed frequencies of different kinds of crime in three neighborhoods.

|Violence|	Theft|	Vandalism|**Total**|
|---|---|---|---|
|Neighborhood1|	16|	25|	42|	**83**|
|Neighborhood2|	15|	18|	16|	**49**|
|Neighborhood3|	39|	36|	30|	**105**|
|**Total**	|70	|79	|88	|237|


<div class="alert alert-block alert-info">
What are the expected counts of this table? Is there an association between different neighbourhoods and types of crime?
</div>


In [2]:
from scipy.stats import  chi2_contingency, chi2   

chi2_val, pvalue ,df, expected_freqs = chi2_contingency( [ [16, 25] , [15, 18],[39, 36] ] )
print(chi2_val)
print('')
df = (2-1)*(3-1)
print('')
print(expected_freqs)
print('')
print(pvalue)
p_val = 1-chi2(df=df).cdf(chi2_val)
print(p_val)

1.8313946825018164


[[19.26174497 21.73825503]
 [15.5033557  17.4966443 ]
 [35.23489933 39.76510067]]

0.4002374266857166
0.4002374266857167


### Exercise 7

You have quite a lot of plants in and outside your house, some of which have flowers, and some of which don't. Your flower data is presented below: 

|Flowering |Indoors|	Outdoors|
|---|---|---|
|Flower	|7	|3|
|No flower|	1|	12|


<div class="alert alert-block alert-info">
Is flowering independent from the plant being indoors or outdoors?
</div>

In [28]:
from scipy.special import comb              # Prob of getting this exact table by two manners
from scipy.stats import hypergeom

a = 7
b = 3
c = 1
d = 12
n = a+b+c+d

p = comb(a+b,a) * comb(c+d,c) / comb(n,a+c)
print(p)           

rv = hypergeom(M=a+b+c+d, n=a+c, N=a+b)
p = rv.pmf(a)
print(p)

print('')
print('')
print('')


from scipy.stats import fisher_exact , chi2_contingency    


oddsratio, pvalue = fisher_exact( [ [7, 3] , [1,12] ] )
print(pvalue)

chi2, pvalue ,df, expected_freqs = chi2_contingency( [ [7, 3] , [1,12] ] )
print(pvalue)


0.0031816346259743757
0.003181634625974381



0.005898261114306346
0.007616400216780019


### Exercise 8

The table below describes residents of a Madrid neighborhood based on their car ownership and public transportation usage.

| Public vs Cars  | Owns car | Does not own car| Total|
|---|---|---|---|
|Uses public transport|34|94|128|
|Does not use public transport|126|17|143|
|Total|160|111|271|  



<div class="alert alert-block alert-info">
Is there an association between car ownership and public transportation usage? If there was no association, how many individuals would we expect to not own a car and not use public transport?
</div>


In [10]:
from scipy.special import comb              # Prob of getting this exact table by two manners
from scipy.stats import hypergeom

a = 34
b = 94
c = 126
d = 17
n = a+b+c+d

p = comb(a+b,a) * comb(c+d,c) / comb(n,a+c)
print(p)           

rv = hypergeom(M=a+b+c+d, n=a+c, N=a+b)
p = rv.pmf(a)
print(p)

print('')
print('')
print('')


from scipy.stats import fisher_exact , chi2_contingency    


oddsratio, pvalue = fisher_exact( [ [34, 94] , [126, 17] ] )
print(pvalue)

chi2, pvalue ,df, expected_freqs = chi2_contingency( [ [34, 94] , [126, 17] ] )
print(pvalue)


2.5262350099043426e-26
2.5262350099046578e-26



3.449987535043802e-26
2.9120514083355244e-24


# falta el b)

In [11]:
# Sol: if events were independent, then
p_not_own_car = 111/271
p_not_use_public = 143/271

n_individuals = 271 * p_not_own_car * p_not_use_public
print('\n# of individuals = ', n_individuals)


# of individuals =  58.57195571955719
