# Chapter 26 - Exercises

In [1]:
import math

def chi_square_manual(obs, exp):
    z = zip(obs, exp)
    return sum([math.pow(o - e, 2) / e for (o, e) in z]) 

In [2]:
import scipy.stats as stats

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## 26.1

### Answers

* a) chi-square test of independence
* b) some other test (\$, not counts)
* c) chi-square test of homogeneity

## 26.3

### Answers

* a) 10x
* b) goodness-of-fit
* c) H0: counts are uniform (10 ea), HA: counts aren't uniform
* d) 
  * + counts
  * + random
  * + independent
* e) 5 (6-1)
* f) $\chi^2$ = 5.6; P-value: 0.3471
* g) fail to reject H0

In [13]:
obs = [11,7,9,15,12,6]
exp = [10] * 6

chisq = chi_square_manual(obs, exp)
df = 6 - 1

print("chi-square: {}".format(chisq))
print("df: {}".format(df))

chi-square: 5.6
df: 5


In [14]:
stats.chisquare(obs, f_exp=exp)

Power_divergenceResult(statistic=5.6, pvalue=0.34710506828171556)

## 26.5

### Answers

* a) the values aren't counts, but rather, weights
* b) count the nuts rather than weighing them

## 26.7

### Answers

* H0: % of officers in each racial group matches % of overall population
* HA: % of officers in each racial group is different from % of overall population
* $\chi^2$: 16616.285
* P-value: 0.0
* reject H0: evidence supports claim of a difference between %'s of overall population and police force

In [23]:
exp_pcts = [0.292, 0.282, 0.315, 0.091, 0.02]
obs_pcts = [0.648, 0.145, 0.191, 0.014, 0.00]
n = 26181

exp = [math.floor(n * p) for p in exp_pcts]
obs = [math.floor(n * p) for p in obs_pcts]

print(exp)
print(obs)

stats.chisquare(obs, f_exp=exp)

[7644, 7383, 8247, 2382, 523]
[16965, 3796, 5000, 366, 0]


Power_divergenceResult(statistic=16616.28450805835, pvalue=0.0)

## 26.9

### Answers

* a) yes - fail to reject H0, results are consistent; $\chi^2$ = 5.67, P-value = 0.129
* b) reject H0; $\chi^2$ = 11.342, P-value = 0.010
* c) the larger the sample size, the greater the emphasis on even small deviations between observed and expected

In [8]:
ratios = [9, 3, 3, 1]
total = sum(ratios)
exp = [100 * x / total for x in ratios]
obs = [59, 20, 11, 10]

print(chi_square_manual(obs, exp))

stats.chisquare(obs, f_exp=exp)

5.671111111111111


Power_divergenceResult(statistic=5.671111111111111, pvalue=0.12875503784496686)

In [10]:
exp2 = [200 * x / total for x in ratios]
obs2 = [2 * x for x in obs]

stats.chisquare(obs2, f_exp=exp2)

Power_divergenceResult(statistic=11.342222222222222, pvalue=0.010012229535652856)

## 26.11

### Answers

* a) 6
* b) goodness of fit
* c) 
  * H0: frequency of huricanes is uniform across decades
  * HA: frequency isn't uniform
* d) df = 15
* e) P-value: 0.628
* f) Fail to reject H0: evidence doesn't support the claim that there is a change in distribution of huricanes across decades
* g) Potentially underestimates the frequency in the last decade.  It could potentially make the difference between obs and exp larger in last decade, possibly changing the final result.

In [24]:
data = pd.read_table("../data/Hurricane_frequencies.txt", index_col="Decade")
data

Unnamed: 0_level_0,Count
Decade,Unnamed: 1_level_1
1851-1860,6
1861-1870,1
1871-1880,7
1881-1890,5
1891-1900,8
1901-1910,4
1911-1920,7
1921-1930,5
1931-1940,8
1941-1950,10


In [20]:
# a)
data.Count.sum() / data.Count.size

6.0

In [21]:
# d
data.Count.size - 1

15

In [23]:
# e)
exp = [6] * data.Count.size
obs = data.Count.values

stats.chisquare(obs, f_exp=exp)

Power_divergenceResult(statistic=12.666666666666668, pvalue=0.6280270845138782)

## 26.13

### Answers

* a) chi-square test of independence
* b)
  * H0: Epidural during childbirth and later breastfeeding are independent. 
  * HA: Epidural during childbirth and later breastfeeding are associated. 

## 26.15

### Answers

* a) df = (r-1)(c-1) = 1
* b) 159.34
* c) exp are at least 5 in all cells, assume random and indpependent

In [26]:
# b
(474 / 1178) * 396

159.34125636672326

## 26.17

### Answers

* a) 5.899
* b) $\chi^2$ = 14.87, df = 1; P-value = 
* c) reject H0: P-value < 0.001; evidence supports claim that they are associated

In [6]:
# a)
exp = (474 / 1178) * 396
obs = 190
comp = math.pow(obs - exp, 2) / exp
print("component: {}".format(comp))

component: 5.899028177659629


In [11]:
# b)
pvalue = 1 - stats.chi2.cdf(14.87, 1)
print("P-value: {}".format(pvalue))

P-value: 0.00011518027809664932


## 26.19

### Answers

* a) see below
* b) The residuals appear to further support the claim that epidural during childbirth is associated with less breastfeeding at 6 months.

In [13]:
# a)
obs = 190
exp = (474 / 1178) * 396

resid = (obs - exp) / math.sqrt(exp)
print("resid: {}".format(resid))

resid: 2.4287915055968945


## 26.21

### Answers

* Those represent more than two variables.  We'd need two variables with two or more categories within each to use the chi-square methods we've used for the prior problems.

## 26.23

### Answers

* a) 40.2%
* b) 8.1%
* c) 62.2%
* d) 285
* e) 
  - H0 : survival is independent of passenger "class"
  - HA : survival is associated with passenger classes
* f) 3
* g) reject H0: evidence supports claim that survival was associated with the class of passenger

In [14]:
# a)
885 / 2201

0.4020899591094957

In [15]:
# b)
178 / 2201

0.08087233075874603

In [16]:
# c)
202 / 325

0.6215384615384615

In [17]:
# d)
pct = 710 / 2201
885 * pct

285.48387096774195

## 26.25

### Answers

* Shows a fairly clear pattern of high survival for First class, moderate for Second, and lower for Crew and Third.

## 26.27

### Answers

* a) experiment: randomly selected? 3 treatments applied; tx, alternate, and control
* b) homogeneity
* c) 
  - H0: rate of infection is same across treatments
  - HA: rate of infection varies by treatment
* d) independent, random, at least 5 expected in each cell
* e) 2
* f) $\chi^2$ = 7.78, P-value = 0.020
* g) reject H0: evidence shows an association between infection rate and treatment
* h) control shows smallest residuals; cranberry shows lower than expected infection; alternate tx shows higher than expected infection

In [32]:
# f)
inf = [8, 20, 18]
non = [42, 30, 32]

result = stats.chi2_contingency([inf, non])
print(result)

(7.775919732441471, 0.020487099856223837, 2, array([[15.33333333, 15.33333333, 15.33333333],
       [34.66666667, 34.66666667, 34.66666667]]))


In [33]:
# h)
z1 = zip(inf, result[3][0])
z2 = zip(non, result[3][1])
print([(o - e) / math.sqrt(e) for (o, e) in z1])
print([(o - e) / math.sqrt(e) for (o, e) in z2])

[-1.8727643676692471, 1.1917591430622478, 0.6810052246069987]
[1.2455047375590558, -0.7925939239012166, -0.4529108136578379]


## 

### Answers

* 


## 

### Answers

* 


## 

### Answers

* 


## 

### Answers

* 


## 

### Answers

* 
