## IS6 in Python: Comparing Groups (Chapter 17)

### Introduction and background

This document is intended to assist students in undertaking examples shown in the Sixth Edition of Intro Stats (2022) by De Veaux, Velleman, and Bock. This pdf file as well as the associated ipynb reproducible analysis source file used to create it can be found at (INSERT WEBSITE LINK HERE).

#### Chapter 17: Comparing Groups

In [76]:
# Read in libraries
import pandas as pd
import numpy as np
from scipy.stats import norm, t
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest

#### Section 17.1: A Confidence Interval for the Difference Between Two Proportions

Question: 
- What is the diffmean() equivalent in Python?
- What is the resample() equivalent in Python? Note, the dataframe.resample() can only work with time series

#### Example 17.1:  Finding the Standard Error of a Difference in Proportions

In [56]:
# Set up
p_c = 0.23
p_nc = 0.19
se_c = ((p_c * (1-p_c)) / 2698) ** 0.5
print(f"standard error, college: {se_c}")
se_nc = ((p_nc * (1 - p_nc)) / 696) ** 0.5
print(f"standard error, no college: {se_nc}")
se_diff = ((se_c ** 2) + (se_nc ** 2)) ** 0.5
print(f"standard error, difference in proportions: {se_diff}")

standard error, college: 0.008101926666870337
standard error, no college: 0.0148701274256535
standard error, difference in proportions: 0.016934045747266573


#### Example 17.2: Finding a Two-Proportion z-Interval

In [57]:
# A 95% confidence interval for p_c - p_nc is:
left = (p_c - p_nc) - 1.96 * se_diff
right = (p_c - p_nc) + 1.96 * se_diff
conf_int = pd.Interval(left, right)
print(conf_int)

(0.006809270335357526, 0.07319072966464249]


Question: What is the prop.test() equivalent in Python?

#### Section 17.2: Assumptions and Conditions for Comparing Proportions
#### Step-by-step Example: A Two-Proportion z-Interval 

In [58]:
#Create dataframe for seatbelts
seatbelts = np.array(["F", True] * 2777 + ["F", False] * (4208 - 2777) + ["M", True] * 1363 + ["M", False] * (2763 - 1363)).reshape(-1,2)
seatbelts = pd.DataFrame(seatbelts, columns = ["passenger", "belted"])
seatbelts.head()

Unnamed: 0,passenger,belted
0,F,True
1,F,True
2,F,True
3,F,True
4,F,True


In [59]:
#Mechanics
n_f = 4208
n_m = 2763
p_f = 2777/4208
p_m = 1363/2763
se_diff = np.sqrt((p_f * (1 - p_f)) / n_f + (p_m * (1 - p_m)) / n_m)
print(f"standard error of difference: {se_diff}")
me = 1.96 * se_diff
conf_int = pd.Interval(p_f - p_m - me, p_f - p_m + me)
print(f"95% confidence interval: {conf_int}")

standard error of difference: 0.01199154662477283
95% confidence interval: (0.1431256493936262, 0.1901325121627357]


#### Section 17.3: The Two-Sample z-Test: Testing for the Difference Between Proportions
#### Step-By-Step Example: A Two-Proportion z-Test

In [60]:
#Create dataframe for sleep habits
sleep = np.array(["GenY", True] * 205 + ["GenY", False] * (293 - 205) + ["GenX", True] * 235 + ["GenX", False] * (469 - 235)).reshape(-1,2)
sleep = pd.DataFrame(sleep, columns = ["gen", "internet"])
sleep.head()

Unnamed: 0,gen,internet
0,GenY,True
1,GenY,True
2,GenY,True
3,GenY,True
4,GenY,True


In [61]:
#Mechanics
#n for GenY
n_y = sleep[sleep["gen"] == "GenY"].count()[0]
print(n_y)

293


In [62]:
#y for GenY
y_y = sleep[(sleep["gen"] == "GenY") & (sleep["internet"] == "True")].count()[0]
print(y_y)

205


In [63]:
#proportion for GenY
p_y = y_y / n_y
print(p_y)

0.6996587030716723


In [64]:
#n for GenX
n_x = sleep[sleep["gen"] == "GenX"].count()[0]
print(n_x)

469


In [65]:
#y for GenX
y_x = sleep[(sleep["gen"] == "GenX") & (sleep["internet"] == "True")].count()[0]
print(y_x)

235


In [66]:
#proportion for GenX
p_x = y_x / n_x
print(p_x)

0.5010660980810234


In [67]:
#overall SE
sepgen = ((p_y * (1 - p_y)) / n_y + (p_x * (1 - p_x)) / n_x) ** 0.5
print(sepgen)

0.03535867225219601


In [68]:
#Difference between proportions
pdiff = p_y - p_x
print(pdiff)

0.1985926049906489


In [69]:
z = (pdiff - 0) / sepgen
print(z)

5.616517599252188


In [70]:
print(2 * norm.sf(x = z))

1.9484441249264696e-08


In [71]:
#Using function to calculate z-test
count = np.array([205, 235])
nobs = np.array([293, 469])
stat, pval = proportions_ztest(count, nobs)
print(f"p-value: {pval}")
print(f"z score: {stat}")

p-value: 6.704500816465756e-08
z score: 5.398915319236189


Question: The numbers are slightly different

#### Section 17.4: A Confidence Interval for the Difference Between Two Means
#### Example 17.7: Finding a Confidence Interval for the Difference in Sample Means
We can calculate the confidence interval using summary statistics.

In [85]:
# page 585
n_f = 465
n_m = 1310

mean_f = 39667.2
mean_m = 46484

std_f = 37125.9
std_m = 38699.8

mean_diff = mean_m - mean_f

df = 846

t_star = 3.364

se_diff = np.sqrt((std_m ** 2 / n_m) + (std_f ** 2 / n_f))

print(t.interval(0.95, df = df, loc = mean_diff, scale = se_diff))

(2838.8954767433183, 10794.704523256687)


#### Section 17.5: The Two-Sample t-Test: Testing for the Difference Between Two Means
#### Step-By-Step Example: A Two-Sample t-Test for the Difference Between Two Means

In [None]:
# Page 587
buy = pd.read_csv("datasets/buy-from-a-friend.txt", sep = "\t")
buy.head()

In [None]:
sns.boxplot(data = buy)
plt.xlabel("Buying from")
plt.ylabel("Amount Offered ($)")
plt.show()

In [None]:
sns.histplot(data = buy["Friend"], bins = 3)
plt.xlabel("Buy from friend")
plt.show()

In [None]:
sns.histplot(data = buy["Stranger"], bins = 4)
plt.xlabel("Buy from stranger")
plt.show()

We can replicate the analyses on pages 588 - 589

In [None]:
print(buy.describe())

In [None]:
buy = buy.dropna()
# To perform Welch’s t-test, which does not assume equal population variance, use argument equal_var = False
stat, p = ttest_ind(buy["Friend"], buy["Stranger"], equal_var = False)

# Display the results
print("p-values:", p)
print("t-test:", stat)

Question: I need to double check the code above. I used the teaching code but did not get the same result shown in the book. One possible explanation is that my dataset is missing 1 example

#### Random Matters: Randomization Tests for The Difference Between Two Means

In [None]:
#Page 591 - 592
car = pd.read_csv("datasets/car-speeds.txt", sep = "\t")
car.head()

In [None]:
print(car.groupby("direction").describe())

Question: (same as above)
- What is the diffmean() equivalent in Python?
- What is the resample() equivalent in Python?
- What is the qdata() equivalent in Python?

Note: I think I know how how to shuffle() in Python, the only thing I need now is diffmean()

#### Section 17.6: Pooling

In [None]:
buy = pd.read_csv("datasets/buy-from-a-friend.txt", sep = "\t")
buy = buy.dropna()
# The pooled variance ttest can be generated by using the option equal_var = True (which is by default)
stat, p = ttest_ind(buy["Friend"], buy["Stranger"], equal_var = True)

# Display the results
print("p-values:", p)
print("t-test:", stat)

Question: Same as above, need to double check the code above. I used the teaching code but did not get the same result shown in the book. One possible explanation is that my dataset is missing 1 example. Also, I don't see this example in the new version

#### Section 17.7: The Standard Deviation of a Difference