# t-Test, Part3

## Dependent samples
* two conditions - same person, two different conditions (eg. 2 types of treatment);
* longitudinal - measure variable in one point of time and then measure later and see if the variable changes;
* pre-test; post-test - what was the measure of variable before and afater the treatment?

Advantages:
* controls for individual differences
  * use fewer subjects
  * cost-effective
  * less time consuming
  * less expensive

Disadvantages
* carry-over effect (eg. we have some method of teaching math that you apply on your students; after the treatment students will have different skills and you won't be able to conduct test on the  same sample again) - second measurement can be affected by first treatment
* order may influence result (eg. test 2 types of pills. What if first pill has some interaction with  2nd pill?)

##  Independent samples

Advantages of dependent samples are independent samples' **disadventages**.

Disadvantages of dependent samples are independent samples' **advantages**.

* Experimental tests where we give treatment to the subject
* Observational


In [1]:
%%latex
Standard deviation
$$SD = \sqrt{s_1^2 + s_2^2}\\$$

Standard error
$$SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\\$$

Degrees of freedom
$$df = (n_1 - 1) + (n_2 - 1) = n_1 + n_2 - 2\\$$

<IPython.core.display.Latex object>

In [2]:
import pandas as pd
import numpy as np

prices_df = pd.read_csv('spreadsheets/food_prices.csv')
prices_df.columns = ['gettysburg', 'wilma']
prices_df

Unnamed: 0,gettysburg,wilma
0,9,11.0
1,5,10.0
2,6,12.0
3,11,9.0
4,8,8.0
5,5,13.0
6,7,14.0
7,13,15.0
8,12,12.0
9,13,11.0


In [3]:
getys_avg = prices_df['gettysburg'].mean()
getys_avg

8.944444444444445

In [4]:
wilma_avg = prices_df['wilma'].mean()
wilma_avg

11.142857142857142

In [5]:
getys_std = prices_df['gettysburg'].std(ddof=1)
getys_std

2.6451336499586917

In [6]:
wilma_std = prices_df['wilma'].std(ddof=1)
wilma_std

2.1788191176076888

In [7]:
# calculate standard error
sem = np.sqrt(getys_std ** 2 / prices_df['gettysburg'].count() + 
             wilma_std ** 2 / prices_df['wilma'].count())
sem

0.8531100847677227

In [8]:
# calculate t-statistic both ways

t1 = (getys_avg - wilma_avg) / sem
t2 = (wilma_avg - getys_avg) / sem

t1, t2

(-2.5769390582356815, 2.5769390582356815)

In [9]:
# degrees of freedom
df = prices_df.count().sum() - 2
t_crit = 2.042  # read from t-table

df

30

In [10]:
reject_null = t2 > t_crit
reject_null

True

# Acne medication

2 dermatologists: **A** and **B**, both had their own acne medication. They tested their product on different patients.

In [11]:
import pandas as pd
import numpy as np

# numbers below represent the percentage of acne left
# on patient after the  treatment
dA = pd.Series([0.4, 0.36, 0.2, 0.32, 0.45, 0.28], name='A')
dB = pd.Series([0.41, 0.39, 0.18, 0.23, 0.35], name='B')
d = pd.DataFrame([dA, dB]).T * 100
d

Unnamed: 0,A,B
0,40.0,41.0
1,36.0,39.0
2,20.0,18.0
3,32.0,23.0
4,45.0,35.0
5,28.0,


In [12]:
d['A'].mean(), d['B'].mean()

(33.5, 31.2)

In [13]:
d['A'].std(ddof=1), d['B'].std(ddof=1)

(8.893818077743664, 10.158740079360236)

In [14]:
# null hypothesis is that results of medication will not be significantly different
t = \
(d['A'].mean() - d['B'].mean()) \
/ np.sqrt(d['A'].std() ** 2 / d['A'].count() + d['B'].std() ** 2 / d['B'].count())
t

0.3954755449732927

In [15]:
dof = d.count().sum() - 2
dof

9

In [16]:
t_crit = 2.262
reject_null = t > t_crit
reject_null

False

# Who has more shoes?


In [17]:
shoes = pd.read_csv('spreadsheets/shoes.csv')
shoes.columns = ['f', 'm']  # f = female, m =  male
shoes

Unnamed: 0,f,m
0,90.0,4
1,28.0,120
2,30.0,5
3,10.0,3
4,5.0,10
5,9.0,3
6,60.0,5
7,,13
8,,4
9,,10


In [18]:
shoes['f'].mean(), shoes['m'].mean()

(33.142857142857146, 18.0)

In [19]:
shoes['f'].std(), shoes['m'].std()

(31.360423952430722, 34.27243790569909)

In [20]:
sem = np.sqrt(
    shoes['f'].std() ** 2 / shoes['f'].count() + 
    shoes['m'].std() ** 2 / shoes['m'].count()
)
sem

15.725088769901236

In [21]:
t = (shoes['f'].mean() - shoes['m'].mean()) / sem
t

0.9629743503795974

In [22]:
dof = shoes.count().sum() - 2
t_crit = 2.12
dof

16

In [23]:
mean_diff = shoes['f'].mean() -  shoes['m'].mean()
print(mean_diff)
ci = mean_diff - t_crit * sem, mean_diff + t_crit * sem
ci

15.142857142857146


(-18.194331049333478, 48.48004533504777)

**What proportion of the difference in pairs of shoes owned can be attributed to gender?**

In [24]:
%%latex
$$r^2 = \frac{t^2}{t^2 + df}$$

<IPython.core.display.Latex object>

In [25]:
r2 = t ** 2 / (t ** 2 + dof)
r2

0.05478242400037163

This means only about 5% of the difference in pairs of shoes owned is **due to gender**. The remaining 95% can be explained by **something else** (eg. type of personality).