# T Distribution

## Assumption (sensitive to outliers)

- simple random sample

- independent observations

- large sample size (n >= 20 <- more symmetric smaller sample size; more skewed larger sample size) OR invidivals ~ normal distribution

In [25]:
from scipy import stats

# construct 95% confidence interval (2 sided)
n = 16
mean = 25.2
s = 5
ci = .95

# =================== TWO SIDED ===================
# one method
stats.t.interval(ci, df = n - 1, loc = 25.2, scale = s/ (n ** .5))
print ('one method')
print (stats.t.interval(ci, df = n - 1, loc = 25.2, scale = s/ (n ** .5)))
# second method
t_stat = stats.t.ppf(1 - (1-ci)/2, df = n - 1)

lower, upper = mean - t_stat * s/ (n ** .5), mean + t_stat * s/ (n ** .5)
print ('TWO SIDE: have {}% confidence that the population MEAN is between {:.2f} and {:.2f}, and the t-value is {:.2f}'.format(ci*100, lower, upper, t_stat))
print ()
# =================== ONE SIDED ===================
# second method
t_stat_1 = stats.t.ppf(ci, df = n - 1)

upper = mean + t_stat_1 * s/ (n ** .5)
print ('ONE SIDE: have {}% confidence that the population MEAN is {:.2f} or less, and the t-value is {:.2f}'.format(ci*100, upper, t_stat_1))

one method
(22.535688068050845, 27.864311931949153)
TWO SIDE: have 95.0% confidence that the population MEAN is between 22.54 and 27.86, and the t-value is 2.13

ONE SIDE: have 95.0% confidence that the population MEAN is 27.39 or less, and the t-value is 1.75


## Understanding of the Confidence Interval

- If we took sample of size 16 from the population in Vancouver over and over again and build Confidence Intervals over and over again, we expect for every 100 Confidence Intervals about 95 of those to have the true mean fall within the intervals

- REMEBER: the confidence interval is random, but the population mean is not random!

## Margin of Error 

t_value * sample_standard_deviation / (n ** .5)

### How to decrease Margin of Error 

- decrease t_value -> decrease confidence

- decrease standard deviation <- not possible (if you measure people height, etc)

- increase n 


### Margin of Error and Sample Size

- suppose want ME = .5 => what sample size?

- ![sample_size_cal](sample_size_cal.png)

    - use the t_value here a approximate value; or use a normal distribution z_value

- comment: 

    - plan ahead of time

        - no standard deviation 

            - look at literature, leverage the similar data

            - conduct small pilot study

            - use expert knowledge for a range and properties of normal distribution

                - range usually 6 std deviation because +/- 3 std include 99.7% of data. 
    
    - balance between these two 

In [31]:
from scipy import stats

# construct 95% confidence interval (2 sided)
mean = 25.2
s = 5
ci = .95

me = .5

(2 * s / .5) ** 2
stats.norm.ppf(.975)

1.959963984540054

In [30]:
import pandas as pd
import matplotlib.pyplot as plt

fl_name = 'results.xlsx'
df = pd.read_excel(fl_name, sheet_name = 'data')
# xl = pd.ExcelFile(fl_name)
# print (xl.sheet_names)
test = 'test_name'
unit_id = 'unit_id'
unit = 'unit'
power = 'power_change'
test_data = 'data'
dead = 'dead'
try_out = 'try out?'
pu = 'PU'
part = 'change_part'
tm = 'tm'
minutes = 'min.1'
second = 'second'
surface = 'surface'

runt_cols = [unit, part, tm, pu, test_data, minutes, second, try_out, surface]
runt_df = df.loc[(df[test] == 'running_time') & (df[pu] == 180) & (df[surface] == 'hardfloor'), runt_cols]
runt_df.loc[~pd.isna(runt_df[part]), try_out] = True
runt_df[test_data] = runt_df[minutes] + runt_df[second] / 60.0

analys_data = runt_df.loc[pd.isna(runt_df[try_out]), test_data]
runtime_s = analys_data.std()
runtime_mean = analys_data.mean()
runtime_n = len(analys_data)

ci = .95


stats.t.interval(ci, df = runtime_n, loc = runtime_mean, scale = runtime_s/ (runtime_n ** .5))
print ('one method')
print (stats.t.interval(ci, df = runtime_n, loc = runtime_mean, scale = runtime_s/ (runtime_n ** .5)))





one method
(13.629760245044944, 15.132306421621719)
TWO SIDE: have 95.0% confidence that the population MEAN is between 13.57 and 15.19, and the t-value is 2.78

