In [None]:
!pip install scipy==1.7.3 

In [76]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

## Lab | Inferential statistics - T-test & P-value

Instructions
One tailed t-test - In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

In [77]:
data = pd.read_csv('files_for_lab/machine.txt', encoding='UTF-16', sep='\t')
data
#H0: mean(new_machine_packing_times)<= mean(old_machine_packing_times) 
#H1:  mean(new_machine_packing_times)> mean(old_machine_packing_times) 
#Our stats is right hand side
#n<30 - so it's a t stats


Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [37]:
cols=list(data.columns.values)
cols

['New machine', '    Old machine']

In [39]:
data=data.rename({'New machine': 'New machine','    Old machine': 'Old machine'},axis='columns')

In [40]:
data['New machine'].mean()

42.14

In [41]:
data['Old machine'].mean()

43.230000000000004

Before conducting the two-sample T-Test we need to find if the given data groups have the same variance. If the ratio of the larger data groups to the small data group is less than 4:1 then we can consider that the given data groups have equal variance. To find the variance of a data group, we can use the below syntax.
So let's consider the variance of groups equal.

In [42]:
print(np.var(data['New machine']), np.var(data['Old machine']))
#Print the variance of both data groups


0.4204000000000012 0.5060999999999988


In [45]:
#‘less’: the mean of the distribution underlying the first sample is less than the mean of the distribution 
#underlying the second sample:
st.ttest_ind(a=data['New machine'], b=data['Old machine'], equal_var=True, alternative='less')

Ttest_indResult(statistic=-3.3972307061176026, pvalue=0.0016055712503872579)

it's a one taled test, so our alpha is equal to 0.05, p-value=0.001<0.05
t<0, so it's also why we reject H0.
So we reject H0.

## Matched Pairs Test - In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

Given the unequality on the alternative, we need to use a "two-sided" test.
H0 - pokemon's defense has the same average value to pokemon's attack
H1 - pokemon's defense has different avarage value to pokemon's attack



In [78]:
pokemon=pd.read_csv('pokemon.csv')
pokemon

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [48]:
from scipy.stats import ttest_rel

In [65]:
pokemon['difference'] = pokemon['Attack']-pokemon['Defense']
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,difference
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,0
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,-1
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,-1
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,-23
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,9


In [56]:
sample_diff_mean, sample_diff_std = pokemon['difference'].mean(), pokemon['difference'].std(ddof=1)
sample_diff_mean, sample_diff_std

(5.15875, 33.7323418553516)

In [58]:
t = sample_diff_mean / ( sample_diff_std / np.sqrt(pokemon.shape[0]) )
print("The mean of our samples differences is: {:.2f}".format(sample_diff_mean))
print("The standard deviation of our samples differences is: {:.2f}".format(sample_diff_std))
print("Our t statistics is: {:.2f}".format(t))

The mean of our samples differences is: 5.16
The standard deviation of our samples differences is: 33.73
Our t statistics is: 4.33


In [67]:
tc = st.t.ppf(1-(0.05/2),df= pokemon.shape[0] - 1)
tc


1.9629374611056056

Our statistic is 4.33 while the critical value is 1.96. Then, as 4.3 > 1.96 we reject the H0.

In [69]:
# Another way.
st.ttest_rel(pokemon['Attack'],pokemon['Defense'])
#1.7140303479358558e-05 is a very small number much smaller than 0,025 - We reject H0.

Ttest_relResult(statistic=4.325566393330478, pvalue=1.7140303479358558e-05)

## OPTIONAL PART | Inferential statistics - ANOVA

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

State the null hypothesis:
 #H0 for ANOVA is always that the means of the various groups are the same, 

State the alternate hypothesis:
The means are not the same

What is the significance level - 0.05
What are the degrees of freedom of model, error terms, and total DoF 
The degrees of freedom (DF) are the number of independent pieces of information


In [118]:
etching_rate=pd.read_excel('anova_lab_data.xlsx')
etching_rate

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [119]:
cols=list(etching_rate.columns.values)
cols

['Power ', 'Etching Rate']

In [121]:
etching_rate=etching_rate.rename({'Power ': 'Power','Etching Rate': 'Etching Rate'},axis='columns')

In [122]:
etching_rate['power_count'] = etching_rate.groupby('Power').cumcount()
etching_rate

Unnamed: 0,Power,Etching Rate,power_count
0,160 W,5.43,0
1,180 W,6.24,0
2,200 W,8.79,0
3,160 W,5.71,1
4,180 W,6.71,1
5,200 W,9.2,1
6,160 W,6.22,2
7,180 W,5.98,2
8,200 W,7.9,2
9,160 W,6.01,3


In [123]:
etching_rate_pivot = etching_rate.pivot(index='power_count', columns='Power', values='Etching Rate')

In [124]:
etching_rate_pivot

Power,160 W,180 W,200 W
power_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


## PART 1. Calculate ANOVA by hands

In [None]:
# H0 for ANOVA is always that the means of the various groups are the same
# H1 is that they are not the same

In [126]:
#First I created the rows with squares of the values.AFter we will make sum of it.
etching_rate_pivot['160_w_square']=etching_rate_pivot['160 W']*etching_rate_pivot['160 W']
etching_rate_pivot['180_w_square']=etching_rate_pivot['180 W']*etching_rate_pivot['180 W']
etching_rate_pivot['200_w_square']=etching_rate_pivot['200 W']*etching_rate_pivot['200 W']

In [127]:
etching_rate_pivot

Power,160 W,180 W,200 W,160_w_square,180_w_square,200_w_square
power_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,5.43,6.24,8.79,29.4849,38.9376,77.2641
1,5.71,6.71,9.2,32.6041,45.0241,84.64
2,6.22,5.98,7.9,38.6884,35.7604,62.41
3,6.01,5.66,8.15,36.1201,32.0356,66.4225
4,5.59,6.6,7.55,31.2481,43.56,57.0025


In [134]:
sum_x1=sum(etching_rate_pivot['160 W'])
sum_x1

28.959999999999997

In [135]:
sum_x2=sum(etching_rate_pivot['180 W'])
sum_x2

31.189999999999998

In [136]:
sum_x3=sum(etching_rate_pivot['200 W'])
sum_x3

41.589999999999996

In [138]:
#We calculate the square of sums and devide on number of values.
temp1=sum_x1*sum_x1/5
temp1

167.73631999999998

In [139]:
temp2=sum_x2*sum_x2/5
temp2

194.56321999999997

In [140]:
temp3=sum_x3*sum_x3/5
temp3

345.9456199999999

In [142]:
#making sum of the squares
sq1=sum(etching_rate_pivot['160_w_square'])
sq1

168.1456

In [143]:
sq2=sum(etching_rate_pivot['180_w_square'])
sq2

195.3177

In [144]:
sq3=sum(etching_rate_pivot['200_w_square'])
sq3

347.73909999999995

In [145]:
#sum of the sum
sum_of_sum=sum_x1+sum_x2+sum_x3
sum_of_sum

101.73999999999998

In [146]:
#calculating the sqaure of sum of sums and devide on total amount of values
sq_of_sum_of_sum=sum_of_sum*sum_of_sum/15
sq_of_sum_of_sum

690.0685066666664

In [147]:
total_temp=temp1+temp2+temp3
total_temp

708.2451599999999

In [148]:
sum_of_sq=sq1+sq2+sq3
sum_of_sq

711.2023999999999

In [150]:
df_between=2
df_within=12

In [152]:
SS_between=total_temp-sq_of_sum_of_sum
SS_between

18.176653333333547

In [153]:
SS_within=sum_of_sq-total_temp
SS_within

2.957239999999956

In [154]:
MS_between=SS_between/df_between
MS_between

9.088326666666774

In [155]:
MS_within=SS_within/df_within
MS_within

0.246436666666663

In [156]:
F=MS_between/MS_within
F
#From stat scipy we have F_onewayResult(statistic=36.87895470100505)
#SAME!!! Hooray!

36.87895470100597

## Part 2 In this section, use the Python to conduct ANOVA

In [107]:
cols=list(etching_rate_pivot.columns.values)
cols

['160 W', '180 W', '200 W']

In [108]:
etching_rate_pivot=etching_rate_pivot.rename({'160 W': 'Power_160_W','180 W': 'Power_180_W','200 W': 'Power_200_W'},axis='columns')
etching_rate_pivot.head(10)

Power,Power_160_W,Power_180_W,Power_200_W
power_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


In [109]:
etching_rate_pivot.mean()

Power
Power_160_W    5.792
Power_180_W    6.238
Power_200_W    8.318
dtype: float64

In [111]:
# H0 for ANOVA is always that the means of the various groups are the same
# H1 is that they are not the same
st.f_oneway(etching_rate_pivot.Power_160_W,etching_rate_pivot.Power_180_W,etching_rate_pivot.Power_200_W)


F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)

pvalue is very small, much smaller than alpha, so we reject H0

Degrees of freedom between groups: k-1, where k-number of groups. in our case k=3. so k-1=2
Degrees of freedom within groups: N-k = 15-3=12
Total degrees of freedom: N-1=14
where N-total number of subjects/values.

Error for ANOVA - The amount by which an observed variate differs from the value predicted by the model. 

In [157]:
etching_rate

Unnamed: 0,Power,Etching Rate,power_count
0,160 W,5.43,0
1,180 W,6.24,0
2,200 W,8.79,0
3,160 W,5.71,1
4,180 W,6.71,1
5,200 W,9.2,1
6,160 W,6.22,2
7,180 W,5.98,2
8,200 W,7.9,2
9,160 W,6.01,3


SST:
This term computes how much it deviates each group mean from the global mean and add the squares of those deviations multiplied by the number of members in the group divided by the number of members minus one.

In [159]:
S2t = 0
for rate in etching_rate['Power'].unique():
    ng = len(etching_rate[etching_rate['Power'] == rate])  
    S2t  += ( (etching_rate[etching_rate['Power'] == rate]['Etching Rate'].mean() - etching_rate['Etching Rate'].mean() ) ** 2) * ng
S2t /= ( etching_rate['Power'].nunique() - 1 )
print("The value of S2t is {:.2f}".format(S2t)) 

The value of S2t is 9.09


In [160]:
SST=S2t*2
SST

18.176653333333334

SSE:
This other term, computes how much every single value of every group deviates from the group mean.
In summary, SST computes the variance of the group means from the global mean, while SSE computes the variance of the values against the global mean.

In [161]:
S2E = 0
for p in etching_rate['Power'].unique():
    for rate in etching_rate[etching_rate['Power'] == p]['Etching Rate']:
        S2E += ( rate - etching_rate[etching_rate['Power'] == p]['Etching Rate'].mean() ) ** 2
S2E /= ( len(etching_rate) - etching_rate['Power'].nunique() ) 

print()
print("The value of S2E is {:.2f}".format(S2E))


The value of S2E is 0.25


In [162]:
SSE=S2E*12
SSE

2.9572399999999974