### Problem Statement 1: Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:

        High School  Bachelors  Masters  Ph.d.  Total
Female       60          54       46      41     201
Male         40          44       53      57     194
Total       100          98       99      98     395

### Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained? 

We need to perform Chi-Square test of independence. 

- Null hypothesis, H0 = There is no relationship between gender and level of education
- Alternate Hypothesis, H1 = There is a relationship between gender and level of education

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

female = [60,54,46,41]
male = [40,44,53,57]
high_school = [60,40]
bachelors = [54,44]
masters = [46,53]
Phd = [41,57]

marks = female + male
print(marks)

[60, 54, 46, 41, 40, 44, 53, 57]


In [8]:
sex = ['F','F','F','F','M','M','M','M']
education = ['high_school','bachelors','masters','Phd','high_school','bachelors','masters','Phd']

data = {'education':education,'Marks':marks,'sex':sex}
df = pd.DataFrame(data)
df

Unnamed: 0,education,Marks,sex
0,high_school,60,F
1,bachelors,54,F
2,masters,46,F
3,Phd,41,F
4,high_school,40,M
5,bachelors,44,M
6,masters,53,M
7,Phd,57,M


In [10]:
crosstab= pd.crosstab([df.sex,df.Marks],df.education,margins=True)
crosstab

Unnamed: 0_level_0,education,Phd,bachelors,high_school,masters,All
sex,Marks,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
F,41.0,1,0,0,0,1
F,46.0,0,0,0,1,1
F,54.0,0,1,0,0,1
F,60.0,0,0,1,0,1
M,40.0,0,0,1,0,1
M,44.0,0,1,0,0,1
M,53.0,0,0,0,1,1
M,57.0,1,0,0,0,1
All,,2,2,2,2,8


In [11]:
df1 = pd.crosstab(df.sex,df.education,df.Marks,aggfunc='sum',margins=True)
df1

education,Phd,bachelors,high_school,masters,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,41,54,60,46,201
M,57,44,40,53,194
All,98,98,100,99,395


In [16]:
df1.columns = ['Phd','Bachelors','High_school','Masters',"Genderwise_total"]
df1.index = ['Female','Male','Combined']
df1

Unnamed: 0,Phd,Bachelors,High_school,Masters,Genderwise_total
Female,41,54,60,46,201
Male,57,44,40,53,194
Combined,98,98,100,99,395


In [17]:
df2 = df1.drop(["Genderwise_total"],axis=1)
df2

Unnamed: 0,Phd,Bachelors,High_school,Masters
Female,41,54,60,46
Male,57,44,40,53
Combined,98,98,100,99


We need to calculate the expected counts of each cell in a 2d table instead of a 1d table. We multiple the row for that cell by the column total of that cell to get the expected count for a cell. Then divide by the total number of observations. We can get the expected counts for all cells in the table by taking the row totals and column totals of the table,then performing an outer product on them and dividing by the number of observations. 

In [19]:
df3 = np.outer(df1['Genderwise_total'][0:2],df1.loc['Combined'][0:4])/395.0
df3 = pd.DataFrame(df3)

In [20]:
df3.columns = ['Phd','Bachelors','High_school','Masters']
df3.index = ['Female','Male']
df3

Unnamed: 0,Phd,Bachelors,High_school,Masters
Female,49.868354,49.868354,50.886076,50.377215
Male,48.131646,48.131646,49.113924,48.622785


Calculating the chisquare statistic, critical value and p-value. We called sum() twice, once to get the column sum and then to add the sum, returning the sum of entire 2d table. 

In [21]:
chi_sq_test = (((df3-df2)**2)/df3).sum().sum()
chi_sq_test

8.00606624626254

In [22]:
#Critical value for 95% confidence interval and degree of freedom = 3
critical_value = stats.chi2.ppf(q=0.95,df=3)
print("Critical value: ", critical_value)

Critical value:  7.814727903251179


In [24]:
#P-value

p_value = 1 -stats.chi2.cdf(x=chi_sq_test,df=3)
print("P_value: ", p_value)

P_value:  0.04588650089174717


In [25]:
#Perform a test of independence automatically given a frequency table of observed counts using stats.chi2_contingency() function

result = stats.chi2_contingency(observed=df2)
print(result)


(8.006066246262538, 0.23765921376622554, 6, array([[ 49.86835443,  49.86835443,  50.88607595,  50.37721519],
       [ 48.13164557,  48.13164557,  49.11392405,  48.62278481],
       [ 98.        ,  98.        , 100.        ,  99.        ]]))


Chi-square statistic = 8.006, p value = 0.045 and degrees of freedom = 3. The critical value with degrees of freedom = 3 is 7.815. Since 8.006 > 7.815, we reject the null hypothesis and conclude that there is a relationship between the gender of an individual and the level of education they have obtained at a 5% level of significance. 

### Problem Statement 2: Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format.

In [27]:
import scipy.stats as stats

In [28]:
Group1 = [51,45,33,45,67]
Group2 = [23,43,23,43,45]
Group3 = [56,76,74,87,56]

#Perform the Anova
statistic, pvalue = stats.f_oneway(Group1,Group2,Group3)
print("F Statistic Value {}, p_value {}". format(statistic,pvalue))
if pvalue < 0.05:
    print("True")
else:
    print("False")
    

F Statistic Value 9.747205503009463, p_value 0.0030597541434430556
True


### Problem Statement 3: Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25. For 10, 20, 30, 40, 50:

In [30]:
stats.f_oneway([10,20,30,40,50],[5,10,15,20,25])

Group1 = [10,20,30,40,50]
Group2 = [5,10,15,20,25]
mean_1 = np.mean(Group1)
mean_2 = np.mean(Group2)

grp1_sub_mean1 = []
grp2_sub_mean2 = []
add1=0
add2=0

for items in Group1:
    add1 += (items - mean_1)**2
for items in Group2:
    add2 += (items - mean_2)**2
    
var1 = add1/(len(Group1)-1)
var2 = add2/(len(Group2)-1)

F_test = var1/var2
F_test

4.0