## Problem Statement 1:
#### Is gender independent of education level? A random sample of 395 people were surveyed and each person was asked to report the highest education level they obtained. The data that resulted from the survey is summarized in the following table:

#### Question: Are gender and education level dependent at 5% level of significance? In other words, given the data collected above, is there a relationship between the gender of an individual and the level of education that they have obtained?

##### State the hypothesis
H0: Gender and education level are independent.
H1: Gender and education level are not independent.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
#Intialise data
edu_lvls=['High School','Bachelors','Masters','Ph-d','High School','Bachelors','Masters','Ph-d']
male_female=['male','male','male','male','female','female','female','female']
observations=[60,54,46,41,40,44,53,57]

raw_data=pd.DataFrame({"sex":male_female,
                        "education":edu_lvls,
                        "value":observations})
raw_data

Unnamed: 0,education,sex,value
0,High School,male,60
1,Bachelors,male,54
2,Masters,male,46
3,Ph-d,male,41
4,High School,female,40
5,Bachelors,female,44
6,Masters,female,53
7,Ph-d,female,57


In [3]:
#Creating contigency table from raw data.
df=pd.crosstab(raw_data.sex,raw_data.education,raw_data.value, aggfunc="sum",margins=True)
df.columns = ["Bachelors","High School","Masters","Ph.d.","row_totals"]
df.index = ["Female","Male","col_totals"]
df

Unnamed: 0,Bachelors,High School,Masters,Ph.d.,row_totals
Female,44,40,53,57,194
Male,54,60,46,41,201
col_totals,98,100,99,98,395


In [4]:
df_observed = df.iloc[0:2,0:4]   # Get observed data from table
df_observed

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
Female,44,40,53,57
Male,54,60,46,41


In [5]:
#We can quickly get the expected counts for all cells in the table by 
#taking the row totals and column totals of the table, performing an 
#outer product on them with the np.outer() function and dividing by the number of observations:

df_expected = pd.DataFrame(np.outer(df["row_totals"][0:2],
                     df.loc["col_totals"][0:4]) / df.iloc[2,4])
df_expected.columns = ["Bachelors","High School","Masters","Ph.d."]
df_expected.index = ["Female","Male"]
df_expected

Unnamed: 0,Bachelors,High School,Masters,Ph.d.
Female,48.131646,49.113924,48.622785,48.131646
Male,49.868354,50.886076,50.377215,49.868354


In [13]:
#Apply Χ^2 = Σ [ (Observed - Expected)^2 / Expected ]  to calculate chisquare statistics of the data

chi_squared= (((df_observed-df_expected)**2)/df_expected).sum().sum()
print("Chi - Squared statistics value: %0.2f"%chi_squared)

Chi - Squared statistics value: 8.01


In [14]:
#calculate critical value for 0.5% significance level

dof= df_observed.size - sum(df_observed.shape) + df_observed.ndim - 1 ## degrees of freedom
crit = stats.chi2.ppf(q = 0.95, df = dof) # here 0.95 is confidence level= 1- significance level=1-0.5

print("Crtical value for the data is:%0.2f " %crit)


Crtical value for the data is:7.81 


#### Conclusion
The critical value with 3 degree of freedom is 7.81. Since 8.00 > 7.81, therefore we reject the null hypothesis and conclude that the education level depends on gender at a 5% level of significance.

# Problem Statement 2:
#### Using the following data, perform a oneway analysis of variance using α=.05. Write up the results in APA format.

#### [Group1: 51, 45, 33, 45, 67]
#### [Group2: 23, 43, 23, 43, 45]
#### [Group3: 56, 76, 74, 87, 56]

In [8]:
#The scipy library has a function for carrying out one-way ANOVA tests called scipy.stats.f_oneway(). 
#we will be using that to perform the testing on this data
# no need to import data libarary as it already imported during problem 1

#step 1. Initialise data
group1=[51, 45, 33, 45, 67]
group2=[23, 43, 23, 43, 45]
group3=[56, 76, 74, 87, 56]

In [9]:
#step 2 . Perform 1-way ANOVA test
f_stat_value,p_value=stats.f_oneway(group1,group2,group3)

In [10]:
#step 3. Print the value and take decision
print("F-stat value:  %0.3f"%f_stat_value)
print("p-value:  %0.3f"%p_value)

F-stat value:  9.747
p-value:  0.003


#### Conclusion

Here the p-value returned is 0.003 which is < 0.05. Hence it suggests the groups don't have the same sample means

# Problem Statement 3:
#### Calculate F Test for given 10, 20, 30, 40, 50 and 5,10,15, 20, 25. For 10, 20, 30, 40, 50:

In [19]:
#initialise data
group1=[10, 20, 30, 40, 50]
group2=[5,10,15, 20, 25]

In [20]:
#Step 1. Caluculate Variance of first group
mg1 = np.mean(group1)
addg1 = 0
for items in group1:
    addg1 += (items - mg1)**2
varg1 = addg1/(len(group1)-1)
varg1

250.0

In [23]:
#Step 1. Caluculate Variance of first group
mg2 = np.mean(group2)
addg2 = 0
for items in group2:
    addg2 += (items - mg2)**2
varg2 = addg2/(len(group2)-1)
varg2

62.5

In [25]:
#Step3: Calculate F-Test Value
F_test=varg1/varg2
print("F Test for given [10, 20, 30, 40, 50] and [5,10,15, 20, 25]. For [10, 20, 30, 40, 50] is %0.2f:"%F_test)

F Test for given [10, 20, 30, 40, 50] and [5,10,15, 20, 25]. For [10, 20, 30, 40, 50] is 4.00:
