# EdX Dataset - Inferential Statistics  

The goal of this project is to be build a model that will predict the course completion rate from the various factors such as the registrants age/education background/course/region etc 

In this notebook , we look at the correlation between the users educational background and the course grade through Hypothesis testing 


In [1]:
import numpy as np 
import pandas as pd 

from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn import metrics 
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

import matplotlib.pyplot as plt
import seaborn as sb

# EdX Dataset 


The dataset contains data about particpants who enrolled in MITx and HarvardX courses on EdX platform (Academic Year 2013: Fall 2012, Spring 2013, and Summer 2013). The data includes  aggregate records of participants activities on EdX (which some information such as 'user name' de-identified). The dataset has been downloaded from 

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/26147&version=10.0



In [2]:
# PATH to data file
file='HMXPC13_DI_v2_5-14-14.csv'
path='/Users/suka/Downloads/dataverse_files'
filename=path+'/'+file

In [3]:
#  READ THE CSV to data frame 
full_df = pd.read_csv(filename,parse_dates=True)


In [4]:
# Extract relevant fields 
data = full_df[['course_id','userid_DI','final_cc_cname_DI','LoE_DI','YoB','gender','start_time_DI','grade','viewed','explored','nevents','ndays_act','nplay_video','nchapters','certified','registered','incomplete_flag']].copy()

# split the course id into multiple fields
a,b,c = data['course_id'].str.split('/').str
data.insert(1,'institution',a)
data.insert(2,'course',b)

# clean up the year which has the _spring,_fall suffixes 
a = c.str.split('_').str[0]
b = c.str.split('_').str[1]

data.insert(3,'year',a)
data.insert(4,'term',b)

data.drop('course_id',axis=1,inplace=True)

# Create the mapping of course-id to title 
#courselist = {'CB22x':'Greek Heros', 'CS50x':'Computer Science','ER22x':'Justice','PH207x':'Health Stat','PH278x':'Health Env','14.73x':'Poverty','2.01x':'Structures','3.091x':'SS Chemistry','6.002x':'Circuits','6.00x':'Computer Pgming','7.00x':'Biology','8.02x':'Electricity & Magnetism','8.MReV':'Mechanics'}
courselist = {'CB22x':'Greek Heros', 'CS50x':'Comp Sci','ER22x':'Justice','PH207x':'Health Stat','PH278x':'Health Env','14.73x':'Poverty','2.01x':'Structures','3.091x':'SS Chemistry','6.002x':'Circuits','6.00x':'Comp Pgming','7.00x':'Biology','8.02x':'Elec & Magnetism','8.MReV':'Mechanics'}
data['course'].replace(courselist, inplace=True)
data['course'].replace(courselist, inplace=True)

#rename columns 
data.rename(columns={'nchapters':'chapters viewed','ndays_act':'days active','nplay_video':'videos played','course_id': 'course', 'final_cc_cname_DI': 'country','LoE_DI':'education','userid_DI':'user',"start_time_DI":"start-time"}, inplace=True)

data['YoB'] = data.groupby('course')['YoB'].transform(lambda x: x.fillna(x.median()))
data['gender'] = data.groupby('course')['gender'].transform(lambda x: x.fillna(x.value_counts().index[0]))
data['education'] = data.groupby('course')['education'].transform(lambda x: x.fillna(x.value_counts().index[0]))


In [5]:

# change the datatypes of some of the columns
data['institution']= data.institution.astype('category')
data['course']= data.course.astype('category')

e_order = ["Less than Secondary","Secondary","Bachelor's","Master's","Doctorate"]
#e_type = pd.api.types.CategoricalDtype(categories=["Less than Secondary","Secondary","Bachelor's","Master's","Doctorate"], ordered=True)

data['grade'] = pd.to_numeric(data['grade'],errors='coerce')
data['grade'] = data['grade'].fillna(0).multiply(100)

data['education']= data.education.astype('category').cat.set_categories(e_order, ordered=True)

#Fill NaN data with zeros
data['chapters viewed'].fillna(0,inplace=True) # replace NaN with 0 
data['nevents'].fillna(0,inplace=True) # replace NaN with 0 
data['videos played'].fillna(0,inplace=True) # replace NaN with 0 
data['chapters viewed'].fillna(0,inplace=True) # replace NaN with 0 
data['days active'].fillna(0,inplace=True) # replace NaN with 0 
data['term'].fillna('Fall',inplace=True)

data['gender']= data.gender.astype('category')
data['year']= data.year.astype('int')
data['YoB']= data.YoB.astype('int')
data['nevents']= data.nevents.astype('int')

data['start-time'] = pd.to_datetime(data['start-time'])

# Add a column "age" using YoB (age calculated as on 2014)
data.insert(3,"age", [(x.year-x.YoB) for index, x in data.iterrows() ])

# Lets get the grade averages for various education levels 

In [14]:
cdata = data[data.certified==1]
cdata['grade'] = cdata['grade'].apply(np.int64)

aggregations = { 'certified':'sum','grade':'mean'}

lsd = cdata[cdata.education == "Less than Secondary"]
sd = cdata[cdata.education == "Secondary"]
bd = cdata[cdata.education == "Bachelor's"]
md = cdata[cdata.education == "Master's"]
pd = cdata[cdata.education == "Doctorate"]

edata = cdata.groupby(['education'],as_index=False).agg(aggregations)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [13]:
print(edata)

             education  certified      grade
0  Less than Secondary        402  81.407960
1            Secondary       4565  82.987076
2           Bachelor's       7985  83.149781
3             Master's       4313  84.758405
4            Doctorate        422  86.914692


## Is there a correlation between  grade  and education level ?

### Hypothesis Testing
Null Hypothesis: The mean 'grade' is the same for different education levels .
Alternative Hypothesis: The mean 'grade' is NOT the same for  different education levels

In [12]:
import scipy.stats as stats
f_stats, p_value = stats.f_oneway(lsd['grade'], sd['grade'], bd['grade'],md['grade'],pd['grade'])

print('F-statistic        : {:.4f}'.format(f_stats))
print('p value of  t-test : {:.8f}'.format(p_value))

F-statistic        : 22.9447
p value of  t-test : 0.00000000


## Summary 
* Since the p-value from the test is extremely small we reject the NULL hypothesis (that there is no difference in means across education levels). There is a (statistically) significant difference which cannot be attributed to randomness.

* There is a difference in the mean grade of the users with different education levels . 

* Mean Grade (Doctorate) > Mean Grade (Masters) > Mean Grade (Bachelors) > Mean Grade (Secondary) > Mean Grade (Less than Secondary) 