## Insight Data Science Project 2017: EduCare - Keeping Students Engaged!

In this project I worked with a company that advises math tutoring centers in CA. The data consisted of students from tutoring centers in CA

First, I load the libraries and the data

Example features available: Start and End Date, Gender, Skill Level (Grade), City, State, Zip, Open Days

In [19]:
# Load all the Libraries needed

import pandas as pd
import glob
import time
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from datetime import datetime, timedelta
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode(connected=True)
from plotly.graph_objs import *
from geopy.geocoders import Nominatim
geolocator = Nominatim()

In [20]:
# Read in the student and math tutoring center information
# Then merge both files to one file that has all the information

stud = pd.read_csv('/home/harisk87/Dropbox/PythonProjects/EduCare/StudentCA.csv', parse_dates = ['enroll_date', 'end_date', 'date_of_birth'])
camps = pd.read_csv('/home/harisk87/Dropbox/PythonProjects/EduCare/MathCampAll.csv', parse_dates = ['open_date'])

students = pd.merge(stud, camps, on='camp_id')
students = students.rename(columns={'city_x': 'city_student', 'country': 'country_student', 'zip_code_x': 'zip_student', 'state_code_x': 'state_student', 'city_y':'city_camp', 'zip_code_y':'zip_camp', 'state_code_y':'state_camp,', 'country_code':'country_camp'})

students.head()

Unnamed: 0,camp_id,student_id,gender,date_of_birth,enroll_date,end_date,grade_level,city_student,country_student,zip_student,state_student,city_camp,zip_camp,"state_camp,",country_camp,open_date,center_days
0,2824,61913,M,2002-01-22,2015-09-01,2015-11-30,12,Santa Rosa,USA,95409,CA,Windsor,95492,CA,US,2015-01-01,"[2, 5]"
1,2824,61922,F,2002-04-28,2015-01-01,2016-09-18,11,Santa Rosa,USA,95403,CA,Windsor,95492,CA,US,2015-01-01,"[2, 5]"
2,2824,61938,M,2000-12-13,2015-01-01,2015-08-30,12,Santa Rosa,USA,95404,CA,Windsor,95492,CA,US,2015-01-01,"[2, 5]"
3,2824,61958,M,2008-07-21,2016-01-05,2016-09-18,6,Windsor,USA,95492,CA,Windsor,95492,CA,US,2015-01-01,"[2, 5]"
4,2824,61981,F,2008-01-21,2016-03-04,2016-05-28,7,Santa Rosa,USA,95409,CA,Windsor,95492,CA,US,2015-01-01,"[2, 5]"


## Data Cleaning

- removing nulls and duplicates
- incorrect data where end and start time is flipped
- not including students who are still enrolled - only those I know the time of enrollment

In [21]:
# Remove any nulls and duplicate data 
# includes 215 students who have not submitted worksheets
# no duplicates in data


print(str(len(students)) + ' Total Students Enrolled')
print(str(len(students.dropna(how='any'))) + ' remaining students after nulls dropped ~5%')
students = students.dropna(how='any')
print(len(students))
print(str(len(students[students.duplicated()]))+' absolute duplicates in the data')


4428 Total Students Enrolled
4204 remaining students after nulls dropped ~5%
4204
0 absolute duplicates in the data


In [22]:
# Compute time enrolled for each student

td = (students['end_date'] - students['enroll_date'])
students["time_enrolled"] = (td / np.timedelta64(1, 'D')).astype(int)

In [23]:
# Correct data where end and start date is flipped

enr_d = students[students['time_enrolled'] < 0]['enroll_date']
end_d = students[students['time_enrolled'] < 0]['end_date']

students.enroll_date[students['time_enrolled'] < 0] = end_d
students.end_date[students['time_enrolled'] < 0] = enr_d
students.time_enrolled[students['time_enrolled'] < 0] = 102
students.time_enrolled[students['time_enrolled'] < 0] 



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



Series([], Name: time_enrolled, dtype: int64)

In [24]:
# Eliminate students who are still enrolled 
# include only data within 1 week of last data pull so end dates are accurate

dy = 7
students = students[students['end_date'] <= students.end_date.max()-timedelta(days=dy)]
len(students)

trace1 = Histogram(x=students['time_enrolled'], marker=dict(color='rgb(0, 128, 128)'))
data = [trace1]
layout = Layout(xaxis=dict(autotick=False,tick0=0,dtick=1000, tickfont=dict(size=18),title = 'Time of Enrollment (Days)', titlefont=dict(size=24)),yaxis=dict(autotick=False,tick0=0,dtick=100, tickfont=dict(size=18), title='Number of students', titlefont=dict(size=24), showgrid=False),title='Distribution of Time of Enrollment for students (Days)',titlefont=dict(size=24))


fig1 = Figure(data=data, layout=layout)
iplot(fig1)



Distribution showing the time of enrollments for students

Segment students into 2 classes - Short and Long term, based on the median time of 230 days

In [25]:
# Segmenting students into 2 classes based on the median:

ndays = students['time_enrolled'].median()
students['short_term'] = 0
students.ix[students['time_enrolled'] <= ndays, 'short_term'] = 1

# Another way of segmenting students into short and long term based on winter / summer / fall terms

ndays = 120
students['shorter_term'] = 0
students.ix[students['time_enrolled'] <= ndays, 'shorter_term'] = 1


## Looking at the data and the simplest features available

In [26]:
# First Feature: GENDER - classify by gender - male or female, deal with unclean data
# More data cleaning: there re entries in the gender column called: {u'@xml:space': u'preserve'} - remove 


grouped = students.groupby('gender')
students = students.drop(grouped.get_group("{u'@xml:space': u'preserve'}").index) 
len(students)  

# Second Feature: AGE
# Compute age in 2016, age of enrollment and age when churned

ag_2016 = students.end_date.max() - students['date_of_birth']
ag_enroll = students['enroll_date'] - students['date_of_birth']
ag_end = students['end_date'] - students['date_of_birth']
students['age_2016'] = (((ag_2016 / np.timedelta64(1, 'D')).astype(int))/365).astype(int)
students['age_enroll'] = (((ag_enroll / np.timedelta64(1, 'D')).astype(int))/365).astype(int)
students['age_end'] = (((ag_end / np.timedelta64(1, 'D')).astype(int))/365).astype(int)

print(students.groupby('age_2016').size())
print(students.groupby('age_enroll').size())
print(students.groupby('age_end').size())


# Third Feature: GRADE LEVEL (Skill Level)

grlv = students['grade_level']
print(students.groupby('grade_level').size())


# Fourth Feature: NUMBER OF DAYS CAMP IS OPEN

students['days_open'] = (students['center_days'].str.len()/3).astype(int)
print(students.groupby('days_open').size())


# Fifth Feature: CITY CAMP IS IN

print(students.groupby('city_camp').size())



# CITY STUDENT IS IN -  leave this out for now beause there are 189 features..

print(len(students.groupby('city_student').size()))



age_2016
2       2
3       7
4      17
5      42
6      74
7     127
8     165
9     225
10    250
11    286
12    301
13    293
14    206
15    203
16    150
17    102
18     88
19     54
20     31
21     24
22     10
23      4
24      5
29      1
48      1
50      2
51      1
54      1
dtype: int64
age_enroll
2       5
3      48
4     114
5     211
6     273
7     341
8     333
9     328
10    310
11    253
12    199
13    113
14     69
15     41
16     17
17      8
18      2
19      1
26      1
42      2
46      3
dtype: int64
age_end
2       3
3      20
4      45
5     120
6     191
7     276
8     281
9     345
10    333
11    309
12    271
13    207
14    143
15     60
16     37
17     19
18      5
20      1
27      1
42      2
46      2
47      1
dtype: int64
grade_level
1      10
2      23
3      25
4      40
5     117
6     197
7     271
8     333
9     398
10    356
11    276
12    150
13    164
14    119
15     86
16     62
17     22
18     11
19      4
20      3
21      5
d

## Implementing Logistic Regression with 5 simple features

In [29]:
# Implement Logistic Regression with simplest features

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn import linear_model, datasets, metrics
from sklearn.cross_validation import train_test_split

logreg = LogisticRegression(penalty = "l2", C=0.1, class_weight = 'balanced', tol = 0.0001)
feature_cols = ['age_enroll','days_open', 'grade_level']

xprev = students[feature_cols]
y = students['short_term']
x1 = []
x1 = pd.concat([xprev, pd.get_dummies(students['gender'])],axis=1)
x = []
x = pd.concat([x1, pd.get_dummies(students['city_camp'])],axis=1)


X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
L = logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)

print 'Accuracy of model on test set is ' + str(metrics.accuracy_score(y_test,y_pred)*100) +' %'
print('F1 scores are long-term/short-term' + str(f1_score(y_test, y_pred, average=None)))
print('Precision scores are lt-st' + str(precision_score(y_test, y_pred, average=None)))
print('Recall scores are lt-st' + str(recall_score(y_test, y_pred, average=None)))
print metrics.confusion_matrix(y_test, y_pred)
print ("Predicting everything as falling as long term will give accuracy of "+ str((len([i for i in y_test if i == 0])/float(len(y_test)))*100))

fimplr = np.abs(logreg.coef_)
vv = list(x)


data = [Bar(x=vv,y=fimplr.ravel())]

layout = Layout(xaxis=dict(title = 'Features'),yaxis=dict(title='Coefficients'),title='Feature Coefficients')

fig1 = Figure(data=data, layout=layout)
iplot(fig1)


Accuracy of model on test set is 77.9439252336 %
F1 scores are long-term/short-term[ 0.78066914  0.77819549]
Precision scores are lt-st[ 0.77777778  0.78113208]
Recall scores are lt-st[ 0.78358209  0.7752809 ]
[[210  58]
 [ 60 207]]
Predicting everything as falling as long term will give accuracy of 50.0934579439


## Feature Engineering: Distance Data

Instead of using cities as categorical variables, I decided to have 1 continuous variable - distance.

I computed the distance from the student's home to the center because this might affect churn. 

Code is below, but it will be much faster if you skip this cell and load pre-loaded pickle file in next cell

In [None]:
# Feature Engineering: Distance


# Cleaning the zip code data

students.groupby('zip_student').size()
students = students[students['zip_student'].str.startswith("9")]
students['dist'] = ""


# Computing distance
# 1. Convert from zip code to latitude and longitude
# 2. Can only do these around 100 - 200 at a time because there are limited calls to the Google API
# 3. Compute great circle distance
# 4. Save as a pickle file because 

geolocator = Nominatim()
from geopy.distance import vincenty
from geopy.distance import great_circle
for a in range(1,len(students)):
    if students['zip_student'].iloc[a] == students['zip_camp'].iloc[a]:
        students['dist'].iloc[a] = 0
    else:
        locationa = geolocator.geocode(students['zip_camp'].iloc[a], timeout=None)
        locationb = geolocator.geocode(students['zip_student'].iloc[a], timeout=None)
        if locationa and locationb:
            zipa = (locationa.latitude, locationa.longitude)
            zipb = (locationb.latitude, locationb.longitude)
            students['dist'].iloc[a] = great_circle(zipa, zipb).miles

students.to_pickle('students') 


In [38]:
# Cleaning the distance data
# 1. remove any null values
# 2. sometimes distance computation based on latitude and longitude is incorrect: 
# so remove anomalous values for now and correct later 

# 1. remove any null values

students = pd.read_pickle('students')
 
studentsshort = students[students['dist'] != ""]
print(len(studentsshort))


# look for rows with distance = 0  -to check - all look fine
#students[students['dist'] == 0][['zip_student','zip_camp', 'dist']]


# 2. sometimes distance computation is incorrect - look for rows with weird distances 


studentsshortnew = studentsshort[studentsshort['dist']<100]
print(len(studentsshortnew))
print(studentsshortnew['dist'].max())


# Normalize distance by the maximum

studentsshortnew['dist_norm'] = studentsshortnew['dist'] / studentsshortnew['dist'].max()

trace1 = Histogram(x=studentsshortnew['dist'])
data = [trace1]
layout = Layout(xaxis=dict(title = 'Normalized Distance (miles)'),yaxis=dict(title='Number of students'),title='Distribution of Distance between camp and student (miles)')
fig1 = Figure(data=data, layout=layout)
iplot(fig1)


2611
2242
50.3791061577




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



## Feature Engineering: Income Data

I also included census data about the household income for each student based on where they lived



In [39]:
#load census data

income1 = pd.read_csv('/home/harisk87/Dropbox/PythonProjects/Insight-Project/IncomeData.csv')
income = income1[['GEO.id2', 'HC02_EST_VC02', 'HC02_EST_VC04']]
income = income.rename(columns={'GEO.id2':'zip_student'})


# merge median income data
studentsshortnew = pd.merge(studentsshortnew, income, on='zip_student')
# rename columns
studentsshortnew = studentsshortnew.rename(columns={'HC02_EST_VC02': 'household_income', 'HC02_EST_VC04': 'median_income'})

# convert income data from string to numeric values

studentsshortnew['median_income'] = pd.to_numeric(studentsshortnew['median_income'], errors='coerce')
studentsshortnew['household_income'] = pd.to_numeric(studentsshortnew['household_income'], errors='coerce')


# cleaning - drop null values

studentsshortnew = studentsshortnew.dropna(how='any')
print(len(studentsshortnew))


# normalize income data

studentsshortnew['median_income_norm'] = studentsshortnew['median_income']/studentsshortnew['median_income'].max()
studentsshortnew['household_income_norm'] = studentsshortnew['household_income']/studentsshortnew['household_income'].max()


2233


## Feature Engineering : Include Data about Skill Level Increase

In [40]:
# Include and Merge file with Start and End Skill Level of each student

scores = pd.read_csv('/home/harisk87/Dropbox/PythonProjects/EduCare/ScoresCA.csv')
scores1 = scores.groupby(['student_id'], sort=False)['grade_level'].max()
scores2 = scores.groupby(['student_id'], sort=False)['grade_level'].min()
scores3 = pd.DataFrame({'student_id':scores1.keys(), 'grade_level_end':scores1.values})
scores4 = pd.DataFrame({'student_id':scores2.keys(), 'grade_level_begin':scores2.values})
studentsn = pd.merge(studentsshortnew, scores4, on='student_id')
studentsn = pd.merge(studentsn, scores3, on='student_id')

# Compute Increase in Skill Level over time, using correct information loaded

studentsn['skill_increase'] = ((studentsn['grade_level']-studentsn['grade_level_begin'])/studentsn['time_enrolled'])
studentsn['skill_increase_correct'] = ((studentsn['grade_level_end']-studentsn['grade_level_begin'])/studentsn['time_enrolled'])
len(studentsn)

# Data Cleaning 

studentsn = studentsn.dropna(how='any')
studentsn = studentsn.replace([np.inf, -np.inf], np.nan).dropna(subset=["skill_increase"], how="all")
studentsn = studentsn.replace([np.inf, -np.inf], np.nan).dropna(subset=["skill_increase_correct"], how="all")


#Normalizing the feature
studentsn['skill_increase_norm'] = studentsn['skill_increase']/studentsn['skill_increase'].max()
studentsn['skill_increase_norm_correct'] = studentsn['skill_increase_correct']/studentsn['skill_increase_correct'].max()

print(len(studentsn))
studentsn.head()

2229


Unnamed: 0,camp_id,student_id,gender,date_of_birth,enroll_date,end_date,grade_level,city_student,country_student,zip_student,...,household_income,median_income,median_income_norm,household_income_norm,grade_level_begin,grade_level_end,skill_increase,skill_increase_correct,skill_increase_norm,skill_increase_norm_correct
0,2824,61913,M,2002-01-22,2015-09-01,2015-11-30,12,Santa Rosa,USA,95409,...,65453.0,65505.0,0.30956,0.298022,11,12,0.011111,0.011111,0.008333,0.008333
1,2824,61981,F,2008-01-21,2016-03-04,2016-05-28,7,Santa Rosa,USA,95409,...,65453.0,65505.0,0.30956,0.298022,5,7,0.023529,0.023529,0.017647,0.017647
2,2824,61991,F,2004-09-25,2016-03-04,2016-05-26,7,Santa Rosa,USA,95409,...,65453.0,65505.0,0.30956,0.298022,6,7,0.012048,0.012048,0.009036,0.009036
3,2824,62145,F,2005-06-11,2015-01-01,2015-08-23,10,Santa Rosa,USA,95409,...,65453.0,65505.0,0.30956,0.298022,2,9,0.034188,0.029915,0.025641,0.022436
4,2824,62186,F,2007-09-28,2015-05-01,2015-05-24,6,Santa Rosa,USA,95409,...,65453.0,65505.0,0.30956,0.298022,5,6,0.043478,0.043478,0.032609,0.032609


### Pre-processing and normalizing data

In [41]:
# Normalizing other features: Age, End and Beginning Grade Level, Gender

studentsn['age_enroll_norm'] = studentsn['age_enroll']/studentsn['age_enroll'].max()
studentsn['grade_level_begin_norm'] = studentsn['grade_level_begin']/studentsn['grade_level_begin'].max()
studentsn['grade_level_end_norm'] = studentsn['grade_level_end']/studentsn['grade_level_end'].max()
studentsn['gender_norm'] = 0
studentsn.ix[studentsn['gender'] == 'F', 'gender_norm'] = 1
studentsn.head()

Unnamed: 0,camp_id,student_id,gender,date_of_birth,enroll_date,end_date,grade_level,city_student,country_student,zip_student,...,grade_level_begin,grade_level_end,skill_increase,skill_increase_correct,skill_increase_norm,skill_increase_norm_correct,age_enroll_norm,grade_level_begin_norm,grade_level_end_norm,gender_norm
0,2824,61913,M,2002-01-22,2015-09-01,2015-11-30,12,Santa Rosa,USA,95409,...,11,12,0.011111,0.011111,0.008333,0.008333,0.282609,0.6875,0.571429,0
1,2824,61981,F,2008-01-21,2016-03-04,2016-05-28,7,Santa Rosa,USA,95409,...,5,7,0.023529,0.023529,0.017647,0.017647,0.173913,0.3125,0.333333,1
2,2824,61991,F,2004-09-25,2016-03-04,2016-05-26,7,Santa Rosa,USA,95409,...,6,7,0.012048,0.012048,0.009036,0.009036,0.23913,0.375,0.333333,1
3,2824,62145,F,2005-06-11,2015-01-01,2015-08-23,10,Santa Rosa,USA,95409,...,2,9,0.034188,0.029915,0.025641,0.022436,0.195652,0.125,0.428571,1
4,2824,62186,F,2007-09-28,2015-05-01,2015-05-24,6,Santa Rosa,USA,95409,...,5,6,0.043478,0.043478,0.032609,0.032609,0.152174,0.3125,0.285714,1


## Machine Learning: Implementing Logistic Regression to classify students as short or long-term
 - Tried various combinations of features
 - Did 5-fold cross-validation on 80% of data
 - Also performed grid search to check for regularization parameter
 - Had a hold-out test set of 20%


In [42]:
# Logistic Regression

# Trying various feature combinations to see which predicts churn well


#feature_cols = ['age_enroll','days_open', 'dist_norm', 'median_income_norm',  'grade_level_end','grade_level_begin', 'skill_increase_norm_correct']

#feature_cols = ['age_enroll_norm', 'grade_level_begin_norm', 'days_open', 'gender_norm', 'skill_increase_norm_correct']
#vv = ['Age Enrolled', 'Rate of Increase of Skill Level', 'Begin Skill Level']


#feature_cols = ['dist_norm','median_income_norm', 'age_enroll','days_open',  'grade_level_end','grade_level_begin', 'gender_norm']
#vv = ['Distance', 'Median Income' , 'Age Enrolled', 'Days Open', 'Ending Skill Level','Begin Skill Level', 'Gender']


#feature_cols = ['age_enroll','days_open',  'grade_level_end','grade_level_begin', 'gender_norm']
#vv = ['Age Enrolled', 'Days Open', 'End Skill Level','Begin Skill Level', 'Gender']


feature_cols = ['dist_norm','median_income_norm' ,'age_enroll_norm', 'grade_level_begin_norm',  'grade_level_end_norm']
vv = ['Distance', 'Median Income', 'Age Enrolled', 'Begin Skill Level', 'End Skill Level']

#feature_cols = ['age_enroll_norm', 'skill_increase_norm_correct', 'grade_level_begin_norm',  'grade_level_end_norm']
#vv = ['Age Enrolled', 'Growth', 'Begin Skill Level', 'End Skill Level']

xprev = studentsn[feature_cols]#.reset_index()[feature_cols]
y = studentsn['short_term']#.reset_index()['short_term']x

#xprev = studentsn[studentsn['time_enrolled'] < 1000][feature_cols] # without outliers - same importance and accuracy
#y = studentsn[studentsn['time_enrolled'] < 1000]['short_term']

x = xprev


# Train-Test Split 80-20 

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
L = logreg.fit(X_train,y_train)
y_pred = logreg.predict(X_test)

print 'Accuracy of model on test set is ' + str(metrics.accuracy_score(y_test,y_pred)*100) +' %'
print('F1 scores are long-term/short-term' + str(f1_score(y_test, y_pred, average=None)))
print('Precision scores are lt-st' + str(precision_score(y_test, y_pred, average=None)))
print('Recall scores are lt-st' + str(recall_score(y_test, y_pred, average=None)))
print metrics.confusion_matrix(y_test, y_pred)
print ("Predicting everything as falling as long term will give accuracy of "+ str((len([i for i in y_test if i == 0])/float(len(y_test)))*100))

fimplr = logreg.coef_
#vv = list(x)

# Implementing grid search and 5-fold cross-validation

from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

def crossValidate(model,param_grid,X,y,scores):

    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

    for score in scores:
        print("# Tuning hyper-parameters for %s" % score)
        print()

        grid = GridSearchCV(model, param_grid, cv=5, scoring=score)
        grid.fit(X_train,y_train)

        print("Best parameters set found on development set:")
        print()
        print(grid.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        for params, mean_score, scores in grid.grid_scores_:
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean_score, scores.std() * 2, params))
        print()

        print("Detailed classification report:")
        print()
        print("The model is trained on the training set.")
        print("The scores are computed on the test set.")
        print()
        y_true, y_pred = y_test, grid.predict(X_test)
        print(classification_report(y_true, y_pred))
        print()
logreg = LogisticRegression(penalty = "l2", class_weight = 'balanced', tol = 0.0001)
Cvals = [0.0000000000001,0.1,1,10,100,1000,10000,1000000,1000000,10000000,100000000,100000000000000000000000000000]
param_grid = dict(C=Cvals)
scores = ['accuracy']
crossValidate(logreg,param_grid,x,y,scores)


# Plotting feature coefficients - sorted and normalized

yv = abs(fimplr.ravel())
yvv = yv/yv.max()

inds = yvv.argsort()[::-1]
print(inds)
print(vv)
vvv = np.array(vv)
print(vvv)
sortvv = vvv[inds]
print(sortvv)
data = [Bar(x=sortvv,y=np.sort(yvv)[::-1], marker=dict(color='rgb(0, 128, 128)'))]

layout = Layout(xaxis=dict(tickfont=dict(size=20)),
                 yaxis=dict(title='Coefficients (Normalized)', showgrid=False, tickfont=dict(size=20), titlefont=dict(size=24)),
                title='Feature Coefficients', titlefont=dict(size=24))


fig1 = Figure(data=data, layout=layout)
iplot(fig1)


# Plotting feature coefficients 


data = [Bar(x=vv,y=abs(fimplr.ravel()))]

layout = Layout(xaxis=dict(title = 'Features'),
                 yaxis=dict(title='Coefficients'),
                title=' Feature Coefficients')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)


# Plotting Log feature coefficients 


yv = abs(fimplr.ravel())

data = [Bar(x=vv,y=np.log(yv))]
layout = Layout(xaxis=dict(title = 'Features'),
                 yaxis=dict(title='Coefficients'),
                title='Log Feature Coefficients')

fig1 = Figure(data=data, layout=layout)
iplot(fig1)

Accuracy of model on test set is 76.4573991031 %
F1 scores are long-term/short-term[ 0.73684211  0.78701826]
Precision scores are lt-st[ 0.80327869  0.73764259]
Recall scores are lt-st[ 0.68055556  0.84347826]
[[147  69]
 [ 36 194]]
Predicting everything as falling as long term will give accuracy of 48.4304932735
# Tuning hyper-parameters for accuracy
()



This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.



Best parameters set found on development set:
()
{'C': 100}
()
Grid scores on development set:
()
0.492 (+/-0.001) for {'C': 1e-13}
0.758 (+/-0.049) for {'C': 0.1}
0.789 (+/-0.036) for {'C': 1}
0.798 (+/-0.012) for {'C': 10}
0.800 (+/-0.020) for {'C': 100}
0.798 (+/-0.021) for {'C': 1000}
0.798 (+/-0.021) for {'C': 10000}
0.798 (+/-0.021) for {'C': 1000000}
0.798 (+/-0.021) for {'C': 1000000}
0.798 (+/-0.021) for {'C': 10000000}
0.798 (+/-0.021) for {'C': 100000000}
0.798 (+/-0.021) for {'C': 100000000000000000000000000000L}
()
Detailed classification report:
()
The model is trained on the training set.
The scores are computed on the test set.
()
             precision    recall  f1-score   support

          0       0.88      0.80      0.84       226
          1       0.81      0.89      0.85       220

avg / total       0.85      0.85      0.85       446

()
[4 3 2 0 1]
['Distance', 'Median Income', 'Age Enrolled', 'Begin Skill Level', 'End Skill Level']
['Distance' 'Median Income' '

## Statistical tests: 
- Testing if short-term and long-term students are significantly different in terms of their
    - rate of skill increase
    - beginning skill level
    - distance commuted
    - median income
    - age enrolled
    - ending skill level
    - number of days the center they attend is open
- Using the p value, t value and effect size

In [43]:
import scipy as sp

# define function for cohen's d:

def cohensd(val1,val2):
    return ( np.mean(val1) - np.mean(val2) ) / np.sqrt( (np.std(val1) ** 2 + np.std(val2) ** 2) / 2 )


# Test if long-term and short term students are significantly different:

stsk = studentsn[studentsn['short_term']==1]['skill_increase_correct']
ltsk = studentsn[studentsn['short_term']==0]['skill_increase_correct']
print("The p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms increase skills on average at rate " + str(np.median(stsk)))
print("LOng terms increase skills on avergae at rate " + str(np.median(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['grade_level_begin']
ltsk = studentsn[studentsn['short_term']==0]['grade_level_begin']
print("\n The p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms begin at grade level " + str(np.mean(stsk)))
print("LOng terms begin at grade level " + str(np.mean(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['dist']
ltsk = studentsn[studentsn['short_term']==0]['dist']
print("\n The p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms commute " + str(np.mean(stsk)))
print("LOng terms commute " + str(np.mean(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['median_income']
ltsk = studentsn[studentsn['short_term']==0]['median_income']
print("\nThe p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms income " + str(np.mean(stsk)))
print("LOng terms income " + str(np.mean(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['age_enroll']
ltsk = studentsn[studentsn['short_term']==0]['age_enroll']
print("\nThe p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms enroll age " + str(np.median(stsk)))
print("LOng terms enroll age " + str(np.median(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['grade_level_begin']
ltsk = studentsn[studentsn['short_term']==0]['grade_level_begin']
print("\nThe p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms entry level " + str(np.median(stsk)))
print("LOng terms entry level " + str(np.median(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['grade_level_end']
ltsk = studentsn[studentsn['short_term']==0]['grade_level_end']
print("\nThe p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms exit level " + str(np.median(stsk)))
print("LOng terms exit level " + str(np.median(ltsk)))


stsk = studentsn[studentsn['short_term']==1]['days_open']
ltsk = studentsn[studentsn['short_term']==0]['days_open']
print("\nThe p value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(stsk, ltsk,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(stsk, ltsk)))
print("SHort terms days open " + str(np.mean(stsk)))
print("LOng terms days open " + str(np.mean(ltsk)))

The p value is 4.05276648762e-28 and the t value is 5.61313304513
The effect size is 0.473591177663
SHort terms increase skills on average at rate 0.0186335403727
LOng terms increase skills on avergae at rate 0.00796020688617

 The p value is 0.0127760178808 and the t value is 1.11723781798
The effect size is 0.0947476616482
SHort terms begin at grade level 6.14133333333
LOng terms begin at grade level 5.90760869565

 The p value is 0.310625146601 and the t value is 0.247074074105
The effect size is 0.0209420240341
SHort terms commute 3.10984037795
LOng terms commute 3.03465083303

The p value is 0.422897904187 and the t value is 0.0972540495816
The effect size is 0.00824202570695
SHort terms income 88695.1208889
LOng terms income 88467.8097826

The p value is 7.01374482192e-11 and the t value is 3.22366082497
The effect size is 0.272972052696
SHort terms enroll age 9.0
LOng terms enroll age 8.0

The p value is 0.0127760178808 and the t value is 1.11723781798
The effect size is 0.09474

## Insights from Data Visualization and Analysis 

I visualized the data to gain some insights into the features that were predicting churn

In [45]:
# Visualizing growth of students and time enrolled

stsk = studentsn[studentsn['short_term']==1]['skill_increase_correct']
stsk1 = studentsn[studentsn['short_term']==1]['time_enrolled']

ltsk = studentsn[studentsn['short_term']==0]['skill_increase_correct']
ltsk1 = studentsn[studentsn['short_term']==0]['time_enrolled']

trace1 = Scatter(x=stsk,y=stsk1, mode='markers', name = 'Short-Term Enrollees', marker=dict(color='rgb(100, 100, 200)', size=5))
trace2 = Scatter(x=ltsk,y=ltsk1, mode='markers', name = 'Long-Term Enrollees', marker=dict(color='rgb(250, 130, 170)',size=5))


data = [trace1,trace2]
layout = Layout(xaxis=dict(title = 'Growth (per Day)',type='log', autorange='True',tickfont=dict(size=20), showgrid=False,  titlefont=dict(size=24)),
                 yaxis=dict(title='Time Enrolled (Days)', showgrid=False, tickfont=dict(size=20), titlefont=dict(size=24),dtick=1000),
                title='Growth of Student vs Time Enrolled',  titlefont=dict(size=24), showlegend=True)
fig1 = Figure(data=data, layout=layout)
iplot(fig1)

The above plot shows that students who progress slower typically stay for longer periods, and students who progress faster typically stay for shorter periods

In [47]:
# Another visualization of the distribution of growth for short and long term students

stsk = studentsn[studentsn['short_term']==1]['skill_increase_correct']
ltsk = studentsn[studentsn['short_term']==0]['skill_increase_correct']

trace1 = Histogram(x=stsk,opacity=0.75, name='Short Term Enrollees')
trace2 = Histogram(x=ltsk,opacity=0.75, name = 'Long Term Enrollees')
data = [trace1, trace2]
layout = Layout(barmode='overlay',xaxis=dict(title = 'Rate of Skill Increase (per day)'),yaxis=dict(title='Number of students'),
                title='Comparing Rate of Skill Increase for Long and Short Term Enrollees')
fig = Figure(data=data, layout=layout)
iplot(fig)

Long-term students typically progress much slower than short-term students

In [48]:
# Visualizing the differences in ending skill level for short and long-term students

stsk = studentsn[studentsn['short_term']==1]['grade_level_end']
ltsk = studentsn[studentsn['short_term']==0]['grade_level_end']

trace1 = Histogram(x=stsk,opacity=0.75, name='Short-Term Enrollees', marker=dict(color='rgb(125, 70, 200)'))
trace2 = Histogram(x=ltsk,opacity=0.75, name = 'Long Term Enrollees',  marker=dict(color='rgb(250, 130, 170)'))
data = [trace1, trace2]

layout = Layout(barmode='overlay',xaxis=dict(title = 'End Skill Level', tickfont=dict(size=20), titlefont=dict(size=24)),yaxis=dict(title='Number of Students', showgrid=False,tickfont=dict(size=20),dtick=40, titlefont=dict(size=24)),
                title='Distribution of End Skill Level for Short and Long Term Students', titlefont=dict(size=24), showlegend=False)
fig = Figure(data=data, layout=layout)
iplot(fig)

Short-term students typically churn around an average skill level of 7 and long-term students churn around an average skill level of 11

In [49]:
stsk = studentsn[studentsn['short_term']==1]['age_enroll']
ltsk = studentsn[studentsn['short_term']==0]['age_enroll']

trace1 = Histogram(x=stsk,opacity=0.75, name='Short Term Enrollees')
trace2 = Histogram(x=ltsk,opacity=0.75, name = 'Long Term Enrollees')
data = [trace1, trace2]
layout = Layout(barmode='overlay',xaxis=dict(title = 'Age of Student (years)'),yaxis=dict(title='Number of students'),
                title='Comparing Age of Long and Short Term Enrollees')
fig = Figure(data=data, layout=layout)
iplot(fig)

Short and long-term students differ only a little in age (averages computed in statistical tests above)

In [50]:
stsk = studentsn[studentsn['short_term']==1]['grade_level_end']
stsk1 = studentsn[studentsn['short_term']==1]['time_enrolled']

ltsk = studentsn[studentsn['short_term']==0]['grade_level_end']
ltsk1 = studentsn[studentsn['short_term']==0]['time_enrolled']

trace1 = Scatter(x=stsk,y=stsk1, mode='markers', name = 'short-term')
trace2 = Scatter(x=ltsk,y=ltsk1, mode='markers', name = 'long-term')

data = [trace1,trace2]
layout = Layout(xaxis=dict(title = 'Ending grade level'),
                 yaxis=dict(title='Time Enrolled (days)'),
                title='Ending grade level vs Time Enrolled')

fig1 = Figure(data=data, layout=layout)
iplot(fig1)

Distribution of time enrolled for each end skill level - larger end skill level does not necessarily mean that students are enrolled for longer - there is a huge variation in how long students are enrolled for

In [52]:
# Computing some correlation coefficients between age enrolled and time
np.corrcoef(studentsn['time_enrolled'],studentsn['age_enroll_norm'])

array([[ 1.        , -0.16524028],
       [-0.16524028,  1.        ]])

### I also tried implementing Random Forests, to see how it would compare in performance to Logistic Regression

In [54]:
# Random Forests

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=10)

# Features

feature_cols = ['median_income_norm', 'age_enroll_norm', 'skill_increase_norm_correct', 'grade_level_begin_norm',  'grade_level_end_norm', 'dist_norm']

xprev = studentsn[feature_cols]#.reset_index()[feature_cols]
y = studentsn['short_term']#.reset_index()['short_term']x
x = xprev
vv = list(x)


# Train / Test Split

X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
L = rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
print 'Accuracy of model on test set is ' + str(metrics.accuracy_score(y_test,y_pred)*100) +' %'
print('Weighted F1 score is ' + str(f1_score(y_test, y_pred, average='weighted')))

print metrics.confusion_matrix(y_test, y_pred)
print ("Predicting everything as falling as long term will give accuracy of "+ str((len([x for x in y_test if x == 0])/float(len(y_test)))*100))

# Plotting feature importances

fimplr = rf.feature_importances_
print(fimplr.ravel())

data = [Bar(x=vv,y=fimplr.ravel())]

layout = Layout(xaxis=dict(title = 'Features'),
                 yaxis=dict(title='Coefficients'),
                title='Feature Coefficients')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

Accuracy of model on test set is 91.4798206278 %
Weighted F1 score is 0.914762220002
[[207  12]
 [ 26 201]]
Predicting everything as falling as long term will give accuracy of 49.1031390135
[ 0.06638936  0.06115591  0.44345287  0.12401699  0.25794272  0.04704215]


### Some extra plots and analysis of features I used previously - median income, distance, city, gender, days open, age

In [55]:
# Median Income and Time Enrolled

trace1 = Scatter(x=studentsshortnew['median_income'],y=studentsshortnew['time_enrolled'], mode='markers')
data = [trace1]
layout = Layout(xaxis=dict(title = 'Median Income'),
                 yaxis=dict(title='Time Enrolled'),
                title='Median Income vs Time Enrolled')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

In [56]:
# Distance and Time Enrolled

trace1 = Scatter(x=studentsshortnew['dist'],y=studentsshortnew['time_enrolled'], mode='markers')
data = [trace1]
layout = Layout(xaxis=dict(title = 'Distance'),
                 yaxis=dict(title='Time Enrolled'),
                title='Distance vs Time Enrolled')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

In [58]:
# City of Camp and Days Enrolled

trace1 = Scatter(x=students['city_camp'],y=students['time_enrolled'], mode='markers')
data = [trace1]
layout = Layout(xaxis=dict(title = 'City of Camp'),
                 yaxis=dict(title='Time Enrolled'),
                title='City of Camp vs Time Enrolled')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

students['time_enrolled'].groupby(students['city_camp']).mean()


city_camp
Concord          260.652893
Fairfield        155.950980
Fresno           169.434783
Los Alamitos     459.307200
Mission Viejo    286.478632
San Clemente      91.037037
San Francisco    149.582524
San Jose         402.703196
San Rafael       352.745763
Santa Clara      148.207547
Stockton         761.105911
Torrance         233.338028
Victorville      168.747748
Whittier         156.200000
Windsor          177.779070
Name: time_enrolled, dtype: float64

In [60]:
# Gender and Time Enrolled

male_time = students[students['gender'] == 'M']['time_enrolled']
female_time = students[students['gender'] == 'F']['time_enrolled']

import scipy as sp
def cohensd(val1,val2):
    return ( np.mean(val1) - np.mean(val2) ) / np.sqrt( (np.std(val1) ** 2 + np.std(val2) ** 2) / 2 )

print("The p value is " + str(sp.stats.ttest_ind(male_time, female_time,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(male_time, female_time,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(male_time, female_time)))
print("Males are enrolled on average for " + str(int(np.mean(male_time))) + " days")
print("Females are enrolled on average for " + str(int(np.mean(female_time)))+ " days")
print("Females are enrolled on average for a longer number of days, so might want to target males, but this is probably not a high return on investment because the distributions are not significantly different, and the effect on predicting churn is also small.")

#iplot([Scatter(x=students['gender'],y=students['time_enrolled'], mode='markers')])
trace1 = Scatter(x=students['gender'],y=students['time_enrolled'], mode='markers')
data = [trace1]
layout = Layout(xaxis=dict(title = 'Gender'),
                 yaxis=dict(title='Time Enrolled'),
                title='Gender vs Time Enrolled')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

The p value is 0.0989691751991 and the t value is -0.643884990064
The effect size is -0.0499569449623
Males are enrolled on average for 373 days
Females are enrolled on average for 395 days
Females are enrolled on average for a longer number of days, so might want to target males, but this is probably not a high return on investment because the distributions are not significantly different, and the effect on predicting churn is also small.


In [63]:
# Days Open and Time Enrolled

twodays_time = students[students['days_open'] == 2]['time_enrolled']
threedays_time = students[students['days_open'] == 3]['time_enrolled']


print("The p value is " + str(sp.stats.ttest_ind(twodays_time, threedays_time,axis=0, equal_var=False).pvalue/float(2)) + " and the t value is " + str(sp.stats.ttest_ind(twodays_time, threedays_time,axis=0, equal_var=False).statistic/float(2)))
print ("The effect size is " + str(cohensd(twodays_time, threedays_time)))
print("Centers open for 2 days hav an average enrllment of " + str(int(np.mean(twodays_time))) + " days")
print("Centers open for 3 days hav an average enrllment of " + str(int(np.mean(threedays_time)))+ " days")
#print(" The number of days the center is open does not seem to have a significant impact on the logistic regression - but the 2 distributions are significantly different: The centers open for 2 days have a higher mean enrollment time than those open for 3 days, albeit with a small effect size. This might be a feature we need to boost up in the LR model")
      
trace1 = Scatter(x=students['days_open'],y=students['time_enrolled'], mode='markers')
data = [trace1]
layout = Layout(xaxis=dict(title = 'Days Open'),
                 yaxis=dict(title='Time Enrolled'),
                title='Days Open vs Time Enrolled')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

The p value is 2.8655314731e-05 and the t value is 2.01629683429
The effect size is 0.160512995133
Centers open for 2 days hav an average enrllment of 403 days
Centers open for 3 days hav an average enrllment of 337 days


In [64]:
# Age enrolled and Time Enrolled

ae = (((ag_enroll / np.timedelta64(1, 'D')).astype(int))/365).astype(int)
trace1 = Scatter(x=ae,y=students['time_enrolled'], mode='markers')
data = [trace1]
layout = Layout(xaxis=dict(title = 'Age of Student'),
                 yaxis=dict(title='Time Enrolled'),
                title='Age of student vs Time Enrolled')


fig1 = Figure(data=data, layout=layout)
iplot(fig1)

#print(np.corrcoef(ae,students['time_enrolled'])[0,1])
print('The age of enrollment definitely seems to have an impact on enrollment times - seems like higher age of enrollment means slightly lower enrollment times - seems like you might want to target people between 12 and 19 and 2 and 3 to enroll for longer periods; also beyond 1 and below 4 you have lower enrollment')
print('Logistic regression also seems to be picking this up - negative values for ages 2-12 and beyond 19, postive for btw 12 and 19')
students['time_enrolled'].groupby(ae).median()

The age of enrollment definitely seems to have an impact on enrollment times - seems like higher age of enrollment means slightly lower enrollment times - seems like you might want to target people between 12 and 19 and 2 and 3 to enroll for longer periods; also beyond 1 and below 4 you have lower enrollment
Logistic regression also seems to be picking this up - negative values for ages 2-12 and beyond 19, postive for btw 12 and 19


2     123.0
3     323.5
4     297.0
5     290.0
6     266.0
7     249.5
8     272.5
9     222.0
10    215.0
11    201.5
12    193.0
13    157.5
14    137.5
15    173.0
16     64.0
17     49.5
18    100.5
19     91.0
26    115.0
42     44.0
46     90.0
Name: time_enrolled, dtype: float64

In [61]:
# Enrollment over the years

students['enroll_date'].groupby(students['enroll_date'].dt.year).size().values

trace0 = Scatter(x = [2007,2008,2009,2010,2011,2012,2013,2014,2015,2016],y = students['enroll_date'].groupby(students['enroll_date'].dt.year).size().values,
                    mode = 'lines+markers',name = 'Number of students enrolled')
trace1 = Scatter(x = [2007,2008,2009,2010,2011,2012,2013,2014,2015], y = camps.groupby(camps[camps['state_code']== 'CA']['open_date'].dt.year).size().values[9:18],
                 mode = 'lines+markers',name = 'Number of camps opened')
trace2 = Scatter(x = [2007,2008,2009,2010,2011,2012,2013,2014,2015, 2016], y = students['end_date'].groupby(students['end_date'].dt.year).size().values,
                 mode = 'lines+markers',name = 'Number of students who quit')
data = [trace0, trace1, trace2]
layout = dict(title = 'Trends over the years')
fig = dict(data=data, layout=layout)
iplot(fig, filename='scatter-mode')

