The goal of this project is to identify the students who might need intervention before they fail in the exams.It is a classification problem as it contains binary output as whether the student needs intervention or not.
The student data is saved as student-data.csv file.


In the first step the data is loaded and all the necessary packages of Python are imported

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score

# Read student data
student_data = pd.read_csv("student-data.csv")
print("Student data read successfully!")

Student data read successfully!


DATA EXPLORATION
=========

Now we will look into the student data and try to gain valuable insights. We would rather find out the following:

1. The total number of students in the class.
2. The total number of features for each students
3. The numebr of students who passed
4. The number of students who failed
5. The graduation rate of the students


In [2]:
#Calculating number of students
n_students = len(student_data)

#Calculate number of features
n_features = len(student_data.iloc[0]) - 1

#Calculate passing students
n_passed = len(student_data[student_data['passed'] == 'yes'])

#Calculate failing students
n_failed = len(student_data[student_data['passed'] == 'no'])

#Calculate graduation rate
grad_rate = float(n_passed)/n_students * 100

# Print the results
print("Total number of students （number of datapoints): {}".format(n_students))
print("Number of features: {}".format(n_features))
print("Number of students who passed (graduates): {}".format(n_passed))
print("Number of students who failed (non-graduates): {}".format(n_failed))
print("Graduation rate of the class: {:.2f}%".format(grad_rate))

Total number of students （number of datapoints): 395
Number of features: 30
Number of students who passed (graduates): 265
Number of students who failed (non-graduates): 130
Graduation rate of the class: 67.09%


DATA MODELLING ,TRAINING AND TESTING
====================

The data contains non numeric features which needs to be converted into a numeric one as most machine learning algorithms expect numeric data to deal with.

In [3]:

# Extracting feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print("Feature columns:\n{}".format(feature_cols))
print("\nTarget column: {}".format(target_col))

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print("\nFeature values:")
print(X_all.head())



Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       

As we can shere are several non numeric columns that needs to be converted. The column which has just two outputs like yes or no can be directly converted to '1' for yes and '0' for no. 

For other columns which ahs multiple outputs, one way would be to create as many columns as many outputs that it refelcts and assign 1 to them and 0 to others.This can be done using pandas 'get_dummies' function which does this transformation.


In [4]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigating each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print("Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns)))

Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


TRAINING AND TESTING DATA SPLITTING
===================

We will randomly shuffle the data and then split in 75:25 ratio and save the data in X_train , X_test , y_train and y_test. So there would be around 300 poimts for training and 95 points for testing.


In [5]:
#Import any additional functionality you may need here
from sklearn.cross_validation import train_test_split
from sklearn.utils import shuffle

#Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

#Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=num_train, test_size=num_test)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))


Training set has 300 samples.
Testing set has 95 samples.




We will choose 7 different training models as follows:
    
1. Gaussian Naive Bayes (GaussianNB)
2. Decision Trees
3. Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)
4. K-Nearest Neighbors (KNeighbors)
5. Stochastic Gradient Descent (SGDC)
6. Support Vector Machines (SVM)
7. Logistic Regression

We will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F1 score. The training and testing time of the model is also calculated for choosing the best of the following models looking at all the parameters.



We will initialize three helper functions which you can use for training and testing the supervised learning models you've chosen.

The helper funstions are:

1. trainClassifier - takes as input a classifier and training data and fits the classifier to the data.
2. predictLabels - takes as input a fit classifier, features, and a target labeling and makes predictions using the F1 score.
3. trainPredict - takes as input a classifier, and the training and testing data, and performs train_clasifier and predict_labels.



In [9]:
def trainClassifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print("Trained model in {:.4f} seconds".format(end - start))

    
def predictLabels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print("Made predictions in {:.4f} seconds.".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')


def trainPredict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print("Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train)))
    
    # Train the classifier
    trainClassifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print("F1 score for training set: {:.4f}.".format(predictLabels(clf, X_train, y_train)))
    print("F1 score for test set: {:.4f}.".format(predictLabels(clf, X_test, y_test)))
    print("\n")

We will import out supervised learning models stated obove and run trainPredict on the data. We need to do it for different training size as 100, 200 and 300. So in all we will have 21 different results.

In [11]:
#Import the three supervised learning models from sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier

#Initializing the  models
clf_A = GaussianNB()
clf_B = DecisionTreeClassifier(random_state=0)
clf_C = RandomForestClassifier(random_state=0)
clf_D = KNeighborsClassifier()
clf_E = SGDClassifier(random_state=0)
clf_F = SVC(random_state=0)
clf_G = LogisticRegression(random_state=0)

#Set up the training set sizes

X_train_100 = X_train[:100]
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

X_train_300 = X_train
y_train_300 = y_train

#Executing the 'train_predict' function for each classifier and each training set size

for clf in [clf_A, clf_B, clf_C , clf_D , clf_E , clf_F , clf_G]:
    for j in [(X_train_100, y_train_100), (X_train_200, y_train_200), (X_train_300, y_train_300)]:
        trainPredict(clf, j[0], j[1], X_test, y_test)

Training a GaussianNB using a training set size of 100. . .
Trained model in 0.0014 seconds
Made predictions in 0.0005 seconds.
F1 score for training set: 0.4000.
Made predictions in 0.0005 seconds.
F1 score for test set: 0.3146.


Training a GaussianNB using a training set size of 200. . .
Trained model in 0.0016 seconds
Made predictions in 0.0006 seconds.
F1 score for training set: 0.8015.
Made predictions in 0.0009 seconds.
F1 score for test set: 0.6767.


Training a GaussianNB using a training set size of 300. . .
Trained model in 0.0014 seconds
Made predictions in 0.0006 seconds.
F1 score for training set: 0.8180.
Made predictions in 0.0005 seconds.
F1 score for test set: 0.7176.


Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0011 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.6777.


Training a DecisionTreeClassifier using a training set size 

  'precision', 'predicted', average, warn_for)


Out of all the models we choose Logistic Regression as our mocel of choice . The reasons are below:
    
1. Logistic has a better F1 socre of 0.8598 in 300 data points. It also does not suffer from overfitting.
2. The training and testing time for logistic regression model is quite less compared to other algorithms


MODEL TUNING
========

We will fine tune the chosen model. We will use grid search (GridSearchCV) with at least one important parameter tuned with at least 3 different values. We will need to use the entire training set for this.

In [12]:
# TODO: Import 'GridSearchCV' and 'make_scorer'
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer

def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    score = clf.score(features, target.values)
    end = time()
    print("Score: ", score)
    
    # Printing and returning results
    print("Made predictions in {:.4f} seconds.".format(end - start))
    return f1_score(target.values, y_pred, pos_label='yes')


#Create the parameters list you wish to tune
parameters = { "penalty":["l2","l1"], 
               "C":[1,10,100,1000],
              }

#Initialize the classifier
clf = LogisticRegression()

#Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(f1_score, pos_label='yes')

#Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(clf, parameters, scoring=f1_scorer)

#Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_train, y_train)
print(grid_obj)

# Get the estimator
clf = grid_obj.best_estimator_
print(clf)

# Report the final F1 score for training and testing after parameter tuning
print("Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train)))
print("Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test)))



GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l2', 'l1'], 'C': [1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True,
       scoring=make_scorer(f1_score, pos_label=yes), verbose=0)
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Score:  0.796666666667
Made predictions in 0.0012 seconds.
Tuned model has a training F1 score of 0.8598.
Score:  0.642105263158
Made predictions in 0.0008 seconds.
Tuned model has a testing 

Thus the model chooses the best parameters and fine tunes the model to give the best accuracy on both training and testing dataset.