**Algorithms and Techniques**

I will be using Random Forests and XGBoost as classifiers for this problem and comparing their results to the benchmark model described below. Both of these are strong classification algorithms that work well on similar problems.

As discussed in the benchmark model section, this is a binary classification problem and the classes in this dataset are imbalanced (77% of the target values are 1 and 23% are 0). Based on this, I want to focus on maximizing the ability of the model to correctly identify observations in the minority class. As a result I will not be using accuracy as the main metric for evaluating model performance. I will report the following metrics:

- F1 Score 
- AUC/ROC

Both of these methods are effective in measuring not only the rate at which a model provides correct predictions but the rate at which it can correctly predict the minority class. I will discuss more on the dataset and class imbalance in the benchmark model section.

In [1]:
import pandas as pd
import numpy as np
import csv as csv
import matplotlib.pyplot as plt
%matplotlib inline 

**Benchmark Model**

I am using a dummy classifier as a benchmark model. The first dataset I use will include only the KCs as features and does not include any information about the problem name, section of the curriculum, or any identifying information about the student such as the student ID. It also does not include any information that would not have been included in the final test set portion of the KDD Challenge, such as the problem times or hints and incorrects. 

In [3]:
df = pd.read_pickle('mtprocessed.p')
#df = df.apply(pd.to_numeric, errors='coerce', axis=1)
#df.columns = df.columns.str.replace("],[<", "")
df.head(5)

Unnamed: 0,Changing axis bounds,Changing axis intervals,Choose Graphical Refl-v,Choose Graphical a,Choose Graphical h,Choose Graphical k,"Convert unit, mixed","Convert unit, multiplier","Convert unit, standard",Correctly placing points,...,PROP,L1F,NOV,PERCENT,IPC,CTA,ES,LINEAR,QUAD,Problem View
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,2


In [6]:
#df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
#train, test = df[df['is_train']==True], df[df['is_train']==False]
df = df.drop('Correct First Attempt',1).join(df['Correct First Attempt']) #make CFA the last column

# Show the number of observations for the train and test dataframes
#print('Number of observations in the training data:', len(train))
#print('Number of observations in the test data:', len(test))

In [6]:
features = df.columns[:-1]
target = df.columns[-1]

In [7]:
target

'Correct First Attempt'

In [15]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

clf = DummyClassifier(strategy='most_frequent')
clf2 = DummyClassifier()
clf.fit(train[features], train[target])
clf2.fit(train[features], train[target])
y_pred1 = clf.predict(test[features])
y_pred2 = clf2.predict(test[features])

print ("Accuracy score (most frequent): {0}".format(accuracy_score(y_pred1, test[target])))
print ("Accuracy score (stratified): {0}".format(accuracy_score(y_pred2, test[target])))

Accuracy score (most frequent): 0.7678144688246795
Accuracy score (stratified): 0.6423380009084985


Because this data is imbalanced, (about 77% of the observations have target “Correct First Attempt” values of 1 and 23% have target values of 0), this model can achieve an accuracy of 77% by simply guessing 1 for every observation (the "most_frequent" strategy). This is not an acceptable solution because we also want to be able to identify negative cases accurately, i.e. students who do not answer questions correctly on the first try. This 77% “success” rate is due to the classes being unbalanced, not to the classifier being useful.

Due to the class imbalance, accuracy is not a useful metric for this model, and I will be using the F1 score and AUC/ROC scores as metrics instead. In order to draw a meaningful baseline solution, I am using a dummy model with the default "stratified" strategy, which respects the original class frequency of the target. We find the F1 score for this model is approximately 0.23 for the minority class and 0.77 for the majority class, which reflects the balance of the distribution and gives us a solution that matches the dataset better than the more naive "most frequent" strategy. The accuracy of this model is lower (about 64%), but the F1 score is improved for the minority class. 

In [16]:
from sklearn.metrics import f1_score

print ("F1 score (most frequent): {0}".format(f1_score(test[target], y_pred1, average=None)))
print ("F1 score (stratified): {0}".format(f1_score(test[target], y_pred2, average=None)))

F1 score (most frequent): [ 0.          0.86865956]
F1 score (stratified): [ 0.23237183  0.76685399]


  'precision', 'predicted', average, warn_for)


Using the "most frequent" strategy, (for which we receive a warning that "F-score is ill-defined" because the 0 class is never predicted, which causes a divide-by-zero error) the F-score for the 0 class is 0.0 and the F-score for the 1 class is about 0.867. As expected, our recall for the 1 class should be perfect and our precision poor. 

In [17]:
from sklearn.metrics import roc_auc_score

print ("AUC/ROC score (most frequent): {0}".format(roc_auc_score(test[target], y_pred1)))
print ("AUC/ROC score (stratified): {0}".format(roc_auc_score(test[target], y_pred2)))

AUC/ROC score (most frequent): 0.5
AUC/ROC score (stratified): 0.49961370874525235


**Model Implementation**

Below I am running the XGBoost and random forests classifiers on the dataset to compare their performance. 

In [None]:
from xgboost import XGBClassifier

clf = XGBClassifier()
clf.fit(train[features], train[target])
y_pred = clf.predict(test[features])

print ("XGBoost accuracy score: {0}".format(accuracy_score(y_pred, test[target])))
print ("F1 score: {0}".format(f1_score(test[target], y_pred, average=None)))
print ("AUC/ROC score: {0}".format(roc_auc_score(test[target], y_pred)))

In [18]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(train[features], train[target])
y_pred = clf.predict(test[features])

print ("Random forests accuracy score: {0}".format(accuracy_score(y_pred, test[target])))
print ("F1 score: {0}".format(f1_score(test[target], y_pred, average=None)))
print ("AUC/ROC score: {0}".format(roc_auc_score(test[target], y_pred)))

Random forests accuracy score: 0.7747121442537476
F1 score: [ 0.37308675  0.8626829 ]
AUC/ROC score: 0.6051971590981953


**Model Evaluation and Validation**

In [7]:
from sklearn import cross_validation
from sklearn.utils import shuffle

#df = df.drop('is_train', 1)
num_all = len(df)  
num_train = int(len(df)*.75) 
num_test = num_all - num_train

X_train, X_test, y_train, y_test = cross_validation.train_test_split(df[df.columns[:-1]], df[df.columns[-1]], 
                                                                     train_size=num_train,
                                                                     test_size=num_test, 
                                                                     random_state=10)


print ("Training set: {} samples".format(X_train.shape[0]))
print ("Test set: {} samples".format(X_test.shape[0]))

Training set: 607270 samples
Test set: 202424 samples


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print ("Random forests accuracy score: {0}".format(accuracy_score(y_pred, y_test)))
print ("F1 score: {0}".format(f1_score(y_test, y_pred, average=None)))
print ("AUC/ROC score: {0}".format(roc_auc_score(y_test, y_pred)))

In [None]:
#train on smaller training set sizes

X_train_10k = X_train[:10000]
y_train_10k = y_train[:10000]
X_train_100k = X_train[:100000]
y_train_100k = y_train[:100000]

In [None]:
clf.fit(X_train_10k, y_train_10k)
y_pred = clf.predict(X_test)

print ("Random forests accuracy score: {0}".format(accuracy_score(y_pred, y_test)))
print ("F1 score: {0}".format(f1_score(y_test, y_pred, average=None)))
print ("AUC/ROC score: {0}".format(roc_auc_score(y_test, y_pred)))

In [None]:
clf.fit(X_train_100k, y_train_100k)
y_pred = clf.predict(X_test)

print ("Random forests accuracy score: {0}".format(accuracy_score(y_pred, y_test)))
print ("F1 score: {0}".format(f1_score(y_test, y_pred, average=None)))
print ("AUC/ROC score: {0}".format(roc_auc_score(y_test, y_pred)))

In [12]:
#gridsearchcv on random forests and xgboost
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV

classifier = RandomForestClassifier()
parameters = {'max_features':['auto', 'log2'], 'n_estimators':[10, 20, 30]}
f1_scorer = make_scorer(f1_score, pos_label=1)

clf = GridSearchCV(classifier, parameters, scoring=f1_scorer)
clf.fit(X_train, y_train)

train_f1_score = predict_labels(clf, X_train, y_train)
test_f1_score = predict_labels(clf, X_test, y_test)

print ("Optimal parameter values: {}".format(clf.best_params_))
print ("F1 score for training set: {}".format(train_f1_score))
print ("F1 score for test set: {}".format(test_f1_score))

In [9]:
import time 

def predict_labels(clf, features, target):
    print ("Predicting labels using {}...".format(clf.__class__.__name__))
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print ("Done!\nPrediction time (secs): {:.3f}".format(end - start))
    print (f1_score(target, y_pred, average=None))
    return f1_score(target, y_pred)

In [47]:
#gridsearchcv on random forests and xgboost
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import make_scorer
from sklearn.grid_search import GridSearchCV

classifier = XGBClassifier()
parameters = {'learning_rate':[0.05, 0.1, 0.2], 'subsample':[0.5, 0.8, 1]}
f1_scorer = make_scorer(f1_score, pos_label=1)

clf = GridSearchCV(classifier, parameters, scoring=f1_scorer)
clf.fit(X_train, y_train)

train_f1_score = predict_labels(clf, X_train, y_train)
test_f1_score = predict_labels(clf, X_test, y_test)

print ("Optimal parameter values: {}".format(clf.best_params_))
print ("F1 score for training set: {}".format(train_f1_score))
print ("F1 score for test set: {}".format(test_f1_score))

Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 18.519
[ 0.24985872  0.87654126]
Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 4.158
[ 0.25264589  0.87665146]
Optimal parameter values: {'learning_rate': 0.2, 'subsample': 0.5}
F1 score for training set: 0.8765412566317545
F1 score for test set: 0.8766514626109729
