**Algorithms and Techniques**

I will be using Random Forests and XGBoost as classifiers for this problem and comparing their results to the benchmark model described below. Both of these are strong classification algorithms that work well on similar problems.

In [1]:
import pandas as pd
import numpy as np
import csv as csv
import matplotlib.pyplot as plt
%matplotlib inline 

**Benchmark Model**

I am using a dummy classifier as a benchmark model. The first dataset I use will include only the KCs as features and does not include any information about the problem name, section of the curriculum, or any identifying information about the student such as the student ID. It also does not include any information that would not have been included in the final test set portion of the KDD Challenge, such as the problem times or hints and incorrects. 

In [2]:
df = pd.read_pickle('mtpickle.p')

In [4]:
df = df.drop('Step Start Time', 1)
df = df.drop('First Transaction Time', 1)
df = df.drop('Correct Transaction Time', 1)
df = df.drop('Step End Time', 1)
df = df.drop('Step Duration (sec)', 1)
df = df.drop('Correct Step Duration (sec)', 1)
df = df.drop('Error Step Duration (sec)', 1)
df = df.drop('Incorrects', 1)
df = df.drop('Hints', 1)
df = df.drop('Corrects', 1)

In [5]:
df = df.drop('Opportunity(Default)', 1)
df = df.drop('KC(Default)', 1)
df = df.drop('Problem Hierarchy', 1)
df = df.drop('Problem View', 1)
df = df.drop('Step Name', 1)
df = df.drop('Anon Student Id', 1)
df = df.drop('Row', 1)
df = df.drop('Problem Name', 1)

In [6]:
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
df = df.drop('Correct First Attempt',1).join(df['Correct First Attempt']) #make CFA the last column

# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:', len(test))

Number of observations in the training data: 606967
Number of observations in the test data: 202727


In [7]:
features = df.columns[:-1]
target = df.columns[-1]

In [8]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

clf = DummyClassifier(strategy='most_frequent')
clf.fit(train[features], train[target])
y_pred = clf.predict(test[features])

accuracy_score(y_pred, test[target])

0.76763825242814232

Because this data is unbalanced, (about 77% of the observations have target “Correct First Attempt” values of 1 and 23% have target values of 0), this model can achieve an accuracy of 77% by simply guessing 1 for every observation (the "most_frequent" strategy). This is not an adequate solution because we also want to be able to identify negative cases accurately, i.e. students who do not answer questions correctly on the first try. This 77% “success” rate would be due to the classes being unbalanced, not to the classifier being useful. 

I am not necessarily expecting this to be the only set of features or final benchmark model I will use, since I am planning to integrate other features into future models besides just the KCs. This is simply what I am starting with because it demonstrates that the data is unbalanced and that this is the minimum level of overall accuracy I could expect from an overly simplified method. 

**Data Preprocessing**

As shown below, the KC and OC features I am using for my initial analysis are packaged as strings of strings separated by tilde characters:

In [23]:
pd.set_option('max_colwidth', -1)
df = pd.read_csv('algebra_2005_2006_train.txt', sep='\t')
df.head(5)[['KC(Default)', 'Opportunity(Default)']]

Unnamed: 0,KC(Default),Opportunity(Default)
0,"[SkillRule: Eliminate Parens; {CLT nested; CLT nested, parens; Distribute Mult right; Distribute Mult left; (+/-x +/-a)/b=c, mult; (+/-x +/-a)*b=c, div; [var expr]/[const expr] = [const expr], multiply; Distribute Division left; Distribute Division right; Distribute both mult left; Distribute both mult right; Distribute both divide left; Distribute both divide right; Distribute subex}]",1
1,"[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]~~[SkillRule: Isolate positive; x+a=b, positive]",1~~1
2,"[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]",2
3,"[SkillRule: Remove coefficient; {ax+b=c, divide; ax=b; [const expr]*[var fact] + [const expr] = [const expr], divide; [var expr]*[const expr] = [const expr], divide; a/b*x=c; a/b*x=c, reciprocal; ax/b=c, reciprocal; ax/b=c; x/a=b; ax=b; (+/-x +/-a)/b=c, mult; a=x*(b+c); a=x*(b-c); a=x*(b*c+d); x/a+b=c, multiply; [var expr]/[const expr] = [const expr], multiply}]~~[SkillRule: Remove negative coefficient; {ax/b=c, reciprocal; ax/b=c; ax=b; x/a=b}]",1~~1
4,"[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]~~[SkillRule: ax+b=c, negative; ax+b=c, negative]",3~~1


I have split the strings into lists as follows:

In [30]:
df['KC(Default)'] = df['KC(Default)'].str.split('~~') #split string into list
df['Opportunity(Default)'] = df['Opportunity(Default)'].str.split('~~') #split string into list
df.head(5)[['KC(Default)', 'Opportunity(Default)']]

Unnamed: 0,KC(Default),Opportunity(Default)
0,"[[SkillRule: Eliminate Parens; {CLT nested; CLT nested, parens; Distribute Mult right; Distribute Mult left; (+/-x +/-a)/b=c, mult; (+/-x +/-a)*b=c, div; [var expr]/[const expr] = [const expr], multiply; Distribute Division left; Distribute Division right; Distribute both mult left; Distribute both mult right; Distribute both divide left; Distribute both divide right; Distribute subex}]]",[1]
1,"[[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}], [SkillRule: Isolate positive; x+a=b, positive]]","[1, 1]"
2,"[[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]]",[2]
3,"[[SkillRule: Remove coefficient; {ax+b=c, divide; ax=b; [const expr]*[var fact] + [const expr] = [const expr], divide; [var expr]*[const expr] = [const expr], divide; a/b*x=c; a/b*x=c, reciprocal; ax/b=c, reciprocal; ax/b=c; x/a=b; ax=b; (+/-x +/-a)/b=c, mult; a=x*(b+c); a=x*(b-c); a=x*(b*c+d); x/a+b=c, multiply; [var expr]/[const expr] = [const expr], multiply}], [SkillRule: Remove negative coefficient; {ax/b=c, reciprocal; ax/b=c; ax=b; x/a=b}]]","[1, 1]"
4,"[[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}], [SkillRule: ax+b=c, negative; ax+b=c, negative]]","[3, 1]"


For parts of the analysis, I have used a multi-item list version of one-hot encoding of the KC variable, which the following code will create:

In [12]:
df = df.drop('KC(Default)', 1).join(df['KC(Default)'].str.join('|').str.get_dummies()) #combined dummies for KCs

I have also created a list of shorter and more descriptive KC names in the following dictionary:

In [19]:
df_dict = pd.read_pickle('kclistpickle.p')
df_dict.sort_values(by=['Num_KCs'], ascending=False).head(5)

Unnamed: 0,KC,Num_KCs,Label
17,Entering a given,76894,Entering a given
85,"[SkillRule: Remove coefficient; {ax+b=c, divid...",62421,SkillRule: Remove coefficient
86,"[SkillRule: Remove constant; {ax+b=c, positive...",53947,SkillRule: Remove constant
60,Using small numbers,51002,Using small numbers
59,Using simple numbers,50830,Using simple numbers


In [22]:
d = dict(zip(df_dict['KC'],df_dict['Label'])) #make dictionary
df = df.rename(columns = d) #rename columns
df = df.drop('Correct First Attempt',1).join(df['Correct First Attempt']) #make CFA the last column
df.columns = df.columns.str.lower() #make column names lowercase
df.head(5)

Unnamed: 0,changing axis bounds,changing axis intervals,choose graphical refl-v,choose graphical a,choose graphical h,choose graphical k,"convert unit, mixed","convert unit, multiplier","convert unit, standard",correctly placing points,define variable,edit algebraic a,edit algebraic h,edit algebraic k,edit algebraic refl-v,entering a computed linear value,entering a computed quadratic value,entering a given,entering a given linear value,entering a given quadratic value,entering a point,"entering slope, glf","entering slope, sif",entering the slope,entering the y-intercept,"entering x-intercept, glf","entering x-intercept, sif","entering y-intercept, glf","entering y-intercept, sif",excluding the line when shading,"find x, simple","find x, negative slope","find x, positive slope","find y, simple","find y, any form","find y, negative slope","find y, positive slope",identify parent curve,identify parent description,identify parent equation,identifying units,including the line when shading,labelling point of intersection,labelling the axes,"negative constant, glf","negative constant, sif",placing coordinate point,"positive constants, glf","positive constants, sif",setting the slope,setting the y-intercept,shading glf equation with negative slope,shading glf equation with positive slope,shading sif equation with negative slope,shading sif equation with positive slope,shading greater than,shading less than,using difficult numbers,using large numbers,using simple numbers,using small numbers,"write expression, initial and change","write expression, initial and point","write expression, negative slope","write expression, positive slope","write expression, quadratic","write expression, ratio","write expression, two points",skillrule: add/subtract,skillrule: apply exponent,skillrule: calculate eliminate parens,skillrule: calculate negative coefficient,skillrule: consolidate vars with coeff,"skillrule: consolidate vars, any","skillrule: consolidate vars, no coeff",skillrule: do multiply - whole nested,skillrule: done?,skillrule: eliminate parens 1,skillrule: eliminate parens 2,skillrule: extract to consolidate vars,skillrule: isolate negative,skillrule: isolate positive,skillrule: make variable positive 1,skillrule: make variable positive 2,skillrule: multiply/divide,skillrule: remove coefficient,skillrule: remove constant,skillrule: remove negative coefficient,skillrule: remove positive coefficient,skillrule: select combine terms,skillrule: select eliminate parens,"skillrule: select multiply/divide, nested",skillrule: select multiply,skillrule: variable in denominator,"skillrule: ax+b=c, negative",skillrule: done infinite solutions,skillrule: done no solutions,skillrule: invert-mult,combine-like-terms-r-sp,combine-like-terms-sp,combine-like-terms-whole-sp,distribute-sp,factor-quadratic-sp,factor-sp,perform-mult-r-sp,perform-mult-row2-sp,perform-mult-sp,perform-mult-whole-sp,qft-den-sp,qft-num1-sp,qft-num2-sp,simplify-fractions-sp,correct first attempt
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


For other parts of the analysis, I have replaced the one-hot indicators for each KC column with the corresponding Opportunity Count number for that KC. The following creates a column of dictionaries of KC column names to OC values:

In [31]:
head = df[['Row', 'KC(Default)', 'Opportunity(Default)']]
d = []
for index, row in head.iterrows():
    if np.all(pd.isnull(row['KC(Default)'])):
        d.append({})
    else:
        keys = row['KC(Default)']
        values = row['Opportunity(Default)']
        dictionary = dict(zip(keys, values))
        d.append(dictionary)
s = pd.Series(d)

In [36]:
s = s.to_frame()
head = head.join(s)
head.rename(columns = {0:'nested'}, inplace = True)
head = head.drop('KC(Default)', 1)
head = head.drop('Opportunity(Default)', 1)

In [38]:
head.head(5)

Unnamed: 0,Row,nested
0,1,"{'[SkillRule: Eliminate Parens; {CLT nested; CLT nested, parens; Distribute Mult right; Distribute Mult left; (+/-x +/-a)/b=c, mult; (+/-x +/-a)*b=c, div; [var expr]/[const expr] = [const expr], multiply; Distribute Division left; Distribute Division right; Distribute both mult left; Distribute both mult right; Distribute both divide left; Distribute both divide right; Distribute subex}]': '1'}"
1,2,"{'[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]': '1', '[SkillRule: Isolate positive; x+a=b, positive]': '1'}"
2,3,"{'[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]': '2'}"
3,4,"{'[SkillRule: Remove coefficient; {ax+b=c, divide; ax=b; [const expr]*[var fact] + [const expr] = [const expr], divide; [var expr]*[const expr] = [const expr], divide; a/b*x=c; a/b*x=c, reciprocal; ax/b=c, reciprocal; ax/b=c; x/a=b; ax=b; (+/-x +/-a)/b=c, mult; a=x*(b+c); a=x*(b-c); a=x*(b*c+d); x/a+b=c, multiply; [var expr]/[const expr] = [const expr], multiply}]': '1', '[SkillRule: Remove negative coefficient; {ax/b=c, reciprocal; ax/b=c; ax=b; x/a=b}]': '1'}"
4,5,"{'[SkillRule: Remove constant; {ax+b=c, positive; ax+b=c, negative; x+a=b, positive; x+a=b, negative; [var expr]+[const expr]=[const expr], positive; [var expr]+[const expr]=[const expr], negative; [var expr]+[const expr]=[const expr], all; Combine constants to right; Combine constants to left; a-x=b, positive; a/x+b=c, positive; a/x+b=c, negative}]': '3', '[SkillRule: ax+b=c, negative; ax+b=c, negative]': '1'}"


The following "unpacks" the KC and OC columns by exploding the KC lists into separate columns and applying the corresponding OC values for those columns in place of one-hot indicators.

In [39]:
def unpack(df, column):
    ret = pd.concat([df, pd.DataFrame((d for idx, d in df[column].iteritems())).fillna(0)], axis=1)
    ret = ret.drop(column, 1)
    return ret

head = head.applymap(lambda x: {} if pd.isnull(x) else x)
df_unpacked = unpack(head, 'nested')

In [44]:
df_unpacked.tail(5)

Unnamed: 0,Row,Changing axis bounds,Changing axis intervals,Choose Graphical Refl-v,Choose Graphical a,Choose Graphical h,Choose Graphical k,"Convert unit, mixed","Convert unit, multiplier","Convert unit, standard",...,factor-quadratic-sp,factor-sp,perform-mult-r-sp,perform-mult-row2-sp,perform-mult-sp,perform-mult-whole-sp,qft-den-sp,qft-num1-sp,qft-num2-sp,simplify-fractions-sp
809689,1080612,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
809690,1080613,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
809691,1080614,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
809692,1080615,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
809693,1080616,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


I have used these exploded dataframes mostly for data exploration and feature engineering purposes, and they will not necessarily be useful for feeding into a learning model. So far the two features I have made the most use of are the number of KCs and the sum of the OC numbers in each row for providing illustration and analysis of the dataset. 

In [45]:
df_kc = pd.read_pickle('mtunpacked.p')
df_sum = pd.read_pickle('mtpickle.p')

In [46]:
df_kc['Correct First Attempt'] = df_sum['Correct First Attempt']
df_kc['Sum_OCs'] = (df_kc[df_kc.columns[1:113]].astype(int)).sum(axis=1)
df_kc['Num_KCs'] = (df_sum[df_sum.columns[18:130]] == 1).sum(axis=1)

In [49]:
df_kc.tail(5)

Unnamed: 0,Row,Changing axis bounds,Changing axis intervals,Choose Graphical Refl-v,Choose Graphical a,Choose Graphical h,Choose Graphical k,"Convert unit, mixed","Convert unit, multiplier","Convert unit, standard",...,perform-mult-row2-sp,perform-mult-sp,perform-mult-whole-sp,qft-den-sp,qft-num1-sp,qft-num2-sp,simplify-fractions-sp,Correct First Attempt,Sum_OCs,Num_KCs
809689,1080612,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,5,1
809690,1080613,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,6,1
809691,1080614,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,5,2
809692,1080615,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
809693,1080616,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,8,2
