<a href="https://colab.research.google.com/github/rts1988/Duolingo_spaced_repetition/blob/main/15_Q2Duolingo_preprocessing_studentandwordfeatures_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color = 'cornflowerblue' size=4>Introduction</font>

In the previous notebook student based features were explored, and some baselines were obtained by modeling using only student based features. (~30% precision at 30% recall).

In this notebook, word based features from Q1 (derived word features, one-hot encoding + language dummies + word vectors) are added to the dataset.

Along with the Q2 training set, the Q2 test set for unseen students and Q3 test sets of new languages will also be preprocessed, so that the preprocessing pipeline is kept the same. 

Classical, ensemble and neural nets will be used to optimize for precision at a recall of 30%. 

Lastly, the best model obtained is saved, and tested on the q2 test set (unseen students) and the q3 test sets (new languages)





## <font color = 'cornflowerblue' size=4>Loading data and computing aggregates</font>

In [1]:
import warnings
warnings.filterwarnings('ignore')


import bz2
import pickle
import _pickle as cPickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from google.colab import drive
drive.mount('/content/drive')

def decompress_pickle(file):
 data = bz2.BZ2File(file, 'rb')
 data = cPickle.load(data)
 return data

def compressed_pickle(title, data):  # do not add extension in filename
 with bz2.BZ2File(title + '.pbz2', 'w') as f: 
  cPickle.dump(data, f)

path_name = '/content/drive/MyDrive/'

Mounted at /content/drive


Loading q2train dataset:

In [42]:
# Loading q2 train set with student features based on 11 days history
q2train11d_sfonly = decompress_pickle(path_name+"Q2trainX11dayhist.pbz2")

# Loading q2 test set with student features based on 11 days history
q2test11d_sfonly = decompress_pickle(path_name+"Q2test11dayhist.pbz2")

# Loading q3 test english to german with 11 day history based student features
q3test11d_en_to_de_sfonly = decompress_pickle(path_name+"Q3test_en_to_de_11dayhist.pbz2")

# Loading q3 test italian to english with 11 day history based student features
q3test11d_it_to_en_sfonly = decompress_pickle(path_name+"Q3test_it_to_en_11dayhist.pbz2")

In [43]:
q2train11d_sfonly.shape, q2test11d_sfonly.shape, q3test11d_en_to_de_sfonly.shape, q3test11d_it_to_en_sfonly.shape

((25753, 25), (3913, 25), (3212, 25), (1417, 25))

In [35]:
q2train11d_sfonly.columns

Index(['timestamp', 'delta', 'user_id', 'learning_language', 'ui_language',
       'lexeme_id', 'lexeme_string', 'history_seen', 'history_correct',
       'session_seen', 'session_correct', 'p_forgot_bin', 'simoverdiff',
       'lang_frozenset', 'Datetime', 'delta_days', 'history_frac', 'Date_x',
       'user_date_tup_x', 'Date_y', 'user_date_tup_y', 'avgp_forgot_day',
       'avg_history_frac', 'numwordspracticed_day', 'avgdelta_day'],
      dtype='object')

The above dataframes have the student features already. 

Word based features will be added to it based on the lexeme id. 

Range of Dates is from Mar 10 -Mar 12, 2013, since we used 10 days of history from the dataset. 

In [36]:
q2train11d_sfonly['Date_x'].min(), q2train11d_sfonly['Date_x'].max()

(datetime.date(2013, 3, 10), datetime.date(2013, 3, 12))

## <font color='cornflowerblue' size=4>Preprocessing all sets</font>

1. Splitting Q2 training set into training and validation 95-5 set so that validation contains some unseen students. 
2. Adding the word based features from Q1 to the following sets:
- q2 train
- q2 validation
- q2 test set
- q3 test English to German
- q3 test Italian to English

save X and y separately for modeling.

In [168]:
users_list = q2train11d_sfonly['user_id'].unique()

usersq2test_list = q2test11d_sfonly['user_id'].unique()
usersq3test_en_to_de_list = q3test11d_en_to_de_sfonly['user_id'].unique()
usersq3test_it_to_en_list = q3test11d_it_to_en_sfonly['user_id'].unique()


Since the validation set should have some unseen students, the list of users is split in a 95-5 split first.

In [169]:
np.random.seed(123)
userstrain_list = np.random.choice(users_list,int(0.95*len(users_list)),replace=False) # sample from list without replacement
usersvalid_list = list(set(users_list).difference(userstrain_list))

In [170]:
# splitting the dataframe based on the users:
q2train = q2trainX11d_sfonly.loc[q2train11d_sfonly['user_id'].isin(userstrain_list),:]


q2valid = q2trainX11d_sfonly.loc[q2train11d_sfonly['user_id'].isin(usersvalid_list),:]


# print shapes of the dataframes after splitting by users.
q2train.shape, q2valid.shape

((24409, 25), (1344, 25))

Getting fraction of positive class samples in training and validation sets:

In [171]:
q2train['p_forgot_bin'].sum()/q2train.shape[0], q2valid['p_forgot_bin'].sum()/q2valid.shape[0]

(0.15318120365438978, 0.11383928571428571)

In [172]:
q2test11d_sfonly['p_forgot_bin'].sum()/q2test11d_sfonly.shape[0]

0.16994633273703041

In [173]:
q3test11d_en_to_de_sfonly['p_forgot_bin'].sum()/q3test11d_en_to_de_sfonly.shape[0]

0.16220423412204235

In [174]:
q3test11d_it_to_en_sfonly['p_forgot_bin'].sum()/q3test11d_it_to_en_sfonly.shape[0]

0.29992942836979536

The validation set, and the q3 Italian to English test set has a slightly different fraction of positive class samples. 

Adding word based features:

In [29]:
all_lexemes = decompress_pickle(path_name+"Duolingo_all_lexemes.pbz2")

In [175]:
# getting lexemes list for each dataset
lexemestrain_list = q2train['lexeme_id'].unique()
lexemesvalid_list = q2valid['lexeme_id'].unique()

lexemesq2test_list = q2test11d_sfonly['lexeme_id'].unique()
lexemesq3test_en_to_de_list = q3test11d_en_to_de_sfonly['lexeme_id'].unique()
lexemesq3test_it_to_en_list = q3test11d_it_to_en_sfonly['lexeme_id'].unique()

In [176]:
q2trainlexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(lexemestrain_list),:]
q2validlexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(lexemesvalid_list),:]

q2testlexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(lexemesq2test_list),:]

q3test_en_to_de_lexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(lexemesq3test_en_to_de_list),:]

q3test_it_to_en_lexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(lexemesq3test_it_to_en_list),:]

Imputing null values of English IDF:

In [177]:
q2trainlexemes['EnglishIDF'].isna().sum()

29

As was done earlier, the median IDF value from the train list will be used to impute values for all the validation and test sets sa well. 

In [178]:
# calculating median value from train set lexemes data
medianIDF = q2trainlexemes['EnglishIDF'].median()

# imputing all datasets with median IDF from training set. 
q2trainlexemes.loc[q2trainlexemes['EnglishIDF'].isna(),'EnglishIDF'] = medianIDF
q2validlexemes.loc[q2validlexemes['EnglishIDF'].isna(),'EnglishIDF'] = medianIDF

q2testlexemes.loc[q2testlexemes['EnglishIDF'].isna(),'EnglishIDF'] = medianIDF

q3test_en_to_de_lexemes.loc[q3test_en_to_de_lexemes['EnglishIDF'].isna(),'EnglishIDF'] = medianIDF
q3test_it_to_en_lexemes.loc[q3test_it_to_en_lexemes['EnglishIDF'].isna(),'EnglishIDF'] = medianIDF

One-hot encoding of categoricals:


In [179]:
from sklearn.preprocessing import OneHotEncoder

enc_lang = OneHotEncoder(sparse=False,handle_unknown='ignore')
enc_pos = OneHotEncoder(sparse = False,handle_unknown = 'ignore')
enc_mods = OneHotEncoder(sparse=False,handle_unknown='ignore')

# get one hot encoded part of speech:
enc_pos.fit(np.array(q2trainlexemes['pos']).reshape(-1, 1))
q2train_pos = pd.DataFrame(enc_pos.transform(np.array(q2trainlexemes['pos']).reshape(-1, 1)),index=q2trainlexemes.index)

q2valid_pos = pd.DataFrame(enc_pos.transform(np.array(q2validlexemes['pos']).reshape(-1, 1)),index=q2validlexemes.index)

q2test_pos = pd.DataFrame(enc_pos.transform(np.array(q2testlexemes['pos']).reshape(-1, 1)),index=q2testlexemes.index)

q3test_en_to_de_pos = pd.DataFrame(enc_pos.transform(np.array(q3test_en_to_de_lexemes['pos']).reshape(-1, 1)),index=q3test_en_to_de_lexemes.index)
q3test_it_to_en_pos = pd.DataFrame(enc_pos.transform(np.array(q3test_it_to_en_lexemes['pos']).reshape(-1, 1)),index=q3test_it_to_en_lexemes.index)




In [180]:
# get one hot encoded modstrings
q2trainlexemes['modstrings'].head()

7                []
20          [m, pl]
21    [pri, p1, sg]
22          [m, sg]
23               []
Name: modstrings, dtype: object

Since modstrings are saved as a list of strings, some more processing needs to be done before passing to the one hot encoder. 

In [182]:
# explode values of list so a separate record is created for each element of the list. 
q2trainlexemes['modstrings'].explode()
enc_mods.fit(np.array(q2trainlexemes['modstrings'].explode()).reshape(-1,1))

# q2 train set mod one-hots
q2train_mods_exploded =enc_mods.transform(np.array(q2trainlexemes['modstrings'].explode()).reshape(-1,1))
q2train_modsdf_exploded = pd.DataFrame(q2train_mods_exploded, index = q2trainlexemes['modstrings'].explode().index)
# group the exploded one-hot dataframe by the q1lexemes index, sum up the exploded records (since one word can have many modifiers in its list)
q2train_modsdf = q2train_modsdf_exploded.groupby(q2train_modsdf_exploded.index).sum()

# for q2 valid set:
q2valid_mods_exploded =enc_mods.transform(np.array(q2validlexemes['modstrings'].explode()).reshape(-1,1))
q2valid_modsdf_exploded = pd.DataFrame(q2valid_mods_exploded, index = q2validlexemes['modstrings'].explode().index)
q2valid_modsdf = q2valid_modsdf_exploded.groupby(q2valid_modsdf_exploded.index).sum()

# for q2 test
q2test_mods_exploded =enc_mods.transform(np.array(q2testlexemes['modstrings'].explode()).reshape(-1,1))
q2test_modsdf_exploded = pd.DataFrame(q2test_mods_exploded, index = q2testlexemes['modstrings'].explode().index)
q2test_modsdf = q2test_modsdf_exploded.groupby(q2test_modsdf_exploded.index).sum()

# for q3 test english to german:
q3test_entode_mods_exploded =enc_mods.transform(np.array(q3test_en_to_de_lexemes['modstrings'].explode()).reshape(-1,1))
q3test_entode_modsdf_exploded = pd.DataFrame(q3test_entode_mods_exploded, index = q3test_en_to_de_lexemes['modstrings'].explode().index)
q3test_entode_modsdf = q3test_entode_modsdf_exploded.groupby(q3test_entode_modsdf_exploded.index).sum()

# for q3 test italian to german:
q3test_ittoen_mods_exploded =enc_mods.transform(np.array(q3test_it_to_en_lexemes['modstrings'].explode()).reshape(-1,1))
q3test_ittoen_modsdf_exploded = pd.DataFrame(q3test_ittoen_mods_exploded, index = q3test_it_to_en_lexemes['modstrings'].explode().index)
q3test_ittoen_modsdf = q3test_ittoen_modsdf_exploded.groupby(q3test_ittoen_modsdf_exploded.index).sum()


In [183]:
q2train_modsdf.shape, q2trainlexemes.shape

((4696, 60), (4696, 17))

In [184]:
q2valid_modsdf.shape, q2validlexemes.shape

((716, 60), (716, 17))

In [185]:
q3test_ittoen_modsdf.shape, q3test_it_to_en_lexemes.shape

((640, 60), (640, 17))

The summed shape of the mods dataframe is the same length as the q1trainlexemes dataframe. 

Now the language, pos and mod sparse columns are combined along with numerical word features to make the first feature set. 

In [186]:
q2trainfeatureset1 = pd.concat([q2trainlexemes[['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']], 
                                q2train_pos, 
                                q2train_modsdf, 
                                ],axis=1)


q2validfeatureset1 = pd.concat([q2validlexemes[['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']], 
                                q2valid_pos, 
                                q2valid_modsdf, 
                                ],axis=1)

q2testfeatureset1 = pd.concat([q2testlexemes[['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']], 
                                q2test_pos, 
                                q2test_modsdf, 
                                ],axis=1)

q3test_en_to_de_featureset1 = pd.concat([q3test_en_to_de_lexemes[['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']], 
                                q3test_en_to_de_pos, 
                                q3test_entode_modsdf, 
                                ],axis=1)

q3test_it_to_en_featureset1 = pd.concat([q3test_it_to_en_lexemes[['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']], 
                                q3test_it_to_en_pos, 
                                q3test_ittoen_modsdf, 
                                ],axis=1)


In [187]:
q2trainfeatureset1.shape, q3test_it_to_en_featureset1.shape

((4696, 117), (640, 117))

In [188]:
q2trainfeatureset1.columns = ['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']  + ['pos_' + c for c in list(enc_pos.get_feature_names_out())] + ['mod_' + c for c in list(enc_mods.get_feature_names_out())]
q2validfeatureset1.columns = q2trainfeatureset1.columns
q2testfeatureset1.columns = q2trainfeatureset1.columns
q3test_en_to_de_featureset1.columns = q2trainfeatureset1.columns
q3test_it_to_en_featureset1.columns = q2trainfeatureset1.columns

In [189]:
len(list(q2trainfeatureset1.columns)) - len(set(q2trainfeatureset1.columns))

0

In [190]:
q2trainfeatureset1.head()

Unnamed: 0,sf_length,L_dist_sf_noaccents_norm,EnglishIDF,lexeme_id,pos_x0_@adv:a_part,pos_x0_@adv:a_peu_pres,pos_x0_@adv:au_moins,pos_x0_@adv:en_general,pos_x0_@adv:por_favor,pos_x0_@adv:por_supuesto,...,mod_x0_prs,mod_x0_qnt,mod_x0_ref,mod_x0_sg,mod_x0_sint,mod_x0_sp,mod_x0_subj,mod_x0_sup,mod_x0_tn,mod_x0_nan
7,4,0.5,3.733996,73eecb492ca758ddab5371cf7b5cca32,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
20,5,0.6,10.981924,c84476c460737d9fb905dca3d35ec995,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21,3,0.0,4.051187,1a913f2ded424985b9c02d0436008511,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
22,5,0.8,0.438409,38b770e66595fea718366523b4f7db3f,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
23,1,0.0,0.002856,4bdb859f599fa07dd5eecdab0acc2d34,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Adding word vectors:

In [111]:
wordvecs = decompress_pickle(path_name+"Duolingo_wordvectors.pbz2")

In [191]:
q2trainvecs = wordvecs.loc[wordvecs['lexeme_id'].isin(lexemestrain_list),:]

# for non-na values
notna = q2trainvecs.loc[~q2trainvecs['vectorlist'].isna(),:]
q2trainvecs_notna_expanded = pd.DataFrame(notna['vectorlist'].tolist(),index = notna['lexeme_id'])


# computing centroid for imputation of nulls 
q2trainvecs_centroid = q2trainvecs_notna_expanded.mean(axis=0)

# separating out null valued rows
is_na = q2trainvecs.loc[q2trainvecs['vectorlist'].isna(),:]
is_na.index = is_na['lexeme_id']

numnull = is_na.shape[0]

# imputing with centroid
q2trainvecs_na_expanded = pd.concat([pd.DataFrame(q2trainvecs_centroid).transpose()]*numnull,axis=0)
q2trainvecs_na_expanded.index = is_na.index

# joining imputed null with the rest, keeping index consistent
q2trainvecs_expanded = pd.concat([q2trainvecs_na_expanded,q2trainvecs_notna_expanded],axis=0)

# viewing first few rows:
q2trainvecs_expanded.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
lexeme_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5377c84560aaf45988067be11302d1d8,0.107671,0.154306,-0.167931,-0.034556,0.116771,0.022966,-0.026985,-0.094826,0.075554,1.226188,...,-0.13612,0.056093,-0.112848,-0.060538,-0.014504,-0.002696,-0.022266,-0.013571,-0.032485,0.046411
26133676f75f829992c823dd7402f9a3,0.107671,0.154306,-0.167931,-0.034556,0.116771,0.022966,-0.026985,-0.094826,0.075554,1.226188,...,-0.13612,0.056093,-0.112848,-0.060538,-0.014504,-0.002696,-0.022266,-0.013571,-0.032485,0.046411
3baca3559289e14345edc0c53f516033,0.107671,0.154306,-0.167931,-0.034556,0.116771,0.022966,-0.026985,-0.094826,0.075554,1.226188,...,-0.13612,0.056093,-0.112848,-0.060538,-0.014504,-0.002696,-0.022266,-0.013571,-0.032485,0.046411
883996f36cc186da5983b13384230a9f,0.107671,0.154306,-0.167931,-0.034556,0.116771,0.022966,-0.026985,-0.094826,0.075554,1.226188,...,-0.13612,0.056093,-0.112848,-0.060538,-0.014504,-0.002696,-0.022266,-0.013571,-0.032485,0.046411
0f2bda0e0fb7b053496a4c9402800bf3,0.107671,0.154306,-0.167931,-0.034556,0.116771,0.022966,-0.026985,-0.094826,0.075554,1.226188,...,-0.13612,0.056093,-0.112848,-0.060538,-0.014504,-0.002696,-0.022266,-0.013571,-0.032485,0.046411


In [192]:
q2trainlexemesvecs = pd.merge(left = q2trainfeatureset1,\
                              right = q2trainvecs_expanded,left_on = 'lexeme_id',right_index=True)

In [193]:
q2trainlexemesvecs.shape

(4696, 417)

Repeating this process with q2validation set:

In [194]:
def joinvectorstolexemes(lexemesvalid_list,q2validfeatureset1):

  q2validvecs = wordvecs.loc[wordvecs['lexeme_id'].isin(lexemesvalid_list),:]
  print("length of vectors dataframe: ",q2validvecs.shape[0])
  # for non-na values
  notna = q2validvecs.loc[~q2validvecs['vectorlist'].isna(),:]
  q2validvecs_notna_expanded = pd.DataFrame(notna['vectorlist'].tolist(),index = notna['lexeme_id'])
  print("length of not null vectors dataframe: ",q2validvecs_notna_expanded.shape[0])
  # separating out null valued rows
  is_na = q2validvecs.loc[q2validvecs['vectorlist'].isna(),:]
  is_na.index = is_na['lexeme_id']

  numnull = is_na.shape[0]
  print("length of null vectors dataframe: ",numnull)

  # imputing with centroid previously computed from training set. 
  if numnull!=0:
    q2validvecs_na_expanded = pd.concat([pd.DataFrame(q2trainvecs_centroid).transpose()]*numnull,axis=0)
    q2validvecs_na_expanded.index = is_na.index
  else:
    q2validvecs_na_expanded = pd.DataFrame()
  # joining imputed null with the rest, keeping index consistent
  
  q2validvecs_expanded = pd.concat([q2validvecs_na_expanded,q2validvecs_notna_expanded],axis=0)
  print("all vectors shape: ",q2validvecs_expanded.shape )
  # viewing first few rows:

  q2validlexemesvecs = pd.merge(left = q2validfeatureset1,\
                                right = q2validvecs_expanded,left_on = 'lexeme_id',right_index=True)
  
  return q2validlexemesvecs

In [195]:
q2validlexemesvecs = joinvectorstolexemes(lexemesvalid_list,q2validfeatureset1)

length of vectors dataframe:  716
length of not null vectors dataframe:  716
length of null vectors dataframe:  0
all vectors shape:  (716, 300)


In [196]:
q2validlexemesvecs.shape

(716, 417)

Repeating process for q2 test set:

In [197]:
q2testlexemesvecs = joinvectorstolexemes(lexemesq2test_list,q2testfeatureset1)

length of vectors dataframe:  1858
length of not null vectors dataframe:  1856
length of null vectors dataframe:  2
all vectors shape:  (1858, 300)


In [198]:
q2testlexemesvecs.shape, q2testfeatureset1.shape

((1858, 417), (1858, 117))

Repeating process for q3 test (en to de)

In [199]:
q3test_en_to_de_lexemevecs = joinvectorstolexemes(lexemesq3test_en_to_de_list,q3test_en_to_de_featureset1)

length of vectors dataframe:  1030
length of not null vectors dataframe:  1016
length of null vectors dataframe:  14
all vectors shape:  (1030, 300)


In [200]:
q3test_en_to_de_lexemevecs.shape, q3test_en_to_de_featureset1.shape

((1030, 417), (1030, 117))

Repeating process for Italian to English

In [201]:
q3test_it_to_en_lexemevecs = joinvectorstolexemes(lexemesq3test_it_to_en_list,q3test_it_to_en_featureset1)

length of vectors dataframe:  640
length of not null vectors dataframe:  640
length of null vectors dataframe:  0
all vectors shape:  (640, 300)


In [202]:
q3test_it_to_en_lexemevecs.shape, q3test_it_to_en_featureset1.shape,len(lexemesq3test_it_to_en_list)

((640, 417), (640, 117), 640)

All sets have been preprocessed. Now we join with the student features dataset by 'lexeme_id'

In [203]:
q2train_allf = pd.merge(left = q2train,right = q2trainlexemesvecs, left_on = 'lexeme_id',right_on = 'lexeme_id',how="left")

In [205]:
q2train_allf.shape, q2train.shape

((24409, 441), (24409, 25))

In [214]:
q2train_allf.columns[135:147]

Index(['mod_x0_sint',   'mod_x0_sp', 'mod_x0_subj',  'mod_x0_sup',
         'mod_x0_tn',  'mod_x0_nan',             0,             1,
                   2,             3,             4,             5],
      dtype='object')

Can get rid of some extraneous columns post joining with student features datasets. 

Deleting the following columns:
1. timestamp
2. user id
3. learning language
4. ui language
5. lexeme id
6. lexeme string
7. history correct (since history_frac is there)
8. session seen
9. session correct (target variable)
10. lang_frozenset
11. Datetime
12. delta_days
13. Date_x
14. 'user_date_tup_x', 
15. 'Date_y',
16.  'user_date_tup_y',

In [215]:
q2train_allf = q2train_allf.drop(['timestamp','user_id','learning_language','ui_language','lexeme_id','lexeme_string','history_correct','session_seen','session_correct',\
                                  'lang_frozenset','Datetime','delta_days','Date_x','user_date_tup_x','Date_y','user_date_tup_y'],axis=1)

q2train_allf.shape

(24409, 425)

In [217]:
q2train_allf.select_dtypes('object').shape

(24409, 0)

There are no non-numerical variables left. 

Separating out y and X, and saving to compressed pickle files. 

In [218]:
q2trainy_allf = q2train_allf['p_forgot_bin']
q2trainX_allf = q2train_allf.drop('p_forgot_bin',axis=1)
compressed_pickle(path_name+"Q2TRAIN_ALLFX",q2trainX_allf)
compressed_pickle(path_name+"Q2TRAIN_ALLFy",q2trainy_allf)

In [239]:
def joinstudentfeatures_save(q2train,q2trainlexemesvecs,prefixfilename):
  q2train_allf = pd.merge(left = q2train,right = q2trainlexemesvecs, left_on = 'lexeme_id',right_on = 'lexeme_id',how="left")
  print("Shape of all features dataset")
  print("Deleting extraneous columns")
  q2train_allf = q2train_allf.drop(['timestamp','user_id','learning_language','ui_language','lexeme_id','lexeme_string','history_correct','session_seen','session_correct',\
                                  'lang_frozenset','Datetime','delta_days','Date_x','user_date_tup_x','Date_y','user_date_tup_y'],axis=1)

  print("New shape: ",q2train_allf.shape)

  q2trainy_allf = q2train_allf['p_forgot_bin']
  q2trainX_allf = q2train_allf.drop('p_forgot_bin',axis=1)

  print("Checking for any nulls before saving")
  print("Number of null values",q2trainX_allf.isna().any().sum())

  compressed_pickle(path_name+prefixfilename+"X",q2trainX_allf)
  compressed_pickle(path_name+prefixfilename+"y",q2trainy_allf)

  print('saved: ',prefixfilename,"X and y")


In [240]:
joinstudentfeatures_save(q2train,q2trainlexemesvecs,prefixfilename = 'Q2TRAIN_ALLF')

Shape of all features dataset
Deleting extraneous columns
New shape:  (24409, 425)
Checking for any nulls before saving
Number of null values 0
saved:  Q2TRAIN_ALLF X and y


In [241]:
joinstudentfeatures_save(q2valid,q2validlexemesvecs,prefixfilename = 'Q2VALID_ALLF')

Shape of all features dataset
Deleting extraneous columns
New shape:  (1344, 425)
Checking for any nulls before saving
Number of null values 0
saved:  Q2VALID_ALLF X and y


In [235]:
joinstudentfeatures_save(q2test11d_sfonly,q2testlexemesvecs,prefixfilename = 'Q2TEST_ALLF')

Shape of all features dataset
Deleting extraneous columns
New shape:  (3913, 425)
Checking for any nulls before saving
Number of null values 0
saved:  Q2TEST_ALLF X and y


In [236]:
joinstudentfeatures_save(q3test11d_en_to_de_sfonly,q3test_en_to_de_lexemevecs,prefixfilename = 'Q3TESTENTODE_ALLF')

Shape of all features dataset
Deleting extraneous columns
New shape:  (3212, 425)
Checking for any nulls before saving
Number of null values 0
saved:  Q3TESTENTODE_ALLF X and y


In [237]:
joinstudentfeatures_save(q3test11d_it_to_en_sfonly,q3test_it_to_en_lexemevecs,prefixfilename = 'Q3TESTITTOEN_ALLF')

Shape of all features dataset
Deleting extraneous columns
New shape:  (1417, 425)
Checking for any nulls before saving
Number of null values 0
saved:  Q3TESTITTOEN_ALLF X and y


## <font color = 'cornflowerblue' size = 4>Conclusions and Next Steps</font>

Since the datasets for Q2 training, validation, test with unseen students are now prepared with word and student based features, modeling can be done. 

The best model will be selected for trial on the two unseen languages. 