<a href="https://colab.research.google.com/github/rts1988/Duolingo_spaced_repetition/blob/main/5_Q1_Set1_preprocessing_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color = 'cornflowerblue' size=4>Introduction</font>

In this notebook, the lexeme strings and word features derived in the previous notebook are used as a feature set.

Using only the lexeems in the Q1 training set, PCA is used to reduce the dimensions of the feature set. 

There is a choice on whether to use sparse matrices or do dimension reduction. PCA will make the reduced dimension set more dense. However, it will help reduce the curse of dimensionality. 

After scaling and dimension reduction of the word based features, the reduced features will be joined to the main Q1 training set as compressed sparse matrix. 

The same preprocessing pipeline will be used to transform the Q1test set. 

This will make both datasets ready for modeling. 

Since larger dimensions are being dealt with, high-RAM option is selected:

In [31]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Your runtime has 54.8 gigabytes of available RAM



Importing libraries and mounting google drive:

In [32]:
import bz2
import pickle
import _pickle as cPickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from google.colab import drive
drive.mount('/content/drive')

def decompress_pickle(file):
 data = bz2.BZ2File(file, 'rb')
 data = cPickle.load(data)
 return data

def compressed_pickle(title, data):  # do not add extension in filename
 with bz2.BZ2File(title + '.pbz2', 'w') as f: 
  cPickle.dump(data, f)

path_name = '/content/drive/MyDrive/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading Q1 train set:

In [33]:
q1train = decompress_pickle(path_name+"Q1TRAIN.pbz2") 

Loading Q1 test set:

In [34]:
q1test = decompress_pickle(path_name+"Q1TEST.pbz2")

Confirming the shapes:

In [35]:
q1train.shape, q1test.shape

((8070561, 17), (1795528, 17))

In [36]:
q1train.size/10**6, q1test.size/10**6

(137.199537, 30.523976)

List of lexeme ids in q1train dataframe:

In [37]:
q1train_lexemelist = q1train['lexeme_id'].unique()

In [38]:
all_lexemes = decompress_pickle(path_name+"Duolingo_all_lexemes.pbz2")

In [39]:
all_lexemes.shape

(19279, 17)

Filtering to only the words in the training set 

In [40]:
q1trainlexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(q1train_lexemelist),:]

In [41]:
q1trainlexemes.columns, q1trainlexemes.shape

(Index(['lexeme_id', 'learning_language', 'lexeme_string', 'surface_form',
        'lemma_form', 'pos', 'modstrings', 'sf_length', 'sf_translation',
        'lf_translation', 'surface_form_no_accents', 'lemma_form_no_accents',
        'L_dist_word_tup_sf_noaccents', 'L_dist_sf_noaccents',
        'L_dist_sf_noaccents_norm', 'IDFword', 'EnglishIDF'],
       dtype='object'), (12446, 17))

Filtering out test set lexemes dataframe for transformation:

In [42]:
q1test_lexemelist = q1test['lexeme_id'].unique()
q1testlexemes = all_lexemes.loc[all_lexemes['lexeme_id'].isin(q1test_lexemelist),:]

In [43]:
q1testlexemes.shape

(2776, 17)

## <font color = 'cornflowerblue' size=4>Preprocessing and PCA</font>

### <font color = 'cornflowerblue' size=3>One hot encoding of categoricals</font>
The learning language, pos and mostrings are further converted to binary dummies: 

In [44]:
from sklearn.preprocessing import OneHotEncoder

enc_lang = OneHotEncoder(sparse=False,handle_unknown='ignore')
enc_pos = OneHotEncoder(sparse = False,handle_unknown = 'ignore')
enc_mods = OneHotEncoder(sparse=False,handle_unknown='ignore')

# get one hot encoded learning language
enc_lang.fit(np.array(q1trainlexemes['learning_language']).reshape(-1, 1))
q1train_lang = pd.DataFrame(enc_lang.transform(np.array(q1trainlexemes['learning_language']).reshape(-1, 1)),index=q1trainlexemes.index)

# get one hot encoded part of speech:
enc_pos.fit(np.array(q1trainlexemes['pos']).reshape(-1, 1))
q1train_pos = pd.DataFrame(enc_pos.transform(np.array(q1trainlexemes['pos']).reshape(-1, 1)),index=q1trainlexemes.index)


In [45]:
# get one hot encoded modstrings
q1trainlexemes['modstrings'].head()

7                []
20          [m, pl]
21    [pri, p1, sg]
22          [m, sg]
24    [pri, p3, sg]
Name: modstrings, dtype: object

Since modstrings are saved as a list of strings, some more processing needs to be done before passing to the one hot encoder. 

In [46]:
# explode values of list so a separate record is created for each element of the list. 
q1trainlexemes['modstrings'].explode()
enc_mods.fit(np.array(q1trainlexemes['modstrings'].explode()).reshape(-1,1))
q1train_mods_exploded =enc_mods.transform(np.array(q1trainlexemes['modstrings'].explode()).reshape(-1,1))

q1train_modsdf_exploded = pd.DataFrame(q1train_mods_exploded, index = q1trainlexemes['modstrings'].explode().index)

# group the exploded one-hot dataframe by the q1lexemes index, sum up the exploded records (since one word can have many modifiers in its list)
q1train_modsdf = q1train_modsdf_exploded.groupby(q1train_modsdf_exploded.index).sum()
#np.concatenate([np.array(q1trainlexemes['modstrings'].explode().index).reshape(-1,1),q1train_mods],axis=1)

In [47]:
q1train_modsdf.shape, q1trainlexemes.shape

((12446, 86), (12446, 17))

The summed shape of the mods dataframe is the same length as the q1trainlexemes dataframe. 

Now the language, pos and mod sparse columns are combined along with numerical word features to make the first feature set. 

In [67]:
q1trainfeatureset1 = pd.concat([q1trainlexemes[['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF','lexeme_id']],
                                q1train_lang, 
                                q1train_pos, 
                                q1train_modsdf, 
                                ],axis=1)

In [68]:
q1trainfeatureset1.shape

(12446, 167)

The feature set has 167 columns. 

Applying the same transformations to the test set lexemes:

In [50]:
# transforming the language column based on already fit one hot encoder:
q1test_lang = pd.DataFrame(enc_lang.transform(np.array(q1testlexemes['learning_language']).reshape(-1, 1)),index=q1testlexemes.index)

# get one hot encoded part of speech:
q1test_pos = pd.DataFrame(enc_pos.transform(np.array(q1testlexemes['pos']).reshape(-1, 1)),index=q1testlexemes.index)


In [51]:
# explode values of list so a separate record is created for each element of the list. 
q1testlexemes['modstrings'].explode()
#enc_mods.fit(np.array(q1trainlexemes['modstrings'].explode()).reshape(-1,1))
q1test_mods_exploded =enc_mods.transform(np.array(q1testlexemes['modstrings'].explode()).reshape(-1,1))

q1test_modsdf_exploded = pd.DataFrame(q1test_mods_exploded, index = q1testlexemes['modstrings'].explode().index)

# group the exploded one-hot dataframe by the q1lexemes index, sum up the exploded records (since one word can have many modifiers in its list)
q1test_modsdf = q1test_modsdf_exploded.groupby(q1test_modsdf_exploded.index).sum()
#np.concatenate([np.array(q1trainlexemes['modstrings'].explode().index).reshape(-1,1),q1train_mods],axis=1)

In [52]:
q1test_modsdf.shape, q1testlexemes.shape

((2776, 86), (2776, 17))

The summed shape of the mods dataframe is the same length as the q1trainlexemes dataframe. 

Now the language, pos and mod sparse columns are combined along with numerical word features to make the first feature set. 

In [69]:
q1testfeatureset1 = pd.concat([q1testlexemes[['lexeme_id','sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF']],
                                q1test_lang, 
                                q1test_pos, 
                                q1test_modsdf, 
                                ],axis=1)

In [70]:
q1testfeatureset1.shape

(2776, 167)

The number of dimensions matches for train and test post binarizing categorical columns. They are saved as compressed pickle files. 

In [71]:
compressed_pickle(path_name+"Q1TRAIN_lexemesFS1",q1trainfeatureset1)
compressed_pickle(path_name+"Q1TEST_lexemesFS1",q1testfeatureset1)

Plan

Pipe 1: (no PCA, just combined)
Featureset1 -> join with main dataframe -> convert to sparse matrix -> minmaxscaler

Pipe2: (PCA of lexeme features)
Featureset1 -> stdscaler -> PCA (explained_variance=0.9) -> join with main dataframe (after scaling with stdscaler) -> convert to sparse matrix


In [27]:
from sklearn.decomposition import PCA, SparsePCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder

Estimate of total size of dataset once joined. 

In [28]:
(q1trainfeatureset1.size/10**6)/q1trainfeatureset1.shape[0]*q1train.shape[0] + q1train.size/10**6

1417.60566

The estimated size of a full combined feature set without any dimension reduction is 1.4GB. 

Attempts to combine it all at once failed even with high RAM.

```
# getting lexeme id for the full feature set 1
q1trainfeatureset1['lexeme_id'] = q1trainlexemes.loc[q1trainfeatureset1.index,'lexeme_id']

# joining with main dataframe on lexeme id
q1train_withfs1 = pd.merge(left= q1train,right=q1train,left_on='lexeme_id',right_on = 'lexeme_id',how="left")
```

As a workaround, the following process is used.

1.  100,000 rows of the main dataframe are taken at a time,joined with q1trainfeatureset1, 
2. and then compressed to a sparse matrix. 
3. The sparse matrices generated are stacked vertically to build a compressed sprarse version of the full dataset. 

Droping extraneous columns from q1train, and q1trainfeatureset1

In [57]:
q1train.columns

Index(['timestamp', 'delta', 'user_id', 'learning_language', 'ui_language',
       'lexeme_id', 'lexeme_string', 'history_seen', 'history_correct',
       'session_seen', 'session_correct', 'p_forgot_bin', 'simoverdiff',
       'lang_frozenset', 'Datetime', 'delta_days', 'history_frac'],
      dtype='object')

The columns needed for merging with lexemes for q1train_X: delta, lexeme_id, history_seen, history_frac, simoverdiff, 

The columns needed for q1train_y: p_forgot_bin


In [58]:
q1train_X = q1train[['lexeme_id','delta','history_seen','history_frac','simoverdiff']]

q1train_y = q1train['p_forgot_bin']

In [64]:
q1train_X.columns

Index(['lexeme_id', 'delta', 'history_seen', 'history_frac', 'simoverdiff'], dtype='object')

With the exception of lexeme id (needed for joining) all other non-numerical columns have been dropped. 

In [63]:
q1trainfeatureset1.columns[0:5]

Index(['sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF', 0, 1], dtype='object')

Modified not to save files.

In [77]:
try: 
  del Xq1
except:
  pass

In [79]:
from scipy.sparse import coo_matrix, vstack
count = 0
for i in range(0,q1train_X.shape[0],100000):
  if count%2:
    print(count)
  
  subdata = q1train_X.iloc[i:min(i+100000,q1train_X.shape[0]),:]
  subdata = pd.merge(left = subdata, right = q1trainfeatureset1,left_on = 'lexeme_id',right_on = 'lexeme_id',how="left")
  subdata = subdata.drop('lexeme_id',axis=1)
  #print('size: ',subdata.size/10**6)

  mat = coo_matrix(subdata)
  #print('size after compression: ',mat.size/10**6)
  if i >1:
    print('stacking: ')
    Xq1 = vstack([Xq1,mat])
    #print('saving: ',"Xq1_"+str(count))
    #compressed_pickle(path_name+"Xq1_"+str(count), mat)
  else:
    Xq1 = mat
    #compressed_pickle(path_name+"Xq1_"+str(count), mat)
  del subdata, mat
  count +=1
#print('Saved all, size Xq1: ',Xq1.size/10**6)
compressed_pickle(path_name+"q1train_pipe1", Xq1)

1
stacking: 
stacking: 
3
stacking: 
stacking: 
5
stacking: 
stacking: 
7
stacking: 
stacking: 
9
stacking: 
stacking: 
11
stacking: 
stacking: 
13
stacking: 
stacking: 
15
stacking: 
stacking: 
17
stacking: 
stacking: 
19
stacking: 
stacking: 
21
stacking: 
stacking: 
23
stacking: 
stacking: 
25
stacking: 
stacking: 
27
stacking: 
stacking: 
29
stacking: 
stacking: 
31
stacking: 
stacking: 
33
stacking: 
stacking: 
35
stacking: 
stacking: 
37
stacking: 
stacking: 
39
stacking: 
stacking: 
41
stacking: 
stacking: 
43
stacking: 
stacking: 
45
stacking: 
stacking: 
47
stacking: 
stacking: 
49
stacking: 
stacking: 
51
stacking: 
stacking: 
53
stacking: 
stacking: 
55
stacking: 
stacking: 
57
stacking: 
stacking: 
59
stacking: 
stacking: 
61
stacking: 
stacking: 
63
stacking: 
stacking: 
65
stacking: 
stacking: 
67
stacking: 
stacking: 
69
stacking: 
stacking: 
71
stacking: 
stacking: 
73
stacking: 
stacking: 
75
stacking: 
stacking: 
77
stacking: 
stacking: 
79
stacking: 
stacking: 


In [81]:
Xq1.size/10**6

86.851987

The compressed matrix X is 86.85 MB.

Saving q1train_y:

In [83]:
compressed_pickle(path_name+"q1train_y",q1train_y)

In [84]:
q1test_X = q1test[['lexeme_id','delta','history_seen','history_frac','simoverdiff']]

q1test_y = q1test['p_forgot_bin']

In [85]:
q1test_X.columns

Index(['lexeme_id', 'delta', 'history_seen', 'history_frac', 'simoverdiff'], dtype='object')

With the exception of lexeme id (needed for joining) all other non-numerical columns have been dropped. 

In [86]:
q1testfeatureset1.columns[0:5]

Index(['lexeme_id', 'sf_length', 'L_dist_sf_noaccents_norm', 'EnglishIDF', 0], dtype='object')

Modified not to save files.

In [87]:
try: 
  del Xq1
except:
  pass

In [88]:
count = 0
for i in range(0,q1test_X.shape[0],100000):
  if count%2:
    print(count)
  
  subdata = q1test_X.iloc[i:min(i+100000,q1test_X.shape[0]),:]
  subdata = pd.merge(left = subdata, right = q1testfeatureset1,left_on = 'lexeme_id',right_on = 'lexeme_id',how="left")
  subdata = subdata.drop('lexeme_id',axis=1)
  #print('size: ',subdata.size/10**6)

  mat = coo_matrix(subdata)
  #print('size after compression: ',mat.size/10**6)
  if i >1:
    print('stacking: ')
    Xq1 = vstack([Xq1,mat])
    #print('saving: ',"Xq1_"+str(count))
    #compressed_pickle(path_name+"Xq1_"+str(count), mat)
  else:
    Xq1 = mat
    #compressed_pickle(path_name+"Xq1_"+str(count), mat)
  del subdata, mat
  count +=1
#print('Saved all, size Xq1: ',Xq1.size/10**6)
compressed_pickle(path_name+"q1test_pipe1", Xq1)

1
stacking: 
stacking: 
3
stacking: 
stacking: 
5
stacking: 
stacking: 
7
stacking: 
stacking: 
9
stacking: 
stacking: 
11
stacking: 
stacking: 
13
stacking: 
stacking: 
15
stacking: 
stacking: 
17
stacking: 


In [89]:
compressed_pickle(path_name+"q1test_y",q1test_y)

## <font color = 'cornflowerblue' size=4>Conclusions and Next Steps</font>

Word based features have been joined to the Q1 train and test dataframes, and stored as compressed sparse matrices. 

scaling will be done before modeling. 

Next steps:

For pipe1:

Further split q1train_featureset1 into training and validation sets (90-10) split. 

1. Model with classical machine learning techniques
- downsample or upsample or adjust class weight hyperparameter
- logistic regression
- decision tree
- Naive Bayes classification
- downsampled kNN
- doensampled SVM

2. Ensemble techniques
- Random Forest
- AdaBoost
- XGBoost

3. Neural net
- Dense neural net. 

Model performance of pipe1 will be compared with validation set performance average precision and ROC AUC, with a baseline model with no word based features. 

