<a href="https://colab.research.google.com/github/rts1988/Duolingo_spaced_repetition/blob/main/Duolingo_q1_joininglexemes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the [previous notebook](https://colab.research.google.com/drive/1ph0aYyuKy1rVYkhmVDGkpNZeibpNJNDR#scrollTo=o83v40YAdJP9) we got word features, and binary dummies for learning language and ui language. we decomposed lexeme string, got some additional features like contains special characters, number of tokens, surface length etc. we corrected for the surface form being <*sf>. 

The issue was that joining datadfq1 and the word features kept running into memory issues. 

We will try the following approaches. 

1. Convert the datadfq1 numeric to sparse representation and concatenate to lexeme features. - this is fairly involved, have to construct the sparse matrix from the lexemes sparse matrix. 
2. Tried Colab Pro+ ($67 per month) - did not work "L"
3. Try doing some modeling with the word features alone, get average delta, p_forgot_bin counts, num records, and do some feature selection first. 
4. Do a PCA on the lexeme features and bring down to around 200.
5. Try doing some feature selection using statistical testing and EDA with different word based features and plot mean p_forgot_bin vs delta, with hue as feature. 


### Convert numerics in datadfq1 to sparse matrix and then concatenate with all lexeme features.

1. convert datadfq1 to sparse array
2. check memory before and after

In [1]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Your runtime has 54.8 gigabytes of available RAM



In [2]:
import bz2
import pickle
import _pickle as cPickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from google.colab import drive
drive.mount('/content/drive')

def decompress_pickle(file):
 data = bz2.BZ2File(file, 'rb')
 data = cPickle.load(data)
 return data

def compressed_pickle(title, data):  # do not add extension in filename
 with bz2.BZ2File(title + '.pbz2', 'w') as f: 
  cPickle.dump(data, f)

path_name = '/content/drive/MyDrive/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
q1df1 = decompress_pickle(path_name+"q1_fulldataset_wordsonly.pbz2")

In [4]:
d1np1 = np.array(q1df1)

In [5]:
d1np1.size/1000/1000

87.760248

In [6]:
from scipy.sparse import coo_matrix
d1np1coo = coo_matrix(d1np1)

In [7]:
d1np1coo.size/1000/1000

37.745286

In [8]:
d1np1coo.shape

(7313354, 12)

We have substantially reduced the space taken by the main dataframe. 

Now we will do the same with the lexeme features. but yikes, how will we join them??

We need to do the following.
1. Get the full datadfq1 (Unseen words training set) - get only lexeme ids - 
2. convert it to a dataframe with one column. 
3. get indices of lexeme rows for each row of the main dataframe.
4. convert lexemes to sparse array, then build the to be concatenated array based on indices (check shape)
5. concatenate columnwise with main dataframe.

In [9]:
datadfq1 = decompress_pickle(path_name+"Unseen_words_training_set.pbz2") 

In [10]:
lexemeids = datadfq1['lexeme_id']
lexemeids = pd.DataFrame(lexemeids)

Getting q1lexemes, getting indices of lexeme id. then joining just those. 

In [11]:
q1lexemes = decompress_pickle(path_name+"q1_lexeme_features.pbz2")

In [12]:
q1lexemes.shape

(12160, 2231)

In [13]:
q1lexemeids=pd.DataFrame(q1lexemes['lexeme_id'])

In [14]:
q1lexemeids = q1lexemeids.reset_index()
q1lexemeids.head()

Unnamed: 0,index,lexeme_id
0,0,73eecb492ca758ddab5371cf7b5cca32
1,1,c84476c460737d9fb905dca3d35ec995
2,2,1a913f2ded424985b9c02d0436008511
3,3,38b770e66595fea718366523b4f7db3f
4,4,4bdb859f599fa07dd5eecdab0acc2d34


In [15]:
lexemeids.shape, q1lexemeids.shape

((7313354, 1), (12160, 2))

Now we will map the main dataframe indices to the lexeme features indices.

In [18]:
lexeme_map = pd.merge(left=lexemeids,right=q1lexemeids, how='left',left_on = 'lexeme_id',right_on = 'lexeme_id')

In [19]:
lexeme_map.head()

Unnamed: 0,lexeme_id,index
0,73eecb492ca758ddab5371cf7b5cca32,0
1,73eecb492ca758ddab5371cf7b5cca32,0
2,c84476c460737d9fb905dca3d35ec995,1
3,1a913f2ded424985b9c02d0436008511,2
4,38b770e66595fea718366523b4f7db3f,3


index column here contains the row of the q1lexemes dataframe that needs to be concatenated to the main dataframe.


Row-wise concatenation using getrow() was taking too long (see below commented cells), so we will instead convert q1lexemes to a coo matrix - > get a dictionary of keys, and use that to construct the bigger matrix. 

There are still some non-numeric columns that need to be converted to binaries or deleted. we will do this first. 

In [20]:
q1lexemes.select_dtypes('object').head()

Unnamed: 0,lexeme_id,learning_language,lexeme_string,surface_form,lemma_form,pos,modstrings
0,73eecb492ca758ddab5371cf7b5cca32,es,bajo/bajo<pr>,bajo,bajo,pr,[]
1,c84476c460737d9fb905dca3d35ec995,es,niños/niño<n><m><pl>,niños,niño,n,"[m, pl]"
2,1a913f2ded424985b9c02d0436008511,es,leo/leer<vblex><pri><p1><sg>,leo,leer,vblex,"[pri, p1, sg]"
3,38b770e66595fea718366523b4f7db3f,es,libro/libro<n><m><sg>,libro,libro,n,"[m, sg]"
4,4bdb859f599fa07dd5eecdab0acc2d34,es,a/a<pr>,a,a,pr,[]


Converting pos to binaries. 

In [21]:
pos_dummies = pd.get_dummies(q1lexemes['pos'],prefix='pos')
q1lexemes = pd.concat([q1lexemes,pos_dummies],axis=1)

In [22]:
q1lexemes = q1lexemes.drop(['lexeme_id','lexeme_string','learning_language','surface_form','lemma_form','pos','modstrings'],axis=1)

In [23]:
q1lexemes.select_dtypes('object')

0
1
2
3
4
...
12155
12156
12157
12158
12159


No strings left in the q1lexemes dataframe. We can now convert to coo matrix. 

In [24]:
q1lexemes.size/1000000

27.85856

In [25]:
q1lexemecoo = coo_matrix(np.array(q1lexemes))

In [26]:
q1lexemecoo.size/1000000

0.151574

Failed attempt at row-wise concatenation:

In [56]:
#Very sparse matrix. now we will index a row.
#q1lexemecoo.getrow(0)

#Will go row_wise in lexeme_map, and get the row from the indices, and concatenate. 
# from scipy.sparse import hstack

# lexemes_coo  = q1lexemecoo.getrow(lexeme_map.iloc[0,1])

# for i,row in lexeme_map.tail(-1).iterrows():
#   if i%1000 == 0:
#     print(i,' rows done',end='\r')
#   lexemes_coo = hstack((lexemes_coo,q1lexemecoo.getrow(row['index'])))
# lexemes_coo.shape

<1x2291 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

Converting to dictionary of keys

In [30]:
q1lexemecoo_dok = q1lexemecoo.todok(copy=True)

Let's look at the first 10 values of the ductionary. 

In [36]:
list(q1lexemecoo_dok.items())[0:10]

[((0, 0), 7.0),
 ((0, 1), 4.0),
 ((0, 87), 1.0),
 ((0, 141), 0.35263237459267294),
 ((0, 144), 0.4122115155582215),
 ((0, 275), 0.28469397919105377),
 ((0, 276), 0.3929244102341387),
 ((0, 1027), 0.32488787493717813),
 ((0, 2280), 1.0),
 ((1, 0), 20.0)]

We can return the first row as: 

In [41]:
list(q1lexemecoo_dok[0].items())

[((0, 0), 7.0),
 ((0, 1), 4.0),
 ((0, 87), 1.0),
 ((0, 141), 0.35263237459267294),
 ((0, 144), 0.4122115155582215),
 ((0, 275), 0.28469397919105377),
 ((0, 276), 0.3929244102341387),
 ((0, 1027), 0.32488787493717813),
 ((0, 2280), 1.0)]

We can create a new coo matrix from a dictionary of keys we defined.

In [46]:
dokdict = dict()
dokdict[(0,0)] = 1
dokdict[(1,1)] = 1

keyvals = [(key,val) for (key,val) in dokdict.items()]
vals = 

print(pass_to_coo)
sp.coo_matrix((vals, (rows, cols)))

spmatrix = coo_matrix(dokdict)



[(1, (0, 0)), (1, (1, 1))]


NameError: ignored

We can construct the dictionary so that we use the 'index' column in the main dataframe change the row numbers to match the row of the main dataframe. 

The pseudocode is below:

initialize a dictionary that will be the sparse matrix:
word_features = dict()

For each row in the lexemes dataframe:<br>
  >indexnumbers = Get the rows of the main dataframe with that word from the 'index' column<br>
  >for each number in indexnumbers:<br>
  >>copy the row of q1lexemes, and change the row number to be same as main dataframe row. 
    
Convert the dictionary to a sparse matrix. 

Another alternative is to get counts of p_forgot_bin and number of sessions for each word, and model just the word features. this is another possibility for reducing size: feature selection. 

Pseudocode:
transform delta for finer granularity, and get bins, 
groupby the lexeme_id, and bins and get p_forgot counts for 1 and 0. also num students for each delta? and total? should we also get bin by history correct/ history and group by both delta and fraction history correct? 

table:
lexeme id, lexeme id features, p_forgot_1_counts list, p_forgot_0_counts list, delta_bins_list, history_frac_list 