<a href="https://colab.research.google.com/github/rts1988/Duolingo_spaced_repetition/blob/main/2A_WordVectors_Duolingo_spaced_repetition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <font color = 'cornflowerblue' size=4>Getting multilingual word vectors and partially pre-processing them</font>

In [None]:
import bz2
import pickle
import _pickle as cPickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from google.colab import drive
drive.mount('/content/drive')

def decompress_pickle(file):
 data = bz2.BZ2File(file, 'rb')
 data = cPickle.load(data)
 return data

def compressed_pickle(title, data):  # do not add extension in filename
 with bz2.BZ2File(title + '.pbz2', 'w') as f: 
  cPickle.dump(data, f)

path_name = '/content/drive/MyDrive/'

Mounted at /content/drive


## <font color='cornflowerblue' size= 3>Getting multilingual word vectors</font>

Multilingual word vectors:
https://www.cs.cmu.edu/~afm/projects/multilingual_embeddings.html

did not use fasttext because it was taking too long to load (~15 min each language) and simply too big (4 GB each language)
https://github.com/babylonhealth/fastText_multilingual

In [None]:
!tar -xvf /content/drive/MyDrive/multilingual_embeddings.tar.gz

multilingual_embeddings.ar
multilingual_embeddings.de
multilingual_embeddings.en
multilingual_embeddings.es
multilingual_embeddings.fr
multilingual_embeddings.it
multilingual_embeddings.nl
multilingual_embeddings.pb
multilingual_embeddings.pl
multilingual_embeddings.ro
multilingual_embeddings.ru
multilingual_embeddings.tr


Arabic - ar
Brazilian Portuguese - pb
Dutch - nl (Netherlands?)
English - en
French - fr
German - de
Italian - it
Polish - pl?
Romanian - ro
Russian - ru
Spanish - es
Turkish - tr


In [None]:
!ls

drive			    multilingual_embeddings.nl
multilingual_embeddings.ar  multilingual_embeddings.pb
multilingual_embeddings.de  multilingual_embeddings.pl
multilingual_embeddings.en  multilingual_embeddings.ro
multilingual_embeddings.es  multilingual_embeddings.ru
multilingual_embeddings.fr  multilingual_embeddings.tr
multilingual_embeddings.it  sample_data


In [None]:

german_embeddings = dict()
count = 0
with open('multilingual_embeddings.es','r') as f1:
  lines = f1.readlines()
for line in lines:
  splitline = re.split('\s+|\n',line)
  word = splitline[0]
  emb  = splitline[1:]
  german_embeddings[word] = emb
  

  #word = line[0]
  #embedding = line[1:]
  #german_embeddings[word] = embedding


In [None]:
import re
def get_embeddings_dict(filename):
  embeddings = dict()
  count = 0
  with open(filename,'r') as f1:
    lines = f1.readlines()
  for line in lines:
    splitline = re.split('\s+|\n',line)
    word = splitline[0]
    emb_raw  = splitline[1:] 
    emb = np.array([float(n) for n in splitline[1:-1]]) # remove last element, and convert rest to float. 
    embeddings[word] = emb

  return embeddings
  
  # currently embeddings is a list of string characterrs, the last element is an empty string.
  # each value needs to be converted to a float value, and the last element should be removed. 


In [None]:
german_embeddings = get_embeddings_dict('multilingual_embeddings.de')
compressed_pickle(path_name+"german_embeddings",german_embeddings)
french_embeddings = get_embeddings_dict('multilingual_embeddings.fr')
compressed_pickle(path_name+"french_embeddings",french_embeddings)
portuguese_embeddings = get_embeddings_dict('multilingual_embeddings.pb')
compressed_pickle(path_name+"portuguese_embeddings",portuguese_embeddings)
italian_embeddings = get_embeddings_dict('multilingual_embeddings.it')
compressed_pickle(path_name+"italian_embeddings",italian_embeddings)
english_embeddings = get_embeddings_dict('multilingual_embeddings.en')
compressed_pickle(path_name+"english_embeddings",english_embeddings)
spanish_embeddings = get_embeddings_dict('multilingual_embeddings.es')
compressed_pickle(path_name+"spanish_embeddings",spanish_embeddings)

In [None]:
list(german_embeddings.keys())[0:20]

['kürzerer',
 'abzuschalten',
 'beteiligen',
 'fundraising',
 'verwirrter',
 'markenrechtsverletzungen',
 'fachblatt',
 'versorgen',
 'familienzeit',
 'sapiens',
 'colaflasche',
 'bulgaren',
 'schneit',
 'vorbeigekommen',
 'spricht',
 'schlagzeuger',
 'männerbereiche',
 'ballonfahrer',
 '146',
 'kalash']

In [None]:
german_embeddings['frau'].shape

(300,)

## <font color = 'cornflowerblue' size=3>Combining with lexeme_id</font>

The word vectors for all languages have been saved to files above. 

They are combined with the lexeme_id below by the following steps:

1. Create a separate dataframe lexeme_vec with index as surface_form, and columns lexeme_id, and lemma form and learning language. 
2. For each language, map the surface form column of the lexeme_vec dataframe with the respective word embeddings dictionary in a separate dataframe. 
2. Check how many words have vectors
3. Deal with missing values by either mapping with lemma form, or other forms of imputation. 


Downloading all_lexemes from google drive:

In [None]:
german_embeddings = decompress_pickle(path_name+"german_embeddings.pbz2")
french_embeddings = decompress_pickle(path_name+"french_embeddings.pbz2")
italian_embeddings = decompress_pickle(path_name+"italian_embeddings.pbz2")
english_embeddings = decompress_pickle(path_name+"english_embeddings.pbz2")
spanish_embeddings = decompress_pickle(path_name+"spanish_embeddings.pbz2")

In [None]:
portuguese_embeddings = decompress_pickle(path_name+"portuguese_embeddings.pbz2")

In [None]:
all_lexemes = decompress_pickle(path_name+"Duolingo_all_lexemes.pbz2")

In [None]:
all_lexemes.head()

Unnamed: 0,lexeme_id,learning_language,lexeme_string,surface_form,lemma_form,pos,modstrings,sf_length,sf_translation,lf_translation,surface_form_no_accents,lemma_form_no_accents,L_dist_word_tup_sf_noaccents,L_dist_sf_noaccents,L_dist_sf_noaccents_norm,IDFword,EnglishIDF
0,76390c1350a8dac31186187e2fe1e178,de,lernt/lernen<vblex><pri><p3><sg>,lernt,lernen,vblex,"[pri, p3, sg]",5,learns,to learn,lernt,lernen,"(2, learns)",2,0.4,learns,6.981924
1,7dfd7086f3671685e2cf1c1da72796d7,de,die/die<det><def><f><sg><nom>,die,die,det,"[def, f, sg, nom]",3,the,the,die,die,"(2, the)",2,0.666667,the,0.00107
2,35a54c25a2cda8127343f6a82e6f6b7d,de,mann/mann<n><m><sg><nom>,mann,mann,n,"[m, sg, nom]",4,husband,husband,mann,mann,"(5, husband)",5,1.25,husband,1.258263
3,0cf63ffe3dda158bc3dbd55682b355ae,de,frau/frau<n><f><sg><nom>,frau,frau,n,"[f, sg, nom]",4,Mrs,Mrs,frau,frau,"(3, mrs)",3,0.75,Mrs,6.309499
4,84920990d78044db53c1b012f5bf9ab5,de,das/das<det><def><nt><sg><nom>,das,das,det,"[def, nt, sg, nom]",3,the,the,das,das,"(3, the)",3,1.0,the,0.00107


In [None]:
all_lexemes['vectorlist'] = ''
all_lexemes.loc[all_lexemes['learning_language']=='de','vectorlist'] = all_lexemes.loc[all_lexemes['learning_language']=='de','surface_form'].map(german_embeddings)

In [None]:
all_lexemes.loc[all_lexemes['learning_language']=='de','vectorlist'].head()

0    [-0.12485239869, 0.403838020553, -0.3502961037...
1    [0.0432688036856, -0.0914570974226, -0.1884847...
2    [-0.0895643525954, 0.208577167333, -0.06067782...
3    [0.0156379608103, 0.37910027723, -0.2013124868...
4    [0.0388741864242, 0.160898776301, -0.105828861...
Name: vectorlist, dtype: object

In [None]:
all_lexemes.loc[(all_lexemes['learning_language']=='de') & (all_lexemes['vectorlist'].isna()),:].shape

(355, 18)

In [None]:
all_lexemes.loc[(all_lexemes['learning_language']=='de') & (all_lexemes['vectorlist'].isna()),'vectorlist'] = all_lexemes.loc[(all_lexemes['learning_language']=='de') & (all_lexemes['vectorlist'].isna()),'lemma_form'].map(german_embeddings)

In [None]:
all_lexemes.loc[(all_lexemes['learning_language']=='de') & (all_lexemes['vectorlist'].isna()),:].shape

(62, 18)

62 missing values for german embeddings. 

In [None]:
all_lexemes.loc[(all_lexemes['learning_language']=='de') & (all_lexemes['vectorlist'].isna()),:].shape

(62, 18)

In [None]:
all_lexemes.loc[all_lexemes['learning_language']=='en','vectorlist'] = all_lexemes.loc[all_lexemes['learning_language']=='en','surface_form'].map(english_embeddings)
print("english embeddings suface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='en') & (all_lexemes['vectorlist'].isna()),:].shape)
all_lexemes.loc[(all_lexemes['learning_language']=='en') & (all_lexemes['vectorlist'].isna()),'vectorlist'] = all_lexemes.loc[(all_lexemes['learning_language']=='en') & (all_lexemes['vectorlist'].isna()),'lemma_form'].map(english_embeddings)
print("english embeddings lemma+surface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='en') & (all_lexemes['vectorlist'].isna()),:].shape)
print("Imputing remaining missing values with centroid:")

all_lexemes.loc[all_lexemes['learning_language']=='es','vectorlist'] = all_lexemes.loc[all_lexemes['learning_language']=='es','surface_form'].map(spanish_embeddings)
print("spanish embeddings suface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='es') & (all_lexemes['vectorlist'].isna()),:].shape)
all_lexemes.loc[(all_lexemes['learning_language']=='es') & (all_lexemes['vectorlist'].isna()),'vectorlist'] = all_lexemes.loc[(all_lexemes['learning_language']=='es') & (all_lexemes['vectorlist'].isna()),'lemma_form'].map(spanish_embeddings)
print("spanish embeddings lemma+surface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='es') & (all_lexemes['vectorlist'].isna()),:].shape)




english embeddings suface form missing values (734, 18)
english embeddings lemma+surface form missing values (0, 18)
Imputing remaining missing values with centroid:
spanish embeddings suface form missing values (335, 18)
spanish embeddings lemma+surface form missing values (6, 18)


In [None]:
all_lexemes.loc[all_lexemes['learning_language']=='fr','vectorlist'] = all_lexemes.loc[all_lexemes['learning_language']=='fr','surface_form'].map(french_embeddings)
print("french embeddings suface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='fr') & (all_lexemes['vectorlist'].isna()),:].shape)
all_lexemes.loc[(all_lexemes['learning_language']=='fr') & (all_lexemes['vectorlist'].isna()),'vectorlist'] = all_lexemes.loc[(all_lexemes['learning_language']=='fr') & (all_lexemes['vectorlist'].isna()),'lemma_form'].map(french_embeddings)
print("french embeddings lemma+surface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='fr') & (all_lexemes['vectorlist'].isna()),:].shape)




french embeddings suface form missing values (1551, 18)
french embeddings lemma+surface form missing values (38, 18)


In [None]:
all_lexemes.loc[all_lexemes['learning_language']=='it','vectorlist'] = all_lexemes.loc[all_lexemes['learning_language']=='it','surface_form'].map(italian_embeddings)
print("italian embeddings suface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='it') & (all_lexemes['vectorlist'].isna()),:].shape)
all_lexemes.loc[(all_lexemes['learning_language']=='it') & (all_lexemes['vectorlist'].isna()),'vectorlist'] = all_lexemes.loc[(all_lexemes['learning_language']=='it') & (all_lexemes['vectorlist'].isna()),'lemma_form'].map(italian_embeddings)
print("italian embeddings lemma+surface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='it') & (all_lexemes['vectorlist'].isna()),:].shape)




italian embeddings suface form missing values (1138, 18)
italian embeddings lemma+surface form missing values (23, 18)


In [None]:
all_lexemes.loc[all_lexemes['learning_language']=='pt','vectorlist'] = all_lexemes.loc[all_lexemes['learning_language']=='pt','surface_form'].map(portuguese_embeddings)
print("portuguese embeddings suface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='pt') & (all_lexemes['vectorlist'].isna()),:].shape)
all_lexemes.loc[(all_lexemes['learning_language']=='pt') & (all_lexemes['vectorlist'].isna()),'vectorlist'] = all_lexemes.loc[(all_lexemes['learning_language']=='pt') & (all_lexemes['vectorlist'].isna()),'lemma_form'].map(portuguese_embeddings)
print("portuguese embeddings lemma+surface form missing values",all_lexemes.loc[(all_lexemes['learning_language']=='pt') & (all_lexemes['vectorlist'].isna()),:].shape)




portuguese embeddings suface form missing values (1759, 18)
portuguese embeddings lemma+surface form missing values (47, 18)


In [None]:
all_lexemes.shape

(19279, 18)

In [None]:
all_lexemes.columns

Index(['lexeme_id', 'learning_language', 'lexeme_string', 'surface_form',
       'lemma_form', 'pos', 'modstrings', 'sf_length', 'sf_translation',
       'lf_translation', 'surface_form_no_accents', 'lemma_form_no_accents',
       'L_dist_word_tup_sf_noaccents', 'L_dist_sf_noaccents',
       'L_dist_sf_noaccents_norm', 'IDFword', 'EnglishIDF', 'vectorlist'],
      dtype='object')

In [None]:
all_lexemes = all_lexemes.drop(['learning_language', 'lexeme_string', 'surface_form',
       'lemma_form', 'pos', 'modstrings', 'sf_length', 'sf_translation',
       'lf_translation', 'surface_form_no_accents', 'lemma_form_no_accents',
       'L_dist_word_tup_sf_noaccents', 'L_dist_sf_noaccents',
       'L_dist_sf_noaccents_norm', 'IDFword', 'EnglishIDF'],axis=1)

In [None]:
all_lexemes.columns

Index(['lexeme_id', 'vectorlist'], dtype='object')

In [None]:
compressed_pickle(path_name+"Duolingo_wordvectors",all_lexemes)

In [None]:
all_lexemes[all_lexemes['vectorlist'].isna()].shape[0]

176

There are 176 lexemes with no word vectors values out of 19,279. 

The lexemes in the train sets will be used to impute the test sets. 