# Simple Substitution
`w266 Final Project: Crosslingual Word Embeddings`

The code in this notebook was used to develop an algorithm to generate crosslingual word embeddings by training on a monolingual corpus and substituting translations at runtime.

# Notebook Setup

In [1]:
# general imports
from __future__ import print_function
import time
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# tell matplotlib not to open a new window
%matplotlib inline

# autoreload modules
%load_ext autoreload
%autoreload 2

In [2]:

## Maya's paths
BASE = '/Users/mmillervedam/Documents/MIDS/w266' #'/home/mmillervedam/' 
PROJ = '/Users/mmillervedam/Documents/MIDS/w266/FinalProject'#'/home/mmillervedam/ProjectRepo'

## Roseanna's paths


## Mona's local paths
#BASE = '/Users/mona/OneDrive/repos/Data' #'/home/mmillervedam/Data'
#PROJ = '/Users/mona/OneDrive/repos/final_proj/W266-Fall-2017-Final-Project'#'/home/mmillervedam/ProjectRepo'


## Repo paths
FPATH_EN = BASE + '/Data/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/Data/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
EN_ES_DICT = PROJ +'/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT_DICT  = PROJ +'/XlingualEmb/data/dicts/en.it.panlex.all.processed'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'
FULL_EN = BASE + '/Data/en/full.txt'
FULL_ES = BASE + '/Data/es/full.txt'
FULL_IT = BASE + '/Data/it/full.txt'

## Large datasets
FULL_EN_ES = "./shuffled_files/en_es_shuf.txt"
FULL_EN_IT = "./shuffled_files/en_it_shuf.txt"

In [3]:
# directory to save pickled embeddings
SAVE_TO = PROJ + '/Notebooks/embeddings'

# Load & Preprocess Data
__`ORIGINAL AUTHORS SAY:`__ "Normally, the monolingual word embeddings are trained on billions of words. However, getting that much of monolingual data for a low-resource language is also challenging. That is why we only select the top 5 million sentences (around 100 million words) for each language." - _Section 5.1, Duong et. al._ 

In [4]:
from parsing import Corpus, Vocabulary, batch_generator

### Corpus

In [5]:
# load corpus
en_it_data = Corpus(EN_IT_RAW)

In [6]:
# Corpus Stats
!wc {EN_IT_RAW}

   20000  430928 3746786 /Users/mmillervedam/Documents/MIDS/w266/FinalProject/XlingualEmb/data/mono/en_it.shuf.10k


__`i.e.:`__ 20K sentences (10K in each language) with ~430K tokens
> So this must not be their full data For now, I'm just going to look at the top 20K words and see what happens. In reality we should probably modify the Vocab class so that it explicily collects the top words for each language separately and then concatenates the index.

### Dictionary

In [7]:
# loading english-italian dictionary
pld = pd.read_csv(EN_IT_DICT, sep='\t', names = ['en', 'it'], dtype=str)
en_set = set(pld.en.unique())
it_set = set(pld.it.unique())

In [8]:
# dictionary vocab lengths:
print('EN:', len(en_set))
print('IT:', len(it_set))

EN: 266450
IT: 258641


In [9]:
# Create dictionary for ease of runtime translation
# WARNING this takes a sec to run
bi_dict = pld.groupby(['en'])['it'].unique().to_dict()

In [10]:
# add other direction
# WARNING this takes another sec to run
bi_dict.update(pld.groupby(['it'])['en'].unique().to_dict())

In [11]:
# demo en to it
bi_dict['en_go'][:5]

array(['it_aggirare', 'it_andai', 'it_andara', 'it_andare',
       'it_andare_avanti'], dtype=object)

In [12]:
# demo it to en
bi_dict['it_ciao'][:5]

array(['en_adieu', 'en_bye-bye', 'en_bye', 'en_cheerio', 'en_ciao'], dtype=object)

### Vocabulary

In [13]:
# train multilingual Vocabulary
en_it_vocab = Vocabulary(en_it_data.gen_tokens(), size = 50000)

In [14]:
# length of corpus vocabulary
en_it_vocab.size

50000

In [15]:
# overlap with dictionary vocabulary
len([w for w in en_it_vocab.types if w in bi_dict])

10607

__`Question for the group`__*Seems low?* Will we limit our vocab to the words in the dict? (MI)

#### Sample of orphaned words

In [16]:
def print_orphans(vocab, bi_dict):
    x = 1
    for w in vocab:
        if w not in bi_dict:
            print(w)
            x += 1
        if x > 20:
            break
            
print_orphans(en_it_vocab.types, bi_dict )
            

it_spunti
it_[[879051]]
it_giostrando
it_subsessile
it_janue
it_[[878172]]
it_raffiguranti
it_promettenti
it_raffigurante
it_margaria
it_gallizio
it_".
it_macchiano
it_ripubblicate
it_peppers
it_mattonelle
it_incendiata
it_contemporanea
it_senyor
it_contemporanei


### Vocabulary with full corpus

In [29]:
# train multilingual Vocabulary
# NOTE: use FULL_EN_IT if on the instance
en_it_data = Corpus(EN_IT_RAW)

In [19]:
en_it_vocab = Vocabulary(en_it_data.gen_tokens())

In [20]:
# length of corpus vocabulary
en_it_vocab.size

50709

In [21]:
# overlap with dictionary vocabulary
len([w for w in en_it_vocab.types if w in bi_dict])

10729

In [22]:
# sample of orphaned words
print_orphans(en_it_vocab.types, bi_dict )

it_spunti
it_[[879051]]
it_giostrando
it_subsessile
it_janue
it_[[878172]]
it_raffiguranti
it_promettenti
it_raffigurante
it_margaria
it_gallizio
it_".
it_macchiano
it_ripubblicate
it_peppers
it_mattonelle
it_incendiata
it_contemporanea
it_senyor
it_contemporanei


### CBOW Data Generator
__`CHECK PAPER for HYPERPARAMS!`__: I can't seem to find where they talk abou the context window size, embedding size and batch size they use -- it may actually be in the Vulic and Moens paper instead of the Duong one.

__`RLH Update`__: Duong et al. section 6, footnote 4: "Default learning rate of 0.025, negative sampling with 25 samples, subsampling rate of value 1e−4, embedding dimension d = 200, window size cs = 48 and run for 15 epochs"


In [24]:
BATCH_SIZE = 5
WINDOW_SIZE = 2
MAX_EPOCHS = 1 # fail safe

In [25]:
batched_data = batch_generator(en_it_data, 
                               en_it_vocab, 
                               BATCH_SIZE, 
                               WINDOW_SIZE, 
                               MAX_EPOCHS)

In [26]:
# sanity check
for context, label in batched_data:
    print("CONTEXT IDS:", context[:5])
    print("CONTEXT:", [en_it_vocab.to_words(c) for c in context[:5]])
    print("LABEL IDS:", label[:5])
    print("LABELS:", en_it_vocab.to_words(label[:5]))
    break

CONTEXT IDS: [[0, 0, 1, 1], [0, 0, 1, 1], [0, 0, 34, 15624], [0, 20, 15624, 1584], [20, 34, 1584, 309]]
CONTEXT: [['<s>', '<s>', '</s>', '</s>'], ['<s>', '<s>', '</s>', '</s>'], ['<s>', '<s>', 'it_un', 'it_remoto'], ['<s>', 'it_in', 'it_remoto', 'it_passato'], ['it_in', 'it_un', 'it_passato', 'it_aveva']]
LABEL IDS: [43790, 24849, 20, 34, 15624]
LABELS: ['it_[[877881]]', 'it_[[879362]]', 'it_in', 'it_un', 'it_remoto']


# Fun Validation Words

In [27]:
en_it_vocab.to_ids(['en_the','en_first', 'it_nuovo', 'it_parola'])

[3, 84, 669, 6646]

In [28]:
bi_dict['en_the']

array(['it_della', 'it_gli', 'it_i', 'it_il', 'it_la', 'it_l\xc3\xa0',
       'it_le', 'it_lo', 'it_ma'], dtype=object)

In [29]:
bi_dict['en_first']

array(['it_anteriore', 'it_anteriormente', 'it_antico', 'it_anzitutto',
       'it_anzi_tutto', 'it_avvio', 'it_dapprima', 'it_davanti',
       'it_di_fronte', 'it_il_primo', 'it_in', 'it_in_cima', 'it_inizio',
       'it_innanzitutto', 'it_innanzi_tutto', 'it_in_primis',
       'it_in_primo_luogo', 'it_la_prima', 'it_per_la_prima_volta',
       'it_per_primo', 'it_precedente', 'it_prima', 'it_prima_di_tutto',
       'it_primariamente', 'it_primario', 'it_prime', 'it_primieramente',
       'it_primiero', 'it_primo', 'it_principale', 'it_principio'], dtype=object)

In [30]:
bi_dict['it_nuovo']

array(['en_fresh', 'en_green', 'en_latter-day', 'en_new', 'en_novel',
       'en_raw', 'en_recent', 'en_renewed', 'en_unexampled', 'en_unused',
       'en_young'], dtype=object)

In [31]:
bi_dict['it_parola']

array(['en_drake', 'en_language', 'en_mot', 'en_parole', 'en_promise',
       'en_-shaped', 'en_speech', 'en_term', 'en_tongue', 'en_verb',
       'en_vocable', 'en_word_of_honor', 'en_word'], dtype=object)

# Base Model - no word sub yet!
__`CODE NOTES:`__ To get this running I had to hard code the context length (set to 2) inside `BuildCoreGraph()` where we generate `self.input_` in line 102. That should really be inferred from the `self.context_` itself but it doesn't seem to like the placeholder dimension (we don't have a span length until runtime). Does tensorflow not have a vectorized average? Something to fix (later). I also had to hard code the number of samples for softmax (I had originally put this as a `tf.placeholder_with_default` thinking we could pass it in to the training function (since its a training parameter) but TF kicked out an error message asking for an integer so for now I'll just give it what it wants. I need to think more about why TF doesn't want this changing from batch to batch. (or if there is another reason it wants an int).

### Fresh Data Generator

In [23]:
BATCH_SIZE = 20
WINDOW_SIZE = 1
MAX_EPOCHS = 15 # fail safe

batched_data = batch_generator(en_it_data, en_it_vocab, BATCH_SIZE, 
                               WINDOW_SIZE, MAX_EPOCHS)

### Initialize the model

In [26]:
from models import BiW2V

EMBEDDING_SIZE = 200

# create model
model = BiW2V(index = en_it_vocab.index, H = EMBEDDING_SIZE)

# intialize TF graphs
model.BuildCoreGraph()
model.BuildTrainingGraph()
model.BuildValidationGraph()

... TF graph created for BiW2V model.
... TF graph created for BiW2V training.
... TF graph created for BiW2V validation.


### Training

__`IMPORTANT!`__ right now the model only works with a window of 1 because the feed dict can't handle context windows of different lengths. We'll either need to figure out how to have a variable length dimension or else add extra padding to the sentences to account for the window size.

In [27]:
# time
start = time.time()

# training parameters
TEST_WORDS = [3, 84, 669, 6646] # en_the, en_first, it_nuovo, it_parole
nBATCHES = 300000 # ~ 14 epochs
DATA_GENERATOR = batched_data

# training call
model.train(nBATCHES, DATA_GENERATOR, TEST_WORDS, learning_rate = 0.15)
tot = (time.time() - start)
print('... {} batches trained in {} seconds'.format(nBATCHES, tot))

... Model Initialized
	 <tf.Variable 'Embedding_Layer/ContextEmbeddings:0' shape=(50709, 200) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/WordEmbeddings:0' shape=(50709, 200) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/b:0' shape=(50709,) dtype=float32_ref>
... Starting Training
... STEP 0 : Average Loss : 9.23656622569e-05
   [it_,] sim words:  it_piazzati, it_un'iscrizione, it_sessanta, it_verità, it_luo, it_chieri, it_sancita, it_disastrosamente,
   [it_nelle] sim words:  it_solidale, it_percento, it_weiße, it_venerato, it_autori, it_moltissimi, it_organizzando, it_produrli,
   [it_fare] sim words:  it_cambio, it_chomsky, it_catalana, it_meccanismi, it_aiutare, it_agricola, it_incaricato, it_drawing,
   [it_manoscritto] sim words:  it_proietta, it_anticrisi, it_autolinee, it_sv, it_arroscia, it_forze, it_popolato, it_detenuti,
... STEP 30000 : Average Loss : 0.560785405807
... STEP 60000 : Average Loss : 0.353204154999
   [it_,] sim words:  it_piazzati, it_un'iscrizione, it_

__NOTES:__ This is just a context of 1 (ie. window = 3) and there's no bilingual signal. When I ran it w/ the default learning rate there was mad overfitting for `the`'s neighbors but `first` had some much better results (eg. `third` and `only`). It would be interesting to really tune the hyperparamters to see how good we could do (this is essentially monolingual word2vec with two languages)... as a point of comparison for the bilingual versions below.

In [28]:
# take a look at the embeddings
model.context_embeddings

array([[ -4.60977550e-04,   6.87121879e-04,  -6.07755675e-04, ...,
          1.04074273e-03,  -7.94071006e-04,  -4.09811037e-04],
       [  6.01218257e-04,  -1.73640612e-04,   1.50801498e-04, ...,
          9.76910815e-04,   4.34648187e-04,   1.50857595e-04],
       [  4.21131008e-05,   1.43417434e-04,  -4.42015793e-04, ...,
         -2.61730253e-04,   2.63188384e-04,  -3.19738436e-04],
       ..., 
       [ -5.24837349e-04,   7.47645172e-05,  -3.87505628e-04, ...,
          3.38180165e-04,   1.87452009e-04,  -4.23476391e-04],
       [ -2.05034637e-04,  -1.74783956e-04,   2.54430954e-04, ...,
         -1.96897527e-04,  -4.46256425e-04,  -4.10091743e-04],
       [  4.34216403e-04,   9.33892716e-06,  -4.76779445e-04, ...,
          1.30503067e-05,  -3.12790798e-05,  -4.18215437e-04]], dtype=float32)

__`Hmmmm...`__ These don't look normalized to me. Something to return to?

# Model with Random Translation

### Fresh Data

In [57]:
BATCH_SIZE = 20
WINDOW_SIZE = 1
MAX_EPOCHS = 30 # fail safe

batched_data = batch_generator(en_it_data, en_it_vocab, BATCH_SIZE, 
                               WINDOW_SIZE, MAX_EPOCHS)

### Initialize

In [58]:
from models import BiW2V_random

EMBEDDING_SIZE = 128

# create model
model2 = BiW2V_random(('en', 'it'), bi_dict, en_it_vocab.to_ids,
                      index = en_it_vocab.index, 
                      H = EMBEDDING_SIZE)

# intialize TF graphs
model2.BuildCoreGraph()
model2.BuildTrainingGraph()
model2.BuildValidationGraph()

... TF graph created for BiW2V model.
... TF graph created for BiW2V training.
... TF graph created for BiW2V validation.


### Train

In [59]:
# training parameters
TEST_WORDS = [3, 84, 669, 6646] # en_the, en_first, it_nuovo, it_parole
nBATCHES = 600000 # ~ 25 epochs
DATA_GENERATOR = batched_data

In [60]:
# training call
start = time.time()
model2.train(nBATCHES, DATA_GENERATOR, TEST_WORDS, learning_rate = 0.15)
tot = (time.time() - start)
print('... {} batches trained in {} seconds'.format(nBATCHES, tot))

... Model Initialized
	 <tf.Variable 'Embedding_Layer/ContextEmbeddings:0' shape=(48579, 128) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/WordEmbeddings:0' shape=(48579, 128) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/b:0' shape=(48579,) dtype=float32_ref>
... Starting Training
... STEP  0 : Average Loss : 0.000111448200544
   [en_the] sim words:  en_psychical, en_slogan, it_balbo, it_race, en_1/4, it_regina, it_causando, it_[[879322]],
   [en_first] sim words:  it_interviú, it_fermarsi, it_tanfo, it_navate, en_delivers, en_sworn, it_avvocatura, it_fortificare,
   [it_nuovo] sim words:  en_dismissal, it_bevve, en_septa, en_implements, it_telegiornale, it_marcello, it_acciaio, it_roxx,
   [it_parola] sim words:  en_censorship, en_minima, it_profesional, it_ritengono, it_michail, en_الأول, it_mckenna, en_prevention,
... STEP  60000 : Average Loss : 4.30245931981
... STEP  120000 : Average Loss : 3.33011244143
   [en_the] sim words:  en_a, en_,, en_., en_in, en_and, en_'s, en_of,

__`NOTES:`__ Same words look reasonable in the English examples. I'd be interesting in training this longer to see if that helps but its probably worth fixing the context window issue first.

In [70]:
# saving final embeddings in case we want to do more stuff later
filename = SAVE_TO + '/en_it_rand_600K_cw1_V_dec15.pkl'
with open(filename, 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(model2.context_embeddings, f, pickle.HIGHEST_PROTOCOL)

filename = SAVE_TO + '/en_it_rand_600K_cw1_U_dec15.pkl'
with open(filename, 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(model2.word_embeddings, f, pickle.HIGHEST_PROTOCOL)

In [73]:
# confirm:
filename = SAVE_TO + '/en_it_rand_600K_cw1_U_dec15.pkl'
with open(filename, 'rb') as f:
    C_embedding = pickle.load(f)
    
C_embedding

array([[ -1.42820040e-02,  -1.08593737e-03,  -9.77827888e-03, ...,
         -7.43078999e-03,  -9.79084056e-04,  -7.75979646e-03],
       [  6.15398725e-03,   3.99457384e-03,  -7.21636403e-04, ...,
         -4.45156882e-04,   4.76947054e-03,   4.01509507e-03],
       [  1.42526999e-03,  -2.62570567e-03,   6.83171733e-04, ...,
         -2.14850740e-03,  -6.21526386e-04,   8.00127018e-05],
       ..., 
       [ -4.83615768e-05,  -7.35577720e-04,   2.69850646e-03, ...,
         -1.72964076e-03,   2.55509047e-03,   6.92204339e-04],
       [  3.46941641e-03,   5.76911087e-04,   7.60798051e-04, ...,
          3.49387084e-03,   3.47503019e-03,  -1.87479728e-03],
       [ -7.38707022e-04,  -1.68911624e-03,  -2.75655207e-03, ...,
         -2.23345775e-03,  -2.73358473e-03,   1.52106478e-03]], dtype=float32)

# Random Translation & Larger Context Window

### Fresh Data

In [140]:
BATCH_SIZE = 20
WINDOW_SIZE = 8
MAX_EPOCHS = 30 # fail safe

batched_data = batch_generator(en_it_data, en_it_vocab, BATCH_SIZE, 
                               WINDOW_SIZE, MAX_EPOCHS)

### Initialize Model

In [141]:
from models import BiW2V_random

EMBEDDING_SIZE = 200

# create model
model3 = BiW2V_random(('en', 'it'), bi_dict, en_it_vocab.to_ids,
                      index = en_it_vocab.index, 
                      H = EMBEDDING_SIZE)

# intialize TF graphs
model3.BuildCoreGraph()
model3.BuildTrainingGraph()
model3.BuildValidationGraph()

... TF graph created for BiW2V model.
... TF graph created for BiW2V training.
... TF graph created for BiW2V validation.


### Train

In [142]:
# parameters
nBATCHES = 600000 # ~ 25 epochs
DATA_GENERATOR = batched_data
TEST_WORDS = [3, 84, 669, 6646]

# training call
model3.train(nBATCHES, DATA_GENERATOR, TEST_WORDS, learning_rate = 0.15)

... Model Initialized
	 <tf.Variable 'Embedding_Layer/ContextEmbeddings:0' shape=(48579, 200) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/WordEmbeddings:0' shape=(48579, 200) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/b:0' shape=(48579,) dtype=float32_ref>
... Starting Training
... STEP 0 : Average Loss : 0.000122350947062
   [en_the] sim words:  en_diakonoff, it_abraham, it_iuta, en_coups, en_griffith, en_alter, en_upward, en_1770s,
   [en_first] sim words:  it_firmato, en_boil, it_fiocina, it_vga, it_[[876681]], en_featural, it_risparmio, en_run,
   [it_nuovo] sim words:  it_sabana, it_dell’antimateria, en_stabbing, en_probes, it_hetman, it_rof, it_roseto, en_restarted,
   [it_parola] sim words:  it_d'artificio, it_rimanesse, en_infantile, it_branca, it_permesso, en_materialism, en_reunited, en_non-syndromal,
... STEP 60000 : Average Loss : 3.43592488782
... STEP 120000 : Average Loss : 2.32015497759
   [en_the] sim words:  en_counteracts, en_antiquated, en_tropics, it_1829,

__NOTES:__ Interesting, the larger context seems to hurt the training performance. Is this because of the extra padding? Or will a smart adjustment of the learning rate redeem this?

# Vizualization

In [106]:
wrds = "en_the en_a en_this en_'s en_an en_their en_its en_these en_his \
       en_first en_on en_in en_for en_to en_with en_are en_. en_all \
       it_nuovo it_di it_un it_, <s> it_i it_con it_è it_più it_parola \
       en_censorship en_minima it_profesional it_michail en_الأول \
       it_ritengono en_prevention it_mckenna en_third".split()

In [107]:
for w in ["en_the", "en_first", "it_nuovo", "it_parole"]:
    wrds += list(bi_dict[w])

In [108]:
wordset = set(en_it_vocab.to_ids(wrds))

In [110]:
model2.plot_embeddings_in_2D(wordset)

NameError: global name 'TSNE' is not defined