# Simple Substitution
`w266 Final Project: Crosslingual Word Embeddings`

The code in this notebook was used to develop an algorithm to generate crosslingual word embeddings by training on a monolingual corpus and substituting translations at runtime.

# Notebook Setup

In [1]:
# general imports
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# tell matplotlib not to open a new window
%matplotlib inline

# autoreload modules
%load_ext autoreload
%autoreload 2

In [9]:
# filepaths
BASE = '/Users/mmillervedam/Documents/MIDS/w266' #'/home/mmillervedam/' 
PROJ = '/Users/mmillervedam/Documents/MIDS/w266/FinalProject'#'/home/mmillervedam/ProjectRepo'
FPATH_EN = BASE + '/Data/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/Data/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
#FULL_EN = BASE + '/Data/en/full.txt'
#FULL_ES = BASE + '/Data/es/full.txt'
EN_ES_DICT = PROJ +'/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT_DICT  = PROJ +'/XlingualEmb/data/dicts/en.it.panlex.all.processed'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'

# Load & Preprocess Data
__`ORIGINAL AUTHORS SAY:`__ "Normally, the monolingual word embeddings are trained on billions of words. However, getting that much of monolingual data for a low-resource language is also challenging. That is why we only select the top 5 million sentences (around 100 million words) for each language." - _Section 5.1, Duong et. al._ 

In [3]:
from parsing import Corpus, Vocabulary, batch_generator

### Corpus

In [11]:
# load corpus
en_it_data = Corpus(EN_IT_RAW)

In [12]:
# Corpus Stats
!wc {EN_IT_RAW}

   20000  430928 3746786 /Users/mmillervedam/Documents/MIDS/w266/FinalProject/XlingualEmb/data/mono/en_it.shuf.10k


__`i.e.:`__ 20K sentences (10K in each language) with ~430K tokens
> So this must not be their full data For now, I'm just going to look at the top 20K words and see what happens. In reality we should probably modify the Vocab class so that it explicily collects the top words for each language separately and then concatenates the index.

### Dictionary

In [13]:
# loading english-italian dictionary
pld = pd.read_csv(EN_IT_DICT, sep='\t', names = ['en', 'it'], dtype=str)
en_set = set(pld.en.unique())
it_set = set(pld.it.unique())

In [14]:
# dictionary vocab lengths:
print('EN:', len(en_set))
print('IT:', len(it_set))

EN: 266450
IT: 258641


### Vocabulary

In [16]:
# train multilingual Vocabulary
en_it_vocab = Vocabulary(en_it_data.gen_tokens(), size = 100000)

### CBOW Data Generator
__`CHECK PAPER for HYPERPARAMS!`__: I can't seem to find where they talk abou the context window size, embedding size and batch size they use -- it may actually be in the Vulic and Moens paper instead of the Duong one.

__`RLH Update`__: Duong et al. section 6, footnote 4: "Default learning rate of 0.025, negative sampling with 25 samples, subsampling rate of value 1e−4, embedding dimension d = 200, window size cs = 48 and run for 15 epochs"


In [24]:
BATCH_SIZE = 48
WINDOW_SIZE = 1
MAX_EPOCHS = 1 # fail safe

In [25]:
batched_data = batch_generator(en_it_data, 
                               en_it_vocab, 
                               BATCH_SIZE, 
                               WINDOW_SIZE, 
                               MAX_EPOCHS)

In [26]:
# sanity check
for context, label in batched_data:
    print("CONTEXT IDS:", context[:5])
    print("LABEL IDS:", label[:5])
    break

CONTEXT IDS: [[0, 1], [0, 1], [0, 34], [20, 17318], [34, 1638]]
LABEL IDS: [25668, 37957, 20, 34, 17318]


# Initialize Model
__`CODE NOTES:`__ To get this running I had to hard code the context length (set to 2) inside `BuildCoreGraph()` where we generate `self.input_` in line 102. That should really be inferred from the `self.context_` itself but it doesn't seem to like the placeholder dimension (we don't have a span length until runtime). Does tensorflow not have a vectorized average? Something to fix (later). I also had to hard code the number of samples for softmax (I had originally put this as a `tf.placeholder_with_default` thinking we could pass it in to the training function (since its a training parameter) but TF kicked out an error message asking for an integer so for now I'll just give it what it wants. I need to think more about why TF doesn't want this changing from batch to batch. (or if there is another reason it wants an int).

In [76]:
from models import BiW2V

In [77]:
EMBEDDING_SIZE = 128

In [78]:
model = BiW2V(index = en_it_vocab.index, H = EMBEDDING_SIZE)

In [79]:
model.BuildCoreGraph()

In [80]:
model.BuildTrainingGraph()

In [81]:
model.BuildValidationGraph()

# Training

__`IMPORTANT!`__ right now the model only works with a window of 1 because the feed dict can't handle context windows of different lengths. We'll either need to figure out how to have a variable length dimension or else add extra padding to the sentences to account for the window size.

In [82]:
model.train(200, batched_data)

... Model Initialized
	 <tf.Variable 'Embedding_Layer/ContextEmbeddings:0' shape=(48579, 128) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/WordEmbeddings:0' shape=(48579, 128) dtype=float32_ref>
	 <tf.Variable 'Hidden_Layer/b:0' shape=(48579,) dtype=float32_ref>
... Starting Training
Average loss at step  0 :  0.39325568676
   Nearest to _en_the: _it_spitz, _en_diagnoses, _it_l'escrezione, _en_جابر, _it_latero, _it_doppiatore, _it_[[877696]], _en_mercedes-benz,
   Nearest to _en_,: _it_fondatori, _it_qum, _it_invadendo, _en_tm-3, _en_fence, _en_outdated, _it_dall'altra, _it_dragan,
   Nearest to _en_.: _it_rurale, _it_dovesse, _en_discuss, _it_oggetto, _it_progressiva, _it_ag4, _en_budgets, _it_strada,
   Nearest to _en_of: _it_pva, _it_angioletto, _it_controproducente, _en_unexpected, _en_brooke, _it_porro, _it_rientrato, _it_espresso,
   Nearest to _it_,: _en_founding, _en_horses, _it_dell'ungheria, _it_transcaspiana, _it_esistenza, _en_flow-on, _it_giorgio, _it_animati,
Average l