# Data Parsing Functions 
`MMV | 12/4 | w266 Final Project: Crosslingual Word Embeddings`   


The code in this notebook builds on the helper functions provided in the TensorFlow Word2Vec tutorial to develop a set of data handling functions for use with the data relevant to Duong et al's paper. Ideally I'll develop a scalable solution for tokenizing, prepending language indicators (eg. `en_`) and extracting sentences in two langauges to create traning data that includes sentences from two languages. I also hope to develop a batch iterator modeled after the one in A4 and do some EDA related to the Panlex dictionaries in preparation for using these to modify Word2Vec for 2 langauges. Depending on the available tools I may end up needing to look at using a distributed system (Spark?) for preprocessing the English corpus which is ~ 9GB.

# Notebook Set-up

In [1]:
# general imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import sys  
import math
import random
import sklearn
import numpy as np
import collections
import pandas as pd
import datetime as dt
import matplotlib
import matplotlib.pyplot as plt
import tensorflow as tf

# tell matplotlib not to open a new window
%matplotlib inline

In [2]:
# filepaths
BASE = '/home/mmillervedam/Data'
FPATH_EN = BASE + '/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
FULL_EN = BASE + '/en/full.txt'
FULL_ES = BASE + '/es/full.txt'
DPATH = '/home/mmillervedam/ProjectRepo/XlingualEmb/data/dicts/en.es.panlex.all.processed'

# Dictionary EDA

In [68]:
# load from file
panlex = pd.read_csv(DPATH, sep='\t', names = ['en', 'es'], dtype=str)

In [69]:
# initial stats
print ('This dictionary contains {} entries.'.format(len(panlex)))
print ('... including {} unique English words.'.format(len(panlex.en.unique())))
print ('... and {} unique Spanish words.'.format(len(panlex.es.unique())))

This dictionary contains 711978 entries.
... including 356410 unique English words.
... and 346572 unique Spanish words.


In [74]:
# multiword mappings
en_dict = panlex.groupby(panlex.en).groups

In [75]:
en_dict # WHY IS THIS NONSENSE!

{'en_guadeloupe': Int64Index([287658], dtype='int64'),
 'en_roll_up_one\xe2\x80\x99s_sleeves': Int64Index([544095], dtype='int64'),
 'en_reach_the_point_of': Int64Index([522322], dtype='int64'),
 'en_ciudad_perdida': Int64Index([123885], dtype='int64'),
 'en_heather_graham_pozzessere': Int64Index([298060], dtype='int64'),
 'en_ambient_air,_open_air,_outdoor_air': Int64Index([25848], dtype='int64'),
 'en_profundity': Int64Index([504466, 504467, 504468, 504469, 504470, 504471], dtype='int64'),
 'en_lou_reed': Int64Index([384601], dtype='int64'),
 'en_dicentrarchus_labrax': Int64Index([179848], dtype='int64'),
 'en_h\xc3\xbasdr\xc3\xa1pa': Int64Index([313326], dtype='int64'),
 'en_hortensia_bussi': Int64Index([309352], dtype='int64'),
 'en_motorize': Int64Index([421839], dtype='int64'),
 'en_state-of-the-art_search_programme': Int64Index([606287], dtype='int64'),
 'en_telesur': Int64Index([633513], dtype='int64'),
 'en_infringement_procedure_(eu)': Int64Index([325976], dtype='int64'),
 'e

In [None]:
# Words we should totally try to translate

# en_pseudo-sophisticated
# en_peppermint_patty'
# en_freedom_of_mobile_multimedia_access
# en_love_wave