# Parsing Wiki Data
` w266 Final Project: Crosslingual Word Embeddings`   

The code in this notebook and the supporting file __`parsing.py`__ build on the helper functions provided in the TensorFlow Word2Vec tutorial to develop a set of data handling functions for use with the data relevant to Duong et al's paper. Ideally I'll develop a scalable solution for tokenizing, prepending language indicators (eg. `en_`) and extracting sentences in two langauges to create traning data that includes sentences from two languages. I also hope to develop a batch iterator modeled after the one in A4. Depending on the available tools I may end up needing to look at using a distributed system (Spark?) for preprocessing the English corpus which is ~ 9GB.

# Notebook Set-up

In [1]:
##    Note: Start Jupyter with 
##      jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000

# general imports
import os
import re
import sys
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from __future__ import print_function
# tell matplotlib not to open a new window
%matplotlib inline
# autoreload modules
%load_ext autoreload
%autoreload 2

In [2]:
# filepaths

## Maya's paths
#BASE = '/Users/mmillervedam/Documents/MIDS/w266' #'/home/mmillervedam/' 
#PROJ = '/Users/mmillervedam/Documents/MIDS/w266/FinalProject'#'/home/mmillervedam/ProjectRepo'

## Roseanna's paths


## Mona's local paths
BASE = '/Users/mona/OneDrive/repos/Data' #'/home/mmillervedam/Data'
PROJ = '/Users/mona/OneDrive/repos/final_proj/W266-Fall-2017-Final-Project'#'/home/mmillervedam/ProjectRepo'
## Mona's gc paths
BASE = '/home/miwamoto' #'/home/mmillervedam/Data'
PROJ = '/home/miwamoto/W266-Fall-2017-Final-Project'#'/home/mmillervedam/ProjectRepo'


## Repo paths

FPATH_EN = BASE + '/Data/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/Data/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
EN_ES_DICT = PROJ +'/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT_DICT  = PROJ +'/XlingualEmb/data/dicts/en.it.panlex.all.processed'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'
EN_IT_RAW = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'
FULL_EN = BASE + '/Data/en/full.txt'
FULL_ES = BASE + '/Data/es/full.txt'
FULL_IT = BASE + '/Data/it/full.txt'
FULL_FR = BASE + '/Data/fr/full.txt'
FULL_JA = BASE + '/Data/ja/full.txt'
FULL_NL = BASE + '/Data/nl/full.txt'

DPATH = PROJ +'/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'

## Large datasets
FULL_EN_ES = "./shuffled_files/en_es_shuf.txt"
FULL_EN_IT = "./shuffled_files/en_it_shuf.txt"


# Desired Data Format

In [5]:
# take a look at what Duong et al trained on for reference
!head -n 10 {EN_IT}

it_[[877881]]
it_[[879362]]
it_in it_un it_remoto it_passato it_aveva it_progettato it_, it_per it_conto it_dei it_demoniazzi it_silastici it_di it_striterax it_, it_una it_bomba it_in it_grado it_di it_collegare it_simultaneamente it_tutti it_i it_nuclei it_di it_tutte it_le it_stelle it_, it_creando it_così it_un'immensa it_supernova it_che it_avrebbe it_distrutto it_l'universo it_, it_secondo it_i it_desideri it_dei it_demoniazzi it_silastici it_.
it_krikkitesi it_i it_krikkitesi it_sono it_una it_razza it_aliena it_che it_per it_miliardi it_di it_anni it_aveva it_vissuto it_senza it_la it_minima it_consapevolezza it_dell'esistenza it_di it_altri it_mondi it_o it_altre it_specie it_.
en_as en_the en_patron en_of en_delphi en_( en_pythian en_apollo en_) en_, en_apollo en_was en_an en_oracular en_god en_— en_the en_prophetic en_deity en_of en_the en_delphic en_oracle en_.
it_all'inizio it_del it_2006 it_ha it_pubblicato it_il it_suo it_primo it_singolo it_solista it_, it_nell'ang

__`NOTE:`__ There are no UNK tokens here and punctuation is included as its own token. However words are lowercased and the language marker is prepended. Also note that sentences from the two languages have been shuffled together.

# Parsing Code
I've put the parsing functions in their own python script for ease of access and shared editing. The scrips can be found in the shared repo at: __`/Notebooks/parsing.py`__. Here's a quick overview of the methods it contains:

In [5]:
from parsing import Corpus, Vocabulary, batch_generator, make_bilingual

In [7]:
print(Corpus.__doc__)


    Class with helper methods to read from a Corpus.
    Intended to facilitate working with multiple corpora at once.
    Init Args:
        path - (str) filepath of the raw data
        prefix - (str) optional language prefix to prepend when reading
    Methods:
        gen_tokens - generator factory for tokens in order
        gen_sentences - generator factory for sentences in order
    


In [8]:
print(Vocabulary.__doc__)


    This class is based heavily on code provided in a4 of MIDS w266, Fall 2017.
    Init Args:
        tokens    - iterable of tokens to count
        wordset   - (optional) limit vocabulary to these words
        size      - (optional) integer, number of vocabulary words
    Attributes:
        self.index   - dictionary of {id : type}
        self.size    - integer, number of words in total
        self.types   - dictionary of {type : id}
        self.wordset - set of types
        self.language- order of languages in the index
    Methods:
        self.to_ids(words) - returns list of ids for the word list
        self.to_words(ids) - returns list of words for the id list
        self.sentence_to_ids(sentence) - returns list of ids with start & end
    


In [9]:
print(batch_generator.__doc__)


    Function to iterate repeated over a corpus delivering
    batch_size arrays of ids and context_labels for CBOW.

    Args:
        corpus - an instance of Corpus()
        vocabulary - an instance of Vocabulary()
        batch_size - int, number of words to serve at once
        bag_window - context distance for CBOW training
        max_epochs - int(default = None) stop generating

    Yields:
        batch: np.array of dim: (batch_size, 2*bag_window)
               Represents set of context words.
        labels: np.array of dim: (batch_size, 1)
               Represents center words to predict/translate.

    you specify max_epochs or explicitly break.
    


# Data Parsing Demos

In [18]:
print(FPATH_EN)

/home/miwamoto/Data/test/wiki_en_10K.txt


In [19]:
# english test corpus
en_test = Corpus(FPATH_EN, 'en')

In [13]:
# demo generator
idx = 1
for tok in en_test.gen_tokens():
    print(tok)
    idx += 1
    if idx > 10:
        break

en_[[12]]
en_anarchism
en_is
en_often
en_defined
en_as
en_a
en_political
en_philosophy
en_which


In [12]:
# english vocabulary
en_vocab = Vocabulary(en_test.gen_tokens(), size = 1000)

In [13]:
# take a look
#en_vocab.types
#en_vocab.index
#en_vocab.wordset
en_vocab.size

1000

In [14]:
# translate the first sentence into indexes
idx = 0
for sent in en_test.gen_sentences():
    if idx == 1:
        print(sent)
        print(en_vocab.sentence_to_ids(sent))
        break
    idx += 1

 en_anarchism en_is en_often en_defined en_as en_a en_political en_philosophy en_which en_holds en_the en_state en_to en_be en_undesirable en_, en_unnecessary en_, en_or en_harmful en_.
[0, 209, 11, 93, 598, 13, 10, 186, 267, 28, 2, 3, 58, 9, 30, 2, 4, 2, 4, 25, 2, 5, 1]


In [15]:
# demo batch iterator
batch_size = 4
bag_window = 2
idx = 0
for batch, labels in batch_generator(en_test, en_vocab, batch_size, bag_window):
    print("CONTEXT WINDOWS:", batch)
    print("CENTER WORDS:", labels)
    idx += 1
    if idx > 2:
        break

CONTEXT WINDOWS: [[0, 0, 1, 1], [0, 0, 11, 93], [0, 209, 93, 598], [209, 11, 598, 13]]
CENTER WORDS: [2, 209, 11, 93]
CONTEXT WINDOWS: [[11, 93, 13, 10], [93, 598, 10, 186], [598, 13, 186, 267], [13, 10, 267, 28]]
CENTER WORDS: [598, 13, 10, 186]
CONTEXT WINDOWS: [[10, 186, 28, 2], [186, 267, 2, 3], [267, 28, 3, 58], [28, 2, 58, 9]]
CENTER WORDS: [267, 28, 2, 3]


In [16]:
# demo batch iterator w/ readible format
batch_size = 4
bag_window = 2
idx = 0
for batch, labels in batch_generator(en_test, en_vocab, batch_size, bag_window):
    print("Batch %s:"%(idx+1))
    for context, wrd in zip(batch,labels):
        print(en_vocab.to_words(context), "-->", en_vocab.to_words([wrd]))
    idx += 1
    if idx > 2:
        break

Batch 1:
['<s>', '<s>', '</s>', '</s>'] --> ['<unk>']
['<s>', '<s>', 'en_is', 'en_often'] --> ['en_anarchism']
['<s>', 'en_anarchism', 'en_often', 'en_defined'] --> ['en_is']
['en_anarchism', 'en_is', 'en_defined', 'en_as'] --> ['en_often']
Batch 2:
['en_is', 'en_often', 'en_as', 'en_a'] --> ['en_defined']
['en_often', 'en_defined', 'en_a', 'en_political'] --> ['en_as']
['en_defined', 'en_as', 'en_political', 'en_philosophy'] --> ['en_a']
['en_as', 'en_a', 'en_philosophy', 'en_which'] --> ['en_political']
Batch 3:
['en_a', 'en_political', 'en_which', '<unk>'] --> ['en_philosophy']
['en_political', 'en_philosophy', '<unk>', 'en_the'] --> ['en_which']
['en_philosophy', 'en_which', 'en_the', 'en_state'] --> ['<unk>']
['en_which', '<unk>', 'en_state', 'en_to'] --> ['en_the']


__`QUESTION:`__ What do we do for context w/ the start and end words? I need to go back and check Mona's code.

In [17]:
# confirm that batch generator will reload
!wc {FPATH_EN}

   10000  259843 1461734 /Users/mona/OneDrive/repos/Data/test/wiki_en_10K.txt


In [18]:
# last sentence
!tail -n 1 {FPATH_EN}

Filmography Tarkovsky is mainly known as a director of films .


In [19]:
# first
!head -n 2 {FPATH_EN}

[[12]]
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .


__`NOTE:`__ `~65001 batches per epoch` (in this test set)

In [20]:
# print the 64952-4rd batch (should be the same as above)
idx = 0
for batch, labels in batch_generator(en_test, en_vocab, 4, 2):
    idx += 1
    if idx < 65000:
        continue
    elif idx > 65003:
        break
    else:
        print("Batch %s:"%(idx))
        for context, wrd in zip(batch,labels):
            print(en_vocab.to_words(context), "-->", en_vocab.to_words([wrd]))

Batch 65000:
['<unk>', 'en_tarkovsky', '<unk>', 'en_known'] --> ['en_is']
['en_tarkovsky', 'en_is', 'en_known', 'en_as'] --> ['<unk>']
['en_is', '<unk>', 'en_as', 'en_a'] --> ['en_known']
['<unk>', 'en_known', 'en_a', '<unk>'] --> ['en_as']
Batch 65001:
['en_known', 'en_as', '<unk>', 'en_of'] --> ['en_a']
['en_as', 'en_a', 'en_of', 'en_films'] --> ['<unk>']
['en_a', '<unk>', 'en_films', 'en_.'] --> ['en_of']
['<unk>', 'en_of', 'en_.', '</s>'] --> ['en_films']
Batch 65002:
['en_of', 'en_films', '</s>', '</s>'] --> ['en_.']
['<s>', '<s>', '</s>', '</s>'] --> ['<unk>']
['<s>', '<s>', 'en_is', 'en_often'] --> ['en_anarchism']
['<s>', 'en_anarchism', 'en_often', 'en_defined'] --> ['en_is']
Batch 65003:
['en_anarchism', 'en_is', 'en_defined', 'en_as'] --> ['en_often']
['en_is', 'en_often', 'en_as', 'en_a'] --> ['en_defined']
['en_often', 'en_defined', 'en_a', 'en_political'] --> ['en_as']
['en_defined', 'en_as', 'en_political', 'en_philosophy'] --> ['en_a']


# Prepare Full Datasets
## Prepare English Corpus

In [36]:
print(FULL_EN)

/home/miwamoto/Data/en/full.txt


In [25]:
# english full corpus
en_text = Corpus(FULL_EN, 'en')

In [39]:
# minimum sentence length = 3
min_sentence_length = 3
en_text.split_file(min_sentence_length)

Time to split -  578.537941 seconds
6946 files written


In [40]:
# Each split has 10K sentences
print(en_text.splits)
#en_text.splits = 6946

6946


We may need to change the upper limit of open files
the soft limit imposed by the current configuration
the hard limit imposed by the operating system.

In [41]:
import resource

soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Soft limit is ', soft)
print('Hard limit is ', hard)

resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))


Soft limit is  10000
Hard limit is  1048576


In [42]:
# Randomly draw from split files
# This take a LONG time for a 9GB file
en_text.draw_random()

(10000, 1048576)
Time to shuffle -  10.776336 seconds
0 sentences skipped


In [47]:
# Quick check - shuffled file size
!ls -lh ./shuffled_files/en_shuffled.txt 

-rw-rw-r-- 1 miwamoto miwamoto 991M Dec 18 21:48 ./shuffled_files/en_shuffled.txt


In [50]:
!head -n5 ./shuffled_files/en_shuffled.txt

 en_reception en_the en_allmusic en_review en_by en_steve en_huey en_awarded en_the en_album en_5 en_stars en_and en_calling en_it en_" en_a en_superbly en_sensuous en_blend en_of en_lusty en_blues en_swagger en_and en_achingly en_romantic en_ballads en_... en_a en_quiet en_, en_sorely en_underrated en_masterpiece en_" en_.
 en_astrid en_is en_a en_member en_of en_the en_crimean en_royal en_knights en_.
 en_although en_it en_ran en_reasonably en_well en_, en_the en_engine en_was en_fuel en_inefficient en_, en_extremely en_noisy en_, en_tended en_to en_overheat en_and en_, en_if en_sufficient en_cooling en_water en_was en_not en_applied en_, en_seize en_up en_.
 en_the en_hebrew en_bible en_refers en_uncritically en_to en_slavery en_as en_an en_established en_institution en_.
 en_sandoval en_sued en_under en_title en_vi en_of en_the en_civil en_rights en_act en_of en_1964 en_.


In [51]:
# remove files if not needed
!rm ./split_files/*

rm: cannot remove './split_files/*': No such file or directory


## Prepare Spanish Corpus

In [52]:
print(FULL_ES)

/home/miwamoto/Data/es/full.txt


In [54]:
# spanish full corpus
es_text = Corpus(FULL_ES, 'es')

In [55]:
# Split full file using min_sentence_length

min_sentence_length = 3
es_text.split_file(min_sentence_length)

Time to split -  126.983595 seconds
1441 files written


In [56]:
print(es_text.splits)

1441


In [58]:
es_text.draw_random()

(10000, 1048576)
Time to shuffle -  9.490429 seconds
0 sentences skipped


In [59]:
# Quick check - shuffled file size
!ls -lh ./shuffled_files/es_shuffled.txt 

-rw-rw-r-- 1 miwamoto miwamoto 1.1G Dec 18 22:06 ./shuffled_files/es_shuffled.txt


In [60]:
!head -n5 ./shuffled_files/es_shuffled.txt

 es_el es_hecho es_de es_que es_acacio es_se es_hubiera es_presentado es_hasta es_entonces es_como es_un es_defensor es_de es_la es_verdadera es_ortodoxia es_es es_curioso es_.
 es_en es_1162 es_sancho es_vi es_de es_navarra es_se es_lanza es_a es_la es_conquista es_de es_la es_rioja es_, es_conquistando es_en es_1163 es_logroño es_, es_navarrete es_, es_entrena es_, es_ausejo es_, es_resa es_, es_ocón es_, es_autol es_, es_quel es_, es_grañón es_, es_pazuengos es_y es_treviana es_.
 es_en es_él es_se es_ubican es_la es_facultad es_de es_ciencias es_económicas es_y es_empresariales es_, es_la es_escuela es_universitaria es_de es_estudios es_empresariales es_, es_la es_facultad es_de es_derecho es_, es_la es_facultad es_de es_ciencias es_políticas es_y es_sociales es_, es_la es_facultad es_de es_filosofía es_, es_la es_facultad es_de es_psicología es_, es_la es_facultad es_de es_filología es_, es_facultad es_de es_geografía es_e es_historia es_, es_la es_facultad es_de es_ciencias es_

In [61]:
# remove files if not needed
!rm ./split_files/*

## Prepare Italian Corpus

In [65]:
print(FULL_IT)

/home/miwamoto/Data/it/full.txt


In [66]:
# Italian full corpus
it_text = Corpus(FULL_IT, 'it')

In [67]:
# Split full file using min_sentence_length

min_sentence_length = 3
it_text.split_file(min_sentence_length)

Time to split -  102.788192 seconds
1152 files written


In [70]:
it_text.draw_random()

(10000, 1048576)
Time to shuffle -  9.534405 seconds
0 sentences skipped


In [71]:
# Quick check - shuffled file size
!ls -lh ./shuffled_files/it_shuffled.txt 

-rw-rw-r-- 1 miwamoto miwamoto 1.2G Dec 18 22:18 ./shuffled_files/it_shuffled.txt


In [72]:
!head -n5 ./shuffled_files/it_shuffled.txt

 it_lo it_stereoscopio it_ad it_ingrandimento it_variabile it_è it_un it_particolare it_stereoscopio it_idoneo it_a it_variare it_l'ingrandimento it_del it_modello it_ottico it_tridimensionale it_osservato it_.
 it_il it_suo it_simbolo it_è it_ni it_.
 it_in it_quella it_squadra it_già it_giocava it_il it_triestino it_nereo it_rocco it_: it_è it_stato it_prima it_un it_importante it_giocatore it_, it_poi it_allenatore it_.
 it_atena it_trasformò it_anche it_la it_parte it_inferiore it_del it_loro it_corpo it_in it_modo it_tale it_da it_renderle it_impossibilitate it_ad it_avere it_rapporti it_sessuali it_con it_un it_uomo it_.
 it_file:copia it_della it_madonna it_di it_albinea it_di it_correggio it_2 it_. it_jpg it_| it_dettaglio it_bibliografia it_collegamenti it_esterni


In [73]:
# remove files if not needed
!rm ./split_files/*

## Prepare French Corpus

In [74]:
print(FULL_FR)

/home/miwamoto/Data/fr/full.txt


In [75]:
# Italian full corpus
fr_text = Corpus(FULL_FR, 'fr')

In [76]:
# Split full file using min_sentence_length

min_sentence_length = 3
fr_text.split_file(min_sentence_length)

Time to split -  152.248704 seconds
1797 files written


In [77]:
print(fr_text.splits)

1797


In [78]:
fr_text.draw_random()

(10000, 1048576)
Time to shuffle -  9.720246 seconds
0 sentences skipped


In [79]:
# Quick check - shuffled file size
!ls -lh ./shuffled_files/fr_shuffled.txt 

-rw-rw-r-- 1 miwamoto miwamoto 1.1G Dec 18 22:22 ./shuffled_files/fr_shuffled.txt


In [80]:
!head -n5 ./shuffled_files/fr_shuffled.txt

 fr_seuls fr_babylone fr_, fr_l fr_' fr_urartu fr_, fr_l fr_' fr_Élam fr_et fr_l fr_' fr_Égypte fr_peuvent fr_un fr_temps fr_caresser fr_l'idée fr_de fr_rivaliser fr_avec fr_lui fr_, fr_mais fr_ils fr_sont fr_finalement fr_tous fr_vaincus fr_.
 fr_comme fr_dans fr_le fr_reste fr_de fr_la fr_région fr_, fr_à fr_la fr_fin fr_du fr_se fr_déroule fr_la fr_guerre fr_de fr_vendée fr_, fr_qui fr_marque fr_de fr_son fr_empreinte fr_le fr_pays fr_tout fr_entier fr_.
 fr_le fr_propulseur fr_est fr_un fr_moteur fr_fusée fr_à fr_carburant fr_solide fr_.
 fr_le fr_dernier fr_kilomètre fr_très fr_raide fr_permet fr_à fr_david fr_de fr_la fr_fuente fr_d'attaquer fr_et fr_de fr_remporter fr_sa fr_première fr_victoire fr_de fr_la fr_saison fr_.
 fr_en fr_france fr_, fr_de fr_nombreuses fr_chaussures fr_de fr_femmes fr_furent fr_par fr_conséquent fr_pourvues fr_de fr_semelles fr_compensées fr_en fr_bois fr_.


In [16]:
# remove files if not needed
!rm ./split_files/*

## Prepare Dutch Corpus

In [3]:
print(FULL_NL)

/home/miwamoto/Data/nl/full.txt


In [6]:
from parsing import Corpus, Vocabulary, batch_generator, make_bilingual

In [15]:
# Italian full corpus
nl_text = Corpus(FULL_NL, 'nl')

In [17]:
# Split full file using min_sentence_length

min_sentence_length = 3
nl_text.split_file(min_sentence_length)

Time to split -  77.246617 seconds
1067 files written


In [18]:
print(nl_text.splits)

1067


In [19]:
import resource

soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
print('Soft limit is ', soft)
print('Hard limit is ', hard)

resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))

('Soft limit is ', 10000)
('Hard limit is ', 1048576)


In [20]:
nl_text.draw_random()

(10000, 1048576)
Time to shuffle -  7.817511 seconds
0 sentences skipped


In [21]:
# Quick check - shuffled file size
!ls -lh ./shuffled_files/nl_shuffled.txt 

-rw-rw-r-- 1 miwamoto miwamoto 809M Dec 19 03:29 ./shuffled_files/nl_shuffled.txt


In [22]:
!head -n5 ./shuffled_files/nl_shuffled.txt

 nl_tesseract nl_wordt nl_tegenwoordig nl_ontwikkeld nl_door nl_google nl_en nl_uitgegeven nl_onder nl_de nl_apache nl_- nl_licentie nl_2.0 nl_.
 nl_in nl_ronde nl_2 nl_maakte nl_felipe nl_aguilar nl_64 nl_waarna nl_hij nl_samen nl_met nl_george nl_murray nl_aan nl_de nl_leiding nl_stond nl_.
 nl_de nl_eigenaar nl_van nl_het nl_150 nl_miljoen nl_dollar nl_kostende nl_gebouw nl_is nl_bentleyforbes nl_.
 nl_taxonomie nl_tipularia nl_wordt nl_tegenwoordig nl_samen nl_met nl_de nl_geslachten nl_calypso nl_en nl_corallorhiza nl_en nl_nog nl_enkele nl_ander nl_tot nl_de nl_tribus nl_calypsoeae nl_gerekend nl_.
 nl_pierre nl_hazette nl_( nl_marneffe nl_, nl_15 nl_maart nl_1939 nl_) nl_is nl_een nl_belgisch nl_politicus nl_voor nl_de nl_prl nl_.


In [23]:
# remove files if not needed
!rm ./split_files/*

## Prepare Japanese Corpus

In [82]:
print(FULL_JA)

/home/miwamoto/Data/ja/full.txt


In [83]:
# Italian full corpus
ja_text = Corpus(FULL_JA, 'ja')

In [84]:
# Split full file using min_sentence_length

min_sentence_length = 3
ja_text.split_file(min_sentence_length)

Time to split -  118.577906 seconds
2096 files written


In [85]:
print(ja_text.splits)

2096


In [87]:
ja_text.draw_random()

(10000, 1048576)
Time to shuffle -  5.354531 seconds
0 sentences skipped


In [88]:
# Quick check - shuffled file size
!ls -lh ./shuffled_files/ja_shuffled.txt 

-rw-rw-r-- 1 miwamoto miwamoto 390M Dec 18 22:26 ./shuffled_files/ja_shuffled.txt


In [89]:
!head -n5 ./shuffled_files/ja_shuffled.txt

 ja_ノー ja_ブル ja_節 ja_北米 ja_西部 ja_高地
 ja_江川 ja_大会 ja_通算 ja_60 ja_奪 ja_三振 ja_記録
 ja_1970 ja_年 ja_空気 ja_ばね ja_式 ja_車体 ja_傾斜 ja_制御 ja_装置 ja_試験 ja_行う ja_時 ja_クハ ja_1658 ja_使用 ja_台車 ja_空気 ja_ばね ja_式 ja_振り子 ja_台車 ja_fs ja_080 ja_形 ja_採用 ja_三菱 ja_三菱電機 ja_電機 ja_自動 ja_振子 ja_制御 ja_装置 ja_組み合わせる
 ja_2001 ja_年 ja_10 ja_月 ja_2002 ja_年 ja_9 ja_月 ja_2 ja_名 ja_後継 ja_番組
 ja_江東 ja_区立 ja_東川 ja_小学校


In [90]:
# remove files if not needed
!rm ./split_files/*

## Combine corpora for bilingual text

In [35]:
make_bilingual(en_text, it_text)

In [None]:
make_bilingual(en_text, es_text)

In [92]:
make_bilingual(en_text, fr_text)

In [93]:
make_bilingual(en_text, ja_text)

In [26]:
make_bilingual(en_text, nl_text)

In [27]:
!head -5 ./shuffled_files/en_nl_shuf.txt

 en_reception en_the en_allmusic en_review en_by en_steve en_huey en_awarded en_the en_album en_5 en_stars en_and en_calling en_it en_" en_a en_superbly en_sensuous en_blend en_of en_lusty en_blues en_swagger en_and en_achingly en_romantic en_ballads en_... en_a en_quiet en_, en_sorely en_underrated en_masterpiece en_" en_.
 nl_tesseract nl_wordt nl_tegenwoordig nl_ontwikkeld nl_door nl_google nl_en nl_uitgegeven nl_onder nl_de nl_apache nl_- nl_licentie nl_2.0 nl_.
 en_astrid en_is en_a en_member en_of en_the en_crimean en_royal en_knights en_.
 nl_in nl_ronde nl_2 nl_maakte nl_felipe nl_aguilar nl_64 nl_waarna nl_hij nl_samen nl_met nl_george nl_murray nl_aan nl_de nl_leiding nl_stond nl_.
 en_although en_it en_ran en_reasonably en_well en_, en_the en_engine en_was en_fuel en_inefficient en_, en_extremely en_noisy en_, en_tended en_to en_overheat en_and en_, en_if en_sufficient en_cooling en_water en_was en_not en_applied en_, en_seize en_up en_.


# Testing with dictionary wordset.

In [19]:
# load wordset from dict
pld = pd.read_csv(DPATH, sep='\t', names = ['en', 'es'], dtype=str)
en_set = set(pld.en.unique())

In [20]:
# take a look
len(en_set)

356410

In [21]:
# create vocab
en_vocab = Vocabulary(en_test.gen_tokens(), wordset = en_set, size = 100000)

In [22]:
# take a look - NOTE: the test set has a small vocabulary!
print(len(en_vocab.wordset))
print(en_vocab.size)

13561
13561


# Testing with full spanish data

In [24]:
# real corpus
es_data = Corpus(FULL_ES, 'es')
es_set = set(pld.es.unique())

In [None]:
%%timeit
# vocabulary trainied on full corpus
es_vocab = Vocabulary(es_data.gen_tokens(), wordset = es_set, size = 10)

In [None]:
print(len(es_vocab.wordset))
print(es_vocab.size)

# Testing with full english data
I am still having memory problems w/ the full file. I think the next steps are 1) try a larger instance and 2) go back to the paper to see if we really need all of it.

# Polyglot nonsense

__`NOTE:`__ First time you run this on a new machine you'll need to make sure you've installed [polyglot](http://polyglot.readthedocs.io/en/latest/Installation.html):
```
sudo apt-get install libicu-dev
pip install polyglot
```

In [None]:
import polyglot

ACK! (see readme for more info on what I've tried to fix this )

In [None]:
from polyglot.detect import Detector

In [None]:
from polyglot.text import *

In [None]:
blob = "[[12]] Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful ."