# Parsing Wiki Data
` w266 Final Project: Crosslingual Word Embeddings`   

The code in this notebook and the supporting file __`parsing.py`__ build on the helper functions provided in the TensorFlow Word2Vec tutorial to develop a set of data handling functions for use with the data relevant to Duong et al's paper. Ideally I'll develop a scalable solution for tokenizing, prepending language indicators (eg. `en_`) and extracting sentences in two langauges to create traning data that includes sentences from two languages. I also hope to develop a batch iterator modeled after the one in A4. Depending on the available tools I may end up needing to look at using a distributed system (Spark?) for preprocessing the English corpus which is ~ 9GB.

# Notebook Set-up

In [2]:
# general imports
import os
import re
import sys
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from __future__ import print_function
# tell matplotlib not to open a new window
%matplotlib inline
# autoreload modules
%load_ext autoreload
%autoreload 2

  'Matplotlib is building the font cache using fc-list. '


In [3]:
# filepaths
BASE = '~/Documents/MIDS/w266/Data' #'/home/mmillervedam/Data'
PROJ = '~/Documents/MIDS/w266/FinalProject'#'/home/mmillervedam/ProjectRepo'
FPATH_EN = BASE + '/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
#FULL_EN = BASE + '/en/full.txt'
#FULL_ES = BASE + '/es/full.txt'
DPATH = PROJ +'/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT = PROJ + '/XlingualEmb/data/mono/en_it.shuf.10k'

# Desired Data Format

In [4]:
# take a look at what Duong et al trained on for reference
!head -n 10 {EN_IT}

it_[[877881]]
it_[[879362]]
it_in it_un it_remoto it_passato it_aveva it_progettato it_, it_per it_conto it_dei it_demoniazzi it_silastici it_di it_striterax it_, it_una it_bomba it_in it_grado it_di it_collegare it_simultaneamente it_tutti it_i it_nuclei it_di it_tutte it_le it_stelle it_, it_creando it_così it_un'immensa it_supernova it_che it_avrebbe it_distrutto it_l'universo it_, it_secondo it_i it_desideri it_dei it_demoniazzi it_silastici it_.
it_krikkitesi it_i it_krikkitesi it_sono it_una it_razza it_aliena it_che it_per it_miliardi it_di it_anni it_aveva it_vissuto it_senza it_la it_minima it_consapevolezza it_dell'esistenza it_di it_altri it_mondi it_o it_altre it_specie it_.
en_as en_the en_patron en_of en_delphi en_( en_pythian en_apollo en_) en_, en_apollo en_was en_an en_oracular en_god en_— en_the en_prophetic en_deity en_of en_the en_delphic en_oracle en_.
it_all'inizio it_del it_2006 it_ha it_pubblicato it_il it_suo it_primo it_singolo it_solista it_, it_nell'ang

__`NOTE:`__ There are no UNK tokens here and punctuation is included as its own token. However words are lowercased and the language marker is prepended. Also note that sentences from the two languages have been shuffled together.

# Parsing Code
I've put the parsing functions in their own python script for ease of access and shared editing. The scrips can be found in the shared repo at: __`/Notebooks/parsing.py`__. Here's a quick overview of the methods it contains:

In [12]:
from parsing import Corpus, Vocabulary, batch_generator

In [13]:
print(Corpus.__doc__)


    Class with helper methods to read from a Corpus.
    Intended to facillitate working with multiple corpora at once.
    Init Args:
        path - (str) filepath of the raw data
        lang - (str) optional language prefix to prepend when reading
    Methods:
        gen_tokens - generator factory for tokens in order
        gen_sentences - generator factory for sentences in order
    


In [8]:
print(Vocabulary.__doc__)


    This class is based heavily on code provided in a4 of MIDS w266, Fall 2017.
    Init Args:
        tokens    - iterable of tokens to count
        wordset   - (optional) limit vocabulary to these words
        size      - (optional) integer, number of vocabulary words
    Attributes:
        self.index   - dictionary of {id : type} 
        self.size    - integer, number of words in total
        self.types   - dictionary of {type : id}
        self.wordset - set of types
    Methods:
        self.to_ids(words) - returns list of ids for the word list
        self.to_words(ids) - returns list of words for the id list
        self.sentence_to_ids(sentence) - returns list of ids with start & end
    


In [14]:
print(batch_generator.__doc__)


    Function to iterate repeated over a corpus delivering
    batch_size arrays of ids and context_labels for CBOW.
    
    Args:
        corpus - an instance of Corpus()
        vocabulary - an instance of Vocabulary()
        batch_size - int, number of words to serve at once
        bag_window - context distance for CBOW training
        max_epochs - int(default = None) stop generating
        
    Yields:
        batch: np.array of dim: (batch_size, 2*bag_window - 1)
               Represents set of context words.
        labels: np.array of dim: (batch_size, 1)
               Represents center words to predict/translate.
        
    you specify max_epochs or explicitly break.
    


# Data Parsing Demos

In [8]:
# english test corpus
en_test = Corpus(FPATH_EN, 'en')

In [9]:
# demo generator
idx = 1
for tok in en_test.gen_tokens():
    print(tok)
    idx += 1
    if idx > 10:
        break

en_[[12]]
en_anarchism
en_is
en_often
en_defined
en_as
en_a
en_political
en_philosophy
en_which


In [10]:
# english vocabulary
en_vocab = Vocabulary(en_test.gen_tokens(), size = 1000)

In [11]:
# take a look
#en_vocab.types
#en_vocab.index
#en_vocab.wordset
en_vocab.size

1000

In [12]:
# translate the first sentence into indexes
idx = 0
for sent in en_test.gen_sentences():
    if idx == 1:
        print(sent)
        print(en_vocab.sentence_to_ids(sent))
        break
    idx += 1

 en_anarchism en_is en_often en_defined en_as en_a en_political en_philosophy en_which en_holds en_the en_state en_to en_be en_undesirable en_, en_unnecessary en_, en_or en_harmful en_.
[0, 209, 11, 93, 598, 13, 10, 186, 267, 28, 2, 3, 58, 9, 30, 2, 4, 2, 4, 25, 2, 5, 1]


In [13]:
# demo batch iterator
batch_size = 4
bag_window = 2
idx = 0
for batch, labels in batch_generator(en_test, en_vocab, batch_size, bag_window):
    print("CONTEXT WINDOWS:", batch)
    print("CENTER WORDS:", labels)
    idx += 1
    if idx > 2:
        break

CONTEXT WINDOWS: [[1], [11, 93], [0, 209, 93, 598], [209, 11, 598, 13]]
CENTER WORDS: [2, 209, 11, 93]
CONTEXT WINDOWS: [[11, 93, 13, 10], [93, 598, 10, 186], [598, 13, 186, 267], [13, 10, 267, 28]]
CENTER WORDS: [598, 13, 10, 186]
CONTEXT WINDOWS: [[10, 186, 28, 2], [186, 267, 2, 3], [267, 28, 3, 58], [28, 2, 58, 9]]
CENTER WORDS: [267, 28, 2, 3]


In [14]:
# demo batch iterator w/ readible format
batch_size = 4
bag_window = 2
idx = 0
for batch, labels in batch_generator(en_test, en_vocab, batch_size, bag_window):
    print("Batch %s:"%(idx+1))
    for context, wrd in zip(batch,labels):
        print(en_vocab.to_words(context), "-->", en_vocab.to_words([wrd]))
    idx += 1
    if idx > 2:
        break

Batch 1:
['</s>'] --> ['<unk>']
['en_is', 'en_often'] --> ['en_anarchism']
['<s>', 'en_anarchism', 'en_often', 'en_defined'] --> ['en_is']
['en_anarchism', 'en_is', 'en_defined', 'en_as'] --> ['en_often']
Batch 2:
['en_is', 'en_often', 'en_as', 'en_a'] --> ['en_defined']
['en_often', 'en_defined', 'en_a', 'en_political'] --> ['en_as']
['en_defined', 'en_as', 'en_political', 'en_philosophy'] --> ['en_a']
['en_as', 'en_a', 'en_philosophy', 'en_which'] --> ['en_political']
Batch 3:
['en_a', 'en_political', 'en_which', '<unk>'] --> ['en_philosophy']
['en_political', 'en_philosophy', '<unk>', 'en_the'] --> ['en_which']
['en_philosophy', 'en_which', 'en_the', 'en_state'] --> ['<unk>']
['en_which', '<unk>', 'en_state', 'en_to'] --> ['en_the']


__`QUESTION:`__ What do we do for context w/ the start and end words? I need to go back and check Mona's code.

In [15]:
# confirm that batch generator will reload
!wc {FPATH_EN}

  10000  259807 1461734 /home/mmillervedam/Data/test/wiki_en_10K.txt


In [16]:
# last sentence
!tail -n 1 {FPATH_EN}

Filmography Tarkovsky is mainly known as a director of films .


In [17]:
# first
!head -n 2 {FPATH_EN}

[[12]]
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .


__`NOTE:`__ `~65001 batches per epoch` (in this test set)

In [18]:
# print the 64952-4rd batch (should be the same as above)
idx = 0
for batch, labels in batch_generator(en_test, en_vocab, 4, 2):
    idx += 1
    if idx < 65000:
        continue
    elif idx > 65003:
        break
    else:
        print("Batch %s:"%(idx))
        for context, wrd in zip(batch,labels):
            print(en_vocab.to_words(context), "-->", en_vocab.to_words([wrd]))

Batch 65000:
['<unk>', 'en_tarkovsky', '<unk>', 'en_known'] --> ['en_is']
['en_tarkovsky', 'en_is', 'en_known', 'en_as'] --> ['<unk>']
['en_is', '<unk>', 'en_as', 'en_a'] --> ['en_known']
['<unk>', 'en_known', 'en_a', '<unk>'] --> ['en_as']
Batch 65001:
['en_known', 'en_as', '<unk>', 'en_of'] --> ['en_a']
['en_as', 'en_a', 'en_of', 'en_films'] --> ['<unk>']
['en_a', '<unk>', 'en_films', 'en_.'] --> ['en_of']
['<unk>', 'en_of', 'en_.', '</s>'] --> ['en_films']
Batch 65002:
['en_of', 'en_films', '</s>'] --> ['en_.']
['</s>'] --> ['<unk>']
['en_is', 'en_often'] --> ['en_anarchism']
['<s>', 'en_anarchism', 'en_often', 'en_defined'] --> ['en_is']
Batch 65003:
['en_anarchism', 'en_is', 'en_defined', 'en_as'] --> ['en_often']
['en_is', 'en_often', 'en_as', 'en_a'] --> ['en_defined']
['en_often', 'en_defined', 'en_a', 'en_political'] --> ['en_as']
['en_defined', 'en_as', 'en_political', 'en_philosophy'] --> ['en_a']


# Testing with dictionary wordset.

In [19]:
# load wordset from dict
pld = pd.read_csv(DPATH, sep='\t', names = ['en', 'es'], dtype=str)
en_set = set(pld.en.unique())

In [20]:
# take a look
len(en_set)

356410

In [21]:
# create vocab
en_vocab = Vocabulary(en_test.gen_tokens(), wordset = en_set, size = 100000)

In [22]:
# take a look - NOTE: the test set has a small vocabulary!
print(len(en_vocab.wordset))
print(en_vocab.size)

13561
13561


# Testing with full spanish data

In [24]:
# real corpus
es_data = Corpus(FULL_ES, 'es')
es_set = set(pld.es.unique())

In [None]:
%%timeit
# vocabulary trainied on full corpus
es_vocab = Vocabulary(es_data.gen_tokens(), wordset = es_set, size = 10)

In [None]:
print(len(es_vocab.wordset))
print(es_vocab.size)

# Testing with full english data
I am still having memory problems w/ the full file. I think the next steps are 1) try a larger instance and 2) go back to the paper to see if we really need all of it.

# Polyglot nonsense

__`NOTE:`__ First time you run this on a new machine you'll need to make sure you've installed [polyglot](http://polyglot.readthedocs.io/en/latest/Installation.html):
```
sudo apt-get install libicu-dev
pip install polyglot
```

In [None]:
import polyglot

ACK! (see readme for more info on what I've tried to fix this )

In [None]:
from polyglot.detect import Detector

In [None]:
from polyglot.text import *

In [None]:
blob = "[[12]] Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful ."