# Parsing Wiki Data
`MMV | 12/4 | w266 Final Project: Crosslingual Word Embeddings`   


The code in this notebook builds on the helper functions provided in the TensorFlow Word2Vec tutorial to develop a set of data handling functions for use with the data relevant to Duong et al's paper. Ideally I'll develop a scalable solution for tokenizing, prepending language indicators (eg. `en_`) and extracting sentences in two langauges to create traning data that includes sentences from two languages. I also hope to develop a batch iterator modeled after the one in A4. Depending on the available tools I may end up needing to look at using a distributed system (Spark?) for preprocessing the English corpus which is ~ 9GB.

# Notebook Set-up

In [19]:
# general imports
import os
import re
import sys
import itertools
import numpy as np
import matplotlib.pyplot as plt
from __future__ import print_function
# tell matplotlib not to open a new window
%matplotlib inline

In [29]:
# filepaths
BASE = '/home/mmillervedam/Data'
FPATH_EN = BASE + '/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
FULL_EN = BASE + '/en/full.txt'
FULL_ES = BASE + '/es/full.txt'
DPATH = '/home/mmillervedam/ProjectRepo/XlingualEmb/data/dicts/en.es.panlex.all.processed'
EN_IT = '/home/mmillervedam/ProjectRepo/XlingualEmb/data/mono/en_it.shuf.10k'

# Desired Data Format

In [32]:
# take a look at what Duong et al trained on for reference
!head -n 10 {EN_IT}

it_[[877881]]
it_[[879362]]
it_in it_un it_remoto it_passato it_aveva it_progettato it_, it_per it_conto it_dei it_demoniazzi it_silastici it_di it_striterax it_, it_una it_bomba it_in it_grado it_di it_collegare it_simultaneamente it_tutti it_i it_nuclei it_di it_tutte it_le it_stelle it_, it_creando it_così it_un'immensa it_supernova it_che it_avrebbe it_distrutto it_l'universo it_, it_secondo it_i it_desideri it_dei it_demoniazzi it_silastici it_.
it_krikkitesi it_i it_krikkitesi it_sono it_una it_razza it_aliena it_che it_per it_miliardi it_di it_anni it_aveva it_vissuto it_senza it_la it_minima it_consapevolezza it_dell'esistenza it_di it_altri it_mondi it_o it_altre it_specie it_.
en_as en_the en_patron en_of en_delphi en_( en_pythian en_apollo en_) en_, en_apollo en_was en_an en_oracular en_god en_— en_the en_prophetic en_deity en_of en_the en_delphic en_oracle en_.
it_all'inizio it_del it_2006 it_ha it_pubblicato it_il it_suo it_primo it_singolo it_solista it_, it_nell'ang

__`NOTE:`__ There are no UNK tokens here and punctuation is included as its own token. However words are lowercased and the language marker is prepended. Also note that sentences from the two languages have been shuffled together.

# Data Reader

In [36]:
# demo parsing
def parse(line, language):
    return re.sub(' ', " "+language+'_', " "+line.lower())

In [39]:
# test it
a = "Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful ."
parse(a, 'en')

' en_anarchism en_is en_often en_defined en_as en_a en_political en_philosophy en_which en_holds en_the en_state en_to en_be en_undesirable en_, en_unnecessary en_, en_or en_harmful en_.'

In [40]:
# fileobject is a lazy reader
def read_line(filename, language):
    for line in open(filename, 'rb'):
        yield re.sub(" ", " "+language+'_', " "+line.lower())

In [48]:
# test it
idx = 0
for l in read_line(FPATH_EN, 'en'):
    print(l)
    break

 en_[[12]]



# Build Vocabulary - work in progress

In [None]:
%%writefile vocabulary.py
#!/usr/bin python
"""
# Base this on the class in Vocabulary.py 

# Instead of creating wordset from all the tokens in the Wiki file, 
# let's start from the dictionary... count only those tokens 
# (since we know they fit in memory) and truncate top N from that list.

# rest of the class should work just fine as it -- make sure we cite it properly. 
"""

# ADD CODE HERE

### utils from a4 copied here for reference:

In [None]:
# Word processing functions
def canonicalize_digits(word):
    if any([c.isalpha() for c in word]): return word
    word = re.sub("\d", "DG", word)
    if word.startswith("DG"):
        word = word.replace(",", "") # remove thousands separator
    return word

def canonicalize_word(word, wordset=None, digits=True):
    word = word.lower()
    if digits:
        if (wordset != None) and (word in wordset): return word
        word = canonicalize_digits(word) # try to canonicalize numbers
    if (wordset == None) or (word in wordset): return word
    else: return "<unk>" # unknown token

def canonicalize_words(words, **kw):
    return [canonicalize_word(word, **kw) for word in words]

__`NOTE:`__ Need to copy the [vocabulay module from a4](https://github.com/datasci-w266/2017-fall-assignment-mmillervedam/blob/master/assignment/a4/shared_lib/vocabulary.py) before running the following (or write our own... see ^^)

In [None]:
# Data loading functions
import nltk
import vocabulary

def get_corpus(name="brown"):
    return nltk.corpus.__getattr__(name)

def sents_to_tokens(sents, vocab):
    """Returns an flattened list of the words in the sentences, with normal padding."""
    padded_sentences = (["<s>"] + s + ["</s>"] for s in sents)
    # This will canonicalize words, and replace anything not in vocab with <unk>
    return np.array([canonicalize_word(w, wordset=vocab.wordset)
                     for w in flatten(padded_sentences)], dtype=object)

def build_vocab(corpus, V=10000):
    token_feed = (canonicalize_word(w) for w in corpus.words())
    vocab = vocabulary.Vocabulary(token_feed, size=V)
    return vocab


def preprocess_sentences(sentences, vocab):
    """Preprocess sentences by canonicalizing and mapping to ids.
    Args:
      sentences ( list(list(string)) ): input sentences
      vocab: Vocabulary object, already initialized
    Returns:
      ids ( array(int) ): flattened array of sentences, including boundary <s>
      tokens.
    """
    # Add sentence boundaries, canonicalize, and handle unknowns
    words = flatten(["<s>"] + s + ["</s>"] for s in sentences)
    words = [canonicalize_word(w, wordset=vocab.word_to_id)
             for w in words]
    return np.array(vocab.words_to_ids(words))


In [None]:
# Use this function
def load_corpus(name, split=0.8, V=10000, shuffle=0):
    """Load a named corpus and split train/test along sentences."""
    corpus = get_corpus(name)
    vocab = build_vocab(corpus, V)
    train_sentences, test_sentences = get_train_test_sents(corpus, split, shuffle)
    train_ids = preprocess_sentences(train_sentences, vocab)
    test_ids = preprocess_sentences(test_sentences, vocab)
    return vocab, train_ids, test_ids

##
# Use this function
def batch_generator(ids, batch_size, max_time):
    """Convert ids to data-matrix form."""
    # Clip to multiple of max_time for convenience
    clip_len = ((len(ids)-1) / batch_size) * batch_size
    input_w = ids[:clip_len]     # current word
    target_y = ids[1:clip_len+1]  # next word
    # Reshape so we can select columns
    input_w = input_w.reshape([batch_size,-1])
    target_y = target_y.reshape([batch_size,-1])

    # Yield batches
    for i in xrange(0, input_w.shape[1], max_time):
	yield input_w[:,i:i+max_time], target_y[:,i:i+max_time]

# Polyglot nonsense

__`NOTE:`__ First time you run this on a new machine you'll need to make sure you've installed [polyglot](http://polyglot.readthedocs.io/en/latest/Installation.html):
```
sudo apt-get install libicu-dev
pip install polyglot
```

In [8]:
import polyglot

ACK! (see readme for more info on what I've tried to fix this )

In [12]:
from polyglot.detect import Detector

ImportError: /home/mmillervedam/anaconda2/lib/python2.7/site-packages/_icu.so: undefined symbol: _ZTIN6icu_5514LEFontInstanceE

In [11]:
from polyglot.text import *

ImportError: /home/mmillervedam/anaconda2/lib/python2.7/site-packages/_icu.so: undefined symbol: _ZTIN6icu_5514LEFontInstanceE

In [7]:
blob = "[[12]] Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful ."