# Initial Data Exploration
__`w266 Final Project | MIDS Fall 2017`__

This notebook contains the code to investigate the data available from Duong et al's repo plus an intial forray into the [PanLex API](https://dev.panlex.org/api/), and [Polyglot python package](http://polyglot.readthedocs.io/en/latest/Embeddings.html).

## Contents
* [Original Data](#Original-Data)
* [Polyglot Embeddings](#Polyglot-Embeddings)
* [Polyglot Wikipedia Dumps](#Polyglot-Wikipedia-Dumps)
* [PanLex API](#PanLex-API)

## Notebook Setup

In [1]:
# imports
import pandas as pd
import requests

In [5]:
# globals
HOME = '/home/mmillervedam/'

# Author's Data

In [2]:
# What files are available?
!ls -l ../XlingualEmb/data/dicts
!ls -l ../XlingualEmb/data/mono

total 138168
-rw-rw-r-- 1 mmillervedam mmillervedam 30951284 Nov 30 19:25 en.de.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam  8626690 Nov 30 19:25 en.el.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam 26496021 Nov 30 19:25 en.es.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam 19429394 Nov 30 19:25 en.fi.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam 17820708 Nov 30 19:25 en.it.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam 24258057 Nov 30 19:25 en.ja.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam 12340490 Nov 30 19:25 en.nl.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam  1541535 Nov 30 19:25 en.sr.panlex.all.processed
-rw-rw-r-- 1 mmillervedam mmillervedam        1 Nov 30 19:25 README.md
total 3664
-rw-rw-r-- 1 mmillervedam mmillervedam 3746786 Nov 30 19:25 en_it.shuf.10k
-rw-rw-r-- 1 mmillervedam mmillervedam       1 Nov 30 19:25 README.md


In [14]:
# What do the panlex processsed files look like?
!head -n 5 ../XlingualEmb/data/dicts/en.es.panlex.all.processed

en_0	es_cero
en_0_or_1_matches	es_0_ó_1_coincidencias
en_0_or_more_matches	es_0_o_más_coincidencias
en_1000000000000	es_billón
en_1000000000	es_billón


In [17]:
# What does the monolingual file look like?
!head -n 8 ../XlingualEmb/data/mono/en_it.shuf.10k

it_[[877881]]
it_[[879362]]
it_in it_un it_remoto it_passato it_aveva it_progettato it_, it_per it_conto it_dei it_demoniazzi it_silastici it_di it_striterax it_, it_una it_bomba it_in it_grado it_di it_collegare it_simultaneamente it_tutti it_i it_nuclei it_di it_tutte it_le it_stelle it_, it_creando it_così it_un'immensa it_supernova it_che it_avrebbe it_distrutto it_l'universo it_, it_secondo it_i it_desideri it_dei it_demoniazzi it_silastici it_.
it_krikkitesi it_i it_krikkitesi it_sono it_una it_razza it_aliena it_che it_per it_miliardi it_di it_anni it_aveva it_vissuto it_senza it_la it_minima it_consapevolezza it_dell'esistenza it_di it_altri it_mondi it_o it_altre it_specie it_.
en_as en_the en_patron en_of en_delphi en_( en_pythian en_apollo en_) en_, en_apollo en_was en_an en_oracular en_god en_— en_the en_prophetic en_deity en_of en_the en_delphic en_oracle en_.
it_all'inizio it_del it_2006 it_ha it_pubblicato it_il it_suo it_primo it_singolo it_solista it_, it_nell'ang

# Polyglot Wikipedia Dumps

These files are made available from [Rami Al Rfou's website](https://sites.google.com/site/rmyeid/projects/polyglot#TOC-Download-Wikipedia-Text-Dumps) and are protected by a creative commons liscence in association with the following publication:  

__Citation:__ [Polyglot: Distributed Word Representations for Multilingual NLP](http://www.aclweb.org/anthology/W13-3520), 
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 
In Proceedings Seventeenth Conference on Computational Natural Language Learning (CoNLL 2013).
> `@InProceedings{polyglot:2013:ACL-CoNLL,
  author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},
  title     = {Polyglot: Distributed Word Representations for Multilingual NLP},
  booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
  month     = {August},
  year      = {2013},
  address   = {Sofia, Bulgaria},
  publisher = {Association for Computational Linguistics},
  pages     = {183--192}, 
  url       = {http://www.aclweb.org/anthology/W13-3520}
}`

### G-Doc Downloading Script
Since the compressed wikipedia text files are so large a simple `wget` or `curl` command won't do. The code below came from [this SO post](https://stackoverflow.com/questions/25010369/wget-curl-large-file-from-google-drive) -- it uses the requests package.

In [3]:
def download_file_from_google_drive(id, destination):
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value

        return None

    def save_response_content(response, destination):
        CHUNK_SIZE = 32768

        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)

    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)

### Download English Corpus
__File:__ `en_wiki_text.tar.lzma`
__Google Doc ID:__ `0B5lWReQPSvmGOTNxdHo3b0lMc3c`

In [9]:
download_file_from_google_drive('0B5lWReQPSvmGOTNxdHo3b0lMc3c', HOME +'Data/en_wiki_text.tar.lzma')

### Download Spanish Corpus
__File:__ `es_wiki_text.tar.lzma` __Google Doc ID:__ `0B5lWReQPSvmGOXdCZEZPSnZoYXc`

In [10]:
%%timeit
download_file_from_google_drive('0B5lWReQPSvmGOXdCZEZPSnZoYXc', HOME +'Data/es_wiki_text.tar.lzma')

1 loop, best of 3: 8.72 s per loop


### Download the French Corpus
__File:__ `fr_wiki_text.tar.lzma` __Google Doc ID:__ `0B5lWReQPSvmGdkIxeVJESWcyVU0`

In [22]:
%%timeit
download_file_from_google_drive('0B5lWReQPSvmGdkIxeVJESWcyVU0', HOME +'Data/fr_wiki_text.tar.lzma')

1 loop, best of 3: 9.43 s per loop


### Download the Japanese Corpus
__File:__ `ja_wiki_text.tar.lzma` __Google Doc ID:__ `0B5lWReQPSvmGYzlWMC1KcV9kVzQ`

In [23]:
download_file_from_google_drive('0B5lWReQPSvmGYzlWMC1KcV9kVzQ', HOME +'Data/ja_wiki_text.tar.lzma')

### Decompress the files
Next you'll need to decompress these files. Do this from your terminal after navigating to the `Data` folder (see linux commands below). Read more about the [tar command](https://www.howtogeek.com/248780/how-to-compress-and-extract-files-using-the-tar-command-on-linux/) or [lzma compressed files](https://www.lifewire.com/lzma-file-2621951) and [here](https://fileinfo.com/extension/lzma).
> `cd /home/mmillervedam/Data`  
> `tar --lzma -xvpf en_wiki_text.tar.lzma`    
> `tar --lzma -xvpf es_wiki_text.tar.lzma`  
> `tar --lzma -xvpf fr_wiki_text.tar.lzma`    
> `tar --lzma -xvpf ja_wiki_text.tar.lzma`  

### Take a look

In [24]:
!wc -l {HOME}Data/*/*

   88083626 /home/mmillervedam/Data/en/full.txt
   18833490 /home/mmillervedam/Data/es/full.txt
   23856824 /home/mmillervedam/Data/fr/full.txt
   52875002 /home/mmillervedam/Data/ja/full.txt
  183648942 total


In [25]:
!head -n 2 {HOME}Data/en/full.txt

[[12]]
Anarchism is often defined as a political philosophy which holds the state to be undesirable , unnecessary , or harmful .


In [26]:
!head -n 2 {HOME}Data/es/full.txt

[[7]]
El Principado de Andorra ( en catalán : Principat d'Andorra ) es un pequeño principado soberano del suroeste de Europa con una extensión de 468 km2 , situado en los Pirineos entre España y Francia , con una altitud media de 1.996 metros sobre el nivel del mar .


In [27]:
!head -n 2 {HOME}Data/fr/full.txt

[[3]]
Paul Jules Antoine Meillet , né le à Moulins ( Allier ) et mort le à Châteaumeillant ( Cher ), est le principal linguiste français des premières décennies du .


In [28]:
!head -n 2 {HOME}Data/ja/full.txt

[[5]]
アンパサンド


# Polyglot Embeddings

# PanLex API