# Milestone Report

Disclosure: I have very little affinity with R and I will probably not use it in the future. Teachers are asking us to take on something we didn't learn before (NLP) and like I prefere Python I will use the Spacy library and other technologies for this capstone.


## Imported libraries

In [7]:
import os
import pandas as pd
import spacy
from collections import Counter
from nltk import bigrams, trigrams
spacy.prefer_gpu()
nlp = spacy.load('en') # loading the english model

## Pre-processing
We are going to focus on the english datasets. As the structure of those files are all the same (just lines of text) let's concatenate them in a single file named `corpus.txt` using the following command lines:

```shell
> cat en_US.*.txt > corpus.txt
> wc -l corpus.txt 
 4269678 corpus.txt
```

The file can be found in the project at `datasets/corpus.txt`

We probably only need words so let's keep only alpha tokens in lowercase. 

We will use [spacy](https://spacy.io/) for the purpose. Like, it takes resources I will store the result in a new file named `preprocessed_corpus.csv`

In [8]:
if not os.path.isfile('datasets/preprocessed_corpus.txt'):
    with open('datasets/corpus.txt') as corpus:
        with open('datasets/preprocessed_corpus.txt', 'w') as preprocessed_corpus:
            for line in corpus.readlines():
                preprocessed_corpus.write('{}\n'.format(','.join([token.lower_ for token in nlp(line.strip()) if token.is_alpha])))

### Monogram frequencies
As our `preprocessed_corpus.csv` contains more than 4 millions lines we do not want to extract the monogram frequencies so we are going to store this result in a new dataset named `monogram_frequencies.txt`

In [12]:
if not os.path.isfile('datasets/monogram_frequencies.txt'):
    words_count = Counter()
    for line in open('datasets/preprocessed_corpus.txt').readlines():
        words_count += Counter(line.strip().split(','))
        
    with open('datasets/monogram_frequencies.txt', 'w') as f:
        for word, count in words_count.most_common():
            f.write('{},{}\n'.format(word, count))

The code above is just a naive approach and a more down to earth is:
 - split the data with command line `split -l 711613 preprocessed_corpus.txt`
 - run the python script provided in the repository: `python monogram_counter.py datasets/xaa` for each `datasets/xa?`
 - aggregate all the results: `python monogram_counter_aggregator.py datasets/xa*.csv`

## Exploratory analysis
> TODO try to calculate coverage for monogram, bigram and trigram for each quantile

In [10]:
words_count = Counter()
# for line in open('datasets/en_US/preprocessed_corpus.txt').readlines():
#      words_count += Counter(line.strip().split(','))

# list(bigrams( ['i', 'am', 'going', 'to', 'the', 'cinema']))
# list(trigrams( ['i', 'am', 'going', 'to', 'the', 'cinema']))

print('Single unique words count {}'.format(len(words_count) ))
def calculate_coverage(wc):
    wc.most_common(200)
    total_nb_words = sum(wc.values())
    fifty_percent_words = total_nb_words * 0.5
    ninety_percent_words = total_nb_words * 0.9
    fifty_percent_count =  0
    ninety_percent_count =  0
    fifty_percent_unique_words =  0
    ninety_percent_unique_words =  0

    for i, (word, count) in enumerate(wc.most_common()):
        if fifty_percent_count < fifty_percent_words:
            fifty_percent_count += count
            if fifty_percent_count >= fifty_percent_words:
                fifty_percent_unique_words = i + 1
            
        if ninety_percent_count < ninety_percent_words:
            ninety_percent_count += count
        else:
            ninety_percent_unique_words = i + 1
    
    return (fifty_percent_unique_words, ninety_percent_unique_words)

Single unique words count 0


### How do you evaluate how many of the words come from foreign languages?
- Use a static dictionary
- Test if the characters belong the the alphabet

### Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

## Modeling