# Modern NLP Tutorial
This notebook is following the tutorial from PyData 2016 by Patrick Harrision titled "Modern NLP in Python". It involves processing Yelp restaurant reviews, modeling topics from them, visualizing the topics, and creating and visualizing word vectors. The original notebook can be found [here](https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb). The accompanying video from PyData can be found [here](https://youtu.be/6zm9NC9uRkk). The academic dataset used in the notebook can be downloaded from [here](https://app.dominodatalab.com/mtldata/yackathon/browse/yelp_dataset_challenge_academic_dataset). The entire Yelp dataset can be found [here](https://www.yelp.com/dataset).

## Imports and Data Preparation

### Import Packages and set data directory paths

In [17]:
import os
import codecs
import pandas as pd
import itertools as it

from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim
import warnings
import pickle

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [16]:
# Toggle this variable to choose between the full and academic yelp datasets
academic = True
prefix = 'yelp_academic_dataset_' if academic else ''
folder = 'yelp-academic' if academic else 'yelp-full'

data_directory = os.path.join('/mnt/Data/ml/datasets/yelp-dataset/' + folder)
businesses_filepath = os.path.join(data_directory, prefix + 'business.json')
review_json_filepath = os.path.join(data_directory, prefix + 'review.json')
intermediate_directory = os.path.join(data_directory, 'intermediate')

review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')
bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')
trigram_model_filepath = os.path.join(intermediate_directory, 'trigram_model_all')
trigram_sentences_filepath = os.path.join(intermediate_directory, 'trigram_sentences_all.txt')
trigram_reviews_filepath = os.path.join(intermediate_directory, 'trigram_transformed_reviews_all.txt')

trigram_dictionary_filepath = os.path.join(intermediate_directory, 'trigram_dict_all.dict')
trigram_bow_filepath = os.path.join(intermediate_directory, 'trigram_bow_corpus_all.mm')
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')
topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')
tsne_filepath = os.path.join(intermediate_directory, 'tsne_model')
tsne_vectors_filepath = os.path.join(intermediate_directory, 'tsne_vectors.npy')

### Write out the review file (once)

Read in the business json file and go through each business. Count how many restaurants are present and get their ids.

Total number of restaurants in yelp-academic: 21,892

Total number of restaurants in yelp-full: 54,618

In [None]:
from helper_fns import get_restaurant_ids
    
restaurant_ids = get_restaurant_ids(businesses_filepath)
print(f'{len(restaurant_ids)} restaurants in the dataset')

Write out the reviews of each restaurant **ONE LINE PER REVIEW** into the reviews file. This is done by escaping the newline character and replacing it with raw '\n' and adding a '\n' at the end to specify a newline

Number of reviews in yelp-academic: 990,627

Number of reviews in yelp-full: 3,221,419

In [None]:
from helper_fns import write_review_file

review_count = write_review_file(review_txt_filepath, review_json_filepath, restaurant_ids)
print(f'Text from {review_count} reviews written to new txt file')

## SpaCy Text Processing

In [None]:
nlp = spacy.load('en_default')

### Sample Review
Grab a sample review and analyze various aspects of SpaCy using it.

In [None]:
import itertools as it

with codecs.open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_reveiw = sample_review.replace('\\n', '\n')
    
parsed_review = nlp(sample_review)

In [None]:
for num, sentence in enumerate(parsed_review.sents):
    print(f'Sentence {num+1}:')
    print(sentence)

In [None]:
for num, entity in enumerate(parsed_review.ents):
    print(f'Entity {num+1}: {entity} - {entity.label_}')

In [None]:
token_attrs = [(token.text,
                token.pos_,
                token.lemma_,
                token.shape_,
                token.prob,
                token.text in STOP_WORDS,
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
                for token in parsed_review]

df = pd.DataFrame(token_attrs, columns=['text', 'pos', 'lemma', 'shape', 'log_prob',
                                       'stop?', 'punct?', 'whitespace?', 'number?',
                                        'out of vocab?'])
df.loc[:, 'stop?':'out of vocab?'] = (df.loc[:, 'stop?':'out of vocab?']
                                     .applymap(lambda x: u'Yes' if x else u''))
df

## Phrase Modeling

### Unigram

#### Unigram write file (once)
Get sentences from each review and write out the unigram file. This should be done only once.

Number of sentences in yelp-academic: 10,146,794  
Time taken to process yelp-academic: 5h 34m 46s

Number of sentences in yelp-full: 30,392,900  
Time taken to process yelp-full: 16h 1m 53s

In [None]:
from helper_fns import write_unigram_sents

sentence_count = write_unigram_sents(unigram_sentences_filepath, review_txt_filepath, nlp)
print(f'{sentence_count} sentences written to {unigram_sentences_filepath} file')        

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [None]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print(u' '.join(unigram_sentence))

#### Unigram Sentences Example

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [None]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print(u' '.join(unigram_sentence))    

### Bigram

#### Bigram Phrase model create and save (once)
We learn a phrase model that will link individual words into two-word phrases. The model is saved after generation.

Time taken to generate yelp-academic bigram model: 4m 3s  

Time taken to generate yelp-full bigram model: 13m 17s

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [None]:
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)

#### Bigram sentences write file (once)
After learning the bigram phrase model, we feed in the individual sentences from unigram_sentences to find possible bigram phrases. If found, gensim will automatically join them with an underscore.

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)
bigram_model = Phrases.load(bigram_model_filepath)

Write out the bigram sentences to disk.

Number of sentences in yelp-academic: 10,109,973  
Time taken to process yelp-academic: 12m 45s

Number of sentences in yelp-full: 30,301,195  
Time taken to process yelp-full: 36m 7s

In [None]:
from helper_fns import write_sents

sentence_count = write_sents(bigram_sentences_filepath, unigram_sentences, bigram_model)
print(f'{sentence_count} sentences written to {bigram_sentences_filepath}')

#### Bigram Sentences Example

In [None]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [None]:
for bigram_sentence in it.islice(bigram_sentences, 230, 240):
    print(u' '.join(bigram_sentence))

### Trigram

#### Trigram Phrase model create and save (once)
We learn a phrase model that will link individual words into three-word phrases based on the input from bigram sentences. The model is saved after generation.

Time taken to generate yelp-academic bigram model: 4m 35s 

Time taken to generate yelp-full bigram model: 12m 14s

In [None]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [None]:
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_filepath)

#### Trigram sentences write file (once)
After learning the trigram phrase model, we feed in the individual sentences from bigram_sentences to find possible triigram phrases. If found, gensim will automatically join them with an underscore.

In [None]:
bigram_sentences = LineSentence(bigram_sentences_filepath)
trigram_model = Phrases.load(trigram_model_filepath)

Write out the trigram sentences to disk.

Number of sentences in yelp-academic: 10,109,973  
Time taken to process yelp-academic: 11m 54s

Number of sentences in yelp-full: 30,301,195  
Time taken to process yelp-full: 35m 22s

In [None]:
from helper_fns import write_sents

sentence_count = write_sents(trigram_sentences_filepath, bigram_sentences, trigram_model)
print(f'{sentence_count} sentences written to {trigram_sentences_filepath}')

#### Trigram Sentences Example

In [None]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [None]:
for trigram_sentence in it.islice(trigram_sentences, 300, 350):
    print(u' '.join(trigram_sentence))

## Generating full reviews file

Now we will generate the full complete text of reviews which would have normalized text, no stopwords, and second order phrases (trigram).


In [None]:
bigram_model = Phrases.load(bigram_model_filepath)
trigram_model = Phrases.load(trigram_model_filepath)

Number of reviews in yelp-academic: 991,714  
Time taken to write reviews in yelp-academic: 5h 48m 52s  
The number of reviews in the original reviews file is 990,627. The trigram transformed reviews have 1,087 reviews more than the original reviews. I am not sure where the increased number of reviews came from.

Number of reviews in yelp-full: 3,223,214  
Time taken to write reviews in yelp-full: 16h 55m 33s
The number of reviews in the original reviews file is 3,221,419. The trigram transformed reviews have 1,795 reviews more than the original reviews. I am not sure where the increased number of reviews came from.

In [None]:
from helper_fns import write_trigram_review

review_count = write_trigram_review(trigram_reviews_filepath, review_txt_filepath, bigram_model, trigram_model,
                                   nlp)
print(f'{review_count} reviews written to {trigram_reviews_filepath}')

### Review File example

In [9]:
from helper_fns import line_review

print("Original:")
print()

for review in it.islice(line_review(review_txt_filepath), 4352, 4353):
    print(review)

print("----")
print()
print("Transformed:")
print()

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 4352, 4353):
        print(review)

Original:

Visited here on Thursday evening (3/20) around 6:30 pm for dinner.  It was not packed although there were several patrons and more walking in as we were leaving.  No wait time.  I noticed upstairs they had a conference room enclosed with glass walls - looks like it would be a nice place for a birthday party or small private engagement.

The food:

Bread - The complimentary bread basket had a mixture of parmesan bread sticks and a couple other bread variations.  It came with two bean type dips and olive oil.  The parmesan bread stick is so good!!! I would've hoarded it all if I weren't on a business trip.  Alas, I had to pretend to be civilized and eat only two.  The bread was good, nothing that stood out, but I love bread I'll eat it anyway.

Fish of the day - I ordered the fish of the day which was rainbow trout over farro.  It came with some sort of creamy sauce that brought everything together perfectly.  A lot of times, the grains that come with the dish are pretty simpl

## Topic Modeling with Latent Dirichlet Allocation (_LDA_)

We want to put the reviews into different representing different things. The groups are essentially the topics.

### Generate Dictionary file (once)

First we create a full vocabulary of the corpus to be modeled using gensim's [**Dictionary**](https://radimrehurek.com/gensim/corpora/dictionary.html)  class.

Time taken to create yelp-academic dictionary: 1m 12.5s 

Time taken to create yelp-full dictionary: 3m 5s 

In [14]:
trigram_reviews = LineSentence(trigram_reviews_filepath)

# learn the dictionary by iterating over all of the reviews
trigram_dictionary = Dictionary(trigram_reviews)

# filter tokens that are very rare or too common from
# the dictionary (filter_extremes) and reassign integer ids (compactify)
trigram_dictionary.filter_extremes(no_below=10, no_above=0.4)
trigram_dictionary.compactify()

trigram_dictionary.save(trigram_dictionary_filepath)
    
# load the finished dictionary from disk
trigram_dictionary = Dictionary.load(trigram_dictionary_filepath)

### Generate bag-of-words model (once)

Using the dictionary created above (which is just a mapping of words to integer ID's we create a bag-of-words model where each review is represented by the coutns of distinct terms in it.

Time taken to create and save yelp-academic BOW model:  

Time taken to create and save yelp-full BOW model: 

In [None]:
from helper_fns import trigram_bow_generator

MmCorpus.serialize(trigram_bow_filepath,
                   trigram_bow_generator(trigram_reviews_filepath, trigram_dictionary))

In [None]:
# load the finished bag-of-words corpus from disk
trigram_bow_corpus = MmCorpus(trigram_bow_filepath)