# Modern NLP Tutorial
This notebook is following the tutorial from PyData 2016 by Patrick Harrision titled "Modern NLP in Python". It involves processing Yelp restaurant reviews, modeling topics from them, visualizing the topics, and creating and visualizing word vectors. The original notebook can be found [here](https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb). The accompanying video from PyData can be found [here](https://youtu.be/6zm9NC9uRkk). The academic dataset used in the notebook can be downloaded from [here](https://app.dominodatalab.com/mtldata/yackathon/browse/yelp_dataset_challenge_academic_dataset). The entire Yelp dataset can be found [here](https://www.yelp.com/dataset).

This notebook works with the entire dataset and not the academic dataset because when I started working on this, I couldn't find the academic dataset.

## Imports and Data Preparation

### Import Packages and set data directory paths

In [13]:
import os
import codecs
import pandas as pd
import itertools as it

from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

In [14]:
academic = True
prefix = 'yelp_academic_dataset_' if academic else ''
folder = 'yelp-academic' if academic else 'yelp-full'

data_directory = os.path.join('/mnt/Data/ml/datasets/yelp-dataset/' + folder)
businesses_filepath = os.path.join(data_directory, prefix + 'business.json')
review_json_filepath = os.path.join(data_directory, prefix + 'review.json')
intermediate_directory = os.path.join(data_directory, 'intermediate')

review_txt_filepath = os.path.join(intermediate_directory, 'review_text_all.txt')
unigram_sentences_filepath = os.path.join(intermediate_directory, 'unigram_sentences_all.txt')
bigram_model_filepath = os.path.join(intermediate_directory, 'bigram_model_all')
bigram_sentences_filepath = os.path.join(intermediate_directory, 'bigram_sentences_all.txt')
trigram_model_filepath = os.path.join(intermediate_directory, 'trigram_model_all')
trigram_sentences_filepath = os.path.join(intermediate_directory, 'trigram_sentences_all.txt')
trigram_reviews_filepath = os.path.join(intermediate_directory, 'trigram_transformed_reviews_all.txt')

trigram_dictionary_filepath = os.path.join(intermediate_directory, 'trigram_dict_all.dict')
trigram_bow_filepath = os.path.join(intermediate_directory, 'trigram_bow_corpus_all.mm')
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')
topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')
tsne_filepath = os.path.join(intermediate_directory, 'tsne_model')
tsne_vectors_filepath = os.path.join(intermediate_directory, 'tsne_vectors.npy')

### Write out the review file (once)

Read in the business json file and go through each business. Count how many restaurants are present and get their ids.

Total number of restaurants in yelp-academic: 21,892

Total number of restaurants in yelp-full: 54,618

In [34]:
from file_manip import get_restaurant_ids
    
restaurant_ids = get_restaurant_ids(businesses_filepath)
print(f'{len(restaurant_ids)} restaurants in the dataset')

54618 restaurants in the dataset


Write out the reviews of each restaurant **ONE LINE PER REVIEW** into the reviews file. This is done by escaping the newline character and replacing it with raw '\n' and adding a '\n' at the end to specify a newline

Number of reviews in yelp-academic: 990,627

Number of reviews in yelp-full: 3,221,419

In [30]:
from file_manip import write_review_file
review_count = write_review_file(review_txt_filepath, review_json_filepath, restaurant_ids)
print(f'Text from {review_count} reviews written to new txt file')

Text from 990627 reviews written to new txt file
CPU times: user 54.9 s, sys: 2.58 s, total: 57.4 s
Wall time: 1min 6s


## SpaCy Text Processing

In [15]:
nlp = spacy.load('en_default')

### Sample Review
Grab a sample review and analyze various aspects of SpaCy using it.

In [16]:
import itertools as it

with codecs.open(review_txt_filepath, encoding='utf_8') as f:
    sample_review = list(it.islice(f, 8, 9))[0]
    sample_reveiw = sample_review.replace('\\n', '\n')
    
parsed_review = nlp(sample_review)

In [None]:
for num, sentence in enumerate(parsed_review.sents):
    print(f'Sentence {num+1}:')
    print(sentence)

In [None]:
for num, entity in enumerate(parsed_review.ents):
    print(f'Entity {num+1}: {entity} - {entity.label_}')

In [None]:
token_attrs = [(token.text,
                token.pos_,
                token.lemma_,
                token.shape_,
                token.prob,
                token.text in STOP_WORDS,
                token.is_punct,
                token.is_space,
                token.like_num,
                token.is_oov)
                for token in parsed_review]

df = pd.DataFrame(token_attrs, columns=['text', 'pos', 'lemma', 'shape', 'log_prob',
                                       'stop?', 'punct?', 'whitespace?', 'number?',
                                        'out of vocab?'])
df.loc[:, 'stop?':'out of vocab?'] = (df.loc[:, 'stop?':'out of vocab?']
                                     .applymap(lambda x: u'Yes' if x else u''))
df

### Phrase Modeling

#### Unigram write file (once)
Get sentences from each review and write out the unigram file. This should be done only once.

Number of sentences in yelp-academic: 10,146,794
Time taken to process yelp-academic: 5h 34m 46s

Number of sentences in yelp-full: 30,392,900
Time taken to process yelp-full: 16h 1m 53s

In [4]:
from file_manip import write_unigram_sents

sentence_count = write_unigram_sents(unigram_sentences_filepath, review_txt_filepath, nlp)
print(f'{sentence_count} sentences written to {unigram_sentences_filepath} file')        

10146794 sentences written to /mnt/Data/ml/datasets/yelp-dataset/yelp-academic/intermediate/unigram_sentences_all.txt file


In [22]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [23]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print(u' '.join(unigram_sentence))

-PRON- have never see a restaurant that have a frowning brownie a.k.a frownie as -PRON- icon mascot or spokesperson
king 's family restaurant have surprise -PRON- with this
-PRON- think -PRON- may be in direct dialogue with eat n park smile cookie funny funny odd not funny ha ha
-PRON- be seat rather quickly by the manager
very nice people work here -PRON- be happy to find that even though -PRON- have close a section the server be willing to stay longer to serve -PRON- -PRON- dinner
-PRON- chat with the server a little bit
-PRON- be quite engaging and then move on to order -PRON- meal
tea pepsi
open face hot turkey sandwich with mash potato and gravy buffalo chicken strip with mash potato and macaroni n cheese
-PRON- drink arrive and -PRON- server converse with -PRON- some more about the area


#### Bigram Phrase model create and save (once)
We learn a phrase model that will link individual words into two-word phrases. The model is saved after generation.

Time taken to generate yelp-academic bigram model: 3m 55s  
Time taken to save yelp-academic bigram model: 8s

Time taken to generate yelp-full bigram model: 3m 55s  
Time taken to save yelp-full bigram model: 8s

In [None]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [25]:
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_filepath)

#### Bigram sentences write file (once)
After learning the bigram phrase model, we feed in the individual sentences from unigram_sentences to find possible bigram phrases. If found, gensim will automatically join them with an underscore.

In [28]:
unigram_sentences = LineSentence(unigram_sentences_filepath)
bigram_model = Phrases.load(bigram_model_filepath)

In [None]:
from file_manip import write_sents

sentence_count = write_sents(bigram_sentences_filepath, unigram_sentences, bigram_model)