<a href="https://colab.research.google.com/github/salmanarif86/MLAI/blob/master/ModernNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#spaCy — Industrial-Strength NLP in Python
![alt text](https://camo.githubusercontent.com/5544cd4d424dafdd00f9c3064157cc86b4a892cc/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f736b69706772616d2d696d616765732f73706143792e706e67)
spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

1. Tokenization
2. Text normalization, such as lowercasing, stemming/lemmatization
3. Part-of-speech tagging
4. Syntactic dependency parsing
5. Sentence boundary detection
6. Named entity recognition and annotation

In the "batteries included" Python tradition, spaCy contains built-in data and models which you can use out-of-the-box for processing general-purpose English language text:

1. Large English vocabulary, including stopword lists
2. Token "probabilities"
3. Word vectors

spaCy is written in optimized Cython, which means it's fast. According to a few independent sources, it's the fastest syntactic parser available in any language. Key pieces of the spaCy parsing pipeline are written in pure C, enabling efficient multithreading (i.e., spaCy can release the GIL).

In [0]:
import spacy
import pandas as pd
import numpy as np
import ast
from spacy import displacy

nlp = spacy.load('en')

##Context

I was always fascinated by the food culture of Bengaluru. Restaurants from all over the world can be found here in Bengaluru. From United States to Japan, Russia to Antarctica, you get all type of cuisines here. Delivery, Dine-out, Pubs, Bars, Drinks,Buffet, Desserts you name it and Bengaluru has it. Bengaluru is best place for foodies. The number of restaurant are increasing day by day. Currently which stands at approximately 12,000 restaurants. With such an high number of restaurants. This industry hasn't been saturated yet. And new restaurants are opening every day. However it has become difficult for them to compete with already established restaurants. The key issues that continue to pose a challenge to them include high real estate costs, rising food costs, shortage of quality manpower, fragmented supply chain and over-licensing. This Zomato data aims at analysing demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost etc for a particular location. It also aims at finding similarity between neighborhoods of Bengaluru on the basis of food. The dataset also contains reviews for each of the restaurant which will help in finding overall rating for the place.

We will demonstrate the power of Modern NLP on this dataset

##Content

The data is accurate to that available on the zomato website until 15 March 2019. This data is also available on Kaggle. We will create a dataframe and check a few attributes and do some EDA. The column we are most intersted in review_list. The column is a list of tuples containing reviews for the restaurant, each tuple consists of two values, rating and review by the customer

In [0]:
df = pd.read_csv('zomato.csv')

In [162]:
df.head(3)

Unnamed: 0,url,address,name,online_order,book_table,rate,votes,phone,location,rest_type,dish_liked,cuisines,approx_cost(for two people),reviews_list,menu_item,listed_in(type),listed_in(city)
0,https://www.zomato.com/bangalore/jalsa-banasha...,"942, 21st Main Road, 2nd Stage, Banashankari, ...",Jalsa,Yes,Yes,4.1/5,775,080 42297555\r\n+91 9743772233,Banashankari,Casual Dining,"Pasta, Lunch Buffet, Masala Papad, Paneer Laja...","North Indian, Mughlai, Chinese",800,"[('Rated 4.0', 'RATED\n A beautiful place to ...",[],Buffet,Banashankari
1,https://www.zomato.com/bangalore/spice-elephan...,"2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ...",Spice Elephant,Yes,No,4.1/5,787,080 41714161,Banashankari,Casual Dining,"Momos, Lunch Buffet, Chocolate Nirvana, Thai G...","Chinese, North Indian, Thai",800,"[('Rated 4.0', 'RATED\n Had been here for din...",[],Buffet,Banashankari
2,https://www.zomato.com/SanchurroBangalore?cont...,"1112, Next to KIMS Medical College, 17th Cross...",San Churro Cafe,Yes,No,3.8/5,918,+91 9663487993,Banashankari,"Cafe, Casual Dining","Churros, Cannelloni, Minestrone Soup, Hot Choc...","Cafe, Mexican, Italian",800,"[('Rated 3.0', ""RATED\n Ambience is not that ...",[],Buffet,Banashankari


Converting string representation of list into lits using literal_eval

In [0]:
df.reviews_list = df.reviews_list.apply(lambda x: ast.literal_eval(x))



Coverting string of tuples into a dataframe by first setting the column 'name' as index and then selecting the review_list column and then applying pd.series to seperate elements of list and then stacking them vertically and then applying pd.series to seprate element of tuples and adding 'val_' as a prefix to the new column and then reseting index to move the column 'name' back to the index.

In [0]:
df_reviews=df.set_index('name').reviews_list.apply(pd.Series).stack().apply(pd.Series).add_prefix('val_').reset_index().drop('level_1', axis=1)
  
   


Saving a copy of the file with all reviews unpacked

In [0]:
df_reviews.to_csv('banglore_reviews.csv')


There are alot of non-ascii gibrish data that needs to be cleaned. The following function removes all non-ascii text and replaces it with blank

In [0]:
import re
def replace_foreign_characters(s):
    return re.sub(r'[^\x00-\x7f]',r'', s)

df_reviews.val_1 = df_reviews.val_1.apply(lambda x: replace_foreign_characters(x))

Let's extract the largest comment in our data and perform some modern NLP on it

In [0]:
np.argmax(np.array(df_reviews.val_1.apply(lambda x : len(x))))

Here is a snapshot of what the comment looks like. The customer as you can see really loves the ambiance and the diffrent variety of foods 

In [0]:
sample_review= df_reviews.iloc[68248]['val_1']

Looks the same! What happened under the hood?

What about sentence detection and segmentation?

In [178]:
parsed_review = nlp(sample_review)
print(parsed_review)

RATED
  I visited the place recently on its opening night. Truly amazed by the decor of the place. They have indoor as well as outdoor sitting area. Decorated with sparkling lights, the place was definitely giving a Christmas vibes.

Coming to the food and drinks, I tried food from the ala carte menu. Chef Swatantra and his team has done a fabulous job with the food.

The chakhna or better known as scotch nuts were my favourite. These babies were completely addictive. The best dish to much along with the cocktails.

We also tried the Mojito chicken pizza, the marinated chicken with olives is quite a filler.

In prawns we tried the Teppanyaki wasabi prawns, recommended by the chef himself and it was definitely worth mentioning here. You cannot miss this one. The light zing of the wasabi completely brightened the dish.

Patrani Machhi is another dish which liked a lot. Traditionally made in Parsi households, was presented in different way, they served it with Hyderabadi thiccha.



priya

In [185]:
for num, sentence in enumerate(parsed_review.sents):
  print ('Sentence {}:'.format(num + 1))
  print (sentence)
  print ('')
  

  


Sentence 1:
RATED
  

Sentence 2:
I visited the place recently on its opening night.

Sentence 3:
Truly amazed by the decor of the place.

Sentence 4:
They have indoor as well as outdoor sitting area.

Sentence 5:
Decorated with sparkling lights, the place was definitely giving a Christmas vibes.



Sentence 6:
Coming to the food and drinks, I tried food from the ala carte menu.

Sentence 7:
Chef Swatantra and his team has done a fabulous job with the food.



Sentence 8:
The chakhna or better known as scotch nuts were my favourite.

Sentence 9:
These babies were completely addictive.

Sentence 10:
The best dish to much along with the cocktails.



Sentence 11:
We also tried the Mojito chicken pizza, the marinated chicken with olives is quite a filler.



Sentence 12:
In prawns we tried the Teppanyaki wasabi prawns, recommended by the chef himself and it was definitely worth mentioning here.

Sentence 13:
You cannot miss this one.

Sentence 14:
The light zing of the wasabi completely b

What about named entity detection?

In [190]:
for num, entity in enumerate(parsed_review.ents):
  print('Entity {}:'.format(num + 1), entity, '-',entity.label_)
  print('')

  
displacy.render(parsed_review, style='ent', jupyter=True)

Entity 1: Christmas - DATE

Entity 2: Chef Swatantra - PRODUCT

Entity 3: Mojito - GPE

Entity 4: Teppanyaki - NORP

Entity 5: Patrani Machhi - PERSON

Entity 6: Parsi - GPE

Entity 7: Hyderabadi - GPE



What about part of speech tagging?

In [191]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_pos),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,RATED,PROPN
1,\n,SPACE
2,I,PRON
3,visited,VERB
4,the,DET
5,place,NOUN
6,recently,ADV
7,on,ADP
8,its,DET
9,opening,NOUN


What about text normalization, like stemming/lemmatization and shape analysis?

In [192]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_lemma, token_shape),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,RATED,RATED,XXXX
1,\n,\n,\n
2,I,-PRON-,X
3,visited,visit,xxxx
4,the,the,xxx
5,place,place,xxxx
6,recently,recently,xxxx
7,on,on,xx
8,its,-PRON-,xxx
9,opening,opening,xxxx


What about token-level entity analysis?

In [193]:

token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(zip(token_text, token_entity_type, token_entity_iob),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,RATED,,O
1,\n,,O
2,I,,O
3,visited,,O
4,the,,O
5,place,,O
6,recently,,O
7,on,,O
8,its,,O
9,opening,,O


What about a variety of other token-level attributes, such as the relative frequency of tokens, and whether or not a token matches any of these categories?

1. stopword
2. punctuation
3. whitespace
4. represents a number

whether or not the token is included in spaCy's default vocabulary?

In [194]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,RATED,-20.0,,,,,Yes
1,\n,-20.0,,,Yes,,Yes
2,I,-20.0,Yes,,,,Yes
3,visited,-20.0,,,,,Yes
4,the,-20.0,Yes,,,,Yes
5,place,-20.0,,,,,Yes
6,recently,-20.0,,,,,Yes
7,on,-20.0,Yes,,,,Yes
8,its,-20.0,Yes,,,,Yes
9,opening,-20.0,,,,,Yes


If the text you'd like to process is general-purpose English language text (i.e., not domain-specific, like medical literature), spaCy is ready to use out-of-the-box.

I think it will eventually become a core part of the Python data science ecosystem — it will do for natural language computing what other great libraries have done for numerical computing.

Phrase Modeling
Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase is:






>>>>>>>![alt text](https://render.githubusercontent.com/render/math?math=%5Cfrac%7Bcount%28A%5C%20B%29%20-%20count_%7Bmin%7D%7D%7Bcount%28A%29%20%2A%20count%28B%29%7D%20%2A%20N%20%26gt%3B%20threshold&mode=display)


...where:

1. $count(A)$ is the number of times token $A$ appears in the corpus
2. $count(B)$ is the number of times token $B$ appears in the corpus
3. $count(A\ B)$ is the number of times the tokens $A\ B$ appear in the corpus in order
4. $N$ is the total size of the corpus vocabulary
5. $count_{min}$ is a user-defined parameter to ensure that accepted phrases occur a minimum number of times
6. $threshold$ is a user-defined parameter to control how strong of a relationship between two tokens the model requires before accepting them as a phrase

Once our phrase model has been trained on our corpus, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model (so new york would become new_york). But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible gensim library to help us with phrase modelling

In [0]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

As we're performing phrase modeling, we'll be doing some iterative data transformation at the same time. Our roadmap for data preparation includes:

1. Segment text of complete reviews into sentences & normalize text
2. First-order phrase modeling $\rightarrow$ apply first-order phrase model to transform sentences
3. Second-order phrase modeling $\rightarrow$ apply second-order phrase model to transform sentences
4. Apply text normalization and second-order phrase model to text of complete reviews

We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:

1. Iterate over the 1M reviews 
2. Segment the reviews into individual sentences
3. Remove punctuation and excess whitespace

Lemmatize the text
... and do so efficiently in parallel, thanks to spaCy's nlp.pipe() function.