# NLP in Python


1. The dataset 
2. Text processing with spaCy
3. Automatic phrase modeling
4. Topic modeling with LDA
5. Visualizing topic models with pyLDAvis

# The Dataset

https://www.kaggle.com/residentmario/exploring-tripadvisor-uk-restaurant-reviews/data



# spaCy


spaCy is an industrial-strength natural language processing (NLP) library for Python. spaCy's goal is to take recent advancements in natural language processing out of research papers and put them in the hands of users to build production software.

spaCy handles many tasks commonly associated with building an end-to-end natural language processing pipeline:

Tokenization <br/>
Text normalization, such as lowercasing, stemming/lemmatization<br/>
Part-of-speech tagging<br/>
Syntactic dependency parsing<br/>
Sentence boundary detection<br/>
Named entity recognition and annotation<br/>

In [1]:
import pandas as pd
import numpy as np
import spacy
import itertools as it

import os
import codecs

In [2]:
#default english model

nlp = spacy.load('en_core_web_sm')

In [8]:
#read in data

data = pd.read_csv('/Users/victoriacabales/Documents/data_science/restaurant_reviews', encoding='utf-8')

In [9]:
data

Unnamed: 0,uniq_id,url,restaurant_id,restaurant_location,name,category,title,review_date,review_text,author,author_url,location,rating,food,value,service,visited_on
0,76b689f7956f0dc70bda40450e0bae1c,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d1988022,London,Balans Soho Society,Review of restaurants,“Breakfast at balans a must ! ”,2015-07-28,Fantastic as usual with friendly service ... W...,richard5003,https://www.tripadvisor.co.uk/members/richard5003,,5 of 5 bubbles,,,,June 2015
1,0712b79accc38942797b4ca27eabf98d,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d5600132,London,Duke of York,Review of restaurants,“Ok for a quick pint”,2014-06-17,"Average pub, where you can be serve quickly. T...",pierre l,https://www.tripadvisor.co.uk/members/714pierrel,"London, United Kingdom",3 of 5 bubbles,3 of 5 bubbles,4 of 5 bubbles,3 of 5 bubbles,May 2014
2,eb50da7d57e9cd6cdb84220cf09dc70b,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d2257005,London,Dawat Restaurant,Review of restaurants,“Good Pakistani food”,2016-12-19,Personally my favourite Pakistani restaurant i...,Ehsiii,https://www.tripadvisor.co.uk/members/Ehsiii,,4 of 5 bubbles,,,,December 2016
3,b29e2213a0f62d29ef9bfd3aeb2b18ef,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d1123250,London,Gourmet Burger Kitchen Ealing,Review of restaurants,“Great Burgers”,2016-02-18,For a while now I’ve been dreaming of that Bur...,Gary B,https://www.tripadvisor.co.uk/members/252garyb,"London, England, United Kingdom",4 of 5 bubbles,4 of 5 bubbles,4 of 5 bubbles,4 of 5 bubbles,February 2016
4,994256feb6c70231ebf5ae0ec919473c,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d2689314,London,Lingo,Review of restaurants,“Good Japanese Food in Soho”,2014-12-22,It's close to Regent Street and its tucked awa...,benbecks23,https://www.tripadvisor.co.uk/members/benbecks23,"Singapore, Singapore",4 of 5 bubbles,4 of 5 bubbles,4 of 5 bubbles,5 of 5 bubbles,December 2014
5,0629fbfa019ea01d76335c7dd49e63ba,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d3421476,London,Tonkotsu,Review of restaurants,“Good Ramen”,2013-11-22,"Yes, you have to wait a while to get in, yes t...",AdamLaverty,https://www.tripadvisor.co.uk/members/AdamLaverty,Ireland,4 of 5 bubbles,4 of 5 bubbles,4 of 5 bubbles,2 of 5 bubbles,October 2013
6,c58a11c8173ab6d66cb677e9cbeab724,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d7289150,London,Dirty Martini Monument,Review of restaurants,“Leaving party”,2016-12-17,Leaving party for a friend. We were shoved int...,banner1234,https://www.tripadvisor.co.uk/members/banner1234,"London, United Kingdom",2 of 5 bubbles,,,,December 2016
7,daba2742898770cd2c3cb5c15c117a1d,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d3421476,London,Tonkotsu,Review of restaurants,“Food drained of all life”,2015-08-05,Cramped and somehow joyless interior provides ...,m m,https://www.tripadvisor.co.uk/members/913mm,"London, United Kingdom",1 of 5 bubbles,1 of 5 bubbles,1 of 5 bubbles,2 of 5 bubbles,February 2015
8,aaca771d2dada866d5c743d825686b7b,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d1382024,London,The Boundary Restaurant,Review of restaurants,“Slick and tasty”,2014-05-26,"Loved the look, very slick and clean and brigh...",Dufftez,https://www.tripadvisor.co.uk/members/Dufftez,"London, United Kingdom",5 of 5 bubbles,,,,
9,7b34e4e400460a948fc802a30adb71cd,https://www.tripadvisor.co.uk/ShowUserReviews-...,g186338-d1123250,London,Gourmet Burger Kitchen Ealing,Review of restaurants,“Reliable tasty food”,2013-05-03,Always enjoy a visit to GBK. Tasty good qualit...,Leigh B,https://www.tripadvisor.co.uk/members/leighb668,"London, United Kingdom",4 of 5 bubbles,,,,April 2013


In [10]:
#take review_text field

fields = ["review_text"]
data = pd.read_csv('/Users/victoriacabales/Documents/data_science/restaurant_reviews', encoding='utf-8', na_values=['NA'], usecols = fields)

In [11]:
#concatenate 0-9

print(data['review_text'][0:9].str.cat(sep=' '))

Fantastic as usual with friendly service ... We always head here after nursing the usual high cost city hangover - but don't worry at Ballans you will never feel ripped off ! 3 types of eggs to choose from all cooked to perfection with a lovely sausage , toast , beans and small pieces of hash browns take a beer on the side and enjoy your trip to balans don't even look another way in the Westfield this is THE place to go !!!! Come on Average pub, where you can be serve quickly. Th only down side will be the crowd around the some of the staff. You do't fell very welcome there Personally my favourite Pakistani restaurant in tooting. Good food, good meat quality and a reasonable price. For a while now I’ve been dreaming of that Burger to die for, few places do the Burger that makes you think, should I or shouldn’t I, as it looks so bad but tastes so damn good. Well GBK has this tied down to a Tee, they make Burgers that not only tastes great but look amazing. To have your Burger cooked to 

In [12]:
#use your own path file

reviews_path = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/sample_reviews.txt'

In [13]:
#0:500 to one string

sample_reviews = data['review_text'][0:500].str.cat(sep=' ')

In [14]:
#print string

sample_reviews

'Fantastic as usual with friendly service ... We always head here after nursing the usual high cost city hangover - but don\'t worry at Ballans you will never feel ripped off ! 3 types of eggs to choose from all cooked to perfection with a lovely sausage , toast , beans and small pieces of hash browns take a beer on the side and enjoy your trip to balans don\'t even look another way in the Westfield this is THE place to go !!!! Come on Average pub, where you can be serve quickly. Th only down side will be the crowd around the some of the staff. You do\'t fell very welcome there Personally my favourite Pakistani restaurant in tooting. Good food, good meat quality and a reasonable price. For a while now I’ve been dreaming of that Burger to die for, few places do the Burger that makes you think, should I or shouldn’t I, as it looks so bad but tastes so damn good. Well GBK has this tied down to a Tee, they make Burgers that not only tastes great but look amazing. To have your Burger cooked

In [15]:
text_file = open("/Users/victoriacabales/Documents/data_science/restaurant_reviews/sample_reviews.txt", "w", encoding="utf-8")
text_file.write(sample_reviews)
text_file.close()

Hand these reviews to spaCy, and be prepared to wait...

In [16]:
#parse and tag

parsed_reviews = nlp(sample_reviews)

In [17]:
print(parsed_reviews)

Fantastic as usual with friendly service ... We always head here after nursing the usual high cost city hangover - but don't worry at Ballans you will never feel ripped off ! 3 types of eggs to choose from all cooked to perfection with a lovely sausage , toast , beans and small pieces of hash browns take a beer on the side and enjoy your trip to balans don't even look another way in the Westfield this is THE place to go !!!! Come on Average pub, where you can be serve quickly. Th only down side will be the crowd around the some of the staff. You do't fell very welcome there Personally my favourite Pakistani restaurant in tooting. Good food, good meat quality and a reasonable price. For a while now I’ve been dreaming of that Burger to die for, few places do the Burger that makes you think, should I or shouldn’t I, as it looks so bad but tastes so damn good. Well GBK has this tied down to a Tee, they make Burgers that not only tastes great but look amazing. To have your Burger cooked to 

Looks the same. What did this do?

In [18]:
# sentence detection

for num, sentence in enumerate(parsed_reviews.sents):
    print('Sentence {}:'.format(num + 1))
    print(sentence)
    print('')

Sentence 1:
Fantastic as usual with friendly service ...

Sentence 2:
We always head here after nursing the usual high cost city hangover - but don't worry at Ballans you will never feel ripped off !

Sentence 3:
3 types of eggs to choose from all cooked to perfection with a lovely sausage , toast , beans and small pieces of hash browns take a beer on the side and enjoy your trip to balans don't even look another way in the Westfield

Sentence 4:
this is THE place to go !!!!

Sentence 5:
Come on Average pub, where you can be serve quickly.

Sentence 6:
Th only down side will be the crowd around the some of the staff.

Sentence 7:
You do't fell very welcome there

Sentence 8:
Personally my favourite Pakistani restaurant in tooting.

Sentence 9:
Good food, good meat quality and a reasonable price.

Sentence 10:
For a while now I’ve been dreaming of that Burger to die for, few places do the Burger that makes you think, should I or shouldn’t I, as it looks so bad but tastes so damn good.




Sentence 127:
What a delight the Boundary is!

Sentence 128:
Friendly, warm, human service from the earring team, made better by the beautiful food and excellent drinks.

Sentence 129:
I enjoyed a crab dish to start, followed by Autumn Truffle Panisse with Kale & Chestnut Mushroom Veloute and then the perfectly sweet dessert.

Sentence 130:
To drink was the Moscow Mule inspired Little Moscow - just perfect.

Sentence 131:
The dining area, two floors down, is well maintained not too sumptuous and created the expected ambiance.

Sentence 132:
Toilets, one floor up, clean with luxury hand products.

Sentence 133:
Perfect for vegetarians, carnivores and those partial to a vodka or gin inspired cocktail.

Sentence 134:
Awesome venue with outstanding decor.

Sentence 135:
The attention to detail is outstanding with Victorian-esque pen & ink drawings of facebook cat pictures, hot air balloon delivery services and other silly things which gives the place a truely quirky vibe without taking it

Sentence 462:
Good cause maybe but when you already pay a good monthly donation to a charity I feel this was a bit cheeky.

Sentence 463:
- And finally, a 12.5% service charge.

Sentence 464:
It seems that we're following in the footsteps of America by paying a service charge everywhere now.

Sentence 465:
But if I remember correctly, waiters and chefs in America are on a much lower salary as part of their salary comes from tips.

Sentence 466:
But in England I can't imagine they would even be near min wage which denotes that service charge is a great bonus to their pay.

Sentence 467:
Adding this to the bill makes me feel awkward to have it removed if I feel the service was standard or just good which I think is the min standard a customer should expect.

Sentence 468:
I guess I ended up paying for those free oysters ;) I have visited this place due to perfect experience of other.

Sentence 469:
Unfortunately I was far from to be impressed really.

Sentence 470:
I only went for sushi,


Sentence 578:
Decent service, decent food, nothing special - what you'd expect from a chain I guess.

Sentence 579:
Bottle of Peroni at £6.50 was a bit over the top though, that's taking advantage....

Sentence 580:
Pleasant atmosphere, but there are better places to eat locally...

Sentence 581:
Awesome restaurant, great food

Sentence 582:
, the ambience is wonderful, the staff are friendly and very accommodating (they made my husband's green curry with extra spice)

Sentence 583:
the menu is expansive, my favorite is the grilled chicken with Thai green curry fried rice, which tastes spectacular.

Sentence 584:
The seating can be long tables where you'll get to meet other diners or you can request a table just for your party.

Sentence 585:
Overall great food, good customer service, nice atmosphere, highly recommended!

Sentence 586:
A friend of mine with coeliac's was visiting over a weekend

Sentence 587:
so we decided to head out in search of gluten free fish & chips, and I found

Sentence 918:
Easy access to the Marriott next door.

Sentence 919:
Read the menu and you'll learn that this is part of a chain and if you log onto their website you can claim 5 pounds off your next visit - which might come later in the day as they have several pubs.

Sentence 920:
You will have to climb lots of stairs to use the Lou so plan your day.

Sentence 921:
I was here with colleagues from work and guests from overseas, for a quick dinner following a business meeting nearby.

Sentence 922:
We made a reservation for 6 people and when arriving in the early evening, we were shown to the table.

Sentence 923:
This pizza express branch is conveniently located on Southampton Row and the inside is similar to other branches, with a cheerful brightly lit dining room, nicely decorated bar and overall a pleasant, friendly ambience.

Sentence 924:
We ordered pizza, salad, and drinks.

Sentence 925:
The wine (a bottle of Gavi for £22) was of very good quality and worth the expenses.

Senten


Sentence 1297:
Alain the GM greeted us the moment we arrived and was kind enough to recommend a fantastic tasting menu with wine.

Sentence 1298:
Some of the best food and service we have ever received in London and a place we will definitely return to Good service Amazing food.

Sentence 1299:
Great service.

Sentence 1300:
More than you would expect in a pizza place.

Sentence 1301:
A tiny bit pricy.

Sentence 1302:
JUST TRY PUTTING YOUR KNIFE AND FORK DOWN!

Sentence 1303:
from the fantastic smiling and warm waitresses and the delicious beautifully prepared food-this is a winner in every way.the somellier was helpful and knowledgeable and the chef was charm personified.our friends had been two or three times before and were spot on in their choice.not one bad dish between 5 of us.the amuse guile was as good as I have eaten anywhere.we will be back

Sentence 1304:
A party of 8 (adults, teenagers and kids)

Sentence 1305:
, we were accommodated in this small bistro quite late and at 


Sentence 1467:
We had a nice late lunch here.

Sentence 1468:
The pizza was very good.

Sentence 1469:
Two of us had the Calzones.

Sentence 1470:
They were delicious.

Sentence 1471:
Very accompanying staff as we went off menu a little.

Sentence 1472:
Normal pizza express standard food which I've always found to be very good

Sentence 1473:
Went there for dinner with my boyfriend and we couldn't believe how bad the restaurant.

Sentence 1474:
was.

Sentence 1475:
First the interiors are quite cold and IKEA looking, but we where looking forward to trying the food.

Sentence 1476:
We order cocktails K.O.sake from the very promising menu they came in a quite small glass with ice so not much of the drink there

Sentence 1477:
and we couldn't taste any sake or gin

Sentence 1478:
all we could taste was alcohol and lemon for £12.00 each?We order the aubergine dish and the risotto as starters, aubergine was well presented and very nice,

Sentence 1479:
the risotto..

Sentence 1480:
Risotto

Sentence 1800:
Went there for a bite to eat with a friend, nice decor and cool vibes.

Sentence 1801:
The food I must say is good but way overprice for what it is and for that reason would not return there.

Sentence 1802:
You would have more food and pay less in a middle eastern restaurant in Mayfair (shepherds market) or even in Kensington).

Sentence 1803:
If you want quick tasty imaginative vegetarian food with no fuss, then mildreds is the place to go the food is filling and tasty.

Sentence 1804:
And the layout is cute a bit tight.

Sentence 1805:
The juices are small for the price.

Sentence 1806:
The staff are friendly.0 Me and my fiancé stayed over at an hotel nearby and decided to have our first pub experience.

Sentence 1807:
We went to the second floor but not without waiting for 10 minutes for someone to tell us that that's where we were supposed to have dinner

Sentence 1808:
(I thought we were waiting for a table to vacate in the first floor).

Sentence 1809:
We waited a

The quality and range of dishes are better there and there is also more privacy and a better atmosphere in the restaurant.

Sentence 1962:
Rude staff, rubbish food!

Sentence 1963:
They can't be bothered.

Sentence 1964:
I'm sticking to the railway tavern or the earl haig!

Sentence 1965:
Having been to a number of Michelin star restaurants across London and paid £100 plus it was refreshing to go to a restaurant where the was an 8 course tasting menu for £50.

Sentence 1966:
There was no catch, the food is better than nearby Michelin starred Lyles and Clove Club.

Sentence 1967:
We went on a Tuesday evening and the restaurant was moderately busy.

Sentence 1968:
The waitstaff were excellent and there was a relaxed vibe (potentially some of the previous issues mentioned below have been ironed

Sentence 1969:
out).Each of the 8 courses were excellent.

Sentence 1970:
Normally I would expect a couple of the courses to disappoint but not here!

Sentence 1971:
The squid, lamb and game tea c


Sentence 2209:
The starters are tasty and I'd recommend the mixed shish Kabab.

Sentence 2210:
But book a table, it can get busy

Sentence 2211:
We had summer rolls n mango salad as starter, refreshingly light n tasty; the beef n chicken pho, the soup is rich n not too salty; the pot tea is natural with lemongrass; the home made ice cream is heavenly, the lychee, vietnamese coffee are really special.

Sentence 2212:
The service is friendly n efficient, worth coming back anytime.

Sentence 2213:
We have been here 3 times now, and we feel that the food is getting better, which is always something we really like (some restaurants we feel precisely the other way).

Sentence 2214:
We have tried some Vietnamese in London and this seems to be one of the good ones.

Sentence 2215:
Walked down Exmouth Market looking for somewhere to grab lunch.

Sentence 2216:
Just passing Berber Q and a man came out and said 'go in there it is amazing'.

Sentence 2217:
So I did.

Sentence 2218:
And he was abs


Sentence 2299:
We waited another 20 minutes before receiving the drinks and when we did the boy was confused as to what he had brought and why - the entire order had been slightly mixed up! eventually probably within the next ten minutes

Sentence 2300:
we were abel to give our food order, anxious as to choosing starters as we didn't want to be all night (UGH! not another DH1 disaster!)Some of the food that we chose was a dish of cutlets, another had fillet of lamb and some fries as opposed to rice ( rice had been a little boring last meal).By a little after 9.30pm

Sentence 2301:
, we received the order of food.two of my cutlets were rather large and two small.

Sentence 2302:
A little dry and slightly over cooked as compared to that of last visit when my husband had chosen them.

Sentence 2303:
However my friend, who loves her food well cooked thought they were great!The salad was little tired and had a splash on one side of the plate of something that was white but tasted of very l

I have eaten at this restaurant previously and enjoyed both food and service, but last weekend whilst staying with friends we opted to call for a takeaway.

Sentence 2648:
Being a creature of boring habit, I tend to order the same dish whilst eating within the restaurant.

Sentence 2649:
When we unpacked the food back at their home, I discovered that they had decided to exercise a little portion control.

Sentence 2650:
So what do you think - is £18 decent value for a portion of rice, a small salad and 4 small lamb cutlets?

Sentence 2651:
Certainly for them, but in the longer term, I'd be very hesitant to hand over money to them again.

Sentence 2652:
I visited this restaurant on reccommadation and we were not disappointed with our meal.

Sentence 2653:
We had booked a table but we were late arriving and this was not a problem despite being busy

Sentence 2654:
we were allocated a table.

Sentence 2655:
Service was good and the food was excellent.

Sentence 2656:
We had kalamari for s


Sentence 2785:
Worth the money for the experience.

Sentence 2786:
Visited on the basis of wide tapas menu and huge G&Ts - and was not disappointed at all!

Sentence 2787:
Excellent tapas selection with staff happy to advise and happy for diners to order one or two at a time; double gins (including my favorite, Martin Miller) with matched tonics and aromatics served in huge, stemmed balloon glasses.

Sentence 2788:
Cosy and atmospheric.

Sentence 2789:
No reservations which suited us fine as we did not want to be constrained by a reservation but were able to get a table immediately on a Friday evening.

Sentence 2790:
Visited for some food and drinks.

Sentence 2791:
Staff all helpful and very friendly.

Sentence 2792:
Managed to sit out in the sun.

Sentence 2793:
Food lovely, decent portions and lots of choice.

Sentence 2794:
Wide selection of different beverages.

Sentence 2795:
Nice garden to sit outside.

Sentence 2796:
Will def return next time we're in the area.

Sentence 2797

In [19]:
# entity detection
# https://spacy.io/usage/linguistic-features


for num, entity in enumerate(parsed_reviews.ents):
    print('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print('')

Entity 1: Ballans - NORP

Entity 2: Westfield - GPE

Entity 3: Pakistani - NORP

Entity 4: Burger - PERSON

Entity 5: Burger - ORG

Entity 6: GBK - PERSON

Entity 7: Burger - ORG

Entity 8: the Avocado Bacon Burger - ORG

Entity 9: Cheese - NORP

Entity 10: Ealing - ORG

Entity 11: Regent Street - FAC

Entity 12: japanese - NORP

Entity 13: happy hour - TIME

Entity 14: Tonkotsu Ramen - ORG

Entity 15: Tasty - ORG

Entity 16: Don - PERSON

Entity 17: summer - DATE

Entity 18: Vietnamese - NORP

Entity 19: BSS - ORG

Entity 20: one - CARDINAL

Entity 21: point!We - ORG

Entity 22: 10 - CARDINAL

Entity 23: Jacobs - ORG

Entity 24: Acton - GPE

Entity 25: Amsterdam - GPE

Entity 26: Paris - GPE

Entity 27: Sicily - GPE

Entity 28: English - LANGUAGE

Entity 29: Italian - NORP

Entity 30: Finchley - PERSON

Entity 31: Ballards Lane - PERSON

Entity 32: five - CARDINAL

Entity 33: six - CARDINAL

Entity 34: 5 minutes - TIME

Entity 35: 5 - CARDINAL

Entity 36: one - CARDINAL

Entity 37: Su

Entity 352: 10 - MONEY

Entity 353: 2 and a half - CARDINAL

Entity 354: Busaba - GPE

Entity 355: Busaba - GPE

Entity 356: a bad day - DATE

Entity 357: Friday - DATE

Entity 358: Loved - NORP

Entity 359: Westfields - PERSON

Entity 360: one night - TIME

Entity 361: Rice - PERSON

Entity 362: about 5 minutes - TIME

Entity 363: around 5 - CARDINAL

Entity 364: Visited - PERSON

Entity 365: Pad Thai - NORP

Entity 366: Westfield - GPE

Entity 367: Thai - NORP

Entity 368: Triscott Arms - ORG

Entity 369: 5 - MONEY

Entity 370: Wormwood - ORG

Entity 371: three - CARDINAL

Entity 372: Negroni - PERSON

Entity 373: one - CARDINAL

Entity 374: Chris - PERSON

Entity 375: three - CARDINAL

Entity 376: the Foie Gras - PERSON

Entity 377: Croquetas - ORG

Entity 378: Tomate - ORG

Entity 379: Croquetas - ORG

Entity 380: Espresso Martini - PERSON

Entity 381: Chris - PERSON

Entity 382: 100 year old - DATE

Entity 383: Sherry - PERSON

Entity 384: first - ORDINAL

Entity 385: Busaba - GPE


Entity 656: 4/5 - CARDINAL

Entity 657: PE - ORG

Entity 658: Pizza Express - ORG

Entity 659: Maida Vale's - PERSON

Entity 660: today - DATE

Entity 661: 4 - CARDINAL

Entity 662: 1 - CARDINAL

Entity 663: 6 - CARDINAL

Entity 664: Thai - NORP

Entity 665: Thai - NORP

Entity 666: Friday evening - TIME

Entity 667: about 8.45 pm - TIME

Entity 668: seconds - TIME

Entity 669: wrong??I - ORG

Entity 670: Asian - NORP

Entity 671: 14.20.I - MONEY

Entity 672: Three - CARDINAL

Entity 673: Roast Lamb - PERSON

Entity 674: 3/4 - CARDINAL

Entity 675: Lamd - PERSON

Entity 676: Veg - PERSON

Entity 677: Great Beer - FAC

Entity 678: 7.2% - PERCENT

Entity 679: Nicola - ORG

Entity 680: Nice - GPE

Entity 681: Roman - ORG

Entity 682: Muxima - PERSON

Entity 683: Roman Road - FAC

Entity 684: Bow - PERSON

Entity 685: Last minute - TIME

Entity 686: Saturday night - TIME

Entity 687: two - CARDINAL

Entity 688: 8:30pm - PERSON

Entity 689: two - CARDINAL

Entity 690: USP - ORG

Entity 691


Entity 949: 2 - CARDINAL

Entity 950: 17 - MONEY

Entity 951: Kervan - PERSON

Entity 952: the end of the night - TIME

Entity 953: EARL GREY TEA' - ORG

Entity 954: Earl Grey - PERSON

Entity 955: a few minutes - TIME

Entity 956: 2 - CARDINAL

Entity 957: the last evening - TIME

Entity 958: 4 day - DATE

Entity 959: Kensington - GPE

Entity 960: the previous 3 nights - DATE

Entity 961: 18-02-17 - CARDINAL

Entity 962: London - GPE

Entity 963: Sauvignon Blanc - PERSON

Entity 964: Winchmore Hill!!Grilled - ORG

Entity 965: Vietnamese - NORP

Entity 966: the summer - DATE

Entity 967: a few years ago - DATE

Entity 968: 4 - CARDINAL

Entity 969: Dessert - PERSON

Entity 970: Monday - DATE

Entity 971: One - CARDINAL

Entity 972: Monday 20th February - DATE

Entity 973: Deep - PERSON

Entity 974: & Spring - ORG

Entity 975: Mango - GPE

Entity 976: Street Food - WORK_OF_ART

Entity 977: Bangkok - GPE

Entity 978: Mangeress - LOC

Entity 979: Thailand - GPE

Entity 980: first - ORDIN

Entity 1227: first - ORDINAL

Entity 1228: Sunday - DATE

Entity 1229: years ago - DATE

Entity 1230: Friday - DATE

Entity 1231: a few weeks - DATE

Entity 1232: one - CARDINAL

Entity 1233: Samdan - GPE

Entity 1234: Kabab - PERSON

Entity 1235: summer - DATE

Entity 1236: vietnamese - NORP

Entity 1237: 3 - CARDINAL

Entity 1238: Vietnamese - NORP

Entity 1239: London - GPE

Entity 1240: one - CARDINAL

Entity 1241: Berber Q - ORG

Entity 1242: six - CARDINAL

Entity 1243: Pimlico - PERSON

Entity 1244: My Dad - PERSON

Entity 1245: North - LOC

Entity 1246: a million - CARDINAL

Entity 1247: London - GPE

Entity 1248: Vietnamese - NORP

Entity 1249: Chinese - NORP

Entity 1250: 2.10 - MONEY

Entity 1251: 9pm evening - TIME

Entity 1252: two - CARDINAL

Entity 1253: Bill - PERSON

Entity 1254: 15 - MONEY

Entity 1255: Noooorth - LOC

Entity 1256: last week - DATE

Entity 1257: 3.5hour - CARDINAL

Entity 1258: three - CARDINAL

Entity 1259: at least three - CARDINAL

Entity 1260: Ita

Entity 1508: Rib - LOC

Entity 1509: One - CARDINAL

Entity 1510: the Main Maitre - ORG

Entity 1511: every penny - MONEY

Entity 1512: The Sunday Roast at Hawksmoor - WORK_OF_ART

Entity 1513: one - CARDINAL

Entity 1514: Sunday - DATE

Entity 1515: Beef - NORP

Entity 1516: Yorkshire Pudding - PERSON

Entity 1517: Sunday - DATE

Entity 1518: Lunch - NORP

Entity 1519: Borys and Lukasz - WORK_OF_ART

Entity 1520: one - CARDINAL

Entity 1521: one - CARDINAL

Entity 1522: one - CARDINAL

Entity 1523: event!!Will - PERSON

Entity 1524: 3 - CARDINAL

Entity 1525: last Friday - DATE

Entity 1526: the night - TIME

Entity 1527: two - CARDINAL

Entity 1528: 2 - CARDINAL

Entity 1529: 5 - CARDINAL

Entity 1530: Ivett - PERSON

Entity 1531: Monday evening - TIME

Entity 1532: the night - TIME

Entity 1533: one - CARDINAL

Entity 1534: that evening - TIME

Entity 1535: one - CARDINAL

Entity 1536: two - CARDINAL

Entity 1537: Martin Miller - PERSON

Entity 1538: Friday - DATE

Entity 1539: Sund

In [20]:
# part of speech tagging

token_text = [token.orth_ for token in parsed_reviews]
token_pos = [token.pos_ for token in parsed_reviews]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,Fantastic,ADJ
1,as,ADP
2,usual,ADJ
3,with,ADP
4,friendly,ADJ
5,service,NOUN
6,...,PUNCT
7,We,PRON
8,always,ADV
9,head,VERB


In [21]:
# normalization
# lemmatization, shape analysis


token_lemma = [token.lemma_ for token in parsed_reviews]
token_shape = [token.shape_ for token in parsed_reviews]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,Fantastic,fantastic,Xxxxx
1,as,as,xx
2,usual,usual,xxxx
3,with,with,xxxx
4,friendly,friendly,xxxx
5,service,service,xxxx
6,...,...,...
7,We,-PRON-,Xx
8,always,always,xxxx
9,head,head,xxxx


#### What about other token-level attributes?
* relative frequency of tokens <br> 
* whether or not a token matches any of these categories: stopword, punctuation, whitespace, represents a number, whether or not the token is included in spaCy's default vocabulary)

In [22]:
# token attributes

token_attributes = [(token.orth_,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_reviews]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,Fantastic,,,,,Yes
1,as,Yes,,,,Yes
2,usual,,,,,Yes
3,with,Yes,,,,Yes
4,friendly,,,,,Yes
5,service,,,,,Yes
6,...,,Yes,,,Yes
7,We,,,,,Yes
8,always,Yes,,,,Yes
9,head,,,,,Yes


# Phrase Modeling


Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. There's some fancy formula that our phrase models will use to determine whether two tokens $A$ and $B$ constitute a phrase. It involves a ratio of the number of times each token appears in the corpus and the number of times they appear in order, against the size of the corpus vocabulary. 


Once our phrase model has been trained, we can apply it to new text. When our model encounters two tokens in new text that identifies as a phrase, it will merge the two into a single new token.

Phrase modeling is superficially similar to named entity detection in that you would expect named entities to become phrases in the model. But you would also expect multi-word expressions that represent common concepts, but aren't specifically named entities (such as happy hour) to also become phrases in the model.

We turn to the indispensible gensim library to help us with phrase modeling — the Phrases class in particular.

In [23]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

Simultaneously perform phrase modeling with iterative data transformation:

Segment text of complete reviews into sentences & normalize text <br/>
First-order phrase modeling $\rightarrow$ apply first-order phrase model to transform sentences<br/>
Second-order phrase modeling $\rightarrow$ apply second-order phrase model to transform sentences<br/>
Apply text normalization and second-order phrase model to text of complete reviews<br/>
We'll use this transformed data as the input for some higher-level modeling approaches in the following sections.

First, let's define a few helper functions that we'll use for text normalization. In particular, the lemmatized_sentence_corpus generator function will use spaCy to:

Iterate over the reviews <br/>
Segment the reviews into individual sentences<br/>
Remove punctuation and excess whitespace<br/>
Lemmatize the text<br/>
(and do so efficiently in parallel when data is huge, thanks to spaCy's nlp.pipe() function)

In [24]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with codecs.open(filename) as f:
        for review in f:
            yield review.replace('\n', ' ')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_reviews in nlp.pipe(line_review(filename), batch_size = 1000, n_threads=4):
        for sent in parsed_reviews.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

Write this data back out to a new file (unigram_sentences_all), with one normalized sentence per line. We'll use this data for learning our phrase models.

In [25]:
unigram_sentences_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/unigram_sentences_all.txt'

In [26]:
#reviews_path is path to sample_reviews.txt
with codecs.open(unigram_sentences_filepath, 'w', encoding='utf-8') as f:
    for sentence in lemmatized_sentence_corpus(reviews_path): 
        f.write(sentence + '\n')

The `unigram_sentences_all` file now is a large text file with one document/sentence per line —  Gensim's *LineSentence* class provides an iterator for working with other gensim components. It streams the documents/sentences from disk, so that you never have to hold the entire corpus in RAM at once. This allows you to scale your modeling pipeline up to potentially very large corpora.

In [27]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [28]:
for unigram_sentence in it.islice(unigram_sentences, 0, 20):
    print(u' '.join(unigram_sentence))
    print(u'')

fantastic as usual with friendly service

-PRON- always head here after nurse the usual high cost city hangover but do not worry at ballans -PRON- will never feel rip off

3 type of egg to choose from all cook to perfection with a lovely sausage toast bean and small piece of hash brown take a beer on the side and enjoy -PRON- trip to balan do not even look another way in the westfield

this be the place to go

come on average pub where -PRON- can be serve quickly

th only down side will be the crowd around the some of the staff

-PRON- do't fall very welcome there

personally -PRON- favourite pakistani restaurant in toot

good food good meat quality and a reasonable price

for a while now -PRON- have be dream of that burger to die for few place do the burger that make -PRON- think should -PRON- or should not -PRON- as -PRON- look so bad but taste so damn good

well gbk have this tie down to a tee -PRON- make burgers that not only taste great but look amazing

to have -PRON- burger cook

Next, we'll learn a phrase model that will link individual words into two-word phrases.

In [29]:
bigram_model_filepath = '/Users/victoriacabales/Documents/nlp_hacknight/bigram_model_all.txt'

bigram_model = Phrases(unigram_sentences)

bigram_model.save(bigram_model_filepath)
    
# load the finished model
bigram_model = Phrases.load(bigram_model_filepath)

Now that we have a trained phrase model for word pairs, let's apply it to the review sentences data and explore the results.

In [30]:
bigram_sentences_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/bigrammed_sentences_all.txt'


with codecs.open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:
    for unigram_sentence in unigram_sentences:
        bigram_sentence = u' '.join(bigram_model[unigram_sentence])
        f.write(bigram_sentence + '\n')



In [31]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [32]:
#look at a subset

for bigram_sentence in it.islice(bigram_sentences, 20, 50):
    print(u' '.join(bigram_sentence))
    print(u'')

-PRON- will continue to return as long as that continue

leave party for a friend

-PRON- be shove into a corner and could hardly move let alone talk as -PRON- be so loud

limited cocktail available in happy hour

bar staff so busy -PRON- have not get time to answer question about the drink list

in the end -PRON- settle for a drink -PRON- be familiar with as the bar man do_not respond well to be ask to explain

will not be go_back

cramp and somehow joyless interior provide a setting for tonkotsu ramen which be inexplicably empty flavourless and ultimately expensive

couldn't wait to leave and never return

love the look very slick and clean and bright

the staff be uber friendly and warm

-PRON- only stop for a coffee and cake but the salt caramel cake be out of this world

always enjoy a visit to gbk

tasty good quality burger cook to order and with delicious millkshake to drink

recommend the new don burger on brioche bun very yummy

-PRON- be a safe bet if -PRON- be meet friend fo

In [33]:
bigram_reviews_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/bigram_transformed_reviews_all.txt'

In [34]:
#list of stop words
spacy.lang.en.English.Defaults.stop_words

{'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'g

at this point, you would usually run your entire file through

In [35]:
with codecs.open(bigram_reviews_filepath, 'w', encoding='utf_8') as f:
        
    for parsed_review in nlp.pipe(line_review(reviews_path)):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order phrase model
            bigram_review = bigram_model[unigram_review]
            
            # remove any remaining stopwords
            bigram_review = [term for term in bigram_review
                              if term not in spacy.lang.en.English.Defaults.stop_words]
            
            # write the transformed review as a line in the new file
            bigram_review = u' '.join(bigram_review)
            f.write(bigram_review + '\n')



In [36]:
print(u'Original:' + u'\n')

for review in it.islice(line_review(reviews_path), 0,1):
    print(review)

print(u'----' + u'\n')
print(u'Transformed:' + u'\n')

with codecs.open(bigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 0,1):
        print(review)

Original:

Fantastic as usual with friendly service ... We always head here after nursing the usual high cost city hangover - but don't worry at Ballans you will never feel ripped off ! 3 types of eggs to choose from all cooked to perfection with a lovely sausage , toast , beans and small pieces of hash browns take a beer on the side and enjoy your trip to balans don't even look another way in the Westfield this is THE place to go !!!! Come on Average pub, where you can be serve quickly. Th only down side will be the crowd around the some of the staff. You do't fell very welcome there Personally my favourite Pakistani restaurant in tooting. Good food, good meat quality and a reasonable price. For a while now I’ve been dreaming of that Burger to die for, few places do the Burger that makes you think, should I or shouldn’t I, as it looks so bad but tastes so damn good. Well GBK has this tied down to a Tee, they make Burgers that not only tastes great but look amazing. To have your Burger

# Topic Modeling with Latent Dirichlet Allocation (LDA)

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

In [37]:
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore

import pyLDAvis
import pyLDAvis.gensim

### 1st step: learn the full vocabulary

In [38]:
bigram_dictionary_filepath= '/Users/victoriacabales/Documents/data_science/restaurant_reviews/bigram_dict_all.dict'

In [39]:
bigram_reviews = LineSentence(bigram_sentences_filepath)

    # learn the dictionary by iterating over all of the reviews
bigram_dictionary = Dictionary(bigram_reviews)
    
    # filter tokens that are very rare or too common from
    # the dictionary (filter_extremes) and reassign integer ids (compactify)
bigram_dictionary.filter_extremes(no_below=5, no_above=0.2)
bigram_dictionary.compactify()


bigram_dictionary.save(bigram_dictionary_filepath)
    
# load the finished dictionary from disk
bigram_dictionary = Dictionary.load(bigram_dictionary_filepath)


Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. The bigram_bow_generator function implements this. We'll save the resulting bag-of-words reviews as a matrix.

"bag-of-words" abbreviated to bow.


In [40]:
bigram_bow_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/bigram_bow_corpus_all.mm'

In [41]:
def bigram_bow_generator(filepath):
    """
    function to read reviews from a file
    output: bag-of-words representation
    """
    
    for review in LineSentence(filepath):
        yield bigram_dictionary.doc2bow(review)

In [42]:
# generate bag-of-words representations for all reviews and save them as a matrix
MmCorpus.serialize(bigram_bow_filepath,
                       bigram_bow_generator(bigram_sentences_filepath))
    
# load the finished bag-of-words corpus from disk
bigram_bow_corpus = MmCorpus(bigram_bow_filepath)

With the bag-of-words corpus, we're finally ready to learn our topic model from the reviews. We simply need to pass the bag-of-words matrix and Dictionary from our previous steps to LdaMulticore as inputs, along with the number of topics the model should learn.

In [43]:
lda_model_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/lda_model_all'

In [44]:
lda = LdaMulticore(bigram_bow_corpus,num_topics=10,
                   id2word=bigram_dictionary, 
                   workers=2)
    
lda.save(lda_model_filepath)
    
# load the finished LDA model from disk
lda = LdaMulticore.load(lda_model_filepath)

Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [45]:
def topics(topic_number, topn=5):
    print(u'{:10} {}'.format(u'term', u'frequency') + u'\n')

    for term, frequency in lda.show_topic(topic_number, topn=25):
        print(u'{:10} {:.3f}'.format(term, round(frequency, 3)))

In [46]:
topics(topic_number = 2)

term       frequency

in         0.041
but        0.023
this       0.019
with       0.014
food       0.013
not        0.013
for        0.010
great      0.009
service    0.009
if         0.009
on         0.008
or         0.008
good       0.008
do_not     0.008
there      0.008
all        0.007
very       0.007
after      0.007
as         0.007
of         0.006
love       0.006
which      0.006
that       0.006
will       0.006
attentive  0.006


pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.

In [47]:
#LDAvis_data_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/ldavis_prepared.txt'

In [48]:
#take topic models prepared by gensim and prepare data for visualization

LDAvis_prepared = pyLDAvis.gensim.prepare(lda, bigram_bow_corpus,
                                              bigram_dictionary)


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [49]:
pyLDAvis.display(LDAvis_prepared)

What an LDA visualization shows:
1. Better interpretation of individual topics
2. Relationships between different topics

Distance: topics that are similar appear closer together, dissimilar topics appear farther apart <br/>
Size: relative frequency of topic in dataset<br/>
Bar chart: 30 most relevant terms<br/>


# Word2vec

The goal of word vector embedding models, or word vector models for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the meaning or concept the term represents, and the relationship between it and other terms in the vocabulary. 

# I like ___ food.

a) italian
b) mexican
c) pen
d) chair

In [50]:
from gensim.models import Word2Vec

bigram_sentences = LineSentence(bigram_sentences_filepath)
word2vec_filepath = '/Users/victoriacabales/Documents/data_science/restaurant_reviews/word2vec_model_all'

In [51]:
food2vec = Word2Vec(bigram_sentences, size=100, window=5,
                        min_count=20, sg=1, workers=4)

food2vec.save(word2vec_filepath)


        
# load the finished model from disk
food2vec = Word2Vec.load(word2vec_filepath)
food2vec.init_sims()

print(u'{} training epochs so far'.format(food2vec.train_count))

1 training epochs so far


In [52]:
# look up the topn most similar terms to token

def get_related_terms(token, topn=10):
    for word, similarity in food2vec.most_similar(positive=[token], topn=topn):
        print(u'{:10} {}'.format(word, round(similarity, 5)))

In [53]:
get_related_terms('restaurant')

open       0.99901
italian    0.99895
option     0.9989
decor      0.99889
breakfast  0.99888
place      0.99887
pub        0.99886
who        0.99885
quite      0.99885
dining     0.99883
