# Introduction to Yelp Review Sentiment Classification

In this project, we will build a classifier that can predict a user's rating of a given restaurant from their review. Nowadays, sentiment analysis is used widely by companies in order to better understand their user's preferences and tastes.




In [None]:
#@title Import our libraries (this may take a minute or two)
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv). 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import nltk
import spacy
import wordcloud
import os # Good for navigating your computer's files 
import sys

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from spacy.lang.en.stop_words import STOP_WORDS
nltk.download('wordnet')
nltk.download('punkt')

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
!python -m spacy download en_core_web_md
import en_core_web_md

In [None]:
data_file  = '../input/yelp-reviews-dataset/yelp.csv'


## Data Exploration

First we read in the file containing the reviews and take a look at the data available to us.

In [None]:
# read our data in using 'pd.read_csv('file')'
yelp = pd.read_csv(data_file)

In [None]:
yelp.head()

In [None]:
#Remove unnecessary columns
yelp.drop(labels=['business_id','date','review_id','type','user_id'],inplace=True,axis=1)

The text column is the one we are primarily focused with. Let's take a look at a few of these reviews to better understand our problem.

In [None]:
num_stars = 1 #@param {type:"integer"}

for t in yelp[yelp['stars'] == num_stars]['text'].head(20).values:
    print (t) 

We can start to see that there are certain quantitative differences between highly rated reviews and poorly rated reviews. Certain words, for example, 'delightful', 'impressive', 'amazing', might be more associated with 4 or 5 star reviews. However one might be able to see that these words might also be present in a 2 star review. For example: "The seating and ambience were impressive, but the food served to us was not". 


In [None]:
rule_1 = "Bad" 
rule_2 = "Good" 
rule_3 = "Fine" 

#### World Cloud visualization is a good indication for the frequency with which words were used in each review for better analysis.

In [None]:
#Word cloud for differently rated reviews
num_stars =  2 
this_star_text = ''
for t in yelp[yelp['stars'] == num_stars]['text'].values: # form field cell
    this_star_text += t + ' '
    
wordcloud = WordCloud()    
wordcloud.generate_from_text(this_star_text)
plt.figure(figsize=(14,7))
plt.imshow(wordcloud, interpolation='bilinear')

## Text Preprocessing

#### Tokenization

First of all, we would like to convert each review from a single string into a list of words (this is a process known as tokenizaton). All NLP algorithms require a list of words as arguments and not actual sentences. Enter some example text into the cell below to see the tokenized version.

In [None]:
example_text = "All the people I spoke to were super nice and very welcoming." #@param {type:"string"}
tokens = word_tokenize(example_text)
tokens

#### Stopwords

We can see that certain particular words might be associated with 4 or 5 star reviews, and some words would be associated with 1 or 2 star reviews. However, at the same time, there are some words that do not really possess any relevant information for our current problem. In the field of NLP there is a concept of words that are "stopwords" - words that exist to provide grammatical structure, but do not convey information about the particular subject. Edit the cell below to see if a given word is a stop word.

In [None]:
example_word = "ok"
if example_word.lower() in STOP_WORDS:
  print (example_word + " is a stop word.")
else:
  print (example_word + " is NOT a stop word.")

In [None]:
nlp = en_core_web_md.load()
doc = nlp(u"We are running out of time! Are we though?")
doc

In [None]:
doc = nlp(u"We are running out of time! Are we though?")
token = doc[0] # Get the first word in the text.
assert token.text == u"We" # Check that the token text is 'We'.
assert len(token) == 2 # Check that the length of the token is 2.

In [None]:
doc = nlp(u"I like apples")
apples = doc[2]

print(apples.vector.shape[0]) # Each word is being represented by 96 dimensional vector embedding

The word 'Apple' is represented by a 300 dimensional vector embedding


In [None]:
doc = nlp(u'dog and cat')
word1 = doc[0]
word2 = doc[2]
word1.similarity(word2)

Run the cell below to get rid of 4 star reviews.

In [None]:
yelp = yelp[yelp.stars != 4]

In [None]:
def is_good_review(stars):
    if stars == 5:  ### TODO: FILL IN THE IF STATEMENT HERE ###:
        return True
    else:
        return False

# Change the stars field to either be 'good' or 'bad'.
yelp['is_good_review'] = yelp['stars'].apply(is_good_review)

## One-Hot Vectors


In [None]:
#@title Run this to see the one-hot encoding of 'great tacos at this restaurant'
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('great', 'tacos', 'at','this','restaurant'))
print('--------------------------------------------')
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('1', '0', '0','0','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '1', '0','0','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '0', '1','0','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '0', '0','1','0'))
print('{:^5}|{:^5}|{:^4}|{:^4}|{:^10}'.format('0', '0', '0','0','1'))

#### Next Few Steps that can be tried:
- Bag of words, Logistic Regression