# Natural Language Processing: Yelp Reviews

In this NLP project, we will be classifying Yelp Reviews into 1 star or 5 star categories based off the text content in the reviews. We will utilize the pipeline methods for more complex tasks.

Each observation in the data set is a review of a particular business by a particular user.

The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.

The "cool" column is the number of "cool" votes this review received from other Yelp users. 

All reviews start with 0 "cool" votes, and there is no limit to how many "cool" votes a review can receive. In other words, it is a rating of the review itself, not a rating of the business.

The "useful" and "funny" columns are similar to the "cool" column.

## Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.feature_extraction.text import  TfidfTransformer
from sklearn.pipeline import Pipeline

## Data set importing

In [None]:
yelp = pd.read_csv('../input/yelp-reviews/yelp.csv')

## Checking the head of the Data Frame:

In [None]:
yelp.head()

## Checking the column info of the Data Frame:

In [None]:
yelp.info()

## Checking the statistical summary of the Data Frame:

In [None]:
yelp.describe()

### Let's create a new column called "text length":

In [None]:
yelp['text length'] = yelp['text'].apply(len)

### We can use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings:

In [None]:
g = sns.FacetGrid(yelp,col='stars');
g.map(plt.hist,'text length');

### Also, we can create a boxplot of text length for each star category:

In [None]:
sns.boxplot(x='stars',y='text length',data=yelp,palette='rainbow');

### Now, we can create a countplot of the number of occurrences for each type of star rating:

In [None]:
sns.countplot(x='stars',data=yelp,palette='rainbow');

### We can use groupby to get the mean values of the numerical columns:

In [None]:
stars = yelp.groupby('stars').mean()
stars

### Using the corr() method on that groupby dataframe to determine the correlation between the variables:

In [None]:
stars.corr()

### Then we can use seaborn to create a heatmap based off that .corr() dataframe:

In [None]:
sns.heatmap(stars.corr(),cmap='coolwarm',annot=True);

### Let's move on to the actual classification task. To make things a little easier, let's only grab reviews that were either 1 star or 5 stars:

In [None]:
yelp_class = yelp[(yelp.stars==1) | (yelp.stars==5)]

### Let's create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class:

In [None]:
X = yelp_class['text']
y = yelp_class['stars']

### We can import CountVectorizer and create a CountVectorizer object:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

### Now, let's use the fit_transform method on the CountVectorizer object and pass in X (the 'text' column) and also save this result by overwriting X:

In [None]:
X = cv.fit_transform(X)

## Train Test Split

### Let's split our data into training and testing data.

### Using train_test_split to split up the data, using test_size=0.3 and random_state=101:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)

## Training a Model

### Let's import MultinomialNB (Naive Bayes), create an instance of the estimator and call it nb:

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

### Fitting nb using the training data:

In [None]:
nb.fit(X_train,y_train)

## Predictions and Evaluations

### Using the predict method off of nb to predict labels from X_test:

In [None]:
predictions = nb.predict(X_test)

### Let's now create a confusion matrix and classification report using these predictions and y_test:

In [None]:
print(confusion_matrix(y_test,predictions))
print('\n')
print(classification_report(y_test,predictions))

### We can see that the model fits nicely! Let's see what happens if we try to include TF-IDF to this process using a pipeline.

# Using Text Processing

### Let's create a pipeline with the following steps: CountVectorizer(), TfidfTransformer(),MultinomialNB():

In [None]:
pipeline = Pipeline([
    ('bow', CountVectorizer()),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

## Using the Pipeline

### Time to use the pipeline. Remember this pipeline has all our pre-process steps in it already, meaning we'll need to re-split the original data (Remember that we overwrote X as the CountVectorized version. What we need is just the text.

In [None]:
X = yelp_class['text']
y = yelp_class['stars']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=101)

### Let's now fit the pipeline to the training data. We can't use the same training data as last time because that data has already been vectorized. We need to pass in just the text and labels:

In [None]:
pipeline.fit(X_train,y_train)

### Predictions and Evaluation

### Now let's use the pipeline to predict from the X_test and create a classification report and confusion matrix:

In [None]:
predictions = pipeline.predict(X_test)

In [None]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

### Looks like Tf-Idf actually made things worse, since the accuracy dropped significantly.