### Use of NLTK to build a model (MultinomialNB) to analyze yelp reviews and predict whether the review would have a 1 star or 5 star rating

In [None]:
import pandas as pd
import nltk
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

The input file is yelp.csv from source(https://raw.githubusercontent.com/justmarkham/DAT8/master/data/yelp.csv)
To begin working on the file,read data from yelp.csv, print the size of the DataFrame and the first few lines of data.

In [None]:
yelp = pd.read_csv('../input/yelp-reviews/yelp.csv')
print(yelp.shape)
yelp.head()

The model needs to "read" the 'text' column and determine what the number in the 'stars' column would be. The star values are 1 to 5. For this model,here we are considering only the 1-star (worst) and the 5-star (best) reviews.

Create a DataFrame with only the required columns and print the data.

In [None]:
df = yelp[['stars','text']]
df

Create a DataFrame with only the required rows and print the data.(Use concatenation.)

In [None]:
y1 = df[df.stars==1]
y2 = df[df.stars==5]
yelp = pd.concat([y1,y2],ignore_index=True,axis=0)
yelp

Gaining a fair overview of percentages of 1-stars and the percentages of 1-stars and 5-stars helps us determine if there is enough representation of both types in the dataset.


In [None]:
new_df = yelp.groupby('stars').text.count()
print('Percentage of 1 star :', round(new_df.loc[(1)]/len(yelp)*100,2),'%')
print('Percentage of 5 stars :',round(new_df.loc[(5)]/len(yelp)*100,2),'%')

From the yelp DataFrame create X Dataframe with the 'text' column, by dropping the 'stars' column.

In [None]:
X = yelp.drop(columns=['stars'])
print(X.shape)
X

From the yelp DataFrame create X Dataframe with the 'text' column, by dropping the 'stars' column.

In [None]:
y = yelp['stars']
y.shape

#### Pre-processing the X DataFrame

 - removing non-words
 - removing stop words, and 
 - stemming the words.
 
Print the resulting X variable.

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('[a-zA-Z]+')
w = [tokenizer.tokenize(word.lower())for word in X.text]
#w

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
w = [[word1 for word1 in word if word1 not in stop_words] for word in w ]
#w

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmer = [[stemmer.stem(word)for word in l]for l in w]
#stemmer

In [None]:
row = [' '.join(ele) for ele in stemmer]
X_processed = pd.DataFrame({0: row})
X_processed


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_processed[0])
X_vectors = vect.transform(X_processed[0])
print(X_vectors)
print(X_vectors.shape)
y.shape

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X_vectors,y,test_size=0.2)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

In [None]:
classifier = MultinomialNB()
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

In [None]:
from sklearn.metrics import f1_score

print(metrics.accuracy_score(y_test,y_pred))
print(metrics.confusion_matrix(y_test,y_pred))
f1_score(y_test,y_pred,average='weighted')

Although the accuracy score is pretty high,92.17%, and so is f1 score(92.00%), this high level of accuracy is a false comfort, as there is a huge bias in the data, leaning towards 5-stars. So, although the model built by us shows a high level of accuracy, the confusion matrix reveals it all. It can be seen that matrix predicts correct values for 5-stars 651 times but it goes wrong 23 times. That result and accuracy rate is too good, but for 1-stars, it goes wrong 36 times, and predicts correctly 108 times. Thus, the model works well for values that hugely outnumber(5-star) the other,but not for the values that are low in number(1-stars).
So it ia also advisable to use f1 value as the index of accuracy for this model, because f1 is the weighted average that takes into account the discrepencies in the data.