#  reviews assignment


## Introduction


- **`data.csv`** contains the dataset. 
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.


## Task 1

load **`data.csv`** into a pandas DataFrame and examine it.

## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) 

## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** <br>
from sklearn.naive_bayes import MultinomialNB<br>
nb = MultinomialNB()

In [1]:
import pandas as pd
import numpy as np
df=pd.read_csv('data.csv')

In [2]:
df.describe()

Unnamed: 0,stars,cool,useful,funny
count,10000.0,10000.0,10000.0,10000.0
mean,3.7775,0.8768,1.4093,0.7013
std,1.214636,2.067861,2.336647,1.907942
min,1.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,1.0,0.0
75%,5.0,1.0,2.0,1.0
max,5.0,77.0,76.0,57.0


In [3]:
new_df=pd.concat([df[df.stars==1],df[df.stars==5]],sort=False)

In [4]:
X=pd.Series(new_df.text)
y=pd.Series(new_df.stars)

In [5]:
import nltk
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [6]:
def preprocess(text_data):
    tokenized_text=[]
    for text in text_data:
        #regex to remove punctuations and convert to lowercase
        text=re.sub(r'[^\w\s]|[0-9_]'," ",text).lower()
        #tokenize the text
        tokenized=nltk.word_tokenize(text)
        #remove stop words
        tokenized_text.append(" ".join(list(x for x in tokenized if x not in stop)))
    return tokenized_text

In [7]:
x=preprocess(X)

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
vectorizer =CountVectorizer(stop_words='english')
text=vectorizer.fit_transform(x)
len(vectorizer.vocabulary_)

18487

In [10]:
text.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [11]:
import pandas as pd
df1=pd.DataFrame(vectorizer.fit_transform(x).toarray(),columns=vectorizer.get_feature_names())

In [12]:
count=df1.sum()
count=count.to_frame()
count.columns=['frequency']

In [13]:
list_class=[]
class_1Star=[]
class_5Star=[]
class_1Star=np.zeros(len(y[y==1]),dtype=int)
class_5Star=np.ones(len(y[y==5]),dtype=int)    

In [14]:
import numpy as np
for i in class_1Star:
    list_class.append(i)
for i in class_5Star:
    list_class.append(i)

In [15]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(df1,list_class,test_size=0.5,random_state=30)

In [16]:
#for analyzing the result of your model
from sklearn import metrics
#the two naive bayes algorithms
from sklearn.naive_bayes import MultinomialNB
NB=MultinomialNB()

In [17]:
NB.fit(x_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
y_pred=NB.predict(x_test)
y_pred

array([1, 1, 0, ..., 1, 1, 1])

In [19]:
metrics.accuracy_score(y_test,y_pred)

0.9162995594713657

In [20]:
metrics.confusion_matrix(y_test,y_pred)

array([[ 214,  134],
       [  37, 1658]])