<center> <img src="res/ds3000.png"> </center>

<center> <h2>Sentiment Analysis Part 2</h2></center>

## Outline
1. <a href='#1'>TfidfVectorizer</a>
2. <a href='#2'>Stop Words</a>
3. <a href='#3'>min_df</a>
4. <a href='#4'>ngrams</a>



<a id="1"></a>

## 1. TfidfVectorizer

In [8]:
import pandas as pd
data = pd.read_csv("game_review.csv")

In [9]:
data.head()

Unnamed: 0,gameID,comment,sentiment
0,345650,Is Without Withinnbspworth your time Nonbs...,0
1,289090,My playtime h based on steam Grindy Achieve...,0
2,350090,No Pineapple Left Behind,0
3,409720,PRESS SPACE TO CRASH,0
4,364360,Reason Why Chinese Gamer Give the ShXt to W...,0


In [10]:
features = data["comment"]
target = data["sentiment"]

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer().fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = MultinomialNB(alpha = 0.5).fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))
print("Number of features used: ", len(vect.get_feature_names()))

Classification accuracy on training set:  0.942535958664991
Classification accuracy on testing set:  0.7886910994764398
Number of features used:  87785


### 1.1. Making Predictions

In [12]:
def predict_sentiment(comment):
    comment_features = vect.transform(comment)
    sentiment = model.predict(comment_features)
    
    if sentiment == 1:
        return "Positive"
    else:
        return "Negative"    

In [13]:
predict_sentiment(["it is a good game, not bad at all"])

'Negative'

## 2. TfidfVectorizer with Stop Words

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(stop_words = "english").fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = MultinomialNB(alpha = 0.5).fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))
print("Number of features used: ", len(vect.get_feature_names()))

Classification accuracy on training set:  0.9511939673229995
Classification accuracy on testing set:  0.7861780104712042
Number of features used:  87484


## 3. TfidfVectorizer with min_df

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(stop_words="english", min_df = 5).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = MultinomialNB(alpha = 0.5).fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))
print("Number of features used: ", len(vect.get_feature_names()))

Classification accuracy on training set:  0.8687334171205139
Classification accuracy on testing set:  0.7886910994764398
Number of features used:  8091


<a id="4"></a>

## 4. TfidfVectorizer with ngrams

In [19]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)


#create the vocabulary based on the training data
vect = TfidfVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

#train the classifier
model = MultinomialNB(alpha = 0.5).fit(X=X_train_vectorized, y = y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))
print("Number of features used: ", len(vect.get_feature_names()))

Classification accuracy on training set:  0.9162826420890937
Classification accuracy on testing set:  0.8272251308900523
Number of features used:  28068


In [20]:
vect.get_feature_names()[::1000]

['aa',
 'and cant',
 'arkham games',
 'being on',
 'cant click',
 'could',
 'does not',
 'evolve',
 'for no',
 'gameplayafter',
 'had really',
 'if youve',
 'isbut',
 'km',
 'makes',
 'movie',
 'of cod',
 'other hand',
 'pop culture',
 'released and',
 'settings',
 'sprites',
 'than to',
 'the payday',
 'thisbut',
 'tools to',
 'wars and',
 'with horrible',
 'youre supposed']

### 4.1. Making Predictions

In [21]:
def predict_sentiment(comment):
    comment_features = vect.transform(comment)
    sentiment = model.predict(comment_features)
    
    if sentiment == 1:
        return "Positive"
    else:
        return "Negative"    

In [22]:
predict_sentiment(["it is a bad game, not good at all"])

'Negative'

In [23]:
predict_sentiment(["it is a good game, not bad at all"])

'Positive'