<center> <img src="res/ds3000.png"> </center>

<center> <h2>Sentiment Analysis</h2></center>

## Outline
1. <a href='#1'>Sentiment Analysis</a>
2. <a href='#2'>Data Preparation</a>
3. <a href='#3'>Model Training</a>
4. <a href='#4'>Making Predictions</a>



## 1. Sentiment Analysis
* Need to represent data numerically 
* Need a target variable storing sentiment class (positive or negative)

### 1.1. Case Study: Predicting  Video Game Recommendations from Steam Reviews
* Original dataset:
* https://zenodo.org/record/1000885#.XdXaH1dKhPY

In [2]:
import pandas as pd
data = pd.read_csv("game_review.csv")

In [3]:
data.head()

Unnamed: 0,gameID,comment,sentiment
0,345650,Is Without Withinnbspworth your time Nonbs...,0
1,289090,My playtime h based on steam Grindy Achieve...,0
2,350090,No Pineapple Left Behind,0
3,409720,PRESS SPACE TO CRASH,0
4,364360,Reason Why Chinese Gamer Give the ShXt to W...,0


## 2. Data Preparation

In [4]:
features = data["comment"]
target = data["sentiment"]

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 3000)


#create the vocabulary based on the training data
vect = CountVectorizer().fit(X_train)

#encode the words in X_train and X_test based on the vocabulary
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

In [6]:
len(vect.get_feature_names())

87785

In [7]:
vect.get_feature_names()[::2000]

['aa',
 'alivethe',
 'aquiredhas',
 'badavgnwhat',
 'blankbut',
 'bullsht',
 'charactersthis',
 'commenting',
 'cosistently',
 'dealim',
 'directable',
 'duckthen',
 'ennemies',
 'extentbut',
 'fiver',
 'fulfill',
 'gametheres',
 'grates',
 'headtohead',
 'httpwwwgamesindustrybizarticlesjagexswildchild',
 'insteadtheres',
 'jerkyunpredictable',
 'legionto',
 'machanicfps',
 'messagesnorufusthe',
 'morein',
 'nomoreprogress',
 'onle',
 'partsyou',
 'playerbought',
 'priceone',
 'rami',
 'replayvaluethe',
 'russiandota',
 'seriesone',
 'skins',
 'splitscreen',
 'stuffwell',
 'tediousadditionallythe',
 'throttleor',
 'trilogyit',
 'updateanother',
 'warperhapsit',
 'workbuy']

## 3. Model Training
* Multinomial Naive Bayes is a commonly used popular algorithm for text classification, including sentiment analysis
* https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [8]:
# train the classifier
model = MultinomialNB(alpha = 0.5).fit(X=X_train_vectorized, y=y_train)


print("Classification accuracy on training set: ", model.score(X_train_vectorized, y_train))
print("Classification accuracy on testing set: ", model.score(X_test_vectorized, y_test))

Classification accuracy on training set:  0.945468509984639
Classification accuracy on testing set:  0.7878534031413612


## 4. Making Predictions

In [9]:
def predict_sentiment(comment):
    comment_features = vect.transform(comment)
    sentiment = model.predict(comment_features)
    
    if sentiment == 1:
        return 'Positive'
    else:
        return 'Negative'

In [10]:
predict_sentiment(["What a great game!"])

'Positive'