## Reviews sentiment analysis
This dataset contains reviews for various purposes and services, ranging from car tyre repairs, restaurant customer service to hotel stay service. Since the instructions on Kaggle are rather unclear, I decided to first categorize the data into 1 (positive sentiment) and 0 (negative sentiment). 

It is also good to take note that, rating grades 1,2,3 are considered negative sentiments while rating grades 4,5 are considered positive sentiments.

We will be using NLTK's Naive Bayes Classifier to classify the sentiments.

In [1]:
# Importing required libraries
import pandas as pd
import numpy as np
import re #regex methods
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from os.path import join
from bs4 import BeautifulSoup

# Retrieving our training dataset
train_x = pd.read_csv("train.csv", sep='delimiter', header=None, engine='python', names=['Initial Review'])
#print(train_x)

# Retrieving our test dataset
test_x = pd.read_csv("test.csv", sep='delimiter', header=None, engine='python', names=['Initial Review'])
#print(test_x)

In [2]:
# Creating a new empty column to store Rating Grades
train_x['Rating'] = ''

# Extracting rating grades into Rating column
for i in range(len(train_x['Initial Review'])):
    train_x['Rating'][i] = train_x['Initial Review'][i][0]
train_x['Rating'] = train_x['Rating'].astype(int)

# Creating a new empty column for sentiments
train_x['Sentiment'] = ''
train_x.loc[train_x['Rating'] >= 4, 'Sentiment'] = 1
train_x.loc[train_x['Rating'] <= 3, 'Sentiment'] = 0

Now, after categorizing the reviews into either positive or negative sentiments, let's proceed with "cleaning" the data, extracting the text content through tokenization.

In [3]:
print(train_x['Initial Review'][0])

4	Thank you thank you thank you !! I  want to thank the people that made this place happen ....you have made all my dreams come true. Imagine a delicious yogurt shop with super fun flavors like peanut butter, chocolate mint, cake batter and so many more.  I used to have to travel to  Yogurtland or Jujuberry  but not anymore , now we have one right in the hood!! Guess what?   instead of eating a normal lunch I can pig out with a healthy peanut butter yogurt smothered in chocolate chips.   Could be the perfect lunch!!  See you there!


In [4]:
# Importing nltk library
import nltk

# Download only 'stopwords' resource and not nltk.download() as it could take very long
nltk.download('stopwords')

# List of stopwords
stopwords_dict = stopwords.words('english')

print()

# Creating a list of empty lists
list_of_emptylists = []
for i in range(len(train_x['Initial Review'])):
    list_of_emptylists.append([])
    
# Creating a new column to store individual words in each review (training data)
train_x['Individual Words'] = list_of_emptylists

for sentence in range(len(train_x['Initial Review'])):
    # Regex removal of punctuations
    splitted_sentence = re.split(r'\W+', train_x['Initial Review'][sentence].lower())

    # New list of words to store words and ensure no repeated storage
    for i in range(1,len(splitted_sentence)):
        if splitted_sentence[i] not in train_x['Individual Words'][sentence] and splitted_sentence[i] != '' and splitted_sentence[i] not in stopwords_dict:
            train_x['Individual Words'][sentence].append(splitted_sentence[i])
#train_x['Individual Words'][sentence] = train_x['Individual Words'][sentence][1:]

[nltk_data] Downloading package stopwords to C:\Users\Terence
[nltk_data]     Lim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



In [18]:
# Creating a list of empty lists
list_of_emptylists_test = []
for i in range(len(test_x['Initial Review'])):
    list_of_emptylists_test.append([])

# Creating a new column to store individual words in each review (test data)
test_x['Individual Words'] = list_of_emptylists_test

for sentence in range(len(test_x['Initial Review'])):
    # Regex removal of punctuations
    splitted_sentence = re.split(r'\W+', test_x['Initial Review'][sentence].lower())
    
    # New list of words to store words and ensure no repeated storage
    for i in range(1,len(splitted_sentence)):
        if splitted_sentence[i] not in test_x['Individual Words'][sentence] and splitted_sentence[i] != '' and splitted_sentence[i] not in stopwords_dict:
            test_x['Individual Words'][sentence].append(splitted_sentence[i])

Since stemming can often create non-existent words, I decided to go along with lemmatizing (where outputs are actual words).

In [19]:
# Importing required library
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Download only 'wordnet' resource
nltk.download('wordnet')

# Syntax for using Stemmer if you want to check out how it works
#porter = PorterStemmer()
#print(porter.stem('smothered'))

lemmatizer = WordNetLemmatizer()

# Lemmatize sentences in training data
for sentence in range(len(train_x['Individual Words'])):
    for word in range(len(train_x['Individual Words'][sentence])):
        train_x['Individual Words'][sentence][word] = lemmatizer.lemmatize(train_x['Individual Words'][sentence][word], 'v')

# Lemmatize sentences in test data
for sentence in range(len(test_x['Individual Words'])):
    for word in range(len(test_x['Individual Words'][sentence])):
        test_x['Individual Words'][sentence][word] = lemmatizer.lemmatize(test_x['Individual Words'][sentence][word], 'v')

[nltk_data] Downloading package wordnet to C:\Users\Terence
[nltk_data]     Lim\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Since we are building a Naive Bayes classifier using NLTK library, we need to ensure that the input is in the format where, every word is followed by true.

In [24]:
# Importing required libraries
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.tokenize import word_tokenize

# Converting to a form where it fits how the Naive Bayes classifier expects the input
# A dictionary is used because if a word already exists, it will not be added to the dictionary
def create_word_features(words):
    my_dict = dict([(word, True) for word in words])
    return my_dict

# Creating a list to store negative and positive reviews
neg_reviews = []
pos_reviews = []
for sentence in range(len(train_x['Initial Review'])):
    if train_x['Sentiment'][sentence] == 0:
        neg_reviews.append((create_word_features(train_x['Individual Words'][sentence]), "negative"))
    elif train_x['Sentiment'][sentence] == 1:
        pos_reviews.append((create_word_features(train_x['Individual Words'][sentence]), "positive"))

# Splitting into training and test dataset
train_set = neg_reviews[:20546] + pos_reviews[:36900]
test_set = neg_reviews[20546:] + pos_reviews[36900:]

29351
52714
57446
24619


In [27]:
# Creating the Naive Bayes Classifier
classifier = NaiveBayesClassifier.train(train_set)

accuracy = nltk.classify.util.accuracy(classifier, test_set)
print("Accuracy:" + str(accuracy*100) + "%")

Accuracy:71.31890003655712%


After achieving an accuracy rate of 71.3% which is pretty decent, there are still various areas that can be improved. An area that needs improvement is definitely when splitting the words. More time and effort can be put into understanding the dataset and making it more suitable to run sentiment analysis on. 

Some possible areas of exploration for sentiment analysis would be Beautiful Soup as well as TfidfVectorizer. Till then, this shall be a simple Naive Bayes Classification model for classifying sentiments.