  # SENTIMENTAL ANALYSIS WITH NAIVE BAYES CLASSIFIER

Sentiment analysis, also known as opinion mining, is a valuable technique in natural language processing and machine learning. It involves determining the sentiment or emotional tone expressed in text data, which could be positive or negative. This project aims to perform sentiment analysis on a dataset of text reviews using a Multinomial Naive Bayes classifier.

## Importing Libraries

In [29]:
import pandas as pd
import numpy as np
import os
import nltk
import matplotlib as plt

## Importing Datasets

In [3]:
#importing the datasets
df_amazon = pd.read_csv(r"C:\Users\VAGDEVI\Desktop\sentiment labelled sentences\amazon_cells_labelled.txt", delimiter = '\t', quoting = 3, 
                header = None, names = ['Review', 'Score'])

df_imdb = pd.read_csv(r"C:\Users\VAGDEVI\Desktop\sentiment labelled sentences\imdb_labelled.txt", delimiter = '\t', quoting = 3, 
                header = None, names = ['Review', 'Score'])

df_yelp = pd.read_csv(r"C:\Users\VAGDEVI\Desktop\sentiment labelled sentences\yelp_labelled.txt", delimiter = '\t', 
                        quoting = 3, header = None, 
                        names = ['Review', 'Score'])
print(df_amazon.shape)
print(df_imdb.shape)
print(df_yelp.shape)

(1000, 2)
(1000, 2)
(1000, 2)


## Combining Dataframes

In [4]:
# Combining dataframes into one

df = pd.concat([df_yelp, df_imdb, df_amazon], ignore_index = True)

In [5]:
df.shape

(3000, 2)

In [6]:
df

Unnamed: 0,Review,Score
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
2995,The screen does get smudged easily because it ...,0
2996,What a piece of junk.. I lose more calls on th...,0
2997,Item Does Not Match Picture.,0
2998,The only thing that disappoint me is the infra...,0


## Text Transformation

In [7]:
import re #Regular expressions
#Text transformation
df["lower"]=df.Review.str.lower() #lowercase
df["lower"]=[str(data) for data in df.lower] #converting all to string
df["lower"]=df.lower.apply(lambda x: re.sub('[^A-Za-z0-9 ]+', ' ', x)) #regex


In [8]:
df

Unnamed: 0,Review,Score,lower
0,Wow... Loved this place.,1,wow loved this place
1,Crust is not good.,0,crust is not good
2,Not tasty and the texture was just nasty.,0,not tasty and the texture was just nasty
3,Stopped by during the late May bank holiday of...,1,stopped by during the late may bank holiday of...
4,The selection on the menu was great and so wer...,1,the selection on the menu was great and so wer...
...,...,...,...
2995,The screen does get smudged easily because it ...,0,the screen does get smudged easily because it ...
2996,What a piece of junk.. I lose more calls on th...,0,what a piece of junk i lose more calls on thi...
2997,Item Does Not Match Picture.,0,item does not match picture
2998,The only thing that disappoint me is the infra...,0,the only thing that disappoint me is the infra...


## Tokenization
Tokenize the preprocessed text data into individual words using the `word_tokenize` function from the `nltk` library.

In [10]:
import nltk
from nltk import word_tokenize
nltk.download('stopwords')
#Text splitting
tokens_text = [word_tokenize(str(word)) for word in df.lower]
#Unique word counter
tokens_counter = [item for sublist in tokens_text for item in sublist]
print("Number of tokens: ", len(set(tokens_counter)))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\VAGDEVI\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of tokens:  5185


## Stopword Removal
Remove English stopwords from the tokenized text data using the stopwords from the `nltk` corpus.

In [11]:
#Choosing english stopwords
stopwords_nltk = nltk.corpus.stopwords
stop_words = stopwords_nltk.words('english')

## Creating Predictor and Response Vectors

In [12]:
#Create predictor and response vector

X = df.iloc[:, 0]
y = df.iloc[:, 1]

## Splitting Data into Training and Testing Sets

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                   random_state = 1)

## Creating Document Term Matrix (DTM) with CountVectorizer

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate vectorizer
vect = CountVectorizer()

In [15]:
# Learn vocabulary of training data and create document term matrix

X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<2400x4493 sparse matrix of type '<class 'numpy.int64'>'
	with 25076 stored elements in Compressed Sparse Row format>

In [16]:
# Transform test set to create document term matrix from already learned
# vocabulary

X_test_dtm = vect.transform(X_test) 
X_test_dtm

<600x4493 sparse matrix of type '<class 'numpy.int64'>'
	with 5791 stored elements in Compressed Sparse Row format>

## Training a Multinomial Naive Bayes Classifier

In [17]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

In [30]:
classifier.fit(X_train_dtm, y_train)

## Making Predictions and Evaluating Model

In [19]:
# Make class predictions for X_test_dtm

y_pred_class = classifier.predict(X_test_dtm)

In [20]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
accuracy = accuracy_score(y_test, y_pred_class)
print(f'ACCURACY = {accuracy * 100:.3f}%')

ACCURACY = 82.667%


In [21]:
cm = confusion_matrix(y_test, y_pred_class)
cm

array([[243,  47],
       [ 57, 253]], dtype=int64)

In [22]:
# Calculate predicted probabilities for X_test_dtm

y_pred_proba = classifier.predict_proba(X_test_dtm)[:, 1] 

In [23]:
# Caluclate Area Under Curve

area_under_curve = roc_auc_score(y_test, y_pred_proba)
print(f'Area Under Curve = {area_under_curve * 100:.3f}%')

Area Under Curve = 90.467%


## Sentiment Prediction Functions

In [31]:
import nltk
import re
from nltk.tokenize import word_tokenize

def preprocess_input(input_text):
    # Convert to lowercase
    input_text = input_text.lower()
    # Remove punctuation
    input_text = re.sub('[^A-Za-z0-9 ]+', ' ', input_text)
    # Tokenize the text
    tokens = word_tokenize(input_text)
    return ' '.join(tokens)

def predict_sentiment(user_input, classifier, vect):
    preprocessed_input = preprocess_input(user_input)
    user_input_dtm = vect.transform([preprocessed_input])
    sentiment = classifier.predict(user_input_dtm)
    return "Positive" if sentiment == 1 else "Negative"

# Example usage:
user_input = " "
result = predict_sentiment(user_input, classifier, vect)
print(f"Predicted sentiment: {result}")


Predicted sentiment: Negative
