In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv('train.csv')
train_data

Unnamed: 0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
0,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
1,2,Amazing!,This soundtrack is my favorite music of all ti...
2,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
3,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
4,2,an absolute masterpiece,I am quite sure any of you actually taking the...
...,...,...,...
3599994,1,Don't do it!!,The high chair looks great when it first comes...
3599995,1,"Looks nice, low functionality",I have used this highchair for 2 kids now and ...
3599996,1,"compact, but hard to clean","We have a small house, and really wanted two o..."
3599997,1,what is it saying?,not sure what this book is supposed to be. It ...


In [3]:
# The column names are actually a row let's extract it and change the column names and put it into the the actual data
newRow = pd.DataFrame([train_data.columns.tolist()], columns = train_data.columns)
train_data = pd.concat([newRow, train_data], ignore_index = True)

# Renaming the data columns
train_data.columns = ['Polarity', 'Title', 'Text']
train_data

Unnamed: 0,Polarity,Title,Text
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,2,Amazing!,This soundtrack is my favorite music of all ti...
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
...,...,...,...
3599995,1,Don't do it!!,The high chair looks great when it first comes...
3599996,1,"Looks nice, low functionality",I have used this highchair for 2 kids now and ...
3599997,1,"compact, but hard to clean","We have a small house, and really wanted two o..."
3599998,1,what is it saying?,not sure what this book is supposed to be. It ...


### The Polarity indicates negative and positive reviews (1 == Negative, 2 == Positive)
### The Title column is title of the comment and Text column is the actual comment

In [4]:
# Some rows have str values instead of int
train_data['Polarity'] = train_data['Polarity'].astype(int)

In [5]:
# Changing the nature of Polarity from int to str
train_data['Polarity'] = train_data['Polarity'].replace({1: 'Negative', 2: 'Positive'})

In [7]:
# Check for null values
print('Null data:\n',train_data.isnull().sum())
print('The null data in dataframe: \n')
train_data[train_data.isnull().any(axis=1)]

Null data:
 Polarity      0
Title       207
Text          0
dtype: int64
The null data in dataframe: 



Unnamed: 0,Polarity,Title,Text
13265,Negative,,Couldn't get the device to work with my networ...
26554,Negative,,What separates this band from Evanescence (bes...
26827,Positive,,Falkenbach returns with more of the Viking/Fol...
36598,Positive,,I returned this because I received the same on...
37347,Positive,,This book is a great fantasy. I love this amaz...
...,...,...,...
3403351,Negative,,It is not a game. It is only a memory cardIt w...
3455848,Negative,,"The sleeve is not bad, but the vacuum is worth..."
3493132,Positive,,Al Spath's diary is a must for all poker playe...
3565886,Negative,,IBD should sell single issues for the iPad to ...


In [8]:
# dropping all null value rows
train_data.dropna(inplace = True)

In [9]:
# Extracting features and Target
train_features = train_data[['Text', 'Title']]
train_target = train_data[['Polarity']]

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
preprocessor = ColumnTransformer(
    transformers=[
        ('Text_Col_tfidf', TfidfVectorizer(), 'Text'),  # Apply TF-IDF to the 'Text' column
        ('Title_Col_tfidf', TfidfVectorizer(), 'Title')  # Apply TF-IDF to the 'Title' column
    ])

In [11]:
from sklearn.naive_bayes import ComplementNB 
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline


ComplementNB = Pipeline(
    [
        ('tfidf', preprocessor),
        ("clf", ComplementNB()),
    ]
)

MultinomialNB = Pipeline(
    [
        ('tfidf', preprocessor),
        ("clf", MultinomialNB()),
    ]
)

# Fitting into models to check which performs better later
MultinomialNB.fit(train_features, train_target)
ComplementNB.fit(train_features, train_target)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [12]:
test_data = pd.read_csv('test.csv')
test_data

Unnamed: 0,2,Great CD,"My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
0,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
1,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
2,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
3,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
4,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...
...,...,...,...
399994,1,Unbelievable- In a Bad Way,We bought this Thomas for our son who is a hug...
399995,1,"Almost Great, Until it Broke...",My son recieved this as a birthday gift 2 mont...
399996,1,Disappointed !!!,"I bought this toy for my son who loves the ""Th..."
399997,2,Classic Jessica Mitford,This is a compilation of a wide range of Mitfo...


In [13]:
# Doing the same thing as train_data
newRow = pd.DataFrame([test_data.columns.tolist()], columns = test_data.columns)
test_data = pd.concat([newRow, test_data], ignore_index = True)

test_data.columns = ['Polarity', 'Title', 'Text']

test_data['Polarity'] = test_data['Polarity'].astype(int)
test_data['Polarity'] = test_data['Polarity'].replace({1: 'Negative', 2: 'Positive'})
test_data

Unnamed: 0,Polarity,Title,Text
0,Positive,Great CD,My lovely Pat has one of the GREAT voices of h...
1,Positive,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
2,Negative,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
3,Positive,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
4,Positive,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
...,...,...,...
399995,Negative,Unbelievable- In a Bad Way,We bought this Thomas for our son who is a hug...
399996,Negative,"Almost Great, Until it Broke...",My son recieved this as a birthday gift 2 mont...
399997,Negative,Disappointed !!!,"I bought this toy for my son who loves the ""Th..."
399998,Positive,Classic Jessica Mitford,This is a compilation of a wide range of Mitfo...


In [14]:
print('Null data:\n',test_data.isnull().sum())
print('The null data in dataframe: \n')
test_data[test_data.isnull().any(axis=1)]

Null data:
 Polarity     0
Title       24
Text         0
dtype: int64
The null data in dataframe: 



Unnamed: 0,Polarity,Title,Text
205,Positive,,Awesome.... simply awesome. I couldn't put thi...
2703,Negative,,Who is Joe Nickell? What are his qualification...
10875,Negative,,None the palace of pleasure volume l is not wo...
47630,Negative,,Crazy¡! I am 10 and this book was not a gud in...
66727,Negative,,this is a tereble book. dont read this book. i...
83136,Negative,,The book does have some good info but is dated...
86252,Positive,,i have every book written by nora roberts this...
101746,Negative,,OMG! WHAT FREAK! THIS WAS THE ANSWER TO TO DEM...
112957,Positive,,Random House failed to edit this book. There a...
120213,Positive,,This CD is good. A lot of the songs on here wa...


In [15]:
# Dropping null values from test_data
test_data.dropna(inplace = True)

In [17]:
# Extracting features and target from the test_data
test_features = test_data[['Text', 'Title']]
test_target = test_data[['Polarity']]

In [27]:
ComplementNB_score = ComplementNB.score(test_features, test_target)
MultinomialNB_score = MultinomialNB.score(test_features, test_target)
print(f'Score of ComplementNB: {round(ComplementNB_score*100, 2)}')
print(f'Score of MultinomialNB: {round(MultinomialNB_score*100, 2)}')

Score of ComplementNB: 87.7
Score of MultinomialNB: 87.7


In [25]:
# Test sentence to check the performance
test_sentence = {
        'Title': ['Torn'],
        'Text': ['This product was broken. I did not expect this quality. Please send a better one'],
}
test_sentence = pd.DataFrame(test_sentence)
test_sentence

Unnamed: 0,Title,Text
0,Torn,This product was broken. I did not expect this...


In [32]:
test_pred1 = ComplementNB.predict(test_sentence)
test_pred2 = MultinomialNB.predict(test_sentence)
print('Pred1: ', test_pred1)
print('Pred2: ', test_pred2)

Pred1:  ['Negative']
Pred2:  ['Negative']


In [43]:
# Creating a helper function to evaluate accuracy, precision, recall, f1-score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def evaluationMetrics(y_true, y_pred):
    model_accuracy = accuracy_score(y_true, y_pred) * 100
    model_precision, model_recall, model_f1, _ = precision_recall_fscore_support(y_true, y_pred, average = 'weighted')
    
    model_result = {"Accuracy": model_accuracy,
                    "Precision": model_precision,
                    "Recall": model_recall,
                    "F1-Score": model_f1}
    
    return model_result

In [41]:
# Predicting on test data
ComplementNB_pred = ComplementNB.predict(test_features)
MultinomialNB_pred = MultinomialNB.predict(test_features)

In [44]:
# Evaluating
ComplementNB_eval = evaluationMetrics(test_target, ComplementNB_pred)
MultinomialNB_eval = evaluationMetrics(test_target, MultinomialNB_pred)

print(f'ComplementNB evaluation results:\n{ComplementNB_eval}')
print(f'MultinomialNB evaluation results:\n{MultinomialNB_eval}')

ComplementNB evaluation results:
{'Accuracy': 87.69876192571554, 'Precision': 0.87700451099024, 'Recall': 0.8769876192571554, 'F1-Score': 0.8769862509187568}
MultinomialNB evaluation results:
{'Accuracy': 87.69876192571554, 'Precision': 0.87700451099024, 'Recall': 0.8769876192571554, 'F1-Score': 0.8769862509187568}


## Looks like the models are performing real good. Let's check if we can pump up the accuracy futher by using Deep learning models.

In [45]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [46]:
import joblib
joblib.dump(ComplementNB, 'ComplementNB_model.joblib')
joblib.dump(MultinomialNB, 'MultinomialNB_model.joblib')

['MultinomialNB_model.joblib']