<a href="https://colab.research.google.com/github/rohitpaul23/naive_bayes/blob/main/naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NAIVE BAYES ALGORITHM

Use the Naı̈ve Bayes algorithm for TEXT CLASSIFICATION. The dataset for this problem
is the Amazon Digital Music review dataset and has been obtained from this website. Given a users review,
task is to predict the ‘overall’ rating given by the reviewer. Provided with separate training and test files containing 50,000 reviews (samples) and 14,000 reviews respectively. A review comes from one of the five categories
(class label). Here, class label represents ‘overall’ rating given by the user along with the ‘reviewText’

In [1]:
import json
import re
import pandas as pd
import numpy as np
import random
from nltk.stem import PorterStemmer

ps = PorterStemmer()

Get Training Data

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [3]:
dataloc = '/content/gdrive/My Drive/projectData/ass2/data/q1/reviews_Digital_Music/Music_Review_train.json'

jsonData = [json.loads(line) for line in open(dataloc, 'r')]


List of features in the datasets

In [4]:
jsonDataKeys = list(jsonData[0].keys())
jsonDataKeys

['reviewerID',
 'asin',
 'reviewerName',
 'helpful',
 'reviewText',
 'overall',
 'summary',
 'unixReviewTime',
 'reviewTime']

First Data

In [24]:
jsonData[0]

{'asin': 'B000AA302A',
 'helpful': [1, 1],
 'overall': 5.0,
 'reviewText': 'This is one of the best albums I\'ve bought in a while.  It\'s like, Rush for the new millenium.  Great musicianship, and really catchy songs.  I especially like "The Willing Well" (all 4 parts), "Welcome Home", "Crossing the Frame", and "Apollo I", but they\'re all great.  Buy this now.',
 'reviewTime': '05 4, 2006',
 'reviewerID': 'A2DWB9Y1SM0FVU',
 'reviewerName': 'Ross Reynolds',
 'summary': 'HOLY MOSES GRANDMA!!!',
 'unixReviewTime': 1146700800}

Will be using the 'revieText' as training sentences based on which will predict the 'overall'

Get Test Data

In [5]:
testdataloc = '/content/gdrive/My Drive/projectData/ass2/data/q1/reviews_Digital_Music/Music_Review_test.json'

jsonTestData = [json.loads(line) for line in open(testdataloc, 'r')]
jsonTestDataKeys = list(jsonTestData[0].keys())

In [25]:
len(jsonData), len(jsonTestData)

(50000, 14000)

Have total of 50000 training data and 14000 test data

Require to predict 'overall' from the dataset

Determine y(labels) and number of labels 

In [8]:
random.shuffle(jsonData)

In [26]:
m = 5000

Taking 5000 training data to get the features or words that will be used in classification

In [9]:
y = []
y_test = []
labelsDict = {}
for docIndex in range(m):
    label = int(jsonData[docIndex][jsonDataKeys[5]])
    y.append(label)
    y_test.append(int(jsonTestData[docIndex][jsonTestDataKeys[5]]))
    if label in labelsDict.keys():
        labelsDict[label] += 1
    else:
        labelsDict[label] = 1
y = np.array(y)
y_test = np.array(y_test)

labels = np.unique(y)
y = y.reshape(m,1)
y_test = y_test.reshape(m,1)
L = len(labels)
labels, labelsDict

(array([1, 2, 3, 4, 5]), {1: 252, 2: 283, 3: 523, 4: 1327, 5: 2615})

A Total of 5 classes to classify the data 

In [10]:
uniqueWords = []

for docIndex in range(m):
    document = jsonData[docIndex][jsonDataKeys[4]]
    wordsInDoc = re.split(r'[\s.,()\']', document)
    for word in wordsInDoc:
        if len(word) > 2 and  word.isalpha():
            stemWord = ps.stem(word)
            if stemWord not in uniqueWords:
                uniqueWords.append(stemWord)

In [11]:
featureSize = len(uniqueWords)
featureSize

20325

A total of 20325 unique words that will be the features obtained from 5000 data used in classifications

In [12]:
X = []
X_test = []
for docIndex in range(5000):
    docFeatures = [0 for _ in range(featureSize)]
    docTestFeatures = [0 for _ in range(featureSize)]
    document = jsonData[docIndex][jsonDataKeys[4]]
    wordsInDoc = re.split(r'[\s.,()\']', document)
    for word in wordsInDoc:
        if len(word) > 2 and  word.isalpha():
            stemWord = ps.stem(word)
            if stemWord in uniqueWords:
                docFeatures[uniqueWords.index(stemWord)] += 1
    X.append(docFeatures)    

    testdocument = jsonTestData[docIndex][jsonTestDataKeys[4]]
    wordsInTestDoc = re.split(r'[\s.,()\']', testdocument)
    for word in wordsInTestDoc:
        if len(word) > 2 and  word.isalpha():
            stemWord = ps.stem(word)
            if stemWord in uniqueWords:
                docTestFeatures[uniqueWords.index(stemWord)] += 1
    X_test.append(docTestFeatures)    
    
X = np.array(X)
X_test = np.array(X_test)

In [13]:
n = X.shape[1]
X.shape, y.shape

((5000, 20325), (5000, 1))

In [14]:
X, y

(array([[ 2,  1,  3, ...,  0,  0,  0],
        [ 2,  0,  7, ...,  0,  0,  0],
        [ 6,  1,  6, ...,  0,  0,  0],
        ...,
        [ 0,  1, 15, ...,  1,  1,  1],
        [ 5,  8, 29, ...,  0,  0,  0],
        [ 1,  1,  6, ...,  0,  0,  0]]), array([[5],
        [2],
        [4],
        ...,
        [5],
        [4],
        [2]]))

### Calculating the Paramters

Determining phi i.e., [phi1, phi2, phi3, phi4, phi5] to predict the probability of y

In [15]:
phi = [0, 0, 0, 0, 0]
for index in range(m):
    label = y[index][0]
    if label == 1:
        phi[0] += 1
    elif label == 2:
        phi[1] += 1
    elif label == 3:
        phi[2] += 1
    elif label == 4:
        phi[3] += 1
    elif label == 5:
        phi[4] += 1
phi

[252, 283, 523, 1327, 2615]

In [16]:
phi = np.array(phi)
phi = phi/m
phi

array([0.0504, 0.0566, 0.1046, 0.2654, 0.523 ])

Determining theta to determine the conditional probability of x given y

In [17]:
thetaParaArray = []
alpha = 1
for j in range(n):
    para = [alpha for _ in range(len(labels))]
    for i in range(m):
        para[y[i][0]-1] += X[i][j]
    total = np.sum(para) + alpha*n
    para /= total
    thetaParaArray.append(para)

thetaParaArray = np.array(thetaParaArray) 

In [18]:
theta = np.array(thetaParaArray)
theta.shape

(20325, 5)

In [19]:
phi

array([0.0504, 0.0566, 0.1046, 0.2654, 0.523 ])

In [20]:
logphi = np.log(phi)
y_test_pred = []
accuracy = 0
for j in range(m):
    probLabels = []
    for label in labels: 
        theta0 = theta.T[label-1]
        X_val = X_test[j]
        prob = []
        for i in range(n):
            if X_val[i]:
                prob.append(np.log(theta0[i]))
        sum_ = sum(prob)
        #print('label'+str(label))
        probLabel = sum_ + logphi[label-1]
        #print(probLabel)
        probLabels.append(probLabel)
    y_real = y_test[j]
    y_pred = probLabels.index(max(probLabels)) + 1
    if y_real==y_pred:
        accuracy += 1
    #print(y_pred, y_real)
    #print()
    y_test_pred.append(y_pred)

accuracy

3384

Comparing the Real label of 1st test data with the predicted label 

In [21]:
y_test[0], y_test_pred[0]

(array([5]), 5)

Accuracy%

In [22]:
accuracy*100/y_test.shape[0]

67.68

Reason for having such a low accuracy:
 1. Imbalanced Data Distribution
    With 50% of the data has label '5', Can use Oversampling technique such as SMOTE to rectify it

 2. Simple Features
    Can use feature engineering to create new feature like combining two words