NAIVE BAYES ALGORITHM

Use the Naı̈ve Bayes algorithm for TEXT CLASSIFICATION. The dataset for this problem
is the Amazon Digital Music review dataset and has been obtained from this website. Given a users review,
task is to predict the ‘overall’ rating given by the reviewer. Provided with separate training and test files containing 50,000 reviews (samples)a nd 14,000 reviews respectively. A review comes from one of the five categories
(class label). Here, class label represents ‘overall’ rating given by the user along with the ‘reviewText’

In [1]:
import json
import re
import pandas as pd
import numpy as np
import random
from nltk.stem import PorterStemmer

ps = PorterStemmer()

Get Training Data

In [2]:
dataloc = './data/q1/reviews_Digital_Music/Music_Review_train.json'

jsonData = [json.loads(line) for line in open(dataloc, 'r')]


In [3]:
jsonDataKeys = list(jsonData[0].keys())
jsonDataKeys

['reviewerID',
 'asin',
 'reviewerName',
 'helpful',
 'reviewText',
 'overall',
 'summary',
 'unixReviewTime',
 'reviewTime']

Get Test Data

In [4]:
testdataloc = './data/q1/reviews_Digital_Music/Music_Review_test.json'

jsonTestData = [json.loads(line) for line in open(testdataloc, 'r')]
jsonTestDataKeys = list(jsonTestData[0].keys())

In [5]:
len(jsonData)

50000

In [31]:

m = 5000

Require to predict 'overall' from the dataset

Determine y(labels) and number of labels 

In [7]:
random.shuffle(jsonData)

In [8]:
y = []
y_test = []
labelsDict = {}
for docIndex in range(m):
    label = int(jsonData[docIndex][jsonDataKeys[5]])
    y.append(label)
    y_test.append(int(jsonTestData[docIndex][jsonTestDataKeys[5]]))
    if label in labelsDict.keys():
        labelsDict[label] += 1
    else:
        labelsDict[label] = 1
y = np.array(y)
y_test = np.array(y_test)

labels = np.unique(y)
y = y.reshape(m,1)
y_test = y_test.reshape(m,1)
L = len(labels)
labels, labelsDict

(array([1, 2, 3, 4, 5]), {5: 2608, 4: 1328, 3: 571, 2: 237, 1: 256})

In [9]:
uniqueWords = []

for docIndex in range(m):
    document = jsonData[docIndex][jsonDataKeys[4]]
    wordsInDoc = re.split(r'[\s.,()\']', document)
    for word in wordsInDoc:
        if len(word) > 2 and  word.isalpha():
            stemWord = ps.stem(word)
            if stemWord not in uniqueWords:
                uniqueWords.append(stemWord)

In [10]:
featureSize = len(uniqueWords)
featureSize

20149

In [11]:
X = []
X_test = []
for docIndex in range(5000):
    docFeatures = [0 for _ in range(featureSize)]
    docTestFeatures = [0 for _ in range(featureSize)]
    document = jsonData[docIndex][jsonDataKeys[4]]
    wordsInDoc = re.split(r'[\s.,()\']', document)
    for word in wordsInDoc:
        if len(word) > 2 and  word.isalpha():
            stemWord = ps.stem(word)
            if stemWord in uniqueWords:
                docFeatures[uniqueWords.index(stemWord)] += 1
    X.append(docFeatures)    

    testdocument = jsonTestData[docIndex][jsonTestDataKeys[4]]
    wordsInTestDoc = re.split(r'[\s.,()\']', testdocument)
    for word in wordsInTestDoc:
        if len(word) > 2 and  word.isalpha():
            stemWord = ps.stem(word)
            if stemWord in uniqueWords:
                docTestFeatures[uniqueWords.index(stemWord)] += 1
    X_test.append(docTestFeatures)    
    
X = np.array(X)
X_test = np.array(X_test)

In [12]:
n = X.shape[1]
X.shape, y.shape

((5000, 20149), (5000, 1))

In [13]:
X, y

(array([[1, 1, 4, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 4, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 1, 1, 0],
        [0, 0, 5, ..., 0, 0, 1]]),
 array([[5],
        [4],
        [5],
        ...,
        [5],
        [5],
        [5]]))

Calculating the Paramters

Determining phi i.e., [phi1, phi2, phi3, phi4, phi5] to predict the probability of y

In [64]:
phi = [0, 0, 0, 0, 0]
for index in range(m):
    label = y[index][0]
    if label == 1:
        phi[0] += 1
    elif label == 2:
        phi[1] += 1
    elif label == 3:
        phi[2] += 1
    elif label == 4:
        phi[3] += 1
    elif label == 5:
        phi[4] += 1
phi

[256, 237, 571, 1328, 2608]

In [65]:
phi = np.array(phi)
phi = phi/m
phi

array([0.0512, 0.0474, 0.1142, 0.2656, 0.5216])

Determining theta to determine the conditional probability of x given y

In [16]:
thetaParaArray = []
alpha = 1
for j in range(n):
    para = [alpha for _ in range(len(labels))]
    for i in range(m):
        para[y[i][0]-1] += X[i][j]
    total = np.sum(para) + alpha*n
    para /= total
    thetaParaArray.append(para)

thetaParaArray = np.array(thetaParaArray) 

In [17]:
theta = np.array(thetaParaArray)
theta.shape

(20149, 5)

In [63]:
phi

array([nan, nan, nan, nan, nan])

In [73]:
logphi = np.log(phi)
y_test_pred = []
accuracy = 0
for j in range(m):
    probLabels = []
    for label in labels: 
        theta0 = theta.T[label-1]
        X_val = X_test[j]
        prob = []
        for i in range(n):
            if X_val[i]:
                prob.append(np.log(theta0[i]))
        sum_ = sum(prob)
        #print('label'+str(label))
        probLabel = sum_ + logphi[label-1]
        #print(probLabel)
        probLabels.append(probLabel)
    y_real = y_test[j]
    y_pred = probLabels.index(max(probLabels)) + 1
    if y_real==y_pred:
        accuracy += 1
    #print(y_pred, y_real)
    #print()
    y_test_pred.append(y_pred)

accuracy

3384

Comparing the Real label of 1st test data with the predicted label 

In [74]:
y_test[0], y_test_pred[0]

(5, 5)

Accuracy%

In [75]:
accuracy*100/y_test.shape[0]

67.68

Reason for having such a low accuracy:
 1. Imbalanced Data Distribution
    With 50% of the data has label '5', Can use Oversampling technique such as SMOTE to rectify it

 2. Simple Features
    Can use feature engineering to create new feature like combining two words