# Multi-class text classification using models

This Notebook focuses on implementing multi-class text classification on Amazon automotive reviews dataset by choosing any one combination of various processing techniques and algorithms.

Rating(1-5) is predicted for each review from the dataset.

Any single combination out of the following can be chosen for model processing & training :

* Vectorisation using gensim's word2vec & subsequent training using Random Forest algorithm.
* Vectorisation using word2vec and/or Smooth Inverse Frequency (SIF) technique & subsequent training using Random Forest algorithm.
* Vectorisation using Term frequency-inverse document frequency (Tfidf) technique & subsequent training using Random Forest algorithm.
* Vectorisation using Tfidf technique & subsequent training using Linear support vector clustering (SVC) algorithm.

### Install required packages

In [38]:
!pip install pandas nltk gensim sklearn scikit-learn==0.20.3

Defaulting to user installation because normal site-packages is not writeable


### Import required packages

In [23]:
#General
import pandas as pd
import numpy as np
import pickle
import yaml
from joblib import dump
import re
import nltk as nl
import gensim
import yaml
import os
import requests 

#sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer

#nltk
from nltk.corpus import stopwords
nl.download('punkt')
nl.download('stopwords')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Convert dataset in JSON format to CSV format

In [24]:
json_data = pd.read_json('data/amazon_automotive_reviews.json', lines=True)
json_data.to_csv('amazon_automotive_reviews.csv', index=False)

In [25]:
raw_data = pd.read_csv('amazon_automotive_reviews.csv')

In [26]:
raw_data.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A3F73SC1LY51OO,B00002243X,Alan Montgomery,"[4, 4]",I needed a set of jumper cables for my new car...,5,Work Well - Should Have Bought Longer Ones,1313539200,"08 17, 2011"
1,A20S66SKYXULG2,B00002243X,alphonse,"[1, 1]","These long cables work fine for my truck, but ...",4,Okay long cables,1315094400,"09 4, 2011"
2,A2I8LFSN2IS5EO,B00002243X,Chris,"[0, 0]",Can't comment much on these since they have no...,5,Looks and feels heavy Duty,1374710400,"07 25, 2013"
3,A3GT2EWQSO45ZG,B00002243X,DeusEx,"[19, 19]",I absolutley love Amazon!!! For the price of ...,5,Excellent choice for Jumper Cables!!!,1292889600,"12 21, 2010"
4,A3ESWJPAVRPWB4,B00002243X,E. Hernandez,"[0, 0]",I purchased the 12' feet long cable set and th...,5,"Excellent, High Quality Starter Cables",1341360000,"07 4, 2012"


### Clean up initial dataset

In [27]:
raw_data['overallRating'] = raw_data['overall']
raw_data = raw_data.drop(['reviewerID','asin','reviewerName','helpful','overall','summary','unixReviewTime','reviewTime'], axis=1)

In [28]:
raw_data.head()

Unnamed: 0,reviewText,overallRating
0,I needed a set of jumper cables for my new car...,5
1,"These long cables work fine for my truck, but ...",4
2,Can't comment much on these since they have no...,5
3,I absolutley love Amazon!!! For the price of ...,5
4,I purchased the 12' feet long cable set and th...,5


### Depict class imbalance issue in dataset using value count for each rating

In [29]:
unique_count = raw_data.overallRating.value_counts()
unique_count

5    13928
4     3967
3     1430
2      606
1      542
Name: overallRating, dtype: int64

### Preprocess dataset to remove class imbalance issue

Preprocessing of the dataset is done in such a way that the rating categories of 5 is undersampled and other categories are oversampled accordingly, so as to get a balanced dataset without any prediction output bias.

In [30]:
excess_recs = unique_count[5] - 8000
print("About %d records of rating 5 to be removed" %excess_recs)

rating_5_excess = raw_data[(raw_data['overallRating'] == 5)]
rating_others = raw_data[(raw_data['overallRating'] != 5)]

rating_5 = (rating_5_excess.reset_index()).truncate(before=excess_recs)
rating_5.set_index('index',inplace=True)

raw_data = pd.concat([rating_5,rating_others])

rating_1 = raw_data[(raw_data['overallRating'] == 1)]
rating_2 = raw_data[(raw_data['overallRating'] == 2)]
rating_3 = raw_data[(raw_data['overallRating'] == 3)]
rating_4 = raw_data[(raw_data['overallRating'] == 4)]

About 5928 records of rating 5 to be removed


In [31]:
fin_data = pd.concat([raw_data, rating_1,rating_1,rating_1,rating_1,rating_2,rating_2,rating_2,rating_2,rating_3,rating_4[-2000:]])
fin_data.overallRating.value_counts()

5    8000
4    5967
2    3030
3    2860
1    2710
Name: overallRating, dtype: int64

In [32]:
fin_data['p_review'] = fin_data['reviewText'].apply(lambda x: re.sub(r'[^a-zA-Z\s]','', str(x)))

In [33]:
fin_data.head()

Unnamed: 0,reviewText,overallRating,p_review
8170,"KN 33-2370, I have KN's in all my vehicles. T...",5,KN I have KNs in all my vehicles This one is...
8171,This probably the sixth or seventh unique K&N;...,5,This probably the sixth or seventh unique KN f...
8172,I've always been intrigued by K&N filters bein...,5,Ive always been intrigued by KN filters being ...
8174,This item is great it fits the vehicle perfect...,5,This item is great it fits the vehicle perfect...
8175,K&N has always provided consumers with superio...,5,KN has always provided consumers with superior...


### Choose model type for training

Choose a numeral against the type of model processing & algorithm combinations to use the respective method for training your model.

In [34]:
model_type = {  
                1 : 'word2vec_rf',
                2 : 'sif_rf',
                3 : 'tfidf_rf',
                4 : 'tfidf_lsvc'
             }

model_choice = 1

In [35]:
if not model_choice or model_choice not in range(1,5):
     raise ValueError("Set a model_choice from 1 to 4 based on model_type")

### Tokenize processed review texts while removing stopwords & vectorize the tokens using word2vec

In [37]:
if model_choice in [1,2]:

        p_review = fin_data['p_review'].to_list()

        tokens = [nl.word_tokenize(sentences) for sentences in p_review]

        stop_words = stopwords.words('english')

        tokens = [[word for word in tokens[i] if not word in stopwords.words('english')] for i in range(len(tokens))]

        wv_model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)

        wv_model.train(tokens, total_examples=len(tokens), epochs=50)
        
        print("Done")

Done


### Preprocess & prepare training data

In [38]:
if model_choice == 1:
        
        print("Preparing training data using word2vec processing..")
        wv_train = []
        for i in range(len(tokens)):
            wv_train.append(np.mean(np.asarray([wv_model[token] for token in tokens[i]]),axis=0))
        print("Completed")
            
elif model_choice == 2:
    
        print("Preparing training data using Smooth inverse frequency(SIF) type processing..")
        vlookup = wv_model.wv.vocab
        Z = 0
        for k in vlookup:
                Z += vlookup[k].count # Compute the normalization constant Z

        a = 0.001
        embedding_size = 300
        wv_sif_train = []
        for i in range(len(tokens)):
                vs = np.zeros(300)
                for word in tokens[i]:
                        a_value = a / (a + (vlookup[word].count/Z))
                        vs = np.add(vs, np.multiply(a_value, wv_model.wv[word]))
                wv_sif_train.append(np.divide(vs, len(tokens[i])))
        print("Completed")
                
else:
         print("Preparing training data using TfIdf vectorization..")
         tfidf = TfidfVectorizer(ngram_range=(1,2),sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', stop_words='english')
         features = tfidf.fit_transform(fin_data.p_review).toarray()
         print(features.shape)
         print("Completed")

Preparing training data using word2vec processing..


  


Completed


### Split train & test data

In [39]:
if model_choice == 1:
    x_train, x_test, y_train, y_test = train_test_split(np.asarray(wv_train),fin_data['overallRating'],test_size=0.3,shuffle=True,random_state=7)

elif model_choice == 2:
    x_train, x_test, y_train, y_test = train_test_split(np.asarray(wv_sif_train),fin_data['overallRating'],test_size=0.2,shuffle=True,random_state=7)

else:
    x_train, x_test, y_train, y_test = train_test_split(features,fin_data['overallRating'],test_size=0.3,shuffle=True,random_state=7)
    

### Train model

In [40]:
if model_choice in [1,2,3]:
    model = RandomForestClassifier(n_estimators=40, random_state=0)
    model.fit(x_train,y_train)
    
else:
    model = LinearSVC()
    model.fit(x_train, y_train)  

### Save model 

In [41]:
file_rel_path = 'model/'
file_name = 'model.joblib'

if not os.path.exists(file_rel_path):
    os.mkdir(file_rel_path)
dump(model, file_rel_path + file_name)

['model/model.joblib']

### Define inference service name & model storage URI

In [42]:
svc_name = 'text-classify'

!kubectl get pods $HOSTNAME -o yaml -n anonymous > podspec
with open("podspec") as f:
    content = yaml.safe_load(f)
    for elm in content['spec']['volumes']:
        if 'workspace-' in elm['name']:
            pvc = elm['name']
os.remove('podspec')
pvc
    
storageURI = "pvc://" + pvc + '/' + file_rel_path
print(storageURI)

pvc://workspace-poornima/model/


### Define configuration for inference service creation

In [43]:
wsvol_blerssi_kf = f"""apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: {svc_name}
  namespace: anonymous
spec:
  default:
    predictor:
      sklearn:
        storageUri: {storageURI}
"""
    
kfserving = yaml.safe_load(wsvol_blerssi_kf)
with open('blerssi-kfserving.yaml', 'w') as file:
    yaml_kfserving = yaml.dump(kfserving,file)

! cat blerssi-kfserving.yaml

apiVersion: serving.kubeflow.org/v1alpha2
kind: InferenceService
metadata:
  name: text-classify
  namespace: anonymous
spec:
  default:
    predictor:
      sklearn:
        storageUri: pvc://workspace-poornima/model/


### Apply the configuration .yaml file

In [44]:
!kubectl apply -f blerssi-kfserving.yaml

inferenceservice.serving.kubeflow.org/text-classify unchanged


### Check whether inferenceservice is created

In [47]:
!kubectl get inferenceservice -n anonymous

NAME            URL                                                                  READY   DEFAULT TRAFFIC   CANARY TRAFFIC   AGE
text-classify   http://text-classify.anonymous.example.com/v1/models/text-classify   True    100                                12m


### Note:

Wait for inference service READY="True"

### Predict data from serving after setting INGRESS_IP

In [49]:
host_name = svc_name + '.anonymous.example.com'

headers = { 
    'host': host_name
}

formData = {
    'instances': x_test[:1].tolist()
}

url = 'http://<<INGRESS_IP>>:31380/v1/models/' + svc_name + ':predict'
res = requests.post(url, json=formData, headers=headers)
results = res.json()
prediction = results['predictions']

prediction

[3]

## Clean up after predicting

### Delete inference service

In [50]:
!kubectl delete -f blerssi-kfserving.yaml

inferenceservice.serving.kubeflow.org "text-classify" deleted


### Delete model folder

In [51]:
!rm -rf $file_rel_path