## Text Classification Amazon Review Data
### this notebook follow tutorial on medium
[this is the source](https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34)

In [80]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [81]:
np.random.seed(500)

### load dataset

In [82]:
df = pd.read_csv("corpus.csv",encoding="latin-1")

In [83]:
df.head(10)

Unnamed: 0,text,label
0,Stuning even for the non-gamer: This sound tr...,__label__2
1,The best soundtrack ever to anything.: I'm re...,__label__2
2,Amazing!: This soundtrack is my favorite musi...,__label__2
3,Excellent Soundtrack: I truly like this sound...,__label__2
4,"Remember, Pull Your Jaw Off The Floor After H...",__label__2
5,an absolute masterpiece: I am quite sure any ...,__label__2
6,"Buyer beware: This is a self-published book, ...",__label__1
7,Glorious story: I loved Whisper of the wicked...,__label__2
8,A FIVE STAR BOOK: I just finished reading Whi...,__label__2
9,Whispers of the Wicked Saints: This was a eas...,__label__2


### data preprocessing

In [84]:
df.describe()

Unnamed: 0,text,label
count,10000,10000
unique,10000,2
top,Rochelle explains It All for You: Wondering w...,__label__1
freq,1,5097


In [85]:
df['label'].value_counts()

__label__1     5097
__label__2     4903
Name: label, dtype: int64

### Step - a : Remove blank rows if any.

In [86]:
df['text'].dropna(inplace=True)

### Step - b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently

In [87]:
df['text'] = [entry.lower() for entry in df['text']]

### Step - c : Tokenization : In this each entry in the corpus will be broken into set of words

In [88]:
df['text']= [word_tokenize(entry) for entry in df['text']]

### Step - d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
#### WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun

In [89]:
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

In [90]:
for index,entry in enumerate(df['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    df.loc[index,'text_final'] = str(Final_words)

## Prepare Train and Test Data sets

In [91]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(df['text_final'],df['label'],test_size=0.3)

## Encoding

In [92]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [93]:
Train_Y

array([1, 0, 0, ..., 0, 1, 1])

## Word Vectorization

In [94]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(df['text_final'])

Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

## Use the ML Algorithms to Predict the outcome

In [95]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
print("Naive Bayes Recall Score -> ",recall_score(predictions_NB, Test_Y)*100)
print("Naive Bayes Precision Score -> ",precision_score(predictions_NB, Test_Y)*100)
print("Naive Bayes f1 Score -> ",f1_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  83.23333333333333
Naive Bayes Recall Score ->  83.98299078667611
Naive Bayes Precision Score ->  81.05335157318741
Naive Bayes f1 Score ->  82.49216846501915


In [96]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
print("SVM Recall Score -> ",recall_score(predictions_SVM, Test_Y)*100)
print("SVM Precision Score -> ",precision_score(predictions_SVM, Test_Y)*100)
print("SVM f1 Score -> ",f1_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  84.66666666666667
SVM Recall Score ->  83.98914518317503
SVM Precision Score ->  84.67852257181943
SVM f1 Score ->  84.3324250681199


# todo
- Play around with the Data pre-processing steps and see how it effects the accuracy
- Try other Word Vectorization techniques such as Count Vectorizer and Word2Vec
- Try Parameter tuning with the help of GridSearchCV on these Algorithms
- Try other classification Algorithms Like Linear Classifier, Boosting Models and even Neural Networks