<a href="https://colab.research.google.com/github/sdaliparthi/NLP_TextProcessing/blob/main/Text_Classification_BOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement

The problem is to identify the subcategory and classify the question based on the group it belongs to.



## Learning Objectives

At the end of the experiment, you will be able to understand:

*   Beautiful Soup
*   Use NLTK package
*   Text Representation
*   Classification

## Dataset
Being able to classify the questions will be difficult in natural language processing. The dataset is taken from the TalentSprint aptitude questions which contains more than 20K questions.

## Description
This dataset has the following columns:
1. **Category:** Gives the high-level categorization of the question
2. **Sub-Category:** Determines the type of questions
3. **Article:** Gives the article name of the question
4. **Questions:** Questions are listed
5. **Answers:** Contains answers


In [1]:
#@title Download the datasets
from IPython import get_ipython

ipython = get_ipython()

def setup(): 
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Aptitude_Classification_data.csv")
    ipython.magic("sx wget https://cdn.talentsprint.com/aiml/Experiment_related_data/Mentors_Test_Data.csv")
    from IPython.display import HTML, display
    print("Setup completed successfully")
    return

setup()

Setup completed successfully


In [2]:
# Import Python Libraries
from bs4 import BeautifulSoup
import nltk
import re
import string
import warnings
import numpy as np
import pandas as pd
from collections import Counter
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
warnings.filterwarnings('ignore')
nltk.download('punkt')
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [3]:
# YOUR CODE HERE TO LOAD THE APTITUDE CLASSIFICATION DATASET & EXTRACT THE DATA BASED ON YOUR SUB-CATEGORIES
## Sub-Category = Finding Errors, Ratio and Proportion, Logarithms, Time and Distance, Simple and Compound Interest
### Read Data
data = pd.read_csv('Aptitude_Classification_data.csv')

### Select the rows based on required Sub-Category
reqSubCat = ['Finding Errors', 'Ratio and Proportion', 'Logarithms', 'Time and Distance', 'Simple and Compound Interest']
reqData = data[data['Sub-Category'].isin(reqSubCat)]
assert sorted(reqData['Sub-Category'].unique()) == sorted(reqSubCat)
reqData.head()


Unnamed: 0,Category,Sub-Category,Article,Questions,Answers
1,Quantitative,Time and Distance,Time and Distance - Model 05,Rohan leaves point A and reaches point B in 6 ...,2
2,Verbal,Finding Errors,44054,Read the sentence to find out whether there is...,2
5,Verbal,Finding Errors,44054,Read the sentence to find out whether there is...,2
9,Verbal,Finding Errors,44054,Read the sentence to find out whether there is...,2
12,Quantitative,Time and Distance,Time and Distance - Model 03,Two cars start from the same point at the same...,3


## **Stage 2:** Data Pre-Processing

1.   List item
2.   List item



###  Clean and Transform the data into a specified format

*   Remove the rows of the Questions column which contains blank / NaN.


*   Few set of questions have HTML tags within the question.
  - You can use Beautiful Soup library to convert HTML into text (Refer **"Dealing with HTML"** section from this [link](https://www.nltk.org/book/ch03.html).)


*  Consider Question column as feature and Sub-category as target variable. Convert Sub-category into numerical.

*  Drop the unwanted columns


  **Hint:** Use Label Encoder for obtaining a numeric representation, refer to the [link](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). 

In [4]:
# YOUR CODE HERE for BeatifulSoup

## Remove the rows with blank/NaN Questions
print(reqData.isna().sum())
print(f"\n ##> Data shape before dropna on Questions : {reqData.shape}")
reqData = reqData.dropna(subset=['Questions'])
print(f"\n ##> Data shape after dropna on Questions : {reqData.shape}")

## Convert HTML into text
reqData['Questions'] = reqData['Questions'].apply(lambda text: BeautifulSoup(text, 'html.parser').get_text())
print(reqData['Questions'])

## Drop unwanted columns
reqData = reqData.drop(labels=['Category', 'Article', 'Answers'], axis=1)


Category        0
Sub-Category    0
Article         0
Questions       0
Answers         0
dtype: int64

 ##> Data shape before dropna on Questions : (1607, 5)

 ##> Data shape after dropna on Questions : (1607, 5)
1       Rohan leaves point A and reaches point B in 6 ...
2       Read the sentence to find out whether there is...
5       Read the sentence to find out whether there is...
9       Read the sentence to find out whether there is...
12      Two cars start from the same point at the same...
                              ...                        
4620    Until 1850, the speed of signals along nerves ...
4622    Read the sentence to find out whether there is...
4623    If a man cycles at 10 km/hr, then he arrives a...
4625    Raj and Sai have money in the ratio 3 : 4. Twi...
4630    A man sets out on cycle from Delhi to Faridaba...
Name: Questions, Length: 1607, dtype: object


In [5]:
reqData.head()

Unnamed: 0,Sub-Category,Questions
1,Time and Distance,Rohan leaves point A and reaches point B in 6 ...
2,Finding Errors,Read the sentence to find out whether there is...
5,Finding Errors,Read the sentence to find out whether there is...
9,Finding Errors,Read the sentence to find out whether there is...
12,Time and Distance,Two cars start from the same point at the same...


In [6]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder() # DO NOT CHANGE THIS LINE as we will be using for the Test evaluation.

# YOUR CODE HERE for Fit label encoder and return encoded labels
reqData['Sub-Category'] = le.fit_transform(reqData['Sub-Category'])
reqData.head()

Unnamed: 0,Sub-Category,Questions
1,4,Rohan leaves point A and reaches point B in 6 ...
2,0,Read the sentence to find out whether there is...
5,0,Read the sentence to find out whether there is...
9,0,Read the sentence to find out whether there is...
12,4,Two cars start from the same point at the same...


# Bag of Words (BOW)

## **Stage 3:** Text representation using Bag of Words (BOW)

###  a) Get valid words from all questions & add them to a list.


Treat each question as a separate document and get the list of words using the following:
1.   Split the sentence into words

2.   Remove Stop words. Use NLTK packages for getting the Stop words.

3.   Replace proper names with "name" 
  - Example: "Rahul" -> "name"
       
4.   Remove the single white space character (\n, \t, \f, \r), refer [link](https://developers.google.com/edu/python/regular-expressions)

5.   Ignore words whose length is less than 3 (Eg: 'is', 'a').

6.   Remove punctuation and non-alphabetic words

7.   Convert the text to lowercase

8.   Use the Porter Stemmer to normalize the words


Refer [link](https://www.nltk.org/book/ch03.html) for extracting the words.

Refer [link](https://medium.com/free-code-camp/an-introduction-to-bag-of-words-and-how-to-code-it-in-python-for-nlp-282e87a9da04) for more information.

In [7]:
def extract_words(question):
    # YOUR CODE HERE
    # Hint: Extract words for each question using the above 8 instructions.
    porter = nltk.PorterStemmer()
    ## Split the words
    wordTokens = word_tokenize(question)
    ## Remove stop words
    wordListNoStop = [w for w in wordTokens if w not in set(stopwords.words('english'))]
    ## Replace proper names
    wordListNoName1 = ['name' if nltk.pos_tag([w]) == 'NNP' else w for w in wordListNoStop]
    wordListNoName = ['name' if nltk.pos_tag([w]) == 'NNPs' else w for w in wordListNoName1]
    ## Remove white spaces
    wordListNoWS = [re.sub(r'[\f\n\r\s\t]+','',w) for w in wordListNoName]
    ## Ignore words with length less than 3
    wordListNoLen = [w for w in wordListNoWS if len(w) >= 3]
    ## Remove punctuation and non-alphabetic words
    wordListNoPunc = [w for w in wordListNoLen if w not in string.punctuation]
    #wordListNoNonAlpha = [re.sub(r'[^a-zA-Z]+','',w) for w in wordListNoPunc]
    wordListNoNonAlpha = [w for w in wordListNoPunc if w.isalpha()]
    ## Convert to lower case
    wordListLC = [w.lower() for w in wordListNoNonAlpha]
    ## Apply Porter Stemmer
    wordList = [porter.stem(w) for w in wordListLC]
    wordListNoPunc = [w for w in wordListNoLen if w is not '']
    return wordList

In [8]:
def tokenize(allquestions):
  valid_words = []
  for question in allquestions:
    words = extract_words(question)
    valid_words.extend(words)
  return set(valid_words)

In [9]:
# Use the function to extract words for all questions
# YOUR CODE HERE
vocab = tokenize(reqData['Questions'])
len(vocab)#, vocab

2429

###  b) Generate vectors that can be used by the machine learning algorithm

1.   The length of the vector for each question will be the length of the valid words. Initialize each vector with all Zeros

2.   Compare each valid word with the words in question and generate the vectors based on the counter frequency of the word in that question.



In [10]:
def generate_vectors(question):
    # YOUR CODE HERE
    # Hint: Initialize each vector with all zeros. 
    #reqVec = np.zeros((1, len(vocab)))
    reqVec = np.zeros((len(vocab)))

    # Extracting words for each question and count the words
    words = extract_words(question)
    word_dict = Counter(words)

    # YOUR CODE HERE 
    # Hint: If the word is in valid words then generate the vectors based on the counter frequency of the word in that question.
    for i,w in enumerate(vocab):
        if w in word_dict.keys():
            #reqVec[0,i] = word_dict[w]
            reqVec[i] = word_dict[w]
    return reqVec

In [11]:
# Use the above function for collecting the vectors of all questions into a list.
# YOUR CODE HERE
features = np.array([generate_vectors(sentance) for sentance in reqData['Questions']])
labels = np.array([l for l in reqData['Sub-Category']])
features.shape, labels.shape

((1607, 2429), (1607,))

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reqData['Questions'])
print(X.toarray())
X.shape

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


(1607, 3947)

## **Stage 4:** Classification

### Perform a Classification 

1.   Identify the features and labels

2.   Use train_test_split for splitting the train and test data

3.   Fit your model on the train set using fit() and perform prediction on the test set using predict()

4. Get the accuracy of the model

## Expected Accuracy above 90%


In [13]:
from sklearn.model_selection import train_test_split
# YOUR CODE HERE
xTrain, xTest, yTrain, yTest = train_test_split(features, labels, test_size=0.20, random_state = 9)

In [14]:
xTrain.shape, xTest.shape, yTrain.shape, yTest.shape

((1285, 2429), (322, 2429), (1285,), (322,))

In [15]:
## Decision Tree
dtClf = DecisionTreeClassifier(criterion='entropy', max_depth=31, random_state=1)
dtClf.fit(xTrain, yTrain)
yPred = dtClf.predict(xTest)
print(f"##> Training Accuracy with DECISSION TREE : {dtClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with DECISSION TREE : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with DECISSION TREE : 0.9758754863813229
##> Testing Accuracy with DECISSION TREE : 0.9192546583850931



In [16]:
## KNN
knnClf = KNeighborsClassifier(n_neighbors=2, weights='distance', p=1) # 100% accuracy with n_neighbors=2
#knnClf = KNeighborsClassifier()
knnClf.fit(xTrain,yTrain)
yPred = knnClf.predict(xTest)
print(f"##> Training Accuracy with KNN : {knnClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with KNN : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with KNN : 0.9906614785992218
##> Testing Accuracy with KNN : 0.8229813664596274



In [17]:
## Linear Classifier
from sklearn.linear_model import SGDClassifier
linClf = SGDClassifier(loss='perceptron', alpha=0.001)
linClf.fit(xTrain,yTrain)
yPred = linClf.predict(xTest)
print(f"##> Training Accuracy with LINEAR CLASSIFIER : {linClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with LINEAR CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with LINEAR CLASSIFIER : 0.9898832684824903
##> Testing Accuracy with LINEAR CLASSIFIER : 0.9316770186335404



In [18]:
## Logistic Regression
from sklearn.linear_model import LogisticRegression
logClf = LogisticRegression(random_state=2)
logClf.fit(xTrain,yTrain)
yPred = logClf.predict(xTest)
print(f"##> Training Accuracy with LINEAR CLASSIFIER : {logClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with LINEAR CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with LINEAR CLASSIFIER : 0.9891050583657588
##> Testing Accuracy with LINEAR CLASSIFIER : 0.9409937888198758



In [19]:
## SVM
from sklearn.svm import SVC 
svmClf = SVC(C=2.0)
svmClf.fit(xTrain, yTrain)
yPred = svmClf.predict(xTest)
print(f"##> Training Accuracy with SVM CLASSIFIER : {svmClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with SVM CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with SVM CLASSIFIER : 0.977431906614786
##> Testing Accuracy with SVM CLASSIFIER : 0.9409937888198758



In [20]:
## Ensemble : Voting
from sklearn.ensemble import VotingClassifier
voteClf = VotingClassifier(estimators=[('SVC',svmClf),('Dtree',dtClf),('LogReg',logClf)], voting = 'hard')
voteClf.fit(xTrain, yTrain)
yPred = voteClf.predict(xTest)
print(f"##> Training Accuracy with LINEAR CLASSIFIER : {voteClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with LINEAR CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with LINEAR CLASSIFIER : 0.9859922178988327
##> Testing Accuracy with LINEAR CLASSIFIER : 0.9409937888198758



In [21]:
voteClf.estimators

[('SVC',
  SVC(C=2.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False)),
 ('Dtree',
  DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                         max_depth=31, max_features=None, max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, presort='deprecated',
                         random_state=1, splitter='best')),
 ('LogReg',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='auto', n_jobs=None, penalty='l2',
                     random_state=2, solver='lbfgs'

In [22]:
## Ensemble : Bagging + Voting
from sklearn.ensemble import BaggingClassifier
bgVotClf = BaggingClassifier(base_estimator=voteClf, n_estimators=10, bootstrap=True)
bgVotClf.fit(xTrain, yTrain)
yPred = bgVotClf.predict(xTest)
print(f"##> Training Accuracy with LINEAR CLASSIFIER : {bgVotClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with LINEAR CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with LINEAR CLASSIFIER : 0.9875486381322958
##> Testing Accuracy with LINEAR CLASSIFIER : 0.937888198757764



In [23]:
## Ensemble : Bagging + DT
from sklearn.ensemble import BaggingClassifier
bgDTClf = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', max_depth=29), n_estimators=100, bootstrap=True)
bgDTClf.fit(xTrain, yTrain)
yPred = bgDTClf.predict(xTest)
print(f"##> Training Accuracy with LINEAR CLASSIFIER : {bgDTClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with LINEAR CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with LINEAR CLASSIFIER : 0.9867704280155642
##> Testing Accuracy with LINEAR CLASSIFIER : 0.922360248447205



In [24]:
## Ensemble : Bagging + LR
from sklearn.ensemble import BaggingClassifier
bgDTClf = BaggingClassifier(base_estimator=LogisticRegression(random_state=2), n_estimators=30, bootstrap=True)
bgDTClf.fit(xTrain, yTrain)
yPred = bgDTClf.predict(xTest)
print(f"##> Training Accuracy with LINEAR CLASSIFIER : {bgDTClf.score(xTrain,yTrain)}")
print(f"##> Testing Accuracy with LINEAR CLASSIFIER : {accuracy_score(yTest, yPred)}\n")

##> Training Accuracy with LINEAR CLASSIFIER : 0.9867704280155642
##> Testing Accuracy with LINEAR CLASSIFIER : 0.9347826086956522



## **Stage 5:** Evaluation with entirely new data

### Evaluate with the given test data 

1.  Loading the Test data

2.  Converting the Test data into vectors

3.  Pass through the model and verify the accuracy

## Expected Accuracy above 90%


In [25]:
# YOUR CODE HERE for selecting the trained classifier model, eg: MODEL = decision_tree
MODEL = voteClf # bgDTClf #bgVotClf # voteClf # dtClf #ENTER YOUR MODEL

Test_data = pd.read_csv("Mentors_Test_Data.csv")
Test_data = Test_data[Test_data['Sub-Category'].isin(le.classes_)]
labels = le.transform(Test_data['Sub-Category'])
Test_questions= Test_data['Questions']

Test_BOW=[]
for TQ in Test_questions: 
  Test_vectors = generate_vectors(TQ) 
  Test_BOW.append(Test_vectors)

predict = MODEL.predict(Test_BOW) 
accuracy_score(labels, predict)

0.9438202247191011