Sub Task - 2: Learning a supervised multi-Topic Classifier

“Topic classification is a ‘supervised’ machine learning technique, one that needs training before being able to automatically analyze texts.”

Based on the Relevant Topic clusters identified in sub task 1, you can automatically annotate the documents with the topic names provided to you.

You need to learn a Supervised Classifier(or Ensemble of Classifiers) that can be used to label any document with the set of topics that have been identified

PLEASE NOTE that the number of topics that this classifier can predict on directly depends on the number of Relevant Topic Clusters that have been discovered in Subtask-1. So there is no point of manually annotating samples for a topic that hasn’t been identified in subtask-1.

By Relevant Topic Clusters, we are referring to those topic clusters that are relevant to any of the Provided Topics


## Prerequisites

In [1]:
# Import the Libraries
import pandas as pd
import numpy as np
from numpy import random
import pprint


import gensim
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)



## Understanding Data

In [2]:
#Read the data 
df = pd.read_csv(r'C:\Users\Ruchit Singh\Desktop\SentiSum NLP exercise\Labelled Dataset.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,Dominant_Topic,Topic_Keywords,Num_Documents,Perc_Documents,Text
0,0.0,9.0,"garage, local, tyre, fit, deliver, select, fit...",695.0,0.0686,"""Tires where delivered to the garage of my cho..."
1,1.0,16.0,"excellent, service, recommend, money, highly, ...",614.0,0.0606,Very easy to use and good value for money.
2,2.0,1.0,"easy, find, cheap, convenient, quick, local, t...",459.0,0.0453,Really easy and convenient to arrange
3,3.0,0.0,"fitting, centre, delivery, excellent, price, a...",915.0,0.0903,It was so easy to select tyre sizes and arrang...
4,4.0,9.0,"garage, local, tyre, fit, deliver, select, fit...",321.0,0.0317,service was excellent. Only slight downside wa...
5,5.0,12.0,"service, efficient, quick, friendly, helpful, ...",363.0,0.0358,User friendly Website. Competitive Prices. Goo...
6,6.0,13.0,"price, good, competitive, easy, service, reaso...",441.0,0.0435,Excellent prices and service
7,7.0,3.0,"great, service, price, fantastic, brilliant, f...",508.0,0.0501,It was very straightforward and the garage was...
8,8.0,9.0,"garage, local, tyre, fit, deliver, select, fit...",380.0,0.0375,Use of local garage.
9,9.0,11.0,"good, service, price, great, communication, pr...",384.0,0.0379,"""Easy to use, also good price."""


In [3]:
# List of columns 
df.columns

Index(['Unnamed: 0', 'Dominant_Topic', 'Topic_Keywords', 'Num_Documents',
       'Perc_Documents', 'Text'],
      dtype='object')

In [4]:
#Drop the unnecessary Columns
df = df.drop(['Unnamed: 0','Dominant_Topic','Num_Documents','Perc_Documents'],axis=1)
df.head()

Unnamed: 0,Topic_Keywords,Text
0,"garage, local, tyre, fit, deliver, select, fit...","""Tires where delivered to the garage of my cho..."
1,"excellent, service, recommend, money, highly, ...",Very easy to use and good value for money.
2,"easy, find, cheap, convenient, quick, local, t...",Really easy and convenient to arrange
3,"fitting, centre, delivery, excellent, price, a...",It was so easy to select tyre sizes and arrang...
4,"garage, local, tyre, fit, deliver, select, fit...",service was excellent. Only slight downside wa...


## Data Cleaning and Preprocessing 

In [5]:
print(df['Text'].apply(lambda x: len(x.split(' '))).sum())
print(df['Topic_Keywords'].apply(lambda x: len(x.split(' '))).sum())

227520
101310


In [6]:
#Print the tags of the text
def print_plot(index):
    example = df[df.index == index][['Text','Topic_Keywords',]].values[0]
    if len(example) > 0:
        print(example[0])
        print('Tag:', example[1])
print_plot(10)

Outstanding values for money and a friendly professional service
Tag: excellent, service, recommend, money, highly, save, friend, family, brilliant, satisfied


In [7]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text
    
df['Text'] = df['Text'].apply(clean_text)
print_plot(10)

  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)


outstanding values money friendly professional service
Tag: excellent, service, recommend, money, highly, save, friend, family, brilliant, satisfied


In [8]:
df['Text'].apply(lambda x: len(x.split(' '))).sum()

129192

## Splitting the Dataset into Train and Test Set

In [9]:
X = df.Text
y = df.Topic_Keywords
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [10]:
#My_tags is the list of labelled task provided in the md file
my_tags=['value for money','garage service','ease of booking','tyre quality','mobile fitter','location',
         'length of fitting','delivery punctuality','booking confusion','wait time','discounts','change of date']

## Classifying Text Data

### Naive Bayes Algorithm

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

nb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
nb.fit(X_train, y_train)

from sklearn.metrics import *
y_pred = nb.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))


accuracy 0.44203256043413913


### Stochastic Gradient Descent Classifier  

In [12]:
from sklearn.linear_model import SGDClassifier

sgd = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42, max_iter=5, tol=None)),
               ])
sgd.fit(X_train, y_train)



y_pred = sgd.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))

accuracy 0.5880611741489886


### Logistic Regression 

In [13]:
from sklearn.linear_model import LogisticRegression

logreg = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', LogisticRegression(n_jobs=1, C=1e5)),
               ])
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))



accuracy 0.49235323137641834


### XGB Classifier

In [14]:
from xgboost import XGBClassifier
xgb = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer()),
                ('clf', XGBClassifier()),
               ])

xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))



accuracy 0.5648741983226443


SGD Model performs the best on our data compared to other models henceforth, we'll use sgd to predict on our sample input

## Testing on our sample input

In [15]:
test_data = 'A perfectly easy way to order tyres online. Just enter your car registration number, check the recommended tyres are correct and select the tyres you want. Select the best time and venue for the fitting and pay online. The whole process is easy and the best value for money!'
predict = sgd.predict([test_data])

In [16]:
#Function to extract the Topic tage=s
def final(data):
    data = data.split(',')
    data = [x.strip(' ') for x in data]
    for n,i in enumerate(data):
        if i == 'garage' or i=='quick' or i=='efficient':
            data[n]='garage service'
            
        if i == 'service' or i == 'money' or i == 'expensive' or i == 'rate' or i == 'cost' or i == 'value':
            data[n] = 'value for money'
            
        if i == 'service' or i == 'appointment' or i=='scheduling' or i=='booking':
            data[n] = 'ease of booking' 
            
        if i == 'tyre' or i == 'tyres':
            data[n] = 'tyre quality'
            
        if i == 'mechanics' or i =='fitting' or i == 'mobile' or i == 'fitted':
            data[n] = 'mobile fitter'
            
        if i == 'close' or i=='nearby' or i=='far':
            data[n] = 'location'
            
        if i == 'duration' or i=='time':
            data[n] = 'length of fitting'
            
        if i == 'time' or i == 'punctuality' or i=='quick':
            data[n] = 'delivery punctuality'
            
        if i == 'booking' or i == 'confusion' or 'problem scheduling':
            data[n] == 'booking confusion'
            
        if i == 'long' or i == 'wait':
            data[n] = 'wait time'
            
        if i == 'discount' or i == 'reduce' or i=='reduced':
            data[n] = 'discounts'
            
        if i == 'date' or i=='different date':
            data[n] = 'change of date'
     
    for i in data:
        if i not in my_tags:
            data.remove(i)
    
    return data
    

In [17]:
#Output of the sample input
final(str(predict))

['process', 'choose', 'happy', 'mobile fitter', 'purchase']

### Saving the model

In [18]:
import pickle
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(sgd, open(filename, 'wb'))
