# Building an Automatic Product Title Tagging Engine

Importing all the packages and classes required for pre-processing, modelling etc

In [5]:
import pandas as pd
from time import time
import numpy,os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Reading the train, test and evaluation data from the files into dataframes

In [13]:
os.getcwd()

'/home/sugnakar/Rocky/BigData/Downloads/m2 zip/CPEE_Batch26_Scholarship_Exam/data'

In [14]:
train_raw = pd.read_csv("Train.csv")
test_raw = pd.read_csv("Test.csv")
eval_raw = pd.read_csv("EvalData.csv")

In [15]:
#Copying the test and train data to other variables so that original data is not disturbed
train_data = train_raw.copy()
test_data = test_raw.copy()
eval_data = eval_raw.copy()

# Data Preprocessing

Before we do anything lets take a look at training data

In [16]:
train_data.describe()

Unnamed: 0,ptitle,brand_title,target
count,56493,56493,56493
unique,54799,475,102
top,?????????,OTHER,Mobile & Tablets_Tablet and Smartphone Accesso...
freq,802,13778,14233


#Based on the above output we can understand that there are 802 records that have the title "?????????". Now we have two choices. 
1.Either we ignore the data by deleting these rows
2.Replace the ptitle for all these rows as "dummytitle".

As we do not have the customer to ask, because this is an exam :-), lets just replace the title values with "DummyTitle"

In [17]:
#Remove all the rows that have just "??????????" as they show as the top in the data description
train_data = train_raw.copy()
train_data['ptitle'] = train_data['ptitle'].replace("?????????"," DummyTitle ")

train_data.dropna(inplace=True)


During classification numbers in the data are not of much help. We can do two things - either replace all numbers with a dummy string or replace it with empty.

Note: This would be a question to the customer if the number are of any relevance during the classification

Lets replace all the numbers of digits >2 with the string " suspectnumber "


In [18]:
number_replacer = " suspectnumber "

train_data["ptitle"] = train_data["ptitle"].str.replace("\d{2,}", number_replacer)
test_data["ptitle"] = test_data["ptitle"].str.replace("\d{2,}", number_replacer)
eval_data["ptitle"] = eval_data["ptitle"].str.replace("\d{2,}", number_replacer)

Data in train, test and evaluate have lots of special characters which does not do any help during the classification. Lets replace all the values other than numbers, strings, . with empty character

In [19]:
#Lets replace all the special characters and only consider numbers and english characters
train_data["ptitle"] = train_data["ptitle"].str.replace('[^a-zA-Z0-9 \n\.]',"")
test_data["ptitle"] = test_data["ptitle"].str.replace('[^a-zA-Z0-9 \n\.]',"")
eval_data["ptitle"] = eval_data["ptitle"].str.replace('[^a-zA-Z0-9 \n\.]',"")

After doing the data clean up lets check the description of the data. Also (optionally) dump the modified data into a file and observe if the data clean up is done properly or removed any key information

In [20]:
#train_data.to_csv("modified.csv")
train_data.describe()

Unnamed: 0,ptitle,brand_title,target
count,56493,56493,56493
unique,52348,475,102
top,DummyTitle,OTHER,Mobile & Tablets_Tablet and Smartphone Accesso...
freq,802,13778,14233


Based on the problem statement only title and target columns are relevant. Lets copy them to seperate varaibles 

In [9]:
train_title_data = train_data["ptitle"]
train_target_data = train_data["target"]
test_title_data = test_data["ptitle"]
test_target_data = test_data["target"]
eval_title_data = eval_data["ptitle"]

# Text Transformation

Lets do the transformation of the original title data. This can be done using CountVectorizer or TfidfVectorizer. Tried with both of them but CountVectorizer helped in getting better accuracy and recall values compared to TfidfVectorizer.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#vector = CountVectorizer(ngram_range=(1,2))
vector = CountVectorizer()

#vector = TfidfVectorizer(ngram_range=(1,2), max_df=0.95, min_df=2, stop_words=['english','german']) 
#vector = TfidfVectorizer()


Time to tranform the training title data using fit_transform method

In [11]:
train_data_trans=vector.fit_transform(train_title_data)

Check how many features are derived into the vector based on the training titles . Also to understand what are the various feature names just print them and see. This would also help in further pre-processing of the data

In [12]:
print len(vector.get_feature_names())
#print(vector.get_feature_names())

35276


Tranform the test data using the same CountVectorizer object. Remember you should use "tranform" method and not "fit_transform" as want to use the same features identified by the training data to be used here

In [13]:
test_data_tran = vector.transform(test_title_data)
test_data_tran
#test_title_data

<56493x35276 sparse matrix of type '<type 'numpy.int64'>'
	with 563817 stored elements in Compressed Sparse Row format>

Tranform the evaluation data using the same CountVectorizer object. Remember you should use "tranform" method and not "fit_transform" as want to use the same features identified by the training data to be used here

In [14]:
eval_data_tran = vector.transform(eval_title_data)
#vector.get_feature_names()
eval_data_tran

<10000x35276 sparse matrix of type '<type 'numpy.int64'>'
	with 101709 stored elements in Compressed Sparse Row format>

# Building a model using Logistic Regression

Its time to build a model and test how the model predicts the data on the test and also on the evaluation data. In this section would use LogisticRegression to predict the class. We can use the default construction of LogisticRegression or tune some of the input parameters. I tried tuning various parameters for the model but the default gave better results. So leaving the initiatilization to default values

In [15]:
#logmodel = LogisticRegression(multi_class='multinomial', solver='newton-cg')
logmodel = LogisticRegression()
logmodel.fit(X=train_data_trans, y=train_target_data)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now that the model is ready lets predict the classes on the test data

In [47]:
#Predict the target classification using the logistic model 
log_predictions_on_test = logmodel.predict(X=test_data_tran)


Plot the confusion matrix. 

In [48]:
#Plotting confusion matrix. 

#print(classification_report(log_predictions_on_test, test_target_data))
#conf = confusion_matrix(y_pred=log_predictions_on_test, y_true=test_target_data)
#import seaborn as sns
#import matplotlib.pyplot as plt
#%matplotlib inline
#sns.heatmap(conf)

#As the number of classes are high confusion matrix display is not viewable. Hence commeted out the code

Lets calculate the accuracy of the predictions

In [49]:
#Calculate the accuracy of the model
accuracy_score(y_pred=log_predictions_on_test, y_true=test_target_data)

0.89310180022303653

Awesome. Accuracy of the model on the test data is 89%. But this can be misleading sometimes. Its time to check the precision and recall metrics

As it is a multiclass classification lets use the average='weighted' in calculation of precision and recall as it helps in calculating metrics for each label, and find their average, weighted by the number of true instances for each label

In accuracy we don't have this problem as it is just the sum of all diagonal elements divided by total 

In [50]:
import numpy
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
precision = precision_score(y_pred=log_predictions_on_test, y_true=test_target_data, average='weighted')
print(precision)
recall = recall_score(y_pred=log_predictions_on_test, y_true=test_target_data, average='weighted')
print(recall)

0.899311134834
0.893101800223


Precision and Recall are also 89%. 

Its time to apply the model on the evaluation data 

In [51]:
log_predictions_on_eval = logmodel.predict(X=eval_data_tran)

Now that we have the predictions for the evaluation data we need to merge these predictions with the original eval data

In [52]:
#convert the predictions into dataframe before merging them 
log_eval_target = pd.DataFrame(data=log_predictions_on_eval, columns=["Predictions"])
log_final_eval_data = pd.concat([eval_raw, log_eval_target], axis = 1)

After merging the predictions with the original eval data, dump all the data into "Predictions.csv" file under "data" folder

In [53]:
log_final_eval_data.to_csv("data\Predictions.csv",index=False)

Check for the accuracy and recall by uploading the "Predictions.csv" at http://172.16.0.12:3838/

Got an accuracy of 55.32 and Recall 69%

# Building a model using MultinomialNB

Now lets to model using MultinomialNB as its generally good when there are multiple classifications

In [54]:
mdbmodel = MultinomialNB()
mdbmodel.fit(train_data_trans, train_target_data)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Model creation is faster than the Logisitic Regression. Its time to predict the target values using this model

In [55]:
mnb_predictions = mdbmodel.predict(test_data_tran)

Lets check the accuracy of the predictions

In [56]:
print(accuracy_score(y_pred=mnb_predictions, y_true=test_target_data))

0.767086187669


Got an accuracy of 77% . Check for precision and recall

In [58]:
import numpy
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
precision = precision_score(y_pred=mnb_predictions, y_true=test_target_data, average='weighted')
print(precision)
recall = recall_score(y_pred=mnb_predictions, y_true=test_target_data, average='weighted')
print(recall)

0.805110714839
0.767086187669


Got recall of 81% and precision of 77% on the test data

Lets predict the target for the evaluation data

In [59]:
mnb_predictions_on_eval = mdbmodel.predict(X=eval_data_tran)

Now that we have the predictions for the evaluation data we need to merge these predictions with the original eval data

In [60]:
mnb_eval_target = pd.DataFrame(data=mnb_predictions_on_eval, columns=["Predictions"])
mnb_final_eval_data = pd.concat([eval_raw, mnb_eval_target], axis = 1)

After merging the predictions with the original eval data, dump all the data into "Predictions.csv" file under "data" folder

In [61]:
mnb_final_eval_data.to_csv("data\Predictions.csv",index=False)

Check for the accuracy and recall by uploading the "Predictions.csv" at http://172.16.0.12:3838/

Got an accuracy of 1 and Recall 9%. The model does not predict properly on the evaluation data even though the predictions on the test data seemed fine

# Modelling using SVM

Now lets to model using MultinomialNB as its generally good when there are many features SVM. SVM generally expects the number of training data set to be less. In our case we have good amount of training data. Lets try and see how the model works on this data

In [16]:
svmmodel = SVC(C=20, gamma=0.2)
svmmodel.fit(X=train_data_trans, y=train_target_data)

SVC(C=20, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.2, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

As expected the model has taken lot of time to process. 

Time to do the prediction on test data using the SVM model

In [17]:
svm_predictions = svmmodel.predict(test_data_tran)

Calculate the accuracy, precision and recall metrics for the predictions

In [18]:
print('Accuracy-',accuracy_score(y_pred=svm_predictions, y_true=test_target_data))

import numpy
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

precision = precision_score(y_pred=svm_predictions, y_true=test_target_data, average='weighted')
print('Precision - ',precision)
recall = recall_score(y_pred=svm_predictions, y_true=test_target_data, average='weighted')
print('Recall - ', recall)

('Accuracy-', 0.81510983661692604)


  'precision', 'predicted', average, warn_for)


('Precision - ', 0.8301432203386484)
('Recall - ', 0.81510983661692604)


  'recall', 'true', average, warn_for)


Accuracy, precision and Recall values are very less

If these values are good we could tried the below steps. But it does not make sense to do as key metrics values are very low. Hence stopping this model processing here

In [19]:
svm_predictions_on_eval = svmmodel.predict(X=eval_data_tran)

Merge the predictions with original data and after merging dump all the data into "Predictions.csv" file under "data" folder

In [153]:
svm_eval_target = pd.DataFrame(data=svm_predictions_on_eval, columns=["Predictions"])
svm_final_eval_data = pd.concat([eval_raw, svm_eval_target], axis = 1)
svm_final_eval_data.to_csv("data\Predictions.csv",index=False)