#Project Description :

Classification is probably the most popular task that you would deal with in real life.  Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the  information about the writer without knowing about him/her.     We are going to create a classifier that predicts multiple features of the author of a given text.  We have designed it as a Multilabel classification problem. 

# Data set info : 
Blog Authorship Corpus  Over 600,000 posts from more than 19 thousand bloggers    The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from  blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million  words - or approximately 35 posts and 7250 words per person.    Each blog is presented as a separate file, the name of which indicates a blogger id# and the  blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and  age but for many, industry and/or sign is marked as unknown.)    All bloggers included in the corpus fall into one of three age groups:  8240 "10s" blogs (ages 13-17),  8086 "20s" blogs(ages 23-27)  2994 "30s" blogs (ages 33-47) 
 
  For each age group, there is an equal number of male and female bloggers.  Each blog in the corpus includes at least 200 occurrences of common English words. All formatting  has been stripped with two exceptions. Individual posts within a single blogger are separated by the  date of the following post and links within a post are denoted by the label urllink.    Link to dataset:  https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads


In [1]:
# imports
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,f1_score, accuracy_score, recall_score, precision_score
from nltk.stem import WordNetLemmatizer 

# Step 1 : 
1. Load the dataset (5 points) 
a. Tip: As the dataset is large, use fewer rows. Check what is working well on your  machine and decide accordingly.

In [2]:
# Read data 
corpus_df = pd.read_csv("blog-authorship-corpus\\blogtext.csv")
corpus_df.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [3]:
# Taking only initial 5k rows to initial pre processing & training
corpus_df_sample = corpus_df[:3000]
print(corpus_df_sample.shape)
corpus_df_sample["text"].loc[0]

(3000, 7)


'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

# Step 2 :
2. Preprocess rows of the “text” column.

    a. Remove unwanted characters  
    b. Convert text to lowercase  
    c. Remove unwanted spaces  
    d. Remove stopwords 

In [4]:
#Removing unwanted / special characters
corpus_df_sample['text'] = corpus_df_sample['text'].str.replace('[^A-Za-z]',' ')
corpus_df_sample["text"].loc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


'           Info has been found          pages  and     MB of  pdf files  Now i have to wait untill our team leader has processed it and learns html          '

In [5]:
# Coverting to lower case
corpus_df_sample['text'] = corpus_df_sample['text'].str.lower()
corpus_df_sample["text"].loc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


'           info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html          '

In [6]:
#Removing spaces
corpus_df_sample["text"] = corpus_df_sample["text"].str.strip()
corpus_df_sample["text"].loc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


'info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html'

In [7]:
corpus_df_sample["text"] = corpus_df_sample["text"].str.split()  # splitting each row of text data into individual words.
# So it can be iterated through to remove only stopwords in next steps.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


# Removing stop words 

In [8]:
stop = stopwords.words('english')
def removestopwords(y):   # Function definition
 stopwordremoved = [w for w in y if w not in stop]
 return(" ".join(stopwordremoved)) 

In [9]:
text_column_size = corpus_df_sample["text"].size
print("text column size :", text_column_size)

# Initialize an empty list to hold the text after stop word removal
cleaner_corpus_df_sample_text = []

# Loop over each text
for i in range( 0, text_column_size):
    cleaner_corpus_df_sample_text.append(removestopwords(corpus_df_sample["text"][i]))

text column size : 3000


In [10]:
cleaner_corpus_df_sample_text[10]

'ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea n

In [11]:
#Replace text column with cleaner_corpus_df_sample_text 
corpus_df_sample["text"] = cleaner_corpus_df_sample_text
corpus_df_sample["text"][10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


'ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea n

# Lemmatization

In [12]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    lemm = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    return(" ".join(lemm)) 

corpus_df_sample["text"] = corpus_df_sample.text.apply(lemmatize_text)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [13]:
corpus_df_sample["text"][10] # Lemmatized output

'ah korean language look difficult first figure read hanguel korea surprisingly easy learn alphabet character seems easy vocabulary start oh backwards u sentence structure yikes luckily many option u slow witted foreigner take language course could list urllink joongang article say lot resource urllink well guy motivation jeon ji hyun latest something actually star movie cf hear mean commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly title make sense like website korean english look quite good actually urllink movie shown theatre subtitle special time info urllink list many theatre seoul click urllink urllink great reason learn korean already married went foreigner well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea nothing xxx sensibil

# Step 3: 
As this is a  multi-label classification problem,  merge  all the label columns together, so that we have all the labels together for a particular sentence.
a. Label columns to merge: “gender”, “age”, “topic”, “sign”

In [14]:
#name of available columns 
corpus_df_sample.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [15]:
corpus_df_sample.shape

(3000, 7)

In [16]:
corpus_df_sample.head(2)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found page mb pdf file wait untill team l...
1,2059027,male,15,Student,Leo,"13,May,2004",team member drewes van der laag urllink mail r...


In [17]:
# merge gender', 'age', 'topic', 'sign'
corpus_df_sample['age'] = corpus_df_sample['age'].astype(str)
corpus_df_sample['labels'] = corpus_df_sample[['gender','age','topic','sign']].apply(lambda x: ','.join(x), axis = 1) 
corpus_df_sample_merged = corpus_df_sample.drop(labels = ['date','gender', 'age','topic','sign','id'], axis = 1)
corpus_df_sample_merged.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male,15,Student,Leo"
1,team member drewes van der laag urllink mail r...,"male,15,Student,Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture url popups mean s...,"male,33,InvestmentBanking,Aquarius"


In [18]:
corpus_df_sample_merged.shape

(3000, 2)

# Step4:
Separate features and labels, and split the data into training and testing

In [19]:
feature = corpus_df_sample_merged['text']
corpus_df_sample_merged['labels'] = corpus_df_sample_merged['labels'].str.lower()
labels = corpus_df_sample_merged['labels']
X_train, X_test, Y_train, Y_test = train_test_split(feature,labels, test_size = 0.33, random_state = 143)
Y_train.shape

(2010,)

# Step5 :
Vectorizing the features.

a. Create a Bag of Words using count vectorizer  
 i. Use ngram_range=(1, 2)  
 ii. Vectorize training and testing features  
 
b. Print the term-document matrix 

In [20]:
# Creating Bag of words
vectorizer = CountVectorizer(min_df = 2,ngram_range = (1,2),stop_words = "english")
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
print("X_train shape & sample",X_train.shape)
X_train[0]

X_train shape & sample (2010, 16018)


<1x16018 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

# Step 6:
Create a dictionary to get the count of every label i.e. the key will be label name and value will  be the total count of the label. 

In [21]:
vectorizer_labels = CountVectorizer(min_df = 1,ngram_range = (1,1),stop_words = "english")
labels_vector = vectorizer_labels.fit_transform(labels)
vectorizer_labels.vocabulary_

{'male': 36,
 '15': 1,
 'student': 47,
 'leo': 33,
 '33': 9,
 'investmentbanking': 32,
 'aquarius': 18,
 'female': 28,
 '14': 0,
 'indunk': 30,
 'aries': 19,
 '25': 6,
 'capricorn': 24,
 '17': 3,
 'gemini': 29,
 '23': 4,
 'non': 39,
 'profit': 41,
 'cancer': 23,
 'banking': 21,
 '37': 12,
 'sagittarius': 43,
 '26': 7,
 '24': 5,
 'scorpio': 45,
 '27': 8,
 'education': 26,
 '45': 16,
 'engineering': 27,
 'libra': 34,
 'science': 44,
 '34': 10,
 '41': 14,
 'communications': 25,
 'media': 37,
 'businessservices': 22,
 'sports': 46,
 'recreation': 42,
 'virgo': 50,
 'taurus': 48,
 'arts': 20,
 'pisces': 40,
 '44': 15,
 '16': 2,
 'internet': 31,
 'museums': 38,
 'libraries': 35,
 'accounting': 17,
 '39': 13,
 '35': 11,
 'technology': 49}

In [22]:
# Extracing only key value from above dictionary, which contains unique labels. These set of labels will be used as classes in 
# multilabelbinariser further.
label_classes = []  
for key in vectorizer_labels.vocabulary_.keys():
    label_classes.append(key)
    
print(sorted(label_classes))

['14', '15', '16', '17', '23', '24', '25', '26', '27', '33', '34', '35', '37', '39', '41', '44', '45', 'accounting', 'aquarius', 'aries', 'arts', 'banking', 'businessservices', 'cancer', 'capricorn', 'communications', 'education', 'engineering', 'female', 'gemini', 'indunk', 'internet', 'investmentbanking', 'leo', 'libra', 'libraries', 'male', 'media', 'museums', 'non', 'pisces', 'profit', 'recreation', 'sagittarius', 'science', 'scorpio', 'sports', 'student', 'taurus', 'technology', 'virgo']


# Step 7:
Transform the labels - As we have noticed before, in this task each example can have multiple tags. To deal with  such kind of prediction, we need to transform labels in a binary form and the prediction will be  a mask of 0s and 1s. For this purpose, it is convenient to use ​MultiLabelBinarizer​ from sklearn  

a. Convert your train and test labels using MultiLabelBinarizer  

In [23]:
mlb = MultiLabelBinarizer(classes = label_classes)  # initialising multilabelbinariser with all unique possible classes

In [24]:
# Converting entire se of labels into format required by mlb
labels = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in labels]]
labels[30]

['male', '33', 'investmentbanking', 'aquarius']

In [25]:
labels_trans = mlb.fit(labels) # transforming entire set of lables
labels_trans

MultiLabelBinarizer(classes=['male', '15', 'student', 'leo', '33', 'investmentbanking', 'aquarius', 'female', '14', 'indunk', 'aries', '25', 'capricorn', '17', 'gemini', '23', 'non', 'profit', 'cancer', 'banking', '37', 'sagittarius', '26', '24', 'scorpio', '27', 'education', '45', 'engineering', 'libra', 'science', '3...', 'pisces', '44', '16', 'internet', 'museums', 'libraries', 'accounting', '39', '35', 'technology'],
          sparse_output=False)

In [26]:
#Convert Y_train into a format as required by mlb 
Y_train = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_train]]
Y_train[30]

['male', '24', 'engineering', 'libra']

In [27]:
Y_train_trans = mlb.transform(Y_train) # transforming Train lables using mlb which is trained on all possible unnique labels on entire data set
Y_train_trans[30]

  .format(sorted(unknown, key=str)))


array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [28]:
Y_train_trans.shape

(2010, 51)

In [29]:
#Convert Y_test into a format as required by mlb 
Y_test = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_test]]
Y_test_trans = mlb.transform(Y_test) # transforming test labels.
print(Y_test[30])

['male', '35', 'technology', 'aries']


In [30]:
Y_test_trans[30]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1])

In [31]:
len(mlb.classes_)

51

In [32]:
mlb.classes_

array(['male', '15', 'student', 'leo', '33', 'investmentbanking',
       'aquarius', 'female', '14', 'indunk', 'aries', '25', 'capricorn',
       '17', 'gemini', '23', 'non', 'profit', 'cancer', 'banking', '37',
       'sagittarius', '26', '24', 'scorpio', '27', 'education', '45',
       'engineering', 'libra', 'science', '34', '41', 'communications',
       'media', 'businessservices', 'sports', 'recreation', 'virgo',
       'taurus', 'arts', 'pisces', '44', '16', 'internet', 'museums',
       'libraries', 'accounting', '39', '35', 'technology'], dtype=object)

In [33]:
Y_train_trans[10]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [34]:
Y_train[10]

['male', '25', 'nonprofit', 'cancer']

# Step 8:

In this task, we suggest using the One-vs-Rest approach, which is implemented in  OneVsRestClassifier​ class. In this approach k classifiers (= number of tags) are trained. As a  basic classifier, use ​LogisticRegression​. It is one of the simplest methods, but often it  performs good enough in text classification tasks. It might take some time because the  number of classifiers to train is large.  

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on  every label  

In [35]:
clf = LogisticRegression(solver = 'lbfgs',max_iter = 1000)  # initiating the classifier
#from sklearn.svm import SVC
#clf = SVC(kernel = "linear")
clf = OneVsRestClassifier(clf)

# Step 9:

Fit the classifier, make predictions and get the accuracy

a. Print the following  
        i. Accuracy score  
        ii. F1 score  
        iii. Average precision score  
        iv. Average recall score 
 
Tip: Make sure you are familiar with all of them. How would you expect the  things to work for the multi-label scenario? Read about micro/macro/weighted  averaging 

In [36]:
clf.fit(X_train,Y_train_trans) # Fitting on  train data

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False),
          n_jobs=None)

In [37]:
print("Train Accuracy:",clf.score(X_train,Y_train_trans))

Train Accuracy: 0.972139303482587


In [38]:
Y_pred = clf.predict(X_test) 

In [39]:
print("Test Accuracy:" + str(accuracy_score(Y_test_trans, Y_pred)))
print("F1: " + str(f1_score(Y_test_trans, Y_pred, average='micro')))
print("F1_macro: " + str(f1_score(Y_test_trans, Y_pred, average='macro')))
print("Precision: " + str(precision_score(Y_test_trans, Y_pred, average='micro')))
print("Precision_macro: " + str(precision_score(Y_test_trans, Y_pred, average='macro')))
print("Recall: " + str(recall_score(Y_test_trans, Y_pred, average='micro')))
print("Recall_macro: " + str(recall_score(Y_test_trans, Y_pred, average='macro')))

Test Accuracy:0.5686868686868687
F1: 0.7584394023242943
F1_macro: 0.2777544085541704
Precision: 0.8270971635485818
Precision_macro: 0.42504160995159523
Recall: 0.7003065917220235
Recall_macro: 0.2290073113591835


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


# Step 10:

Print true label and predicted label for any five examples

In [40]:
Y_pred_inv = mlb.inverse_transform(Y_pred)   # inverse transforming predited label data
Y_test_trans_inv =  mlb.inverse_transform(Y_test_trans) # inverse transforming original test label data

In [41]:
print("Example 1 - predicted :",Y_pred_inv[0])
print("Example 1 - Actual :",Y_test_trans_inv[0])
print("Example 1 - Actual_before mlb transformation :",Y_test[0])

Example 1 - predicted : ('male', 'aries', '35', 'technology')
Example 1 - Actual : ('aquarius', 'female', '27', 'education')
Example 1 - Actual_before mlb transformation : ['female', '27', 'education', 'aquarius']


In [42]:
print("Example 2 - predicted :",Y_pred_inv[30])
print("Example 2 - Actual :",Y_test_trans_inv[30])
print("Example 2 - Actual_before mlb transformation :",Y_test[30])

Example 2 - predicted : ('male', 'aries', '35', 'technology')
Example 2 - Actual : ('male', 'aries', '35', 'technology')
Example 2 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [43]:
print("Example 3 - predicted :",Y_pred_inv[39])
print("Example 3 - Actual :",Y_test_trans_inv[39])
print("Example 3 - Actual_before mlb transformation :",Y_test[39])

Example 3 - predicted : ('male', 'aries', '35', 'technology')
Example 3 - Actual : ('female', '14', 'indunk', 'aries')
Example 3 - Actual_before mlb transformation : ['female', '14', 'indunk', 'aries']


In [44]:
print("Example 4 - predicted :",Y_pred_inv[300])
print("Example 4 - Actual :",Y_test_trans_inv[300])
print("Example 4 - Actual_before mlb transformation :",Y_test[300])

Example 4 - predicted : ('male', 'aries', '35', 'technology')
Example 4 - Actual : ('male', 'aries', '35', 'technology')
Example 4 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [46]:
print("Example 5 - predicted :",Y_pred_inv[89])
print("Example 5 - Actual :",Y_test_trans_inv[89])
print("Example 5 - Actual_before mlb transformation :",Y_test[892])

Example 5 - predicted : ('15', 'student', 'female', 'libra')
Example 5 - Actual : ('15', 'student', 'female', 'libra')
Example 5 - Actual_before mlb transformation : ['female', '15', 'student', 'libra']


# Learnings / Conclusions:
1. Have executed this model with 30k/60k/25k samples too. But everytime model is overfitting like how it is demonstrated in above result. 
2. Lemmatization is used as an additional step in the pre processing, still it is not impacting model generalisation.  