# Statistical NLP

Classification is probably the most popular task that you would deal with in real life.  Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the  information about the writer without knowing about him/her.     We are going to create a classifier that predicts multiple features of the author of a given text.  We have designed it as a Multilabel classification problem

Blog Authorship Corpus 


Over 600,000 posts from more than 19 thousand bloggers   

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from  blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million  words - or approximately 35 posts and 7250 words per person.  


Each blog is presented as a separate file, the name of which indicates a blogger id# and the  blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and  age but for many, industry and/or sign is marked as unknown.)  
All bloggers included in the corpus fall into one of three age groups:  8240 "10s" blogs (ages 13-17),  8086 "20s" blogs(ages 23-27)  2994 "30s" blogs (ages 33-47) 
 

In [1]:
import numpy as np
import pandas as pd

# 1.Load the dataset

In [2]:
blogtext_df=pd.read_csv("blogtext.csv")
blogtext_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
blogtext_df.shape

(681284, 7)

In [4]:
blogtext_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   id      681284 non-null  int64 
 1   gender  681284 non-null  object
 2   age     681284 non-null  int64 
 3   topic   681284 non-null  object
 4   sign    681284 non-null  object
 5   date    681284 non-null  object
 6   text    681284 non-null  object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


In [5]:
blogtext_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,681284.0,2397802.0,1247723.0,5114.0,1239610.0,2607577.0,3525660.0,4337650.0
age,681284.0,23.93233,7.786009,13.0,17.0,24.0,26.0,48.0


# 2. Preprocess rows of the “text” column 

In [6]:
blogtext_df["age"]=blogtext_df["age"].astype(str)

In [7]:
#blogtext_df.info()
blogtext_df=blogtext_df[0:50000]   ##  NOTE : considerring only 50k records of 681284. to reduce compute time.

In [8]:
import nltk
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [9]:
blogtext_df["text"]

0                   Info has been found (+/- 100 pages,...
1                   These are the team members:   Drewe...
2                   In het kader van kernfusie op aarde...
3                         testing!!!  testing!!!          
4                     Thanks to Yahoo!'s Toolbar I can ...
                               ...                        
49995           Aug 7th Thur... Bought Her Mua Chee & S...
49996           Aug 6th Wed.. Her 1st Day @ Work Back @...
49997           Aug 4th Mon Zing's BD !! Went To Her Pl...
49998           Aug 3rd Sun.. Went To Her Place B4 Goin...
49999           Aug 1st Fri.. Met Her To Go Shoppin' @ ...
Name: text, Length: 50000, dtype: object

In [23]:
import re
import string
#string.punctuation
stopwords = nltk.corpus.stopwords.words('english')

def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])  ## a. Remove unwanted characters
    new_text = re.sub(r"[\n\t\'+]", "", text)
    new_text = re.sub(r"~[0-9]", "", new_text)
    tokens = re.split('\W+', new_text)                                            ## c. Remove unwanted spaces 
    text = [word for word in tokens if word not in stopwords]                  ##d. Remove stopwords
    return " ".join([word for word in text])

blogtext_df['text_nostop'] = blogtext_df['text'].apply(lambda x: clean_text(x.lower()))  ##b. Convert text to lowercase

blogtext_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_nostop
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...",info found 100 pages 45 mb pdf files wait unt...
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,team members drewes van der laag urllink mail...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eige...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,thanks yahoos toolbar capture urls popupswhic...


In [24]:
blogtext_df1=blogtext_df  ##[0:681284]

# 3. Merge all labels together


In [25]:
 #a. Label columns to merge: “gender”, “age”, “topic”, “sign"

def clean_label(row):
    text=','.join(row.values.astype(str))
    tokens = re.split('\W+', text)
    text = [word for word in tokens]
    return text

cols = ['gender', 'age', 'topic', 'sign']
blogtext_df1['labels'] = blogtext_df1[cols].apply(lambda row: clean_label(row), axis=1) 

blogtext_df1.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_nostop,labels
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...",info found 100 pages 45 mb pdf files wait unt...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,team members drewes van der laag urllink mail...,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eige...,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,thanks yahoos toolbar capture urls popupswhic...,"[male, 33, InvestmentBanking, Aquarius]"


In [26]:
blogtext_df1.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text', 'text_nostop',
       'labels'],
      dtype='object')

In [27]:
blogtext_df1=blogtext_df1.drop(columns=["text","id","gender","age","topic","sign","date"], axis=1)
blogtext_df1.head()

Unnamed: 0,text_nostop,labels
0,info found 100 pages 45 mb pdf files wait unt...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eige...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoos toolbar capture urls popupswhic...,"[male, 33, InvestmentBanking, Aquarius]"


# 4. Separate features and labels, and split the data into training and testing

In [28]:
X = blogtext_df1.text_nostop
print(X.shape)

(50000,)


In [29]:
y = blogtext_df1.labels
#print(y.shape)

In [30]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=4)

print(X_train.shape, X_test.shape, y_train.shape,y_test.shape)

(35000,) (15000,) (35000,) (15000,)


# 5.Vectorize the features 

In [31]:
#a. Create a Bag of Words using count vectorizer and vectorize training and test 

In [32]:
# instantiate the vectorizer
vect = CountVectorizer(ngram_range=(1, 2))

In [33]:
## Fit and transform on training data
X_train_dtm = vect.fit_transform(X_train)

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

In [34]:
len(vect.get_feature_names())  ## there are 20931987 features

2380863

In [35]:
# b.  Print the term-document matrix
print(X_train_dtm)
print(X_test_dtm)

  (1, 1005427)	33
  (1, 2249396)	2
  (1, 955523)	4
  (1, 223584)	1
  (1, 156008)	1
  (1, 2276207)	3
  (1, 1318141)	2
  (1, 625530)	2
  (1, 264787)	1
  (1, 822062)	1
  (1, 736258)	1
  (1, 1933282)	3
  (1, 820995)	1
  (1, 928268)	4
  (1, 835366)	1
  (1, 2306513)	1
  (1, 905025)	2
  (1, 2187362)	1
  (1, 826769)	1
  (1, 658199)	1
  (1, 2286846)	1
  (1, 552201)	12
  (1, 1133587)	1
  (1, 2170961)	1
  (1, 553846)	1
  :	:
  (34999, 870019)	1
  (34999, 181285)	1
  (34999, 1968792)	1
  (34999, 2013198)	1
  (34999, 678985)	1
  (34999, 364019)	1
  (34999, 422762)	1
  (34999, 1969010)	1
  (34999, 293625)	1
  (34999, 64479)	1
  (34999, 2267236)	1
  (34999, 2266950)	1
  (34999, 785620)	1
  (34999, 359044)	1
  (34999, 679456)	1
  (34999, 246297)	1
  (34999, 2093515)	1
  (34999, 359034)	1
  (34999, 1473141)	1
  (34999, 857243)	1
  (34999, 2267278)	1
  (34999, 953929)	1
  (34999, 2267250)	1
  (34999, 2330399)	1
  (34999, 2108757)	1
  (0, 197208)	1
  (0, 349152)	1
  (0, 393856)	1
  (0, 393926)	1
  (0, 40

In [36]:
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

# 6.Create a dictionary to get the count of every label 

In [37]:
##   i.e. the key will be label name and value will  be the total count of the label.
from sklearn.preprocessing import MultiLabelBinarizer

In [41]:
mlb = MultiLabelBinarizer()

In [42]:
labels_tr = mlb.fit_transform(blogtext_df1['labels'])

labels_df = pd.DataFrame(labels_tr, columns=mlb.classes_)

In [43]:
labels_dict = dict(labels_df.sum())

In [44]:
labels_dict

{'13': 745,
 '14': 2043,
 '15': 3508,
 '16': 4156,
 '17': 6859,
 '23': 5518,
 '24': 5746,
 '25': 2837,
 '26': 2869,
 '27': 4094,
 '33': 1654,
 '34': 1886,
 '35': 3365,
 '36': 1985,
 '37': 310,
 '38': 196,
 '39': 412,
 '40': 192,
 '41': 394,
 '42': 96,
 '43': 150,
 '44': 38,
 '45': 93,
 '46': 330,
 '47': 206,
 '48': 318,
 'Accounting': 364,
 'Advertising': 273,
 'Agriculture': 78,
 'Aquarius': 4784,
 'Architecture': 70,
 'Aries': 7795,
 'Arts': 1817,
 'Automotive': 116,
 'Banking': 283,
 'Biotech': 101,
 'BusinessServices': 416,
 'Cancer': 4589,
 'Capricorn': 3819,
 'Chemicals': 75,
 'Communications': 1603,
 'Construction': 28,
 'Consulting': 243,
 'Education': 2646,
 'Engineering': 1402,
 'Environment': 6,
 'Fashion': 1805,
 'Gemini': 2558,
 'Government': 599,
 'HumanResources': 79,
 'Internet': 1420,
 'InvestmentBanking': 85,
 'Law': 308,
 'LawEnforcement': 125,
 'Leo': 3904,
 'Libra': 4378,
 'Libraries': 285,
 'Manufacturing': 441,
 'Maritime': 54,
 'Marketing': 414,
 'Media': 1603,


# 7. Transform the label

In [45]:
#a. Convert your train and test labels using MultiLabelBinarizer

In [46]:
y_train_tr = mlb.transform(y_train)

In [47]:
y_test_tr = mlb.transform(y_test)

# 8. Choose a classifier 

In [48]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [49]:
#a. Use a linear classifier, wrap it up in OneVsRestClassifier to train it on  every label  

# Note:In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ is set to ‘ovr’
lgr = LogisticRegression(solver='lbfgs', multi_class="ovr",n_jobs=-1)
ovrc = OneVsRestClassifier(lgr)

# 9.  Fit the classifier, make predictions and get the accuracy 

In [50]:
ovrc.fit(X_train_dtm, y_train_tr)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='ovr', n_jobs=-1,
                                                 penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [51]:
print("Training Accuracy: ", metrics.accuracy_score(y_train_tr, ovrc.predict(X_train_dtm)))

Training Accuracy:  0.9270285714285714


In [53]:
y_pred = ovrc.predict(X_test_dtm)
print("Training Accuracy: ", metrics.accuracy_score(y_test_tr, y_pred))

Training Accuracy:  0.12433333333333334


The accuracy measurement of multi-label classification is different than single-label classification.
In multi-lable classification, mis-classification is no hard right or wrong. The subset of prediction class is better than non-predicting even a single label.

In micro-averaging all TPs, TNs, FPs and FNs for each class are summed up and then the average is taken. Micro-average aggregates the contributions of all classes to compute the average metric.Micro-averaging can be a useful measure when the class imbalance is already known.

Macro-average computes the metric independently for each class and then take the average i.e. treating all classes equally.
Macro-averaging is useful when we want to know how the system performs overall across the sets of data.

In Weighted-averaging, each class contribution to the average is weighted by the relative number of examples available for it.

In [54]:
print("Micro-averaging :")
print("F1 Score: ", metrics.f1_score(y_test_tr, y_pred, average='micro'))
print("Precision Score: ", metrics.precision_score(y_test_tr, y_pred, average='micro'))
print("Recall Score: ", metrics.recall_score(y_test_tr, y_pred, average='micro'))

Micro-averaging :
F1 Score:  0.49144107191687175
Precision Score:  0.7326180537438038
Recall Score:  0.3697272921775481


In [55]:
print("Macro-averaging :")
print("F1 Score: ", metrics.f1_score(y_test_tr, y_pred, average='macro'))
print("Precision Score: ", metrics.precision_score(y_test_tr, y_pred, average='macro'))
print("Recall Score: ", metrics.recall_score(y_test_tr, y_pred, average='macro'))

Macro-averaging :
F1 Score:  0.23526936306351495
Precision Score:  0.5823387810960329
Recall Score:  0.1615928424949917


In [56]:
print("Weighted-averaging  :")
print("F1 Score: ", metrics.f1_score(y_test_tr, y_pred, average='weighted'))
print("Precision Score: ", metrics.precision_score(y_test_tr, y_pred, average='weighted'))
print("Recall Score: ", metrics.recall_score(y_test_tr, y_pred, average='weighted'))

Weighted-averaging  :
F1 Score:  0.4476699209916955
Precision Score:  0.7117039411788949
Recall Score:  0.3697272921775481


# 10.  Print true label and predicted label for any five examples 

In [58]:
for ii in np.random.randint(10, len(y_test_tr), 5):
    print("Test:",mlb.inverse_transform(y_test_tr)[ii])
    print("predicted :",mlb.inverse_transform(y_pred)[ii])

Test: ('34', 'Aquarius', 'Education', 'male')
predicted : ('35', 'Technology', 'male')
Test: ('14', 'Aquarius', 'Student', 'female')
predicted : ('14', 'Student', 'male')
Test: ('17', 'Aquarius', 'female', 'indUnk')
predicted : ('17', 'Aquarius', 'female', 'indUnk')
Test: ('15', 'Scorpio', 'female', 'indUnk')
predicted : ('17', 'male')
Test: ('34', 'Sagittarius', 'female', 'indUnk')
predicted : ('34', 'Sagittarius', 'female', 'indUnk')


# Conclusion :

 1. I have used 50,000(cell [7] of above) data points and it took almost 2 hours of training hours to reach training accuracy of approx 93%. 
 
2. I have used, njobs=-1(to best utilise the CPUs) helped me with reducing time to compute.

3. The testing accuracy of the model is very poor, just close to 12-13%. 
   Some more iterations and model tuning exercise should be conducted to improve it.

Note : With 5k records I could get 95% on train and 42% on test data. further increasing the number of records , increases the number of features and labels too. Hence I expected a lesser accuracy on test.
   
4. This is also reflected from the above True and Predicted labels. We couldn't get prediction for all classes and these classes are also not predicted correctly.

5. Also,note that I havent remove any numbers from text column, that also adds to features and we can see that there is good number of occurance to them in dictionary created above.

6. We also see significant difference between the Precision/Recall Score by Micro/Macro and averaging methods. The label classes seem to be highly imbalance. So, it would be better to go with Micro-averaging method.