# Project | Statistical NLP
## Project Description
Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her.

We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

## Dataset
### Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17),

8086 "20s" blogs(ages 23-27)

2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/download

## Approach & Steps
### 1. Load the dataset
#### a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

In [1]:
#Importing pandas
import pandas as pd

In [2]:
#Loading Dataset using Pandas read_csv restricting only 100,000 records
dataset = pd.read_csv(r'C:\Users\Sanket\Desktop\PGP AIML\Statistical NLP\Statistical NLP Project\blog-authorship-corpus\blogtext.csv', nrows=100000)

In [3]:
#Checking the shape of the Dataset
dataset.shape

(100000, 7)

In [4]:
#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


### 2. Preprocess rows of the “text” column
#### a. Remove unwanted characters
#### b. Convert text to lowercase
#### c. Remove unwanted spaces
#### d. Remove stopwords

In [5]:
#Importing re library for Regular Expression Engine
import re

#Importing stopwords sub-module from module corpus in nltk library
from nltk.corpus import stopwords

#Lower casing & stripping
dataset['text'] = dataset['text'].apply(lambda s: s.lower().strip())

#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","info has been found (+/- 100 pages, and 4.5 mb..."
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members: drewes van der l...
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde: maak je ...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo!'s toolbar i can now 'capture'...


In [6]:
#Keeping only Alphanumeric values & spaces
dataset['text'] = dataset['text'].apply(lambda s: re.sub('[^a-z ]+',' ',s))

#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info has been found pages and mb of pd...
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members drewes van der l...
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde maak je ...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo s toolbar i can now capture ...


In [7]:
#set of stop words
stop_words = set(stopwords.words('english'))

#For each stop-words, replace them by spaces in the text column of he Dataset
for w in stop_words:
    dataset['text'] = dataset['text'].apply(lambda s: re.sub('(^|[ ]+)' + w + '([ ]+|$)',' ',s))

#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill ...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je ei...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...


In [8]:
#Removing unwanted spaces
dataset['text'] = dataset['text'].apply(lambda s: re.sub('[ ]{2,}',' ',s))
dataset['text'] = dataset['text'].apply(lambda s: re.sub('(^[ ]+|[ ]+$)','',s))

#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...


### 3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence
#### a. Label columns to merge: “gender”, “age”, “topic”, “sign”
#### b. After completing the previous step, there should be only two columns in your data frame

In [9]:
#Merging all the Label columns together
dataset['labels'] = dataset[['gender','age','topic','sign']].values.tolist()

#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [10]:
#Dropping unwanted columns from Dataset
dataset.drop(columns = ['id','gender','age','topic','sign','date'], inplace=True)

#Printing top 5 records to check if the Dataset is loaded properly
dataset.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


### 4. Separate features and labels, and split the data into training and testing

In [11]:
#Importing train_test_split function from module model_selection in library sklearn
from sklearn.model_selection import train_test_split

#Doing a train test split with test_size of 0.20
X_train, X_test, y_train, y_test = train_test_split(dataset['text'], dataset['labels'], test_size=0.20)

### 5. Vectorize the features
#### a. Create a Bag of Words using count vectorizer
##### i. Use ngram_range=(1, 2)
##### ii. Vectorize training and testing features
#### b. Print the term-document matrix

In [12]:
#Importing CountVectorizer function from feature_extraction.text submodule in sklearn library
from sklearn.feature_extraction.text import CountVectorizer

#define vectorizer parameters
vectorizer = CountVectorizer(ngram_range=(1,2))

#Creating document-term matrix
X_train_vect = vectorizer.fit_transform(X_train)

In [13]:
#Doing transform on X_test
X_test_vect = vectorizer.transform(X_test)

In [14]:
#Printing the term-document matrix
print(X_train_vect)

  (0, 1986509)	1
  (0, 539258)	1
  (0, 3363838)	1
  (0, 3941454)	1
  (0, 310951)	1
  (0, 1464535)	1
  (0, 1481474)	1
  (0, 2621926)	1
  (0, 670616)	1
  (0, 3852999)	1
  (0, 3289756)	1
  (0, 3444010)	1
  (0, 3853177)	1
  (0, 156329)	1
  (0, 2241438)	1
  (0, 177092)	1
  (0, 2166692)	1
  (0, 1139361)	1
  (0, 4161073)	1
  (0, 3367532)	1
  (0, 1888803)	1
  (0, 622387)	1
  (0, 1269359)	1
  (0, 3713140)	1
  (0, 3752217)	1
  :	:
  (79999, 2235143)	1
  (79999, 3349652)	1
  (79999, 3035372)	2
  (79999, 186153)	1
  (79999, 253191)	3
  (79999, 1244778)	1
  (79999, 1243419)	1
  (79999, 1086867)	1
  (79999, 4149738)	1
  (79999, 2029099)	1
  (79999, 360014)	1
  (79999, 2922033)	1
  (79999, 3808626)	5
  (79999, 2430638)	1
  (79999, 1940891)	1
  (79999, 2374515)	1
  (79999, 1148825)	1
  (79999, 2052736)	1
  (79999, 1490134)	1
  (79999, 851955)	1
  (79999, 2545271)	1
  (79999, 1479412)	1
  (79999, 3288577)	1
  (79999, 1269323)	1
  (79999, 2043221)	1


### 6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference

In [15]:
#Creating blank dictionary for Training Labels
label_count_train = dict()

#For each record in training labels
for record in y_train:
    #For each listitem in the record
    for label in record:
        #Checking if label is new one
        if label not in label_count_train.keys():
            #Storing new labels in dictionary and registering a count of 1
            label_count_train[str(label)] = 1
        #Else
        else:
            #Increment the count for label in dictionary
            label_count_train[str(label)] += 1

In [16]:
#Printing the dictionary for training set labels
print(label_count_train)

{'male': 42665, '26': 1, 'Technology': 6818, 'Scorpio': 5651, '25': 1, 'Leo': 6635, 'Arts': 4008, 'Virgo': 5646, 'indUnk': 26462, 'Aquarius': 7166, '15': 1, 'Student': 17698, 'female': 37335, '17': 1, '27': 1, 'Government': 1654, '36': 1, 'Pisces': 6043, '23': 1, 'Internet': 1811, 'Aries': 8552, 'Gemini': 7360, '37': 1, '24': 1, 'Science': 892, '16': 1, 'Capricorn': 7022, 'Sagittarius': 5938, '14': 1, '45': 1, 'Education': 4421, 'Cancer': 7428, '38': 1, 'Taurus': 6784, 'Communications-Media': 2272, 'Libra': 5775, 'Fashion': 1522, 'Agriculture': 138, 'Religion': 852, 'Advertising': 610, '35': 1, '39': 1, '40': 1, '43': 1, '42': 1, '41': 1, '33': 1, 'Construction': 198, 'InvestmentBanking': 203, 'BusinessServices': 505, 'Military': 619, 'Sports-Recreation': 332, 'Manufacturing': 435, 'Consulting': 732, '34': 1, 'Non-Profit': 1057, 'Engineering': 1861, 'Chemicals': 239, 'Accounting': 418, 'Publishing': 864, '13': 1, '46': 1, 'Marketing': 570, '47': 1, 'Law': 295, 'Tourism': 191, 'LawEnfor

In [17]:
#Creating blank dictionary for Testing Labels
label_count_test = dict()

#For each record in testing labels
for record in y_test:
    #For each listitem in the record
    for label in record:
        #Checking if label is new one
        if label not in label_count_test.keys():
            #Storing new labels in dictionary and registering a count of 1
            label_count_test[str(label)] = 1
        #Else
        else:
            #Increment the count for label in dictionary
            label_count_test[str(label)] += 1

In [18]:
#Printing the dictionary for testing set labels
print(label_count_test)

{'male': 10693, '16': 1, 'Technology': 1666, 'Aquarius': 1884, 'indUnk': 6635, 'Gemini': 1865, 'female': 9307, '34': 1, 'Sagittarius': 1428, '23': 1, 'Arts': 1023, 'Cancer': 1825, '24': 1, 'Education': 1132, 'Capricorn': 1701, '25': 1, '33': 1, '27': 1, 'Virgo': 1488, '15': 1, 'Libra': 1475, 'BusinessServices': 121, '38': 1, '17': 1, 'Student': 4424, 'Aries': 2085, 'Religion': 229, 'Taurus': 1746, 'Marketing': 156, 'Pisces': 1510, '35': 1, 'Non-Profit': 269, 'Scorpio': 1398, '26': 1, 'Leo': 1595, '43': 1, 'Communications-Media': 558, '14': 1, 'Engineering': 471, '41': 1, '36': 1, 'Government': 401, 'Internet': 440, '40': 1, 'Fashion': 376, '46': 1, 'Consulting': 173, 'RealEstate': 36, 'Science': 198, 'Museums-Libraries': 68, 'Law': 65, '39': 1, 'Manufacturing': 107, '47': 1, '13': 1, 'Accounting': 110, 'LawEnforcement-Security': 75, '42': 1, '45': 1, 'Biotech': 61, '48': 1, 'Construction': 52, 'Military': 179, 'Publishing': 215, '37': 1, 'Transportation': 143, 'Sports-Recreation': 74, 

### 7. Transform the labels
As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

#### a. Convert your train and test labels using MultiLabelBinarizer

In [91]:
#Importing MultiLabelBinarizer sub-module from preprocessing module in sklearn library
from sklearn.preprocessing import MultiLabelBinarizer

#To get rid of numbers from y_train
x = []
for i in y_train:
    l = list()
    for j in i:
        l.append(str(j))
    x.append(l)
y_train = x

#To get rid of numbers from y_test
x = []
for i in y_test:
    l = list()
    for j in i:
        l.append(str(j))
    x.append(l)
y_test = x

#Deleting temporary variable x
del x

#Initialize MultiLabelBinarizer
mlb = MultiLabelBinarizer()

#Fitting MultiLabelBinarizer with training Labels & transforming it to get them in One-hot encoded form
label_train = mlb.fit_transform(y_train)

#Transforming testing Labels to get them in One-hot encoded form
label_test = mlb.transform(y_test)

### 8. Choose a classifier
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a
basic classifier, use LogisticRegression. It is one of the simplest methods, but often it
performs good enough in text classification tasks. It might take some time because the
number of classifiers to train is large.

#### a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

In [92]:
#Importing OneVsRestClassfier sub-module from multiclass module in sklearn library
from sklearn.multiclass import OneVsRestClassifier

#Importing LogisticRegression sub-module from linear_model module in sklearn library
from sklearn.linear_model import LogisticRegression

#Initializing LogisticRegression model with lbfgs solver
clf = LogisticRegression(solver='lbfgs')

#Wrapping it up in OneVsRestClassifier
clf = OneVsRestClassifier(clf)

### 9. Fit the classifier, make predictions and get the accuracy
#### a. Print the following
##### i. Accuracy score
##### ii. F1 score
##### iii. Average precision score
##### iv. Average recall score
##### v. Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging

In [93]:
#Fitting the OVR model on training dataset
clf.fit(X_train_vect, label_train)





OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [97]:
#Predicting labels for training features
label_pred = clf.predict(X_test_vect)

In [98]:
#Importing accuracy_score, f1_score, precision_score, recall_score sub-modules from metrics module in sklearn library
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

#Printing Accuracy, F1, Average Precision & Average Recall using micro averaging
print('Accuracy:', accuracy_score(label_test, label_pred))
print('F1:', f1_score(label_test, label_pred, average='micro'))
print('Average Precision:', precision_score(label_test, label_pred, average='micro'))
print('Average Recall:', recall_score(label_test, label_pred, average='micro'))

Accuracy: 0.1153
F1: 0.4867503486750349
Average Precision: 0.7246391140992683
Average Recall: 0.36645


### 10. Print true label and predicted label for any five examples

In [100]:
#Importing randrange module from random library
from random import randrange

#Getting real labels from transformed predicted labels
y_pred = mlb.inverse_transform(label_pred)

#Picking 5 random records from y_test and comparing actual labels vs predicted labels for those 5 records
for _ in range(5):
    i = randrange(len(y_test))
    print("True labels for", i, "th record are:", y_test[i])
    print("Predicted labels for the same record are:", y_pred[i])

True labels for 18640 th record are: ['male', '23', 'Engineering', 'Aquarius']
Predicted labels for the same record are: ('Aquarius', 'male')
True labels for 3798 th record are: ['male', '35', 'Government', 'Gemini']
Predicted labels for the same record are: ('male',)
True labels for 7655 th record are: ['female', '13', 'indUnk', 'Gemini']
Predicted labels for the same record are: ('indUnk', 'male')
True labels for 9110 th record are: ['female', '15', 'indUnk', 'Gemini']
Predicted labels for the same record are: ('male',)
True labels for 18114 th record are: ['male', '15', 'Science', 'Pisces']
Predicted labels for the same record are: ('male',)
