## <center> **Statistical NLP - Blog Authorship Corpus** </center>

Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown).

All bloggers included in the corpus fall into one of three age groups:<br>
- 8240 "10s" blogs (ages 13-17),
- 8086 "20s" blogs(ages 23-27)
- 2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Link to dataset:
https://www.kaggle.com/rtatman/blog-authorship-corpus

**Import Necessary packages:**

In [0]:
tensorflow_version 2.x

TensorFlow 2.x selected.


In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
import tensorflow as tf

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics

from nltk.corpus import stopwords 

import warnings
warnings.filterwarnings('ignore')

**Mount Drive:**

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


**Extract data from zip file:**

In [0]:
project_path = "/content/drive/My Drive/Statistical NLP/"

In [0]:
# from zipfile import ZipFile
# with ZipFile(project_path + 'blog-authorship-corpus.zip', 'r') as z:
#   z.extractall(project_path)

### **1. Load the dataset (5 points)**

In [0]:
blog_df = pd.read_csv(project_path + 'blogtext.csv')

In [0]:
blog_df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


**View the shape of data:**

In [0]:
blog_df.shape

(681284, 7)

**View dataset info:**

In [0]:
blog_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
id        681284 non-null int64
gender    681284 non-null object
age       681284 non-null int64
topic     681284 non-null object
sign      681284 non-null object
date      681284 non-null object
text      681284 non-null object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB


**We will extract only 25,000 records from the original data for ease of processing**

In [0]:
blog_df_ss =  blog_df.head(25000)
blog_df_ss.shape

(25000, 7)

In [0]:
blog_df_ss.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


### **2. Preprocess rows of the “text” column (7.5 points)**

In [0]:
# a. Remove unwanted characters:
blog_df_ss['text'] = blog_df_ss['text'].apply(lambda x: x.lower())

# b. Convert text to lowercase
blog_df_ss['text'] = blog_df_ss['text'].apply(lambda x: re.sub('[^0-9a-z]', " ", x))

# c. Remove unwanted spaces
blog_df_ss['text'] = blog_df_ss['text'].apply(lambda x: x.strip())

In [0]:
blog_df_ss.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info has been found 100 pages and 4 5 mb...
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members drewes van der l...
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde maak je ...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo s toolbar i can now capture ...


**Remove stopwords:**

In [0]:
stop_words = set(stopwords.words('english'))

# Adding additional stop words 
stop_words.update(['zero','one','two','three','four','five','six','seven','eight','nine','ten'])

stop_words_updt = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)

# Function to remove stopwords:
def removeStopWords(sentence):
    global stop_words_updt
    return stop_words_updt.sub(" ", sentence)

# Call remove stopwords function:
blog_df_ss['text'] = blog_df_ss['text'].apply(removeStopWords)

**View the cleaned dataset:**

In [0]:
blog_df_ss.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found 100 pages 4 5 mb pdf fil...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je ei...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls po...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",interesting conversation dad morning ...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",somehow coca cola way summing things well...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",anything korea country extremes everyth...
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",take read news article urllink joongang il...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",surf english news sites lot looking tidbit...


### **3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence(7.5 points)**

**Dropping date column as it is not needed for our analysis:**

In [0]:
blog_df_ss = blog_df_ss.drop('date', axis=1)

In [0]:
blog_df_ss.head(2)

Unnamed: 0,id,gender,age,topic,sign,text
0,2059027,male,15,Student,Leo,info found 100 pages 4 5 mb pdf fil...
1,2059027,male,15,Student,Leo,team members drewes van der laag ...


**Merging gender, age, topic and sign columns to create a single label column:**

In [0]:
blog_df_ss['labels'] = blog_df_ss['gender'].astype(str) + ',' + blog_df_ss['age'].astype(str) + ',' + \
                       blog_df_ss['topic'].astype(str)  + ',' + blog_df_ss['sign'].astype(str)

In [0]:
blog_df_ss.head(10)

Unnamed: 0,id,gender,age,topic,sign,text,labels
0,2059027,male,15,Student,Leo,info found 100 pages 4 5 mb pdf fil...,"male,15,Student,Leo"
1,2059027,male,15,Student,Leo,team members drewes van der laag ...,"male,15,Student,Leo"
2,2059027,male,15,Student,Leo,het kader van kernfusie op aarde maak je ei...,"male,15,Student,Leo"
3,2059027,male,15,Student,Leo,testing testing,"male,15,Student,Leo"
4,3581210,male,33,InvestmentBanking,Aquarius,thanks yahoo toolbar capture urls po...,"male,33,InvestmentBanking,Aquarius"
5,3581210,male,33,InvestmentBanking,Aquarius,interesting conversation dad morning ...,"male,33,InvestmentBanking,Aquarius"
6,3581210,male,33,InvestmentBanking,Aquarius,somehow coca cola way summing things well...,"male,33,InvestmentBanking,Aquarius"
7,3581210,male,33,InvestmentBanking,Aquarius,anything korea country extremes everyth...,"male,33,InvestmentBanking,Aquarius"
8,3581210,male,33,InvestmentBanking,Aquarius,take read news article urllink joongang il...,"male,33,InvestmentBanking,Aquarius"
9,3581210,male,33,InvestmentBanking,Aquarius,surf english news sites lot looking tidbit...,"male,33,InvestmentBanking,Aquarius"


**Dropping the original gender, age, topic and sign columns:**

In [0]:
blog_df_ss = blog_df_ss.drop(labels = ['gender', 'age', 'topic', 'sign'], axis=1)

In [0]:
blog_df_ss.head()

Unnamed: 0,id,text,labels
0,2059027,info found 100 pages 4 5 mb pdf fil...,"male,15,Student,Leo"
1,2059027,team members drewes van der laag ...,"male,15,Student,Leo"
2,2059027,het kader van kernfusie op aarde maak je ei...,"male,15,Student,Leo"
3,2059027,testing testing,"male,15,Student,Leo"
4,3581210,thanks yahoo toolbar capture urls po...,"male,33,InvestmentBanking,Aquarius"


### **4. Separate features and labels, and split the data into training and testing (5 points)**

**Split features and labels:**

In [0]:
X = blog_df_ss.drop(labels = ['id','labels'], axis=1)
y = blog_df_ss.drop(labels = ['id','text'], axis=1)

**Split data into train and test:**

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X['text'], y['labels'], test_size=0.2, random_state=0, shuffle=True)

**View sample train feature:**

In [0]:
X_train.head()

10263    someone must  interested    feel        real  ...
18409    friday      well  least    made  blog   weekly...
13047    christmas     yup     still awake    got home ...
21371                                              stewart
16392    leaving   grabaawr   well    leaving   40 minu...
Name: text, dtype: object

**View sample train labels:**

In [0]:
y_train.head()

10263          male,17,Student,Cancer
18409       female,27,Religion,Taurus
13047    female,27,indUnk,Sagittarius
21371            female,25,indUnk,Leo
16392      male,27,indUnk,Sagittarius
Name: labels, dtype: object

**View shapes of train and test data sets:**

In [0]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(20000,)
(20000,)
(5000,)
(5000,)


### **5. Vectorize the features (5 points)**

**a. Create a Bag of Words using count vectorizer:**
*   Use ngram_range=(1, 2)
*   Vectorize training and testing features



In [0]:
# Count Vectorizer:
vectorizer = CountVectorizer(ngram_range=(1, 2))

In [0]:
# Vectorize training and testing features:
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)

**b. Print the term-document matrix**

In [0]:
# Let's take 2 sample features to display the term-document matrix:
train_ss = list(X_train[[100, 1]])

In [0]:
pd.DataFrame(vectorizer.fit_transform(train_ss).toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,10mile,10mile fun,aaldering,aaldering urllink,able,able 10mile,abnormal,abnormal screwed,acceptable,acceptable whats,across,across abnormal,alongside,alongside psychology,also,also hav,although,although enough,although means,america,america cool,another,another year,another years,arrivedechi,atmosphere,atmosphere mayb,attracks,attracks bein,bad,bad bein,barmaid,barmaid every1,behaviour,behaviour acceptable,bein,bein barmaid,bein surrounded,better,better biggest,...,wait things,wana,wana able,wana die,wana im,wana live,wana speak,wana work,want,want im,wanting,wanting die,well,well arrivedechi,whats,whats good,whole,whole year,wish,wish summer,work,work bout,work fox,work italy,work pub,workin,workin whole,worst,worst possible,wot,wot mean,xie,xie urllink,year,year go,year guess,year raise,years,years england,years resolution
0,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,...,1,7,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,4,1,1,1,1,1,1,1,1,1,1,0,0,3,1,1,1,2,1,1
1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0


### **6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)**

**Check how many unique bloggers are present in our data subset:**

In [0]:
blog_df_ss['id'].nunique()

659

**Create an unique data set of bloggers:**

In [0]:
blog_df_ss_unq = blog_df_ss.groupby('id').first().reset_index()

**View the unique data:**

In [0]:
blog_df_ss_unq.head()

Unnamed: 0,id,text,labels
0,23191,twenty something call quarter life cris...,"female,23,Advertising,Taurus"
1,72355,saw show discovery komodo dragon menti...,"male,27,indUnk,Leo"
2,108780,several years ago kyle came idea art exh...,"female,25,indUnk,Leo"
3,183164,mommy ang sakit ng paa ko bakit masakit yu...,"female,27,indUnk,Taurus"
4,299143,really write night instead work prob...,"female,23,Engineering,Cancer"


**Create a dictionary to get the count of every label:**

In [0]:
new_dict = {}

for label in blog_df_ss_unq['labels'].str.split(','):
  for val in label:
    if val in new_dict:
      new_dict[val] += 1
    else:
      new_dict[val] = 1

print(new_dict)    

{'female': 330, '23': 68, 'Advertising': 5, 'Taurus': 52, 'male': 329, '27': 41, 'indUnk': 237, 'Leo': 57, '25': 57, 'Engineering': 11, 'Cancer': 67, 'Capricorn': 47, 'Libra': 48, '45': 5, 'Religion': 7, 'Aries': 67, 'Pisces': 43, '24': 57, 'Scorpio': 53, 'Banking': 6, 'Aquarius': 58, '17': 80, 'Student': 164, 'Gemini': 48, '38': 5, 'Internet': 20, 'Sagittarius': 70, '35': 16, 'Technology': 29, '34': 11, 'Arts': 25, '36': 10, 'Fashion': 2, '16': 71, '26': 50, '14': 47, 'Education': 40, 'Government': 6, 'Virgo': 49, 'BusinessServices': 12, '33': 25, '41': 5, 'Communications-Media': 9, 'Marketing': 5, '13': 16, '15': 58, '47': 2, 'Consulting': 9, 'Non-Profit': 13, '42': 5, '40': 2, '48': 2, 'Automotive': 4, '39': 7, '37': 7, 'Manufacturing': 1, 'LawEnforcement-Security': 3, 'Accounting': 5, 'Telecommunications': 3, '46': 6, '44': 3, 'Science': 7, 'Law': 5, 'Military': 3, 'Biotech': 1, 'RealEstate': 3, '43': 3, 'Sports-Recreation': 5, 'InvestmentBanking': 2, 'HumanResources': 3, 'Museums-

**View the length of new dictionary:**

In [0]:
len(new_dict)

77

**Display the contents of dictionary in a easily readable format:**

In [0]:
for k, v in new_dict.items():
  print( '%s: %i' % (k, v) )

female: 330
23: 68
Advertising: 5
Taurus: 52
male: 329
27: 41
indUnk: 237
Leo: 57
25: 57
Engineering: 11
Cancer: 67
Capricorn: 47
Libra: 48
45: 5
Religion: 7
Aries: 67
Pisces: 43
24: 57
Scorpio: 53
Banking: 6
Aquarius: 58
17: 80
Student: 164
Gemini: 48
38: 5
Internet: 20
Sagittarius: 70
35: 16
Technology: 29
34: 11
Arts: 25
36: 10
Fashion: 2
16: 71
26: 50
14: 47
Education: 40
Government: 6
Virgo: 49
BusinessServices: 12
33: 25
41: 5
Communications-Media: 9
Marketing: 5
13: 16
15: 58
47: 2
Consulting: 9
Non-Profit: 13
42: 5
40: 2
48: 2
Automotive: 4
39: 7
37: 7
Manufacturing: 1
LawEnforcement-Security: 3
Accounting: 5
Telecommunications: 3
46: 6
44: 3
Science: 7
Law: 5
Military: 3
Biotech: 1
RealEstate: 3
43: 3
Sports-Recreation: 5
InvestmentBanking: 2
HumanResources: 3
Museums-Libraries: 5
Architecture: 2
Chemicals: 1
Construction: 1
Transportation: 1
Publishing: 3
Agriculture: 1


### **7. Transform the labels - (7.5 points)**

**As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn**

*   Convert your train and test labels using MultiLabelBinarizer

In [0]:
y_train = [x.split(',') for x in y_train]
y_test  = [x.split(',') for x in y_test]

**MultiLabelBinarizer:**

In [0]:
mb = MultiLabelBinarizer()

y_train = mb.fit_transform(y_train)
y_test  = mb.transform(y_test)

**View y_train after applying MultiLabelBinarizer:**

In [0]:
y_train

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

**View a single data in y_train after applying MultiLabelBinarizer to understand clearly:**

In [0]:
y_train[0]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

**View a single data in y_test after applying MultiLabelBinarizer to understand clearly:**

In [0]:
y_test[5]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

**View what all are the classes as given by the binarizer:**

In [0]:
mb.classes_

array(['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33',
       '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44',
       '45', '46', '47', '48', 'Accounting', 'Advertising', 'Agriculture',
       'Aquarius', 'Architecture', 'Aries', 'Arts', 'Automotive',
       'Banking', 'Biotech', 'BusinessServices', 'Cancer', 'Capricorn',
       'Chemicals', 'Communications-Media', 'Construction', 'Consulting',
       'Education', 'Engineering', 'Fashion', 'Gemini', 'Government',
       'HumanResources', 'Internet', 'InvestmentBanking', 'Law',
       'LawEnforcement-Security', 'Leo', 'Libra', 'Manufacturing',
       'Marketing', 'Military', 'Museums-Libraries', 'Non-Profit',
       'Pisces', 'Publishing', 'RealEstate', 'Religion', 'Sagittarius',
       'Science', 'Scorpio', 'Sports-Recreation', 'Student', 'Taurus',
       'Technology', 'Telecommunications', 'Transportation', 'Virgo',
       'female', 'indUnk', 'male'], dtype=object)

### **8. Choose a classifier - (5 points)**

**In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.**

*   Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

In [0]:
# Logistic regression & OneVsRest Classifier:

logreg = LogisticRegression()
ovr_clf = OneVsRestClassifier(logreg)

**Fit the train data:**

In [0]:
ovr_clf.fit(X_train_dtm, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

**Predict on test data:**

In [0]:
y_pred = ovr_clf.predict(X_test_dtm)

### **9. Fit the classifier, make predictions and get the accuracy (5 points)**

**a. Print the following**

*   Accuracy score
*   F1 score
*   Average precision score
*   Average recall score


**Let's take a quick look on the different types of averaging:**

**'micro':**
Calculate metrics globally by counting the total true positives, false negatives and false positives.

**'macro':**
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

**'weighted':**
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

In [0]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.16      0.27        69
           1       0.76      0.23      0.35       208
           2       0.69      0.15      0.25       241
           3       0.77      0.24      0.37       362
           4       0.76      0.28      0.41       533
           5       0.72      0.22      0.33       570
           6       0.86      0.34      0.49       523
           7       0.71      0.11      0.19       247
           8       0.85      0.15      0.26       292
           9       0.75      0.30      0.43       560
          10       0.71      0.14      0.23       184
          11       0.96      0.58      0.72       151
          12       0.79      0.34      0.48       499
          13       0.92      0.50      0.65       332
          14       0.50      0.03      0.06        29
          15       0.00      0.00      0.00        19
          16       1.00      0.29      0.44        21
          17       1.00    

**Let's also print the scores separately for a better understanding:**

In [0]:
 # Accuracy:
 print("\033[1mAccuracy score :\033[0m ", metrics.accuracy_score(y_test, y_pred))

 # F1 score:
 print("\n\033[1mF1 score:\033[0m ")
 print("F1 score micro        : ", metrics.f1_score(y_test, y_pred,average='micro').round(2))
 print("F1 score macro        : ", metrics.f1_score(y_test, y_pred,average='macro').round(2))
 print("F1 score weighted     : ", metrics.f1_score(y_test, y_pred,average='weighted').round(2))

 # Precision:
 print("\n\033[1mPrecision:\033[0m ")
 print("Precision score micro : ", metrics.precision_score(y_test, y_pred,average='micro').round(2))
 print("Precision score macro : ", metrics.precision_score(y_test, y_pred,average='macro').round(2))
 print("Precision weighted    : ", metrics.precision_score(y_test, y_pred,average='weighted').round(2))

 # Recall:
 print("\n\033[1mPecall:\033[0m ")
 print("Recall score micro    : ", metrics.recall_score(y_test, y_pred,average='micro').round(2))
 print("Recall score macro    : ", metrics.recall_score(y_test, y_pred,average='macro').round(2))
 print("Recall score weighted : ", metrics.recall_score(y_test, y_pred,average='weighted').round(2))

[1mAccuracy score :[0m  0.158

[1mF1 score:[0m 
F1 score micro        :  0.54
F1 score macro        :  0.24
F1 score weighted     :  0.49

[1mPrecision:[0m 
Precision score micro :  0.76
Precision score macro :  0.57
Precision weighted    :  0.75

[1mPecall:[0m 
Recall score micro    :  0.41
Recall score macro    :  0.17
Recall score weighted :  0.41


- As seen, our data is imbalanced as we have taken only a subset of the original data. Maximum class (76) count is 2721 and minimum class (39) count is 0.

- For the maximum class, the scores are pretty good and for minimum the scores are 0 as expected

**How to choose micro or micro averaging:**

- Choosing macro or micro averaging depends purely on the problem statement. 

- If our class of interest is the majority one, then we usually go for micro-averaging. If goal for the classifier is to simply maximize its hits and minimize its misses, this would be the way to go.

- However, if we value the minority class the most, then we should switch to a macro-averaged accuracy. This metric is insensitive to the imbalance of the classes and treats them all equal.

- In many applications the latter is preferable as in diagnosing a disease that appears in 1% of the population.

- The reason why micro-averaging is prevalent is because in most tasks, we would be interested in simply maximizing the number of correct predictions that the classifier makes. In these tasks, no class is more important than the others.

**Scores are low since we have only taken a minimum portion of the original data for our model building process.**

### **10. Print true label and predicted label for any five examples (7.5 points)**

In [0]:
y_pred_inv = mb.inverse_transform(y_pred)
y_test_inv = mb.inverse_transform(y_test)

for x in range(20,25):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test.iloc[x],
        ','.join(y_test_inv[x]),
        ','.join(y_pred_inv[x])
    ))

Title:	 response  fifer  earlier comment    wish   happy birthday     merely   wee hours  wednesday morning     blog     refer  earlier post      heeeeeeeee     oh  yeah   happy belated birthday  sarah      however     forget   lol  kidding
True labels:	17,Aquarius,female,indUnk
Predicted labels:	female,indUnk


Title:	urllink stereolab  singer dies  bike accident    urllink click    article    wow      big fan  stereolab  music  high school    also makes   lot  cautious  riding  bicycle  particularly   get  campus     cross  busy road  people like  speed
True labels:	24,Aries,indUnk,male
Predicted labels:	Aries,indUnk,male


Title:	   things     read      scan  headline  move     urllink     them
True labels:	36,Aries,Fashion,male
Predicted labels:	male


Title:	distractions    need  love     strong  powerful force       loved   love    experience  fulfillment   relationships falter  around    hard   hope   yet  need remains    nature yearns   things  yet   search   object   love    l

## **Overall Summary:**

- Imported the necessary packages
- Extracted data from the zip file
- Loaded the dataset into dataframe and used only 25,000 records for further processing
- Data was preprocessed - Removed unwanted characters and spaces, converted text to lower case and removed stop words
- Merged multiple columns (Gender, age, topic & sign) into single column and made that as the label. Dropped the original 4 columns from the dataframe
- Separated features and labels and splitted the data into train & test in 80:20 ratio
- Created Bag of Words using count vectorizer and created the document-term matrix
- Created a dictionary to hold the labels and how often they occur in the document, just to get an idea on the count of multiple labels
- Transformed the labels into 0s & 1s using multilabelbinarizer
- Used logistic regression classifier and wrapped it up in OneVsRestClassifier to train it on every label
- Train data was fit and prediction was done on test data
- Printed accuracy, precision, recall, f1-score (micro, macro & weighted) to output
- Also printed the original and predicted label for a random set of 5 samples.

### <center> **End of Statistical NLP_R8_Project1_Blog authorship** </center>