# About DataSet
Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)


All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27),
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.
Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

In [78]:
#Importing all libraries that are being used in the solution
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from collections import Counter
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score,classification_report
from sklearn.multiclass import OneVsRestClassifier

# Reading DataSet

In [2]:
df_temp=pd.read_csv('C:/Users/AbhishekDhingra/Downloads/blog-authorship-corpus/blogtext.csv',error_bad_lines=False)

# Dataset is humongous, with the current processing power , it is almost impossible to train on the whole dataset, hence I am picking up first 10000 data items for train and testing 

In [3]:
df=df_temp.iloc[0:10000,]

# Displaying first 7 items from the data items

In [4]:
df.head(7)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...


# Perfroming EDA, Where i tried to do the following 
1) Get Shape
2) Check if there is NULL or NA items in the dataset

In [5]:
df.shape

(10000, 7)

In [6]:
df['id'].nunique()

214

In [7]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [8]:
df.isna().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
id        10000 non-null int64
gender    10000 non-null object
age       10000 non-null int64
topic     10000 non-null object
sign      10000 non-null object
date      10000 non-null object
text      10000 non-null object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


# Removing special characters from the dataset 'text' using the 're" library

In [10]:
df['text']=df['text'].apply(lambda x : re.sub('[@,.,^,$,*,?,\,/,\n,\t,<,>,&,:,\(,\),+,\-,!,+,-,\']','',x))

In [11]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",Info has been found 100 pages and ...
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members Drewes...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoos Toolbar I can no...


In [12]:
for col in df.columns:
    temp=df[col]
    if temp.dtype == object:
        df[col]=df[col].apply(lambda x : x.lower())
        df[col]=df[col].apply(lambda x : x.strip())
        

In [13]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,student,leo,"14,may,2004",info has been found 100 pages and 45 mb of pd...
1,2059027,male,15,student,leo,"13,may,2004",these are the team members drewes van der la...
2,2059027,male,15,student,leo,"12,may,2004",in het kader van kernfusie op aarde maak je e...
3,2059027,male,15,student,leo,"12,may,2004",testing testing
4,3581210,male,33,investmentbanking,aquarius,"11,june,2004",thanks to yahoos toolbar i can now capture the...


# Creating a new dataframe with a name "df_text" with only 'text' feature from the df dataset, later i will assiging this back to the orginal feature in the original dataset.
Intention for the below procedure is to remove stopwords from the 'text' feature

In [14]:
frame={'text' : df['text']}
df_text=pd.DataFrame(frame)

In [15]:
df_text.head()

Unnamed: 0,text
0,info has been found 100 pages and 45 mb of pd...
1,these are the team members drewes van der la...
2,in het kader van kernfusie op aarde maak je e...
3,testing testing
4,thanks to yahoos toolbar i can now capture the...


In [16]:
df_text['text'][0]

'info has been found  100 pages and 45 mb of pdf files now i have to wait untill our team leader has processed it and learns html'

# Below Logic will remove the stopwords from the feature 'text' in the new dataframe, however this will create a list of words instead of full sentence. Later i have another logic which will reconvert the list of words to the sentence

In [17]:
word_tokens=[]
stop_words=set(stopwords.words('english'))
for i in range(0,df_text.shape[0]):
    fil=[]
    word_tokens=word_tokenize(df_text['text'][i])
    for w in word_tokens:
        if w not in stop_words:
            fil.append(w)
            
    df_text['text'][i]=fil

# Assiging list of words to the original dataset

In [18]:
df['text']=df_text['text']

In [19]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,student,leo,"14,may,2004","[info, found, 100, pages, 45, mb, pdf, files, ..."
1,2059027,male,15,student,leo,"13,may,2004","[team, members, drewes, van, der, laag, urllin..."
2,2059027,male,15,student,leo,"12,may,2004","[het, kader, van, kernfusie, op, aarde, maak, ..."
3,2059027,male,15,student,leo,"12,may,2004","[testing, testing]"
4,3581210,male,33,investmentbanking,aquarius,"11,june,2004","[thanks, yahoos, toolbar, capture, urls, popup..."


# Converting list of words to sentence in the original dataset, brining back data to its original form

In [21]:
for i in range(0,df.shape[0]):
    s=' '
    df['text'][i]=s.join(df['text'][i])
         

In [22]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,student,leo,"14,may,2004",info found 100 pages 45 mb pdf files wait unti...
1,2059027,male,15,student,leo,"13,may,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,student,leo,"12,may,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,student,leo,"12,may,2004",testing testing
4,3581210,male,33,investmentbanking,aquarius,"11,june,2004",thanks yahoos toolbar capture urls popupswhich...


# Dropping Data from the original dataset

In [23]:
df.drop('date',inplace=True,axis=1)

# Creating new feature in the original dataset with the name labels, This will contain the list generated by concatenating the follwoing four featurs
1) Gender
2) Age
3) Topic
4) Sign

In [24]:
df['labels']=' '

In [25]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,text,labels
0,2059027,male,15,student,leo,info found 100 pages 45 mb pdf files wait unti...,
1,2059027,male,15,student,leo,team members drewes van der laag urllink mail ...,
2,2059027,male,15,student,leo,het kader van kernfusie op aarde maak je eigen...,
3,2059027,male,15,student,leo,testing testing,
4,3581210,male,33,investmentbanking,aquarius,thanks yahoos toolbar capture urls popupswhich...,


# Age is Numeric , Converting it to str. This is required when leveraging  Multibinarizer

In [26]:
df['age']=df['age'].astype('str')

# Below logic will create a list of items by concatenating "Gender", "Age", "Topic", "Sign" and assign it to the new label feature

In [28]:
for i in range(0,df.shape[0]):
    new_label=[]
    new_label.append(df['gender'][i])
    new_label.append(df['age'][i])x`
    new_label.append(df['topic'][i])
    new_label.append(df['sign'][i])
    df['labels'][i]=new_label

In [29]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,text,labels
0,2059027,male,15,student,leo,info found 100 pages 45 mb pdf files wait unti...,"[male, 15, student, leo]"
1,2059027,male,15,student,leo,team members drewes van der laag urllink mail ...,"[male, 15, student, leo]"
2,2059027,male,15,student,leo,het kader van kernfusie op aarde maak je eigen...,"[male, 15, student, leo]"
3,2059027,male,15,student,leo,testing testing,"[male, 15, student, leo]"
4,3581210,male,33,investmentbanking,aquarius,thanks yahoos toolbar capture urls popupswhich...,"[male, 33, investmentbanking, aquarius]"


In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
id        10000 non-null int64
gender    10000 non-null object
age       10000 non-null object
topic     10000 non-null object
sign      10000 non-null object
text      10000 non-null object
labels    10000 non-null object
dtypes: int64(1), object(6)
memory usage: 547.0+ KB


# Droping the following features from the dataset
1) ID

2) Gender

3) Age

4) Topic

5) Sign

In [31]:
df.drop(['id','gender','age','topic','sign'],inplace=True,axis=1)

In [32]:
#for i in range(0,df.shape[0]):
#    new_label=[]
#    new_label.append([df['gender'][i],df['age'][i],df['topic'][i],df['sign'][i]])
#    df['labels'][i]=list(new_label)

# Now Orginal dataset is left with only 2 features.

1) text

2) labels

In [33]:
df.head()

Unnamed: 0,text,labels
0,info found 100 pages 45 mb pdf files wait unti...,"[male, 15, student, leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, student, leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, student, leo]"
3,testing testing,"[male, 15, student, leo]"
4,thanks yahoos toolbar capture urls popupswhich...,"[male, 33, investmentbanking, aquarius]"


#  Segregating dataset into Independent and dependent features

In [34]:
X=df['text']
y=df['labels']


# Splitting dataset with default values

In [35]:
X_train,X_test,y_train,y_test=train_test_split(X,y)

# Using Count vectorizer to create tokens from the data. requirement is tos use ngram range (1,2) 

In [36]:
vect=CountVectorizer(ngram_range=(1,2))

# Creating  Document term matrix of Train and Test data

In [37]:
X_train_dtm=vect.fit_transform(X_train)

In [38]:
X_test_dtm=vect.transform(X_test)

In [39]:
type(df["labels"][0])

list

# Below Logic will create frequency count of the unique label data. This logic will make use of collections library

In [40]:
new_dict=dict()
gender=[]
age=[]
occ=[]
sign=[]

In [41]:
for item in df['labels']:
    i=0
    for value in item:
        if i==0:
            gender.append(value)
        if i==1:
            age.append(value)
        if i==2:
            occ.append(value)
        if i==3:
            sign.append(value)
        i+=1
            
        

In [42]:
dict_age=Counter(age)
dict_gender=Counter(gender)
dict_occ=Counter(occ)
dict_sign=Counter(sign)

In [43]:
def merge_two_dicts(a, b, c, d):
    z = a.copy()   # start with x's keys and values
    z.update(b)    # modifies z with y's keys and values & returns None
    z.update(c)
    z.update(d)
    return z

# Below printed is the frequency of unique labels in the data, thisshows data is highly imbalance, that is why when calculating recall , precision ad f1 score, micro average is more suitable

In [64]:
merge_two_dicts(dict_age, dict_gender, dict_occ, dict_sign)

Counter({'15': 602,
         '33': 136,
         '14': 212,
         '25': 386,
         '17': 1185,
         '23': 253,
         '37': 33,
         '26': 234,
         '24': 655,
         '27': 1054,
         '45': 16,
         '34': 553,
         '41': 20,
         '44': 3,
         '16': 440,
         '39': 79,
         '35': 2315,
         '36': 1708,
         '46': 7,
         '42': 14,
         '13': 42,
         '38': 46,
         '43': 6,
         '40': 1,
         'male': 5916,
         'female': 4084,
         'student': 1137,
         'investmentbanking': 70,
         'indunk': 3287,
         'non-profit': 71,
         'banking': 16,
         'education': 270,
         'engineering': 127,
         'science': 63,
         'communications-media': 99,
         'businessservices': 91,
         'sports-recreation': 80,
         'arts': 45,
         'internet': 118,
         'museums-libraries': 17,
         'accounting': 4,
         'technology': 2654,
         'law': 11,
       

In [45]:
X_train.sample(5)

8282    today sucked completely1st theo broke ya sucke...
3048    check guy draws pictures represent sentences p...
575     move live must endure know will… believe major...
5682    ill admit im statistical genius someone wan na...
128     went jurong east lunch today limroytheowilsonj...
Name: text, dtype: object

In [46]:
X_test.sample(5)

5395    oh yesnbsp ; forgot human beings human doings ...
3258                                              yay eva
7325    never going work duf fer liz phair nope exile ...
2414    dune raiders references sandstorm divine inter...
1450    hmm refresh site new messages come okay time s...
Name: text, dtype: object

# Document term matrix is the sparse matrix created on X_train and X_test data. Total tokens are 528134 

In [47]:
X_train_dtm

<7500x528134 sparse matrix of type '<class 'numpy.int64'>'
	with 1106572 stored elements in Compressed Sparse Row format>

In [48]:
X_test_dtm

<2500x528134 sparse matrix of type '<class 'numpy.int64'>'
	with 222814 stored elements in Compressed Sparse Row format>

In [49]:
y_train

9205            [female, 24, indunk, sagittarius]
1527                [male, 35, technology, aries]
996                  [female, 25, indunk, taurus]
3292                [male, 35, technology, aries]
9609            [female, 27, marketing, aquarius]
3546                [male, 35, technology, aries]
9401                   [male, 16, indunk, cancer]
4954                [female, 17, indunk, scorpio]
7094                   [male, 36, fashion, aries]
4766                 [female, 16, student, libra]
2728                [male, 35, technology, aries]
3073                [male, 35, technology, aries]
9392                   [male, 16, indunk, cancer]
9737                  [male, 17, student, gemini]
1441                [male, 35, technology, aries]
9623            [female, 27, marketing, aquarius]
4076                  [female, 25, indunk, libra]
653                [male, 24, engineering, libra]
5004                [female, 17, indunk, scorpio]
1296                 [male, 39, education, virgo]


# Implementing MultiLabelBinarizer to create binary labels from the list in the labels dataset

In [50]:
mlb=MultiLabelBinarizer()

In [51]:
y_train_mlb=mlb.fit_transform(y_train)

In [52]:
y_test_mlb=mlb.transform(y_test)

# Below are the unique labels in the labels features. Count of unique labels are 64 (in 10000 dataset). It may change when you have less or more number of dataset 

In [53]:
mlb.classes_

array(['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33',
       '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44',
       '45', '46', 'accounting', 'aquarius', 'aries', 'arts',
       'automotive', 'banking', 'businessservices', 'cancer', 'capricorn',
       'communications-media', 'consulting', 'education', 'engineering',
       'fashion', 'female', 'gemini', 'humanresources', 'indunk',
       'internet', 'investmentbanking', 'law', 'lawenforcement-security',
       'leo', 'libra', 'male', 'marketing', 'museums-libraries',
       'non-profit', 'pisces', 'publishing', 'religion', 'sagittarius',
       'science', 'scorpio', 'sports-recreation', 'student', 'taurus',
       'technology', 'telecommunications', 'virgo'], dtype=object)

In [54]:
y_test.head()

1153      [female, 27, education, gemini]
4390    [female, 34, indunk, sagittarius]
4657           [female, 25, student, leo]
8616    [male, 17, technology, capricorn]
8439         [female, 23, marketing, leo]
Name: labels, dtype: object

In [55]:
y_test_mlb.shape

(2500, 64)

In [56]:
mlb.classes_

array(['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33',
       '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44',
       '45', '46', 'accounting', 'aquarius', 'aries', 'arts',
       'automotive', 'banking', 'businessservices', 'cancer', 'capricorn',
       'communications-media', 'consulting', 'education', 'engineering',
       'fashion', 'female', 'gemini', 'humanresources', 'indunk',
       'internet', 'investmentbanking', 'law', 'lawenforcement-security',
       'leo', 'libra', 'male', 'marketing', 'museums-libraries',
       'non-profit', 'pisces', 'publishing', 'religion', 'sagittarius',
       'science', 'scorpio', 'sports-recreation', 'student', 'taurus',
       'technology', 'telecommunications', 'virgo'], dtype=object)

# Creating Model using Logistic regression and onevsrestclassifier on top of the logistic regression

In [57]:
LogReg_pipeline=Pipeline([('clf',OneVsRestClassifier(LogisticRegression(solver='lbfgs')))])

In [58]:
LogReg_pipeline.fit(X_train_dtm,y_train_mlb)

Pipeline(memory=None,
         steps=[('clf',
                 OneVsRestClassifier(estimator=LogisticRegression(C=1.0,
                                                                  class_weight=None,
                                                                  dual=False,
                                                                  fit_intercept=True,
                                                                  intercept_scaling=1,
                                                                  l1_ratio=None,
                                                                  max_iter=100,
                                                                  multi_class='auto',
                                                                  n_jobs=None,
                                                                  penalty='l2',
                                                                  random_state=None,
                                                      

In [59]:
prediction = LogReg_pipeline.predict(X_test_dtm)

# Accuracy on the test data

In [60]:
accuracy_score(y_test_mlb,prediction)

0.3056

# Printing inverse data from the predicted labels

In [62]:
mlb.inverse_transform(prediction[0:5,])

[('17', 'female', 'indunk', 'scorpio'),
 ('34', 'female', 'indunk', 'sagittarius'),
 ('male',),
 ('male',),
 ('female', 'student')]

In [63]:
y_test.head(5)

1153      [female, 27, education, gemini]
4390    [female, 34, indunk, sagittarius]
4657           [female, 25, student, leo]
8616    [male, 17, technology, capricorn]
8439         [female, 23, marketing, leo]
Name: labels, dtype: object

# Printing Recall, Precision and F1 Score from the data

In [67]:
recall_score(y_test_mlb,prediction,average='micro')

0.5256

In [68]:
recall_score(y_test_mlb,prediction,average='macro')

0.17962550018084084

In [69]:
precision_score(y_test_mlb,prediction,average='micro')

0.7849462365591398

In [70]:
precision_score(y_test_mlb,prediction,average='macro')

0.5048610292791127

In [71]:
f1_score(y_test_mlb,prediction,average='micro')

0.629611883085769

In [72]:
f1_score(y_test_mlb,prediction,average='macro')

0.23867343582537381

In [81]:
print(classification_report(y_test_mlb,prediction))

              precision    recall  f1-score   support

           0       1.00      0.10      0.18        10
           1       0.50      0.04      0.07        57
           2       0.80      0.22      0.35       161
           3       0.84      0.15      0.25       108
           4       0.73      0.29      0.42       286
           5       0.25      0.01      0.03        71
           6       0.77      0.13      0.22       155
           7       0.62      0.09      0.15       115
           8       0.36      0.07      0.12        55
           9       0.74      0.34      0.46       256
          10       1.00      0.28      0.44        32
          11       0.99      0.66      0.79       131
          12       0.74      0.63      0.68       565
          13       0.96      0.55      0.70       436
          14       0.00      0.00      0.00         7
          15       1.00      0.12      0.22        16
          16       0.00      0.00      0.00        24
          17       0.00    

# Printing True label and predicted Label for first 5 examples

In [76]:
mlb.inverse_transform(prediction[10:15])

[('male',),
 ('35', 'aries', 'male', 'technology'),
 ('35', 'aries', 'male', 'technology'),
 ('36', 'aries', 'fashion', 'male'),
 ('17', 'female', 'indunk', 'scorpio')]

In [77]:
y_test[10:15]

9018       [male, 15, student, virgo]
6790       [male, 36, fashion, aries]
5253    [female, 17, indunk, scorpio]
6801       [male, 36, fashion, aries]
5326    [female, 17, indunk, scorpio]
Name: labels, dtype: object