DOMAIN: Digital content management

CONTEXT: Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc.
are written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a
classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

DATA DESCRIPTION: Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of
19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or
approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and
the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry and/or sign is
marked as unknown.) 

All bloggers included in the corpus fall into one of three age groups:
• 8240 "10s" blogs (ages 13-17),
• 8086 "20s" blogs(ages 23-27) and
• 2994 "30s" blogs (ages 33-47)

• For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of
common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label url link.

PROJECT OBJECTIVE: To build a NLP classifier which can use input text parameters to determine the label/s of the blog. Specific to this case
study, you can consider the text of the blog: ‘text’ feature as independent variable and ‘topic’ as dependent variable.

In [2]:
import pandas as pd

df1 = pd.read_csv ('blogtext.csv')

In [3]:
print('The data frame is:')
df1.head()

The data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [4]:
print('The number of items in data frame is:')
len(df1)

The number of items in data frame is:


681284

In [5]:
print('Feature-wise percentage of Null values:')
print(df1.isnull().mean() * 100)
print("")

Feature-wise percentage of Null values:
id        0.0
gender    0.0
age       0.0
topic     0.0
sign      0.0
date      0.0
text      0.0
dtype: float64



In [6]:
df2=df1.sample(n = 20000)

In [7]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
386553,1308868,male,26,Communications-Media,Cancer,"27,August,2003",
511624,3603933,female,23,Education,Virgo,"28,July,2004",Spent the day driving up North to Kalev...
288313,2389810,male,23,Student,Aries,"05,June,2004",urlLink I recently had to open ...
341473,3566371,male,17,Student,Aquarius,"13,July,2004","After thinking long and hard, i dec..."
387107,1774842,female,15,indUnk,Cancer,"22,November,2003",my least favorite holiday...the day i a...


In [8]:
print('The number of items in new data frame is:')
len(df2)

The number of items in new data frame is:


20000

Eliminate Non-English textual data 
Hint: Refer ‘langdetect’ library to detect language of the input text)

In [9]:
from langdetect import detect

def detect_english(text):
    
    try:
        return detect(text) == 'en'
    except:
        return False

In [10]:
df2= df2[df2['text'].apply(detect_english)]

In [12]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
511624,3603933,female,23,Education,Virgo,"28,July,2004",Spent the day driving up North to Kalev...
288313,2389810,male,23,Student,Aries,"05,June,2004",urlLink I recently had to open ...
341473,3566371,male,17,Student,Aquarius,"13,July,2004","After thinking long and hard, i dec..."
387107,1774842,female,15,indUnk,Cancer,"22,November,2003",my least favorite holiday...the day i a...
591067,3375531,male,26,Communications-Media,Aquarius,"18,May,2004",urlLink Who Is Abu Zarqawi? : 'Za...


Eliminate All special Characters and Numbers 

In [13]:
def remove_special_characters(dataframe,columnname):
    dataframe_no_special_characters = dataframe[columnname].replace(r'[^A-Za-z0-9 ]+', '', regex=True)
    return dataframe_no_special_characters

df2['text'] = remove_special_characters(df2,'text')

In [14]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
511624,3603933,female,23,Education,Virgo,"28,July,2004",Spent the day driving up North to Kalev...
288313,2389810,male,23,Student,Aries,"05,June,2004",urlLink I recently had to open ...
341473,3566371,male,17,Student,Aquarius,"13,July,2004",After thinking long and hard i deci...
387107,1774842,female,15,indUnk,Cancer,"22,November,2003",my least favorite holidaythe day i am f...
591067,3375531,male,26,Communications-Media,Aquarius,"18,May,2004",urlLink Who Is Abu Zarqawi Zarqa...


Lowercase all textual data 

In [15]:
def lowercase(dataframe,columnname):
    lowercase_dataframe = dataframe[columnname].apply(lambda x: x.lower())
    return lowercase_dataframe

df2['text'] = lowercase(df2,'text')

In [16]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
511624,3603933,female,23,Education,Virgo,"28,July,2004",spent the day driving up north to kalev...
288313,2389810,male,23,Student,Aries,"05,June,2004",urllink i recently had to open ...
341473,3566371,male,17,Student,Aquarius,"13,July,2004",after thinking long and hard i deci...
387107,1774842,female,15,indUnk,Cancer,"22,November,2003",my least favorite holidaythe day i am f...
591067,3375531,male,26,Communications-Media,Aquarius,"18,May,2004",urllink who is abu zarqawi zarqa...


Remove all Stopwords  

In [17]:
from nltk.tokenize import word_tokenize

def tokenize_words(dataframe,columnname):
    dataframe_tokenized_texts= dataframe[columnname].apply(lambda x: word_tokenize(x) )
    
    return dataframe_tokenized_texts

df2['text'] = tokenize_words(df2,'text')

In [18]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
511624,3603933,female,23,Education,Virgo,"28,July,2004","[spent, the, day, driving, up, north, to, kale..."
288313,2389810,male,23,Student,Aries,"05,June,2004","[urllink, i, recently, had, to, open, up, the,..."
341473,3566371,male,17,Student,Aquarius,"13,July,2004","[after, thinking, long, and, hard, i, decided,..."
387107,1774842,female,15,indUnk,Cancer,"22,November,2003","[my, least, favorite, holidaythe, day, i, am, ..."
591067,3375531,male,26,Communications-Media,Aquarius,"18,May,2004","[urllink, who, is, abu, zarqawi, zarqawi, has,..."


In [19]:
from nltk.corpus import stopwords

def remove_stop_words(dataframe,columnname):
    stop = stopwords.words('english')
    dataframe_no_stop_words= dataframe[columnname].apply(lambda x: [item for item in x if item not in stop])
    
    return dataframe_no_stop_words

df2['text'] = remove_stop_words(df2,'text')

In [21]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
511624,3603933,female,23,Education,Virgo,"28,July,2004","[spent, day, driving, north, kaleva, kara, bac..."
288313,2389810,male,23,Student,Aries,"05,June,2004","[urllink, recently, open, ol, beast, sgi, work..."
341473,3566371,male,17,Student,Aquarius,"13,July,2004","[thinking, long, hard, decided, go, school, pa..."
387107,1774842,female,15,indUnk,Cancer,"22,November,2003","[least, favorite, holidaythe, day, forced, eat..."
591067,3375531,male,26,Communications-Media,Aquarius,"18,May,2004","[urllink, abu, zarqawi, zarqawi, associated, g..."


In [22]:
def join_sentence(dataframe,columnname):

    dataframe_joined_sentence= dataframe[columnname].apply(lambda x: (" ").join(x))    
    return dataframe_joined_sentence

df2['text'] = join_sentence(df2,'text')

In [23]:
print('The new data frame is:')
df2.head()

The new data frame is:


Unnamed: 0,id,gender,age,topic,sign,date,text
511624,3603933,female,23,Education,Virgo,"28,July,2004",spent day driving north kaleva kara back lansi...
288313,2389810,male,23,Student,Aries,"05,June,2004",urllink recently open ol beast sgi workstation...
341473,3566371,male,17,Student,Aquarius,"13,July,2004",thinking long hard decided go school party hap...
387107,1774842,female,15,indUnk,Cancer,"22,November,2003",least favorite holidaythe day forced eat bird ...
591067,3375531,male,26,Communications-Media,Aquarius,"18,May,2004",urllink abu zarqawi zarqawi associated groups ...


Remove all extra white spaces

In [1]:
print("Already performed")

Already performed


Build a base Classification model  

Create dependent and independent variables  

In [25]:
X = df2["text"]

print('The independant data frame is:')
X.head()

The independant data frame is:


511624    spent day driving north kaleva kara back lansi...
288313    urllink recently open ol beast sgi workstation...
341473    thinking long hard decided go school party hap...
387107    least favorite holidaythe day forced eat bird ...
591067    urllink abu zarqawi zarqawi associated groups ...
Name: text, dtype: object

In [26]:
y= df2["topic"]

print('The dependant/target data frame is:')
y.head()

The dependant/target data frame is:


511624               Education
288313                 Student
341473                 Student
387107                  indUnk
591067    Communications-Media
Name: topic, dtype: object

In [27]:
print("Count of topics:")
y.value_counts()

Count of topics:


indUnk                     7163
Student                    4227
Technology                 1191
Arts                        903
Education                   847
Communications-Media        544
Non-Profit                  408
Internet                    400
Engineering                 311
Law                         241
Publishing                  220
Science                     211
Government                  194
Consulting                  169
Advertising                 151
Religion                    151
Marketing                   138
BusinessServices            130
Fashion                     129
Accounting                  112
Banking                     102
Telecommunications          101
Military                    101
Museums-Libraries            97
Chemicals                    95
HumanResources               92
RealEstate                   83
Sports-Recreation            79
Transportation               76
Tourism                      71
Biotech                      55
Manufact

Split data into train and test 

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

print("{0:0.2f}% data is in training set".format((len(X_train)/len(df2.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(df2.index)) * 100))

80.00% data is in training set
20.00% data is in test set


In [29]:
print('Number of items in X train:')
print(len(X_train))
print("Shape of X train:")
print(X_train.shape)
print('Number of items in X test:')
print(len(X_test))
print("Shape of X test:")
print(X_test.shape)

Number of items in X train:
15277
Shape of X train:
(15277,)
Number of items in X test:
3820
Shape of X test:
(3820,)


Vectorize data using any one vectorizer 

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

In [31]:
from sklearn.pipeline import Pipeline
nb_text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])

Build a base model for Supervised Learning - Classification 

In [32]:
model = nb_text_clf.fit(X_train, y_train)

In [33]:
predicted_labels = model.predict(X_test)

In [34]:
print('Accuracy of Model of Naive Bayes Classifier Model:', model.score(X_test, y_test))

Accuracy of Model of Naive Bayes Classifier Model: 0.381151832460733


Clearly print Performance Metrics 

In [35]:
from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, predicted_labels))


Classification Report

                         precision    recall  f1-score   support

             Accounting       0.00      0.00      0.00        19
            Advertising       0.00      0.00      0.00        30
            Agriculture       0.00      0.00      0.00         9
           Architecture       0.00      0.00      0.00        11
                   Arts       0.00      0.00      0.00       176
             Automotive       0.00      0.00      0.00         6
                Banking       0.00      0.00      0.00        21
                Biotech       0.00      0.00      0.00        10
       BusinessServices       0.00      0.00      0.00        17
              Chemicals       0.00      0.00      0.00        15
   Communications-Media       0.00      0.00      0.00       112
           Construction       0.00      0.00      0.00         4
             Consulting       0.00      0.00      0.00        34
              Education       0.00      0.00      0.00       158


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [36]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3),}

nb_gs_clf = GridSearchCV(nb_text_clf, parameters, n_jobs=-1)
model = nb_gs_clf.fit(X_train, y_train)

print("Best parameters for model:")
model.best_score_
model.best_params_

Best parameters for model:


{'clf__alpha': 0.01, 'tfidf__use_idf': False, 'vect__ngram_range': (1, 1)}

In [37]:
print("Now we will use best parameters for model")
parameters = {'vect__ngram_range': [(1, 1)],'tfidf__use_idf': [False],'clf__alpha': [0.01]}

nb_best_clf = GridSearchCV(nb_text_clf, parameters, n_jobs=-1)
model = nb_best_clf.fit(X_train, y_train)
predicted_labels = model.predict(X_test)
print('Accuracy of Model of Naive Bayes Classifier Model with best parameters:', model.score(X_test, y_test))

Now we will use best parameters for model
Accuracy of Model of Naive Bayes Classifier Model with best parameters: 0.40078534031413615


In [38]:
print('\nClassification Report\n')
print(classification_report(y_test, predicted_labels))


Classification Report

                         precision    recall  f1-score   support

             Accounting       0.00      0.00      0.00        19
            Advertising       0.00      0.00      0.00        30
            Agriculture       0.00      0.00      0.00         9
           Architecture       0.00      0.00      0.00        11
                   Arts       0.33      0.01      0.01       176
             Automotive       0.00      0.00      0.00         6
                Banking       0.00      0.00      0.00        21
                Biotech       0.00      0.00      0.00        10
       BusinessServices       0.00      0.00      0.00        17
              Chemicals       0.00      0.00      0.00        15
   Communications-Media       1.00      0.01      0.02       112
           Construction       0.00      0.00      0.00         4
             Consulting       0.00      0.00      0.00        34
              Education       0.67      0.03      0.05       158


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Improve Performance of model 

Experiment with other vectorisers 

Build classifier Models using other algorithms than base model 

In [40]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),])

In [41]:
model = text_clf_svm.fit(X_train, y_train)

In [42]:
predicted_labels = model.predict(X_test)

In [43]:
print('Accuracy of Model with Stochastic Gradient Descent Classifier Model:', model.score(X_test, y_test))

Accuracy of Model with Stochastic Gradient Descent Classifier Model: 0.3575916230366492


In [44]:
from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, predicted_labels))


Classification Report

                         precision    recall  f1-score   support

             Accounting       0.50      0.11      0.17        19
            Advertising       0.00      0.00      0.00        30
            Agriculture       0.00      0.00      0.00         9
           Architecture       0.00      0.00      0.00        11
                   Arts       0.10      0.03      0.05       176
             Automotive       0.00      0.00      0.00         6
                Banking       0.00      0.00      0.00        21
                Biotech       0.00      0.00      0.00        10
       BusinessServices       0.14      0.06      0.08        17
              Chemicals       0.00      0.00      0.00        15
   Communications-Media       0.19      0.05      0.08       112
           Construction       0.00      0.00      0.00         4
             Consulting       0.00      0.00      0.00        34
              Education       0.18      0.08      0.11       158


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Tune Parameters/Hyperparameters of the model/s 

In [45]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf-svm__alpha': (1e-2, 1e-3),}

gs_clf_svm = GridSearchCV(text_clf_svm, parameters, n_jobs=-1)
model = gs_clf_svm.fit(X_train, y_train)

print("Best parameters for model")
model.best_score_
model.best_params_

Best parameters for model


{'clf-svm__alpha': 0.001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}

In [46]:
print("Now we will use best parameters for model")
parameters = {'vect__ngram_range': [(1, 2)],'tfidf__use_idf': [True],'clf-svm__alpha': [1e-3],}

svm_best_clf = GridSearchCV(text_clf_svm, parameters, n_jobs=-1)
model = svm_best_clf.fit(X_train, y_train)
predicted_labels = model.predict(X_test)
print('Accuracy of Model of Stochastic Gradient Descent Classifier Model with best parameters:', model.score(X_test, y_test))

Now we will use best parameters for model
Accuracy of Model of Stochastic Gradient Descent Classifier Model with best parameters: 0.38821989528795814


In [47]:
from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, predicted_labels))


Classification Report

                         precision    recall  f1-score   support

             Accounting       0.00      0.00      0.00        19
            Advertising       0.00      0.00      0.00        30
            Agriculture       0.00      0.00      0.00         9
           Architecture       0.00      0.00      0.00        11
                   Arts       0.12      0.01      0.02       176
             Automotive       0.00      0.00      0.00         6
                Banking       0.00      0.00      0.00        21
                Biotech       0.00      0.00      0.00        10
       BusinessServices       0.25      0.06      0.10        17
              Chemicals       0.00      0.00      0.00        15
   Communications-Media       0.50      0.05      0.10       112
           Construction       0.00      0.00      0.00         4
             Consulting       0.00      0.00      0.00        34
              Education       0.59      0.06      0.11       158


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [48]:
print("Now we will try with more advanced vectorizers")

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
#tokenize and tag the card text
card_docs = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df2.text)]

Now we will try with more advanced vectorizers


In [49]:
#instantiate model
model = Doc2Vec(vector_size=64, window=2, min_count=1, workers=8, epochs = 40)
#build vocab
model.build_vocab(card_docs)
#train model
model.train(card_docs, total_examples=model.corpus_count, epochs=model.epochs)

In [50]:
#Convert df2['text'] to array
df2_text_array = df2["text"].to_numpy()
#generate vectors
card2vec = [model.infer_vector((df2_text_array[i].split(' '))) for i in range(0,len(df2_text_array))]

In [64]:
columns=[]
for i in range(0,len(card2vec[0])):
    columns.append('Column'+str(i))

In [65]:
X= pd.DataFrame(card2vec, columns = columns)
X.head()

Unnamed: 0,Column0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,...,Column54,Column55,Column56,Column57,Column58,Column59,Column60,Column61,Column62,Column63
0,1.671392,-1.068388,-0.191523,0.711884,-0.301856,-2.081101,-0.322949,1.298869,-1.333727,-0.45085,...,1.157164,1.354626,0.733016,1.02038,1.679475,-0.413759,-0.498818,0.051744,0.125808,1.678232
1,1.038524,0.484189,0.815457,0.23899,-0.645635,-0.220878,0.266301,-0.756432,-0.693493,-0.647644,...,0.495461,-1.112361,-0.418613,-0.586357,-0.057856,0.252528,-1.285814,-0.843275,0.38182,0.349914
2,-0.469482,-0.777407,0.256851,1.019218,-0.188985,-0.489693,-0.231226,-0.173393,0.292314,-0.070949,...,0.147816,0.039171,0.126307,-0.606989,-0.002127,-0.359383,-0.877908,-0.302257,-0.037587,0.445923
3,-0.034772,0.350986,-0.738007,0.135469,0.333906,-0.337914,-0.027656,0.420688,-0.468603,0.69054,...,0.098034,-0.171656,0.097934,-0.314903,-0.365639,0.169617,-0.876888,-0.302395,0.043974,0.553041
4,-0.125319,-0.110321,-0.208806,0.94586,-0.254838,1.469597,-1.759404,-1.94737,0.582735,-0.306625,...,-1.379077,-1.712678,-0.430529,-1.015618,-1.860881,-0.588598,-1.217528,0.056236,0.501394,-0.425858


In [66]:
y= df2["topic"]
y.head()

511624               Education
288313                 Student
341473                 Student
387107                  indUnk
591067    Communications-Media
Name: topic, dtype: object

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

print("{0:0.2f}% data is in training set".format((len(X_train)/len(df2.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(df2.index)) * 100))

80.00% data is in training set
20.00% data is in test set


In [68]:
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42)),])

model = text_clf_svm.fit(X_train, y_train)

In [69]:
print('Accuracy of Model with card2vec vectorizer and Stochastic Gradient Descent Classifier Model:', model.score(X_test, y_test))

Accuracy of Model with card2vec vectorizer and Stochastic Gradient Descent Classifier Model: 0.26675392670157066


In [70]:
predicted_labels = model.predict(X_test)

In [71]:
from sklearn.metrics import classification_report
print('\nClassification Report\n')
print(classification_report(y_test, predicted_labels))


Classification Report

                         precision    recall  f1-score   support

             Accounting       0.00      0.00      0.00        19
            Advertising       0.00      0.00      0.00        30
            Agriculture       0.00      0.00      0.00         9
           Architecture       0.00      0.00      0.00        11
                   Arts       0.06      0.01      0.02       176
             Automotive       0.00      0.00      0.00         6
                Banking       0.00      0.00      0.00        21
                Biotech       0.00      0.00      0.00        10
       BusinessServices       0.00      0.00      0.00        17
              Chemicals       0.00      0.00      0.00        15
   Communications-Media       0.02      0.01      0.01       112
           Construction       0.00      0.00      0.00         4
             Consulting       0.00      0.00      0.00        34
              Education       0.04      0.11      0.06       158


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
