# Saumya Kothari - Natural Language Processing Project 1 (Part 1)

## Part 1

#### DOMAIN: 
Digital content management

#### CONTEXT: 
Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles,
etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to
create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

#### DATA DESCRIPTION: 
Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected
posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a
blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many,
industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
- 8240 "10s" blogs (ages 13-17),
- 8086 "20s" blogs(ages 23-27) and
- 2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions.
Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url
link. Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

#### PROJECT OBJECTIVE: 
The need is to build a NLP classifier which can use input text parameters to determine the label/s of the blog.
Steps and tasks:
1. Import and analyse the data set.
2. Perform data pre-processing on the data:
• Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.
• Target/label merger and transformation
• Train and test split
• Vectorisation, etc.
3. Design, train, tune and test the best text classifier.
4. Display and explain detail the classification report
5. Print the true vs predicted labels for any 5 entries from the dataset.

In [1]:
# IMPORT NECESSARY PACKAGES

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re # regular expression

import nltk
import os

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

## Import and analyse the data set

In [2]:
data=pd.read_csv("blogtext.csv")

In [3]:
data.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


## Perform data pre-processing on the data

In [4]:
data.isna().any()

id        False
gender    False
age       False
topic     False
sign      False
date      False
text      False
dtype: bool

In [5]:
data.shape

(681284, 7)

For easiness in computation and convinience, there are 68,124 records and is huge to perform analysis and computation, hence we are going to take a subset and rerun with the entire data-set once all errors are fixed and optimization is done

In [6]:
data=data.head(10000)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
id        10000 non-null int64
gender    10000 non-null object
age       10000 non-null int64
topic     10000 non-null object
sign      10000 non-null object
date      10000 non-null object
text      10000 non-null object
dtypes: int64(2), object(5)
memory usage: 547.0+ KB


### Data cleansing

In [8]:
data.drop(['id','date'], axis=1, inplace=True)

Columns like ID and date are removed from the dataset as they do not provide much value

In [9]:
data['age']=data['age'].astype('object')

In [10]:
data.info()

# from here we can see that all columns have been converted to object data type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
gender    10000 non-null object
age       10000 non-null object
topic     10000 non-null object
sign      10000 non-null object
text      10000 non-null object
dtypes: object(5)
memory usage: 390.8+ KB


### Removing unwanted Characters, Spaces etc. Converting text to lowercase.

In [12]:
data['clean_data']=data['text'].apply(lambda x: re.sub(r'[^A-Za-z]+',' ',x)) # remove unwanted characters
data['clean_data']=data['clean_data'].apply(lambda x: x.lower()) # convert to lowercase
data['clean_data']=data['clean_data'].apply(lambda x: x.strip()) # removes unnecssary spaces
print("Actual data=======> {}".format(data['text'][1]))
print("Cleaned data=======> {}".format(data['clean_data'][1]))



### Remove all stop words

In [13]:
stopwords=set(stopwords.words('english')) #setting stopwords

In [15]:
data['clean_data']=data['clean_data'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))
data['clean_data'][15]

'one thing love seoul mean korea general happen little seoul centric street sellers really trust food sell side road except ice cream virtually everything else fair game example get ready trip canada generally stock last two weeks bought plants nieces lightweight sports shirts inlining pair shorts inlining bags dried goguma sweet potatoes yams selling got tie amazing price usd really tell worse ones bought usd back home disposible razors usd ten noise making toy hammer boy disney photo albums sure ol walt make penny clothes seller guy spoke pretty good english know held hostage minutes talked korean men getting fatter hence stock larger sizes husky guys like learned english working us army years ago goguma guy know lot english speak spanish owing fact lived argentina years unfortunately spanish one languages know fair bit french school days smattering japanese course korean anyways passed goguma guy later week gave big hola spanish hello extent proficiency returned one well wow bridgin

### Target/label merger and transformation

In [16]:
data['labels']=data.apply(lambda col: [col['gender'],str(col['age']),col['topic'],col['sign']], axis=1)

In [17]:
data.head()

Unnamed: 0,gender,age,topic,sign,text,clean_data,labels
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,male,15,Student,Leo,testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [18]:
data=data[['clean_data','labels']] #creating a new column called clean_data with the final clean labels

In [19]:
data.head()

Unnamed: 0,clean_data,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [20]:
X=data['clean_data']
Y=data['labels']

### Count Vectorization with bi-grams and tri-grams

In [22]:
vectorizer=CountVectorizer(binary=True, ngram_range=(1,2))
X=vectorizer.fit_transform(X)
X[1]

In [25]:
vectorizer.get_feature_names()[:5]

['aa', 'aa amazing', 'aa anger', 'aa compared', 'aa keeps']

In [26]:
label_counts=dict()

for labels in data.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label]+=1
        else:
            label_counts[label]=1
            
label_counts

{'male': 5916,
 '15': 602,
 'Student': 1137,
 'Leo': 301,
 '33': 136,
 'InvestmentBanking': 70,
 'Aquarius': 571,
 'female': 4084,
 '14': 212,
 'indUnk': 3287,
 'Aries': 4198,
 '25': 386,
 'Capricorn': 215,
 '17': 1185,
 'Gemini': 150,
 '23': 253,
 'Non-Profit': 71,
 'Cancer': 504,
 'Banking': 16,
 '37': 33,
 'Sagittarius': 1097,
 '26': 234,
 '24': 655,
 'Scorpio': 971,
 '27': 1054,
 'Education': 270,
 '45': 16,
 'Engineering': 127,
 'Libra': 491,
 'Science': 63,
 '34': 553,
 '41': 20,
 'Communications-Media': 99,
 'BusinessServices': 91,
 'Sports-Recreation': 80,
 'Virgo': 236,
 'Taurus': 812,
 'Arts': 45,
 'Pisces': 454,
 '44': 3,
 '16': 440,
 'Internet': 118,
 'Museums-Libraries': 17,
 'Accounting': 4,
 '39': 79,
 '35': 2315,
 'Technology': 2654,
 '36': 1708,
 'Law': 11,
 '46': 7,
 'Consulting': 21,
 'Automotive': 14,
 '42': 14,
 'Religion': 9,
 '13': 42,
 'Fashion': 1622,
 '38': 46,
 '43': 6,
 'Publishing': 4,
 '40': 1,
 'Marketing': 156,
 'LawEnforcement-Security': 10,
 'HumanReso

### Pre-processing the labels

In [27]:
binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))
Y=binarizer.fit_transform(data.labels)

### Splitting the data into Test and Train set

In [28]:
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,Y,test_size=0.2)

## Design, train, tune and test the best text classifier.

In [29]:
model=LogisticRegression(solver='lbfgs')
model=OneVsRestClassifier(model)
model.fit(Xtrain,Ytrain)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [30]:
Ypred=model.predict(Xtest)

In [31]:
Ypred_inversed = binarizer.inverse_transform(Ypred)
y_test_inversed = binarizer.inverse_transform(Ytest)

## Display and explain detail the classification report

In [33]:
def print_evaluation_scores(Ytest, Ypred):
    print('Accuracy score: ', accuracy_score(Ytest, Ypred))
    print('F1 score: ', f1_score(Ytest, Ypred, average='micro'))
    print('Average precision score: ', average_precision_score(Ytest, Ypred, average='micro'))
    print('Average recall score: ', recall_score(Ytest, Ypred, average='micro'))

In [34]:
print_evaluation_scores(Ytest, Ypred)

Accuracy score:  0.3065
F1 score:  0.6307253341342545
Average precision score:  0.44431705998495674
Average recall score:  0.525


## Print the true vs predicted labels for any 5 entries from the dataset

In [32]:
for i in range(5):
    print('Text:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        Xtest[i],
        ','.join(y_test_inversed[i]),
        ','.join(Ypred_inversed[i])
    ))

Text:	  (0, 593438)	1
  (0, 312628)	1
  (0, 421129)	1
  (0, 319058)	1
  (0, 449060)	1
  (0, 113776)	1
  (0, 529247)	1
  (0, 334172)	1
  (0, 567161)	1
  (0, 229515)	1
  (0, 586020)	1
  (0, 155847)	1
  (0, 375481)	1
  (0, 609968)	1
  (0, 170072)	1
  (0, 487526)	1
  (0, 442891)	1
  (0, 389210)	1
  (0, 440127)	1
  (0, 374716)	1
  (0, 154884)	1
  (0, 625081)	1
  (0, 150832)	1
  (0, 628568)	1
  (0, 258656)	1
  :	:
  (0, 587208)	1
  (0, 237653)	1
  (0, 100946)	1
  (0, 309458)	1
  (0, 161001)	1
  (0, 186471)	1
  (0, 445105)	1
  (0, 564389)	1
  (0, 146304)	1
  (0, 244114)	1
  (0, 637871)	1
  (0, 468706)	1
  (0, 148383)	1
  (0, 201242)	1
  (0, 155988)	1
  (0, 448370)	1
  (0, 100130)	1
  (0, 553035)	1
  (0, 336332)	1
  (0, 365564)	1
  (0, 100117)	1
  (0, 214330)	1
  (0, 562102)	1
  (0, 414403)	1
  (0, 81836)	1
True labels:	27,Taurus,female,indUnk
Predicted labels:	24,Sagittarius,male


Text:	  (0, 374440)	1
  (0, 227615)	1
  (0, 552721)	1
  (0, 160777)	1
  (0, 248484)	1
  (0, 634422)	1
  (0, 5657

##### Important Notes:
- I have solved Multilabel classification problem that predicts multiple features of the author of a given text
- Loading the data and required basic EDA and data inspection has been done
- The text has been pre processed like cleansing it(removing the unnecessary chars, removing the spaces, converting the case to lower) and also removing the stop words, vectorizing the features
- Preparing the date, splitting them to train and test
- Using multilable binarizers, also various classifier models are trained and the predictions are made and also the accuracy, f1 score, Avg precision and recall scores are calculated.