# NLP Project

# Author Features Prediction

## Description
Classification is probably the most popular task that you would deal with in real life.

Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information 
about the writer without knowing about him/her. 

We are going to create a classifier that predicts multiple features of the author of a given text.

We have designed it as a Multilabel classification problem

## Dataset

Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in 
August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 
35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id and the blogger’s self - provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many, industry 
and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17)

8086 "20s" blogs(ages 23-27)

2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been 
stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following 
post and links within a post are denoted by the label urllink.


## 1. Load the dataset 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from warnings import filterwarnings
filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [2]:
df = pd.read_csv('blogtext.csv')
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [3]:
# Checking the columns
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

#### Checking Dimension

In [4]:
df.shape

(681284, 7)

#### Checking for null values

In [5]:
Total = df.isnull().sum().sort_values(ascending=False)  
Percent = (df.isnull().sum()*100/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([Total, Percent], axis = 1, keys = ['Total', 'Percentage of Missing Values'])    
missing_data

Unnamed: 0,Total,Percentage of Missing Values
id,0,0.0
gender,0,0.0
age,0,0.0
topic,0,0.0
sign,0,0.0
date,0,0.0
text,0,0.0


There are no missing values in the data.

##### As mentioned in the problem statement: "As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly" I have decided to take first 6000 rows for the further analysis. 

In [6]:
df = df.head(6000)

## 2. Preprocess rows of the “text” column

- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [7]:
# Select only alphabets
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))
# Convert text to lowercase
df.text = df.text.apply(lambda x: x.lower())
# Strip unwanted spaces
df.text = df.text.apply(lambda x: x.strip())

In [8]:
# Remove stopwords
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\azhar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Lets verify the preprocessing steps by looking over some values.

In [9]:
df.text[5]

'interesting conversation dad morning talking koreans put money invariably lot real estate cash cash would include short term investments one year well savings accounts reason real estate makes money lot money seen surveys seoul real estate rising per year long stretches even taking account crisis referred imf crisis although imf bailed korea compare korean corporate bonds fell modestly recovered local stock market represented kospi version dow jones index gone appreciably high points points see urllink link see real estate makes sense back conversation noted real big elite real estate investor billion usd see urllink converter properties dad seemed little flabbergasted heck need million dollars need much retire maybe lot risk take real estate south korean asset example north toots horn louder make move country usd worth cents also denominated imf crisis dropped vis vis usd also make bad investment fall victim scam latest urllink good morning city project toast saw lady tv lost everyth

# 3. Merge the label columns

#### a. Label columns to merge: “gender”, “age”, “topic”, “sign”

In [10]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

#### b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

In [11]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


## 4. Separate features and labels, and split the data into training and testing 

In [12]:
df = df[['text','labels']]

In [13]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


### Train_test_split

In [14]:
X = df.text.values
y = df.labels.values

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

In [16]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

X_train shape: (4800,)
X_test shape: (1200,)
y_train shape: (4800,)
y_test shape: (1200,)


## 5. Vectorize the features

### a. Create Bag of Words
- Use CountVectorizer
- Transform the training and testing data

In [17]:
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_cv = vectorizer.fit_transform(X_train)
X_test_cv = vectorizer.transform(X_test)

In [18]:
X_train_cv

<4800x303052 sparse matrix of type '<class 'numpy.int64'>'
	with 600421 stored elements in Compressed Sparse Row format>

In [19]:
# Some feature names
vectorizer.get_feature_names()[:5]

['aa', 'aa amazing', 'aa anger', 'aa compared', 'aaa']

#### b. Print the term-document matrix

In [20]:
X_train_cv.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## 6. Create a dictionary to get the count of every label 

In [21]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

In [22]:
# Print the dictionary
label_counts

{'male': 3301,
 '15': 341,
 'Student': 571,
 'Leo': 190,
 '33': 101,
 'InvestmentBanking': 70,
 'Aquarius': 329,
 'female': 2699,
 '14': 170,
 'indUnk': 2379,
 'Aries': 2483,
 '25': 268,
 'Capricorn': 84,
 '17': 773,
 'Gemini': 86,
 '23': 137,
 'Non-Profit': 47,
 'Cancer': 94,
 'Banking': 16,
 '37': 19,
 'Sagittarius': 709,
 '26': 101,
 '24': 353,
 'Scorpio': 850,
 '27': 637,
 'Education': 118,
 '45': 14,
 'Engineering': 119,
 'Libra': 416,
 'Science': 33,
 '34': 540,
 '41': 14,
 'Communications-Media': 61,
 'BusinessServices': 87,
 'Sports-Recreation': 75,
 'Virgo': 41,
 'Taurus': 651,
 'Arts': 31,
 'Pisces': 67,
 '44': 3,
 '16': 67,
 'Internet': 20,
 'Museums-Libraries': 2,
 'Accounting': 2,
 '39': 79,
 '35': 2307,
 'Technology': 2332,
 '36': 60,
 'Law': 3,
 '46': 7,
 'Consulting': 16,
 'Automotive': 14,
 '42': 9,
 'Religion': 4}

## 7. Convert your train and test labels using MultiLabelBinarizer

In [23]:
mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

In [24]:
y_test

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0]])

## 8. Choose a classifier 

Use a linear classifier, wrap it up in OneVsRestClassifier to train it on every label.

In [25]:
clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

### 9. Fit the classifier, make predictions and get the accuracy

In [26]:
clf.fit(X_train_cv, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

### Predictions
- Get predicted labels and scores

In [27]:
predicted_labels = clf.predict(X_test_cv)
predicted_scores = clf.decision_function(X_test_cv)

In [28]:
predicted_labels

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0]])

In [29]:
predicted_scores

array([[-5.02082026, -5.00440503, -7.32154798, ..., -3.26660956,
        -2.73712702,  3.26660956],
       [-5.47768472, -5.67461942, -7.41961968, ..., -1.55636093,
        -1.01359938,  1.55636093],
       [-5.35760639, -4.38445409, -6.84714014, ..., -0.81900566,
        -2.00643058,  0.81900566],
       ...,
       [-5.53542828, -5.78260246, -6.0730621 , ...,  1.33523316,
        -0.16745297, -1.33523316],
       [-5.3806462 , -4.6569646 , -6.97511038, ..., -2.33434395,
        -2.17689375,  2.33434395],
       [-5.55893256, -3.83007099, -6.25210595, ...,  0.73506029,
        -1.439535  , -0.73506029]])

#### Get inverse transform for predicted labels and test labels

In [30]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

In [31]:
(pred_inversed[:5])

[('35', 'Aries', 'Technology', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('35', 'Aries', 'Technology', 'male'),
 ('female', 'indUnk'),
 ('27', 'Taurus', 'female', 'indUnk')]

In [32]:
y_test_inversed[:5]

[('35', 'Aries', 'Technology', 'male'),
 ('27', 'Taurus', 'female', 'indUnk'),
 ('35', 'Aries', 'Technology', 'male'),
 ('25', 'Aries', 'Arts', 'male'),
 ('27', 'Taurus', 'female', 'indUnk')]

### a. Print the following
- i. Accuracy score
- ii. F1 score
- iii. Average precision score
- iv. Average recall score

In [33]:
def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [34]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.45416666666666666
F1 score:  0.6881618084473528
Average precision score:  0.5127814763445832
Average recall score:  0.6025


### 10. Print true label and predicted label for any five examples 

In [35]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	bling bling money green g cash dough bread ching ching dollahs benjamins get idea
True labels:	35,Aries,Technology,male
Predicted labels:	35,Aries,Technology,male


Title:	uh says economy urllink tanking financial services firms typically bell weathers money invested goes otoh hold true turn guys make margins activity kind movement
True labels:	27,Taurus,female,indUnk
Predicted labels:	35,Aries,Technology,male


Title:	feel like poop kidney infection icky mcick go class hour blaaaaaaaaaaaaaaaaah survive oh long know love know alive
True labels:	35,Aries,Technology,male
Predicted labels:	35,Aries,Technology,male


Title:	urllink whatchu mean scared nbsp picture taken carlene decided lil picnic yep picnic basket blanket everything went park mom used take sister young nice kick back relax watch penelope dog frolic around american flag dress carlene mom got philippines know spoiled really say face mean really
True labels:	25,Aries,Arts,male
Predicted labels:	female,indUnk


Title:	v