![alt text](https://drive.google.com/uc?export=view&id=1UXScsVx_Wni_JuDdB8LeTnM6jsPfIwkW)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

### Package Version
- scikit-learn==0.22.2
- andas==1.0.5
- nltk==3.2.5
- google==2.0.3

# Predict Author Features

### Download and load Blog Authorship Corpus
- Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

### Dataset description
- You can find the dataset description here:
https://www.kaggle.com/rtatman/blog-authorship-corpus

#### Load the contents of zip file

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#### Read the csv using pandas

In [2]:
import pandas as pd

df = pd.read_csv('/content/drive/My Drive/blogtext.csv')

#### Get the names of the columns

In [3]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

#### Have a look at some column values

In [4]:
df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


#### Check if there is any null value, and get the total count.

In [5]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

### Cut the data (skip this step in final run)
Make your data short during development. So that overall process takes less time to execute and you are able to rectify all the errors fast, and check if your code is running smooth.
When evrything is sorted at last, load the entire data and run your code on that and skip this step.

In [6]:
df = df.head(3000)

## Preprocess text
Preprocess values of text column

- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [7]:
# Select only alphabets
import re
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

# Convert text to lowercase
df.text = df.text.apply(lambda x: x.lower())

# Strip unwanted spaces
df.text = df.text.apply(lambda x: x.strip())

# Remove stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Verify the preprocessing steps by looking over some values

In [8]:
df.text[6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

### Merge the label columns

Merge all the label columns together, so that we have all the tags together for a particular sentence

In [9]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

### Select only required columns from your dataframe

In [10]:
df = df[['text','labels']]

### Print final dataframe

In [11]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


## Create training and testing data

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.text.values, df.labels.values, test_size=0.20, random_state=42)

## Vectorize the data

### Create Bag of Words
- Use CountVectorizer
- Transform the training and testing data

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#### Have a look at some feature names

In [14]:
vectorizer.get_feature_names()[:5]

['aa', 'aa compared', 'aa nice', 'aaa', 'aaa take']

#### View term-document matrix

In [15]:
X_train_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

### Create a dictionary to get label counts

In [16]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

#### Print the dictionary

In [17]:
label_counts

{'14': 74,
 '15': 299,
 '16': 25,
 '17': 147,
 '23': 93,
 '24': 334,
 '25': 110,
 '26': 43,
 '27': 86,
 '33': 94,
 '34': 6,
 '35': 1607,
 '37': 19,
 '39': 32,
 '41': 14,
 '44': 3,
 '45': 14,
 'Accounting': 2,
 'Aquarius': 286,
 'Aries': 1699,
 'Arts': 2,
 'Banking': 16,
 'BusinessServices': 21,
 'Cancer': 76,
 'Capricorn': 77,
 'Communications-Media': 14,
 'Education': 118,
 'Engineering': 119,
 'Gemini': 21,
 'Internet': 20,
 'InvestmentBanking': 70,
 'Leo': 55,
 'Libra': 313,
 'Museums-Libraries': 2,
 'Non-Profit': 46,
 'Pisces': 2,
 'Sagittarius': 113,
 'Science': 33,
 'Scorpio': 243,
 'Sports-Recreation': 75,
 'Student': 403,
 'Taurus': 76,
 'Technology': 1607,
 'Virgo': 39,
 'female': 728,
 'indUnk': 452,
 'male': 2272}

## Multi label binarizer

Load a multilabel binarizer and fit it on the labels.

In [18]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

## Classifier

Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

### Fit the classifier

In [None]:
clf.fit(X_train_bow, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

## Make predictions
- Get predicted labels and scores

In [None]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

### Get inverse transform for predicted labels and test labels

In [None]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

### Print some samples

In [None]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	pink already done sure phoenix tho
True labels:	35,Aries,Technology,male
Predicted labels:	35,Aries,Technology,male


Title:	woohoo tomorrow probably means need clean place bit uh oh since jen going get car able least grab furniture somewhere sit timed perfectly pool opening weekend slaving away work lie pool fight cicadas got hdtv cable hookup unfortunately demand function seems broken meantime get early today someone come look brighter side things techie managed blag two u dual piii machines free swap mobo reasonable case hook home idea use yet betting munch seti units meantime figure one last thing hell alt gr button us keyboards annoys type alt euro symbol remember code accented e etc know could change keymap uk whatever alt gr option something end rant
True labels:	25,Aries,Internet,male
Predicted labels:	male


Title:	actually johnathan called late last night sounding groggy thanking something could barely make innane babble wondering saying hoped guess name
True labels:	3

## Calculate accuracy
- Accuracy
- F1-score
- Precision
- Recall

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [None]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.5233333333333333
F1 score:  0.7215575885526625
Average precision score:  0.5596074546125939
Average recall score:  0.6408333333333334
