In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords

### Loading DataSet

In [2]:
data= pd.read_csv('blogtext.csv')
data.head(2)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...


In [3]:
# Null Values
data.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [4]:
# Shape
data.shape

(681284, 7)

In [5]:
# Temporary Cell
# Since data is large, Shrinking data for initial preprocessing and modelling for speeding up the process. 

data= data.head(5000)

## Preprocessing the data

### Preprocess rows of the “text” column 
- a.Remove unwanted characters
- b.Convert text to lowercase
- c.Remove unwanted spaces
- d.Remove stopwords

In [6]:
## Before Preprocessing
data['text'][6]

"             Somehow Coca-Cola has a way of summing up things so well.  In the early 1970s they had as their flagship jingle 'I'd Like to Buy the World a Coke' (to the tune of 'I'd Like to Teach the World to Sing') that pretty much summed up the post-Woodstock era so well.  It didn't add much to sales, but it was a catchy tune.  In Korea Coke's theme is  urlLink Stop Thinking. Feel it.  which pretty much sums up a lot about Korea and Koreans.  (Look at how relaxed that couple is, now that they stopped thinking and started feeling.) Of course they have a high regard for education and math and logic and such, but deep down I think many Koreans really like to work on emotion more than anything else.  Westerners seem to sublimate this moreso, or at least display it in a different way.  Maybe scratch all that...Westerners and Koreans are probably pretty similar, but the context in which we do it is different.  Anyways, if you think you're losing it in Korea just repeat to yourself 'Stop th

In [7]:
## Preprocessing

for i in range(len(data['text'])):
    data['text'][i]= re.sub('[^a-zA-Z]+',' ', data['text'][i])
    data['text'][i]= data['text'][i].lower()
    data['text'][i]= data['text'][i].strip()
    data['text'][i]= ' '.join([word for word in data['text'][i].split() if word not in stopwords.words('english')])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'][i]= data['text'][i].lower()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'][i]= data['text'][i].strip()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'][i]= ' '.join([word for word in data['text'][i].split() if word not in stopwords.words('english')])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returni

In [8]:
## After Preprocessing
data['text'][6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

### Merging multiple columns to a single variable(Dependent Variable) :

In [9]:
data['Labels'] = data.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

In [10]:
## Selecting only the required columns 

df= data[['text','Labels']]
df.head(2)

Unnamed: 0,text,Labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"


### Train Test Split

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'].values, df['Labels'].values, test_size=0.20, random_state=100)

### Data Vectorizing 

In [12]:
## Creating a Bag of Words using count vectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [13]:
vectorizer.get_feature_names()[:5]



['aa', 'aa amazing', 'aa compared', 'aa nice', 'aaa']

In [14]:
## Term Document Matrix
X_train_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Dictionary to get label counts

In [15]:
label_counts = dict()

for Labels in df.Labels.values:
    for label in Labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1
            
label_counts

{'male': 3294,
 '15': 339,
 'Student': 569,
 'Leo': 190,
 '33': 101,
 'InvestmentBanking': 70,
 'Aquarius': 329,
 'female': 1706,
 '14': 170,
 'indUnk': 1381,
 'Aries': 2483,
 '25': 268,
 'Capricorn': 84,
 '17': 331,
 'Gemini': 86,
 '23': 137,
 'Non-Profit': 47,
 'Cancer': 94,
 'Banking': 16,
 '37': 19,
 'Sagittarius': 704,
 '26': 96,
 '24': 353,
 'Scorpio': 408,
 '27': 86,
 'Education': 118,
 '45': 14,
 'Engineering': 119,
 'Libra': 414,
 'Science': 33,
 '34': 540,
 '41': 14,
 'Communications-Media': 61,
 'BusinessServices': 87,
 'Sports-Recreation': 75,
 'Virgo': 41,
 'Taurus': 100,
 'Arts': 31,
 'Pisces': 67,
 '44': 3,
 '16': 67,
 'Internet': 20,
 'Museums-Libraries': 2,
 'Accounting': 2,
 '39': 79,
 '35': 2307,
 'Technology': 2332,
 '36': 60,
 'Law': 3,
 '46': 7,
 'Consulting': 16,
 'Automotive': 14,
 '42': 9,
 'Religion': 4}

### Multi Label Binarizer 

After analyzing the Y variable, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizerfrom sklearn

In [16]:
## Before Binarization
y_train[:2]

array([list(['female', '42', 'Consulting', 'Leo']),
       list(['female', '26', 'Accounting', 'Aquarius'])], dtype=object)

In [17]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

In [18]:
## After Binarization
y_train[:2]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0]])

### Classifier

In this task, I am using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained.

In [19]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [20]:
clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

In [21]:
clf.fit(X_train_bow, y_train)

### Getting predicted labels and scores

In [22]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

In [23]:
predicted_labels[:2]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

In [24]:
predicted_scores[:2]

array([[ -4.35763291,  -4.16537719,  -6.47815738,  -4.56217586,
         -2.05272526,  -4.69083938,  -2.26838023,  -4.65678327,
         -5.4042077 ,  -6.96797421,  -5.32000014,  -8.56615387,
         -8.54807924,  -7.49693435,  -5.08428146,  -8.2857153 ,
         -8.48265852, -10.04151287,  -6.93716006,  -8.29744248,
        -10.86793853,  -2.46483144,  -7.15782527,  -6.28503204,
         -8.35432488,  -7.24938042,  -6.58330536,  -4.7580706 ,
         -6.90778002,  -5.01287559,  -7.04468294,  -5.10962176,
         -5.32402546,  -5.51613628,  -7.5710669 ,  -8.08980942,
        -10.72837661,  -4.54511632,  -1.66671687,  -9.71072468,
         -5.60414139,  -8.53584986, -10.32915179,  -1.73194196,
         -7.48690553,  -4.28866398,  -7.23585881,  -3.32036021,
         -5.21967627,  -8.918888  ,  -7.45723978,   1.55967097,
          0.82457309,  -1.55967097],
       [ -6.32021202,  -6.08873491,  -6.72941572,  -6.33720074,
         -6.23390015,  -5.76439985,  -4.96958885,  -5.69148203,
   

### Get inverse transform for predicted labels and test labels

In [25]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

In [26]:
pred_inversed[:2]

[('female', 'indUnk'), ('male',)]

In [27]:
y_test_inversed[:2]

[('23', 'Sagittarius', 'indUnk', 'male'),
 ('35', 'Aries', 'Technology', 'male')]

### Sample Predictions

In [28]:
for i in range(3):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	people live one place anymore nbsp got morning erik called moving austin helped moving frenzy drove austin u haul one tires erik car exploded bastrop area got back college station went revolution inok farewell night muddy hot music enjoyable went ihop found really crowded good yupe time semester people say goodbyes old friends hellos new friends wish lucks friends leaving places new beginning careers lifes always college station nbsp nbsp nbsp nbsp know
True labels:	23,Sagittarius,indUnk,male
Predicted labels:	female,indUnk


Title:	thought would share something online journal bouncy bouncy thinking johnathan friends life feel ways accomplished little remembered johnathan probably smartest person know started master program spending next five years getting phd inspiring english professor see imagine anything else yet years old think less think sometimes takes longer us reach destination perhaps find right fork road
True labels:	35,Aries,Technology,male
Predicted labels:	male


T

## Calculate accuracy
- Accuracy
- F1-score
- Precision
- Recall

In [29]:
from sklearn.metrics import accuracy_score, f1_score, average_precision_score, recall_score

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))


In [30]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.534
F1 score:  0.7354712225178547
Average precision score:  0.5743043616682586
Average recall score:  0.6565
