## Project B

Project description- Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem. 


## Dataset

Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.


Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers.


Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.
Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

## Table of Content

1. Load the dataset (5 points)
a. Tip: As the dataset is large, use fewer rows. Check what is working well on your
machine and decide accordingly.

2. Preprocess rows of the “text” column (7.5 points)
a. Remove unwanted characters
b. Convert text to lowercase
c. Remove unwanted spaces
d. Remove stopwords

3. As we want to make this into a multi-label classification problem, you are required to merge
all the label columns together, so that we have all the labels together for a particular sentence
(7.5 points)
a. Label columns to merge: “gender”, “age”, “topic”, “sign”
b. After completing the previous step, there should be only two columns in your data
frame i.e. “text” and “labels” as shown in the below image

4. Separate features and labels, and split the data into training and testing (5 points)

5. Vectorize the features (5 points)
a. Create a Bag of Words using count vectorizer
i. Use ngram_range=(1, 2)
ii. Vectorize training and testing features
b. Print the term-document matrix

6. Create a dictionary to get the count of every label i.e. the key will be label name and value will
be the total count of the label. Check below image for reference (5 points)

7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
a. Convert your train and test labels using MultiLabelBinarizer

8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a
basic classifier, use LogisticRegression . It is one of the simplest methods, but often it
performs good enough in text classification tasks. It might take some time because the
number of classifiers to train is large.
a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on
every label
b. As One-vs-Rest approach might not have been discussed in the sessions, we are
providing you the code for that

9. Fit the classifier, make predictions and get the accuracy (5 points)
a. Print the following
i. Accuracy score
ii. F1 score
iii. Average precision score
iv. Average recall score
v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

10. Print true label and predicted label for any five examples (7.5 points)

## 1. Import Libraries

Let us start by mounting the drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Let us check for the version of installed tensorflow.

In [None]:
# used to supress display of warnings
import warnings

# os is used to provide a way of using operating system dependent functionality
# We use it for setting working folder
import os

# Pandas is used for data manipulation and analysis
import pandas as pd 

# Numpy is used for large, multi-dimensional arrays and matrices, along with mathematical operators on these arrays
import numpy as np

# Matplotlib is a data visualization library for 2D plots of arrays, built on NumPy arrays 
# and designed to work with the broader SciPy stack
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import pyplot

import gc
gc.enable

pd.options.mode.chained_assignment = None
import re
import nltk
import spacy
import string
import seaborn as sns

# Seaborn is based on matplotlib, which aids in drawing attractive and informative statistical graphics.
import seaborn as sns
import tensorflow 
print(tensorflow.__version__)

2.4.1


## 2. Setting Options

In [None]:
# suppress display of warnings
warnings.filterwarnings('ignore')

# display all dataframe columns
pd.options.display.max_columns = None

# to set the limit to 3 decimals
pd.options.display.float_format = '{:.7f}'.format

# display all dataframe rows
pd.options.display.max_rows = None

## 3. Read Data

### 3.1 Read the provided CSVs and check 5 random samples and shape to understand the datasets

In [None]:
# For Colabs Setup
from google.colab import files
files.upload() #upload kaggle.json

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"vvsiva03","key":"d898ff478e8c48c9362abd7473073368"}'}

In [None]:
# Using Files directly  from Kaggle through package
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d rtatman/blog-authorship-corpus
!unzip -q blog-authorship-corpus.zip

kaggle.json
Downloading blog-authorship-corpus.zip to /content
 99% 286M/290M [00:03<00:00, 106MB/s]
100% 290M/290M [00:03<00:00, 98.8MB/s]


In [None]:
!ls ~/.kaggle

kaggle.json


In [None]:
input_path ="/content/blogtext.csv"

## just importing small set from original recordset of 681k to start with the process
pd_blog = pd.read_csv(input_path,nrows=50000,index_col=False)

pd_blog.shape

# droping id and date columns
#data_orignal.drop(labels=['id','date'], axis=1,inplace=True)

# the next step is to randomize the rows of the data
#data_orignal = data_orignal.sample(frac=1).reset_index(drop=True)

# gc.collect()

(50000, 7)

In [None]:
pd_blog.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [None]:
# Remove fields id and date 
pd_blog.drop(labels=['id','date'], axis=1,inplace=True)


In [None]:
pd_blog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   gender  50000 non-null  object
 1   age     50000 non-null  int64 
 2   topic   50000 non-null  object
 3   sign    50000 non-null  object
 4   text    50000 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.9+ MB


## 4.  Data Preprocessing for text column

In [None]:
pd_blog1=pd_blog.copy(deep=False)

In [None]:
pd_blog1["text_lower"] = pd_blog1["text"].str.lower()
pd_blog1.head()

Unnamed: 0,gender,age,topic,sign,text,text_lower
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...","info has been found (+/- 100 pages,..."
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members: drewe...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!,testing!!! testing!!!
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo!'s toolbar i can ...


In [None]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('','',PUNCT_TO_REMOVE))
                                        
pd_blog1["text_special"] = pd_blog1["text_lower"].apply(lambda text: remove_punctuation(text))
pd_blog1.head()

Unnamed: 0,gender,age,topic,sign,text,text_lower,text_special
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...","info has been found (+/- 100 pages,...",info has been found 100 pages and ...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members: drewe...,these are the team members drewes...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde...,in het kader van kernfusie op aarde...
3,male,15,Student,Leo,testing!!! testing!!!,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo!'s toolbar i can ...,thanks to yahoos toolbar i can no...


In [None]:
## Setting up environvenment on colabs 
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [None]:
STOPWORDS = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

pd_blog1["text_stop"] = pd_blog1["text_special"].apply(lambda text : remove_stopwords(text))
pd_blog1.head()

Unnamed: 0,gender,age,topic,sign,text,text_lower,text_special,text_stop
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...","info has been found (+/- 100 pages,...",info has been found 100 pages and ...,info found 100 pages 45 mb pdf files wait unti...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team members: drewe...,these are the team members drewes...,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde...,in het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing!!! testing!!!,testing!!! testing!!!,testing testing,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo!'s toolbar i can ...,thanks to yahoos toolbar i can no...,thanks yahoos toolbar capture urls popupswhich...


In [None]:
def text_format(text):
    text = remove_punctuation(text) # remove unwanted characters
    text = text.lower() # convert to lowercase
    text = text.strip() # remove unwanted spaces
    text = remove_stopwords(text) # remove stopwords
    return text

pd_blog["text"] = pd_blog["text"].map(lambda text : text_format(text))

In [None]:
pd_blog.head()

Unnamed: 0,gender,age,topic,sign,text
0,male,15,Student,Leo,info found 100 pages 45 mb pdf files wait unti...
1,male,15,Student,Leo,team members drewes van der laag urllink mail ...
2,male,15,Student,Leo,het kader van kernfusie op aarde maak je eigen...
3,male,15,Student,Leo,testing testing
4,male,33,InvestmentBanking,Aquarius,thanks yahoos toolbar capture urls popupswhich...


### 4.1 Merge label columns


In [None]:
pd_blog.gender.value_counts()

male      25815
female    24185
Name: gender, dtype: int64

In [None]:
pd_blog.age.value_counts()

17    6859
24    5746
23    5518
16    4156
27    4094
15    3508
35    3365
26    2869
25    2837
14    2043
36    1985
34    1886
33    1654
13     745
39     412
41     394
46     330
48     318
37     310
47     206
38     196
40     192
43     150
42      96
45      93
44      38
Name: age, dtype: int64

In [None]:
pd_blog.topic.value_counts()

indUnk                     17560
Student                    10660
Technology                  4379
Education                   2646
Arts                        1817
Fashion                     1805
Communications-Media        1603
Internet                    1420
Engineering                 1402
Science                      705
Government                   599
Non-Profit                   491
Manufacturing                441
BusinessServices             416
Marketing                    414
Accounting                   364
Law                          308
Museums-Libraries            285
Banking                      283
Advertising                  273
Religion                     258
Consulting                   243
Publishing                   207
Transportation               196
Military                     194
LawEnforcement-Security      125
Sports-Recreation            120
Automotive                   116
Biotech                      101
InvestmentBanking             85
HumanResou

In [None]:
pd_blog.sign.value_counts()

Aries          7795
Aquarius       4784
Cancer         4589
Sagittarius    4571
Libra          4378
Pisces         4142
Leo            3904
Capricorn      3819
Taurus         3390
Scorpio        3243
Virgo          2827
Gemini         2558
Name: sign, dtype: int64

In [None]:
pd_blog["age"] = pd_blog["age"].astype(str)
pd_blog["labels"] = pd_blog.apply(lambda col :
                            [col["gender"],col["age"],col["topic"],col["sign"]],axis =1)

In [None]:
pd_blog.drop(columns=["gender","age","sign","topic"],axis =1, inplace = True)

In [None]:
pd_blog.head()

Unnamed: 0,text,labels
0,info found 100 pages 45 mb pdf files wait unti...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoos toolbar capture urls popupswhich...,"[male, 33, InvestmentBanking, Aquarius]"


In [None]:
gc.collect()

150

### 4.2 Create Train and Test data sets

In [None]:
from sklearn.model_selection import train_test_split

X = pd_blog.text
y = pd_blog.labels

X_train, X_test, y_train, y_test =train_test_split(X,y, random_state=60,
                                                   test_size = 0.2,
                                                  shuffle = True)

In [None]:
print("Training shape :", X_train.shape)
print("Testing shape :", X_test.shape)

Training shape : (40000,)
Testing shape : (10000,)


### 4.4 Vectorizing the features

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

ctv = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', 
                      ngram_range=(1, 2), stop_words = 'english')

corpus = list(X_train)+list(X_test)


In [None]:
ctv.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='\\w{1,}', tokenizer=None,
                vocabulary=None)

In [None]:
gc.collect()
xtrain_ctv = ctv.transform(X_train)

In [None]:
gc.collect()
xtest_ctv = ctv.transform(X_test)

In [None]:
# Vocabulary Size
print(len(ctv.vocabulary_))

3071715


In [None]:
#Size of Document Term Matrix
df_ct = ctv.transform(corpus)
df_ct.shape

(50000, 3071715)

In [None]:
# Sample record
print(df_ct[0])

  (0, 39773)	1
  (0, 40034)	1
  (0, 76624)	1
  (0, 76762)	1
  (0, 77511)	1
  (0, 77520)	1
  (0, 78252)	1
  (0, 78605)	1
  (0, 94916)	1
  (0, 94921)	1
  (0, 148522)	1
  (0, 148523)	1
  (0, 230931)	1
  (0, 231214)	1
  (0, 285463)	1
  (0, 285481)	1
  (0, 297447)	1
  (0, 299580)	1
  (0, 312096)	1
  (0, 313788)	1
  (0, 389978)	1
  (0, 390432)	1
  (0, 460008)	1
  (0, 460719)	1
  (0, 529239)	1
  :	:
  (0, 2978470)	1
  (0, 2978593)	1
  (0, 2983209)	1
  (0, 2984294)	1
  (0, 2985719)	1
  (0, 2985857)	1
  (0, 2987899)	1
  (0, 2987908)	1
  (0, 2993552)	1
  (0, 2994579)	1
  (0, 2995569)	1
  (0, 2998929)	1
  (0, 3001283)	2
  (0, 3002179)	1
  (0, 3002970)	1
  (0, 3009204)	1
  (0, 3009652)	1
  (0, 3017499)	2
  (0, 3017561)	1
  (0, 3018752)	1
  (0, 3036922)	1
  (0, 3037903)	1
  (0, 3039920)	2
  (0, 3041083)	1
  (0, 3042237)	1


In [None]:
ctv.get_feature_names()[:20]

['0',
 '0 0',
 '0 05',
 '0 1',
 '0 10',
 '0 15',
 '0 2',
 '0 23003',
 '0 24',
 '0 3',
 '0 4',
 '0 45',
 '0 5',
 '0 6',
 '0 ahhhhhhhhhhhh',
 '0 alcohol',
 '0 allen',
 '0 answering',
 '0 article',
 '0 assists']

### Dictionary of Labels

In [None]:
pd_blog_test=pd_blog.labels[:10].values

In [None]:
label_counts=dict()

for labels in pd_blog.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[str(label)]+=1
        else:
            label_counts[str(label)]=1
        
label_counts

{'13': 745,
 '14': 2043,
 '15': 3508,
 '16': 4156,
 '17': 6859,
 '23': 5518,
 '24': 5746,
 '25': 2837,
 '26': 2869,
 '27': 4094,
 '33': 1654,
 '34': 1886,
 '35': 3365,
 '36': 1985,
 '37': 310,
 '38': 196,
 '39': 412,
 '40': 192,
 '41': 394,
 '42': 96,
 '43': 150,
 '44': 38,
 '45': 93,
 '46': 330,
 '47': 206,
 '48': 318,
 'Accounting': 364,
 'Advertising': 273,
 'Agriculture': 78,
 'Aquarius': 4784,
 'Architecture': 70,
 'Aries': 7795,
 'Arts': 1817,
 'Automotive': 116,
 'Banking': 283,
 'Biotech': 101,
 'BusinessServices': 416,
 'Cancer': 4589,
 'Capricorn': 3819,
 'Chemicals': 75,
 'Communications-Media': 1603,
 'Construction': 28,
 'Consulting': 243,
 'Education': 2646,
 'Engineering': 1402,
 'Environment': 6,
 'Fashion': 1805,
 'Gemini': 2558,
 'Government': 599,
 'HumanResources': 79,
 'Internet': 1420,
 'InvestmentBanking': 85,
 'Law': 308,
 'LawEnforcement-Security': 125,
 'Leo': 3904,
 'Libra': 4378,
 'Manufacturing': 441,
 'Maritime': 54,
 'Marketing': 414,
 'Military': 194,
 '

### Transform the labels


In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))

y_train = binarizer.fit_transform(y_train)
y_test = binarizer.transform(y_test)

In [None]:
y_train[10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

### Defining the classifier

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier

# Using pipeline for applying logistic regression and one vs rest classifier
blog_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'),
                    n_jobs=-1)),])
blog_pipeline.fit(xtrain_ctv, y_train)

Pipeline(memory=None,
         steps=[('clf',
                 OneVsRestClassifier(estimator=LogisticRegression(C=1.0,
                                                                  class_weight=None,
                                                                  dual=False,
                                                                  fit_intercept=True,
                                                                  intercept_scaling=1,
                                                                  l1_ratio=None,
                                                                  max_iter=100,
                                                                  multi_class='auto',
                                                                  n_jobs=None,
                                                                  penalty='l2',
                                                                  random_state=None,
                                                      

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

clf=LogisticRegression(solver='lbfgs')
clf=OneVsRestClassifier(clf)

clf.fit(xtrain_ctv, y_train)

## solver type lbfgs taking longer time to run

In [None]:
y_pred = blog_pipeline.predict(xtest_ctv)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

# Prediction metrics

print('Accuracy score: ', accuracy_score(y_test, y_pred))
print('F1 score: Micro', f1_score(y_test, y_pred, average='micro'))
print('Average precision score: Micro', average_precision_score(y_test, y_pred, average='micro'))
print('Average recall score: Micro', recall_score(y_test, y_pred, average='micro'))

# Macro metrics

print('F1 score: Macro', f1_score(y_test, y_pred, average='macro'))
print('Average precision score: Macro', average_precision_score(y_test, y_pred, average='macro'))
print('Average recall score: Macro', recall_score(y_test, y_pred, average='macro'))

# Weighted Metrics

print('F1 score: weighted', f1_score(y_test, y_pred, average='weighted'))
print('Average precision score: weighted', average_precision_score(y_test, y_pred, average='weighted'))
print('Average recall score: weighted', recall_score(y_test, y_pred, average='weighted'))    
    


Accuracy score:  0.0162
F1 score: Micro 0.3174037495497886
Average precision score: Micro 0.17693479612640162
Average recall score: Micro 0.2093
F1 score: Macro 0.04694142553235136
Average precision score: Macro 0.0632366403718985
Average recall score: Macro 0.03455784637542495
F1 score: weighted 0.23565023574835395
Average precision score: weighted 0.262073152045813
Average recall score: weighted 0.2093


### True and Predicted Labels

In [None]:
import random 
n = 5
  
j = []
for i in range(n):
   j.append(random.randint(0, len(y_test)))
   print(j)
                 
for k in j:
   print("Predicted Value",binarizer.inverse_transform(y_pred)[k])
   print("Actual Value",binarizer.inverse_transform(y_test)[k])
   print("  ")
   print("  ")
                
                 
     

[7261]
[7261, 3933]
[7261, 3933, 1461]
[7261, 3933, 1461, 5184]
[7261, 3933, 1461, 5184, 8229]
Predicted Value ('male',)
Actual Value ('14', 'Aries', 'indUnk', 'male')
  
  
Predicted Value ('female',)
Actual Value ('24', 'Sagittarius', 'female', 'indUnk')
  
  
Predicted Value ('male',)
Actual Value ('35', 'Aries', 'Technology', 'male')
  
  
Predicted Value ('female',)
Actual Value ('24', 'Aries', 'indUnk', 'male')
  
  
Predicted Value ('male',)
Actual Value ('27', 'Pisces', 'indUnk', 'male')
  
  
