# Project MBTI

Subject : Creating a model to predict MBTI personality types from social media posts

Data format (English dataset):
    
    =>type (string): MBTI types
    =>posts (string): text posts 

## I. Data exploration

In [1]:
import numpy as np
import csv
import pandas as pd

Load with panda read_csv to avoid delimiter problem between header and data

In [2]:
data2 = pd.read_csv("C:\Users\Gwen\MBTI\mbti_1.csv", header =0)

In [3]:
print (data2)

      type                                              posts
0     INFJ  'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   https://www.youtube.com/wat...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
5     INTJ  '18/37 @.@|||Science  is not perfect. No scien...
6     INFJ  'No, I can't draw on my own nails (haha). Thos...
7     INTJ  'I tend to build up a collection of things on ...
8     INFJ  I'm not sure, that's a good question. The dist...
9     INTP  'https://www.youtube.com/watch?v=w8-egj0y8Qs||...
10    INFJ  'One time my parents were fighting over my dad...
11    ENFJ  'https://www.youtube.com/watch?v=PLAaiKvHvZs||...
12    INFJ  'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13    INTJ  'Fair enough, if that's how you want to look a...
14    INTP  'Basically this...  https://youtu.be/1pH5c1Jkh...
15    IN

We are testing what we have in columns header

In [4]:
data2.columns

Index([u'type', u'posts'], dtype='object')

In [5]:
data2[:0]

Unnamed: 0,type,posts


Overview of the dataset

In [6]:
print(data2.shape)

(8675, 2)


We have 8675 raws and 2 columns, which correspond to the dataset in input.

### 1. Frequence of each MBTI type in the dataset

Here we can see the repartition of each type in the dataset. We can see we have mainly posts of INFP, INFJ, INTJ and INTP: all the "IN" (Introvert and iNtuitive) are much more represented. 

In [7]:
data2.groupby('type').count()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
ENFJ,190
ENFP,675
ENTJ,231
ENTP,685
ESFJ,42
ESFP,48
ESTJ,39
ESTP,89
INFJ,1470
INFP,1832


### 2. Create category MBTI list

We need to create a category list of the 16 MBTI types.

In [8]:
list(set(data2.type))

['ENFJ',
 'ESFP',
 'INFJ',
 'ESTJ',
 'ISTJ',
 'ENTJ',
 'ISFP',
 'INTJ',
 'ISTP',
 'ENTP',
 'ISFJ',
 'INTP',
 'ESFJ',
 'ESTP',
 'ENFP',
 'INFP']

In [9]:
category = list(data2['type'].unique())

In [10]:
print (category)

['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP', 'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ']


Creation of a dictionnary with MBTI type and index to replace later the "type" columns of train and test set with numerical values

In [11]:
dict_category = {}
for idx, x in enumerate(category):
    dict_category[x]= idx+1
print dict_category

{'ENFJ': 6, 'ESFP': 14, 'INFJ': 1, 'ESTJ': 15, 'ISTJ': 12, 'ENTJ': 5, 'ISFP': 9, 'INTJ': 4, 'ISTP': 10, 'ENTP': 2, 'ISFJ': 11, 'INTP': 3, 'ESFJ': 16, 'ESTP': 13, 'ENFP': 8, 'INFP': 7}


## II. Data cleaning

objective: Remove URL

### 1. Remove URL using regex

In [12]:
import re

We are applying the regex function to remove all url strings on the colum "posts":

In [13]:
for idx, x in enumerate(data2['posts']):
    data2['posts'][idx] = re.sub(r'http([a-z]|[A-Z]|\.|/|\?|=|:|[0-9]|_|-)*','',x)
    #data2['posts'][idx] = re.sub(r'http('(\'|\")','',x)
print(data2['posts'])

0       '||||||enfp and intj moments    sportscenter n...
1       'I'm finding the lack of me in these posts ver...
2       'Good one  _____   |||Of course, to which I sa...
3       'Dear INTP,   I enjoyed our conversation the o...
4       'You're fired.|||That's another silly misconce...
5       '18/37 @.@|||Science  is not perfect. No scien...
6       'No, I can't draw on my own nails (haha). Thos...
7       'I tend to build up a collection of things on ...
8       I'm not sure, that's a good question. The dist...
9       '|||I'm in this position where I have to actua...
10      'One time my parents were fighting over my dad...
11      '|||51 :o|||I went through a break up some mon...
12      'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13      'Fair enough, if that's how you want to look a...
14      'Basically this...  |||Can I has Cheezburgr?||...
15      'Your comment screams INTJ, bro. Especially th...
16      'some of these both excite and calm me:  BUTTS...
17      'I thi

In [14]:
print(data2)

      type                                              posts
0     INFJ  '||||||enfp and intj moments    sportscenter n...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   |||Of course, to which I sa...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
5     INTJ  '18/37 @.@|||Science  is not perfect. No scien...
6     INFJ  'No, I can't draw on my own nails (haha). Thos...
7     INTJ  'I tend to build up a collection of things on ...
8     INFJ  I'm not sure, that's a good question. The dist...
9     INTP  '|||I'm in this position where I have to actua...
10    INFJ  'One time my parents were fighting over my dad...
11    ENFJ  '|||51 :o|||I went through a break up some mon...
12    INFJ  'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13    INTJ  'Fair enough, if that's how you want to look a...
14    INTP  'Basically this...  |||Can I has Cheezburgr?||...
15    IN

## III. Data prepartion: Bag of words 

The objective is to extract numerical features vectors from a text, by tokenizing string (separate words of the corpus and give to each token an numerical id, using white-spaces as delimiters), by counting occurences of each token in each post, by normalizing the token (concerning the importance of the token in occurence/post). And, as most features will be zero (when tokenizing phase), we need to manage stockage space (a possibility is using an inverse document with TF-IDF).

At first I will construct the train and test set. It needs to be done before the vectorization because the matrix posts/terms  mustn't contain all the words of all the posts but only those of the trainset.

### 1. Train and Test set

I will now divide the dataset in trainset (75%) and testset (25%) to execute models. I will use a specific train/test split function given by scikit-learn.

In [15]:
#Import & train and test set creation
from sklearn.model_selection import train_test_split
trainset, testset = train_test_split(data2, test_size=0.25)

In [16]:
print(trainset)

      type                                              posts
781   INTJ  'That's the thing. I can't type her at all. Sh...
16    INFJ  'some of these both excite and calm me:  BUTTS...
3062  INFP  'If I could just cease to exist, it'd be great...
4229  ESTP  'Thanks for the responses, guys! I'd usually l...
8458  ISFJ  'You Are by Charlie Wilson.||||||The Portable ...
3112  INFP  'it's in the socionics description of an INFJ/...
1179  INFP  'Oh you know, I'd be a total badass with nerve...
4596  INTP  'I think INTPs are less biased, yes. However, ...
1369  INFJ  'My daughter says much the same thing, albeit ...
1081  INFP  ':laughing: :laughing:Oh yes you know me so we...
7078  INTP  'one's life only makes sense within the contex...
4183  ENFP  Hehe, well. I really like to ''pick up'' my st...
1699  INTP  'Of the recent and good friends that I am fair...
460   INTJ  'INTJ and a proud follower of the Baha'i Faith...
1205  INFJ  'Sexual: 45 Self-preservation: 30 Social: 29||...
7123  IN

In [17]:
trainset.count()

type     6506
posts    6506
dtype: int64

A trainset with 75% of the dataframe (8675 raws)

In [18]:
testset.count()

type     2169
posts    2169
dtype: int64

A testset with 25% of the dataframe (8675 raws)

#### 1.1. Transform the types in numerical number with the dictionary category

##### a) For train set

In [19]:
trainset_type_num = trainset[:]
print(trainset_type_num)

      type                                              posts
781   INTJ  'That's the thing. I can't type her at all. Sh...
16    INFJ  'some of these both excite and calm me:  BUTTS...
3062  INFP  'If I could just cease to exist, it'd be great...
4229  ESTP  'Thanks for the responses, guys! I'd usually l...
8458  ISFJ  'You Are by Charlie Wilson.||||||The Portable ...
3112  INFP  'it's in the socionics description of an INFJ/...
1179  INFP  'Oh you know, I'd be a total badass with nerve...
4596  INTP  'I think INTPs are less biased, yes. However, ...
1369  INFJ  'My daughter says much the same thing, albeit ...
1081  INFP  ':laughing: :laughing:Oh yes you know me so we...
7078  INTP  'one's life only makes sense within the contex...
4183  ENFP  Hehe, well. I really like to ''pick up'' my st...
1699  INTP  'Of the recent and good friends that I am fair...
460   INTJ  'INTJ and a proud follower of the Baha'i Faith...
1205  INFJ  'Sexual: 45 Self-preservation: 30 Social: 29||...
7123  IN

I am matching the values of the column "type" with category.

In [20]:
trainset_type_num['type'] = trainset_type_num['type'].replace(dict_category)
print(trainset_type_num)

      type                                              posts
781      4  'That's the thing. I can't type her at all. Sh...
16       1  'some of these both excite and calm me:  BUTTS...
3062     7  'If I could just cease to exist, it'd be great...
4229    13  'Thanks for the responses, guys! I'd usually l...
8458    11  'You Are by Charlie Wilson.||||||The Portable ...
3112     7  'it's in the socionics description of an INFJ/...
1179     7  'Oh you know, I'd be a total badass with nerve...
4596     3  'I think INTPs are less biased, yes. However, ...
1369     1  'My daughter says much the same thing, albeit ...
1081     7  ':laughing: :laughing:Oh yes you know me so we...
7078     3  'one's life only makes sense within the contex...
4183     8  Hehe, well. I really like to ''pick up'' my st...
1699     3  'Of the recent and good friends that I am fair...
460      4  'INTJ and a proud follower of the Baha'i Faith...
1205     1  'Sexual: 45 Self-preservation: 30 Social: 29||...
7123    

POST TRAIN

Here we're keeping in a data frame the "posts" column of the train set

In [21]:
trainset_posts_only = pd.DataFrame(data=trainset_type_num['posts'])
print(trainset_posts_only)

                                                  posts
781   'That's the thing. I can't type her at all. Sh...
16    'some of these both excite and calm me:  BUTTS...
3062  'If I could just cease to exist, it'd be great...
4229  'Thanks for the responses, guys! I'd usually l...
8458  'You Are by Charlie Wilson.||||||The Portable ...
3112  'it's in the socionics description of an INFJ/...
1179  'Oh you know, I'd be a total badass with nerve...
4596  'I think INTPs are less biased, yes. However, ...
1369  'My daughter says much the same thing, albeit ...
1081  ':laughing: :laughing:Oh yes you know me so we...
7078  'one's life only makes sense within the contex...
4183  Hehe, well. I really like to ''pick up'' my st...
1699  'Of the recent and good friends that I am fair...
460   'INTJ and a proud follower of the Baha'i Faith...
1205  'Sexual: 45 Self-preservation: 30 Social: 29||...
7123  'yep, yep, yep, especially the last one.    ye...
5236  'I should clarify that this guy hasn't mad

LABEL NUMBER TRAIN : Dataframe of only posts value of testset 

In [22]:
trainset_type_num_only = pd.DataFrame(data=trainset_type_num['type'])

For models we need to have the "type" column alone in numerical value (for the trainset and for the testset). And it's mandatory in our case to have it in an array:

In [23]:
#train set type column only
df = trainset_type_num_only.values
print(df)
df.shape

[[4]
 [1]
 [7]
 ..., 
 [4]
 [1]
 [2]]


(6506L, 1L)

##### b) For test set

In [24]:
testset_type_num = testset[:]
print(testset_type_num)

      type                                              posts
6667  INTP  You have no idea how loudly I just laughed whi...
4264  INFJ  'As the title states... What are some tips/ide...
3176  INFJ  'Most definitely!...:)|||Yesterday I went swim...
4521  INFP  'Yes, but INTJs have a strong sense of individ...
7248  ENFP  'Communication with my INTJ can be tricky I re...
6586  ENFP  'sarahscriptor thanks so much Sarah. Your word...
5382  INFJ  'I'm pretty sure I fight other kinds of windmi...
29    INFJ  'I think that that can absolutely be true of i...
4340  INFP  'Hon this is quite normal for an INFP to feel ...
6465  ENFP  'Istp|||Heh...also very true :D  Do you ever g...
3538  ESTP  'Hey that sounds like a good trick then :p  Ac...
5176  ESFJ  'I think someone can have ADHD and be introver...
670   ISTP  'I like to sing. But I don't think it's becaus...
2540  INFJ  'Intps are fun an open intellectually, almost ...
3495  ESFJ  'Yep, I'm ESFJ.  Hooray!  |||Big 5 is worthles...
3311  IN

I am matching the values of the column "type" with category.

In [25]:
testset_type_num['type'] = testset_type_num['type'].replace(dict_category)
print(testset_type_num)

      type                                              posts
6667     3  You have no idea how loudly I just laughed whi...
4264     1  'As the title states... What are some tips/ide...
3176     1  'Most definitely!...:)|||Yesterday I went swim...
4521     7  'Yes, but INTJs have a strong sense of individ...
7248     8  'Communication with my INTJ can be tricky I re...
6586     8  'sarahscriptor thanks so much Sarah. Your word...
5382     1  'I'm pretty sure I fight other kinds of windmi...
29       1  'I think that that can absolutely be true of i...
4340     7  'Hon this is quite normal for an INFP to feel ...
6465     8  'Istp|||Heh...also very true :D  Do you ever g...
3538    13  'Hey that sounds like a good trick then :p  Ac...
5176    16  'I think someone can have ADHD and be introver...
670     10  'I like to sing. But I don't think it's becaus...
2540     1  'Intps are fun an open intellectually, almost ...
3495    16  'Yep, I'm ESFJ.  Hooray!  |||Big 5 is worthles...
3311    

Here we're keeping in a data frame the "posts" column of the test set

In [26]:
testset_posts_only = pd.DataFrame(data=testset_type_num['posts'])
print(testset_posts_only)

                                                  posts
6667  You have no idea how loudly I just laughed whi...
4264  'As the title states... What are some tips/ide...
3176  'Most definitely!...:)|||Yesterday I went swim...
4521  'Yes, but INTJs have a strong sense of individ...
7248  'Communication with my INTJ can be tricky I re...
6586  'sarahscriptor thanks so much Sarah. Your word...
5382  'I'm pretty sure I fight other kinds of windmi...
29    'I think that that can absolutely be true of i...
4340  'Hon this is quite normal for an INFP to feel ...
6465  'Istp|||Heh...also very true :D  Do you ever g...
3538  'Hey that sounds like a good trick then :p  Ac...
5176  'I think someone can have ADHD and be introver...
670   'I like to sing. But I don't think it's becaus...
2540  'Intps are fun an open intellectually, almost ...
3495  'Yep, I'm ESFJ.  Hooray!  |||Big 5 is worthles...
3311  Absolutely terrible with names. People's faces...
6088  'yeah, SW is great. I saw him on the last 

LABEL NUMBER TEST : Dataframe of only numerical type value of testset 

In [27]:
testset_type_num_only = pd.DataFrame(data=testset_type_num['type'])
print(testset_type_num_only)

      type
6667     3
4264     1
3176     1
4521     7
7248     8
6586     8
5382     1
29       1
4340     7
6465     8
3538    13
5176    16
670     10
2540     1
3495    16
3311     4
6088     4
1975     7
294      4
1324     3
1891     3
358      3
5656     5
1511     7
4534     3
5991    16
5212     2
5090     7
7716     7
4751     5
...    ...
5165     6
2909     1
7139     2
1333     3
8667     2
6157     7
1494     1
4058     3
6937     4
2402     7
417      7
4616     1
1412    12
7310    16
2590     4
196      4
1314     7
449      4
3260     1
5970     3
2078     1
1294    12
3615     3
7008     7
2045     4
8156     1
8170     5
3930     1
3266     7
6788     7

[2169 rows x 1 columns]


For models we need to have the "type" column alone in numerical value (for the trainset and for the testset). And it's mandatory in our case to have it in array:

In [28]:
#test set type column only
df_test = testset_type_num_only.values
print(df_test)
df_test.shape

[[3]
 [1]
 [1]
 ..., 
 [1]
 [7]
 [7]]


(2169L, 1L)

### 2. Vectorization using CountVectorizer of scikit-learn

CountVectorizer is doing in a single class the (1) tokenization and the (2) count occurences

For train

In [29]:
train_for_vec = trainset
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_for_vec['posts'])

For test

In [30]:
test_counts = count_vect.transform(testset['posts'])

In [31]:
train_counts.shape

(6506, 94105)

In [32]:
test_counts.shape

(2169, 94105)

It's give us two sparce matrices: one of 6506 x 94411 (for train) with 614 237 966 elements (zero included as we can see below) and one of 2169 x 94411 (for test) with 204 777 459 elements (zero included as we can see below). For the models it's important to have two matrices of the same dimension (94411), that is why I am only doing the transform() on testset and not the fit().

In [33]:
train_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [34]:
test_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [35]:
print(train_counts)

  (0, 12531)	1
  (0, 70374)	1
  (0, 57296)	1
  (0, 66266)	1
  (0, 76838)	1
  (0, 12537)	1
  (0, 18832)	1
  (0, 12574)	1
  (0, 52739)	1
  (0, 54536)	1
  (0, 44762)	1
  (0, 79117)	1
  (0, 9550)	1
  (0, 82756)	1
  (0, 48799)	1
  (0, 38356)	1
  (0, 73474)	2
  (0, 53520)	2
  (0, 80544)	1
  (0, 37208)	1
  (0, 55108)	2
  (0, 53500)	2
  (0, 43501)	1
  (0, 36406)	1
  (0, 50075)	1
  :	:
  (6505, 38697)	1
  (6505, 60544)	6
  (6505, 50748)	11
  (6505, 88519)	8
  (6505, 14083)	1
  (6505, 46298)	18
  (6505, 17269)	12
  (6505, 86821)	1
  (6505, 88806)	3
  (6505, 40245)	7
  (6505, 10017)	55
  (6505, 83912)	3
  (6505, 60151)	38
  (6505, 69072)	1
  (6505, 7963)	1
  (6505, 84151)	49
  (6505, 74978)	3
  (6505, 9303)	11
  (6505, 11970)	3
  (6505, 41027)	2
  (6505, 85995)	1
  (6505, 17793)	10
  (6505, 83404)	2
  (6505, 83026)	40
  (6505, 82987)	21


In [36]:
print(test_counts)

  (0, 137)	1
  (0, 2048)	2
  (0, 3989)	1
  (0, 4008)	1
  (0, 4833)	2
  (0, 4836)	2
  (0, 6758)	1
  (0, 7440)	1
  (0, 7442)	1
  (0, 7456)	1
  (0, 7504)	1
  (0, 7557)	1
  (0, 7676)	1
  (0, 7974)	1
  (0, 7984)	1
  (0, 8225)	1
  (0, 8315)	1
  (0, 8365)	2
  (0, 8483)	1
  (0, 8568)	3
  (0, 8717)	1
  (0, 8735)	1
  (0, 8772)	1
  (0, 9268)	1
  (0, 9303)	5
  :	:
  (2168, 90966)	1
  (2168, 91130)	1
  (2168, 91264)	1
  (2168, 91392)	14
  (2168, 91417)	2
  (2168, 91751)	1
  (2168, 91778)	1
  (2168, 91781)	1
  (2168, 91855)	1
  (2168, 91878)	2
  (2168, 91902)	4
  (2168, 92012)	1
  (2168, 92023)	1
  (2168, 92428)	1
  (2168, 92731)	1
  (2168, 92814)	2
  (2168, 92854)	1
  (2168, 92866)	1
  (2168, 92880)	1
  (2168, 92996)	1
  (2168, 93030)	1
  (2168, 93195)	18
  (2168, 93240)	6
  (2168, 93252)	1
  (2168, 93271)	1


In [37]:
count_vect.vocabulary_.get('crazy')

23282

Here an example, we have 23411 occurences of the word "crazy" in the posts. 

Now we have (1) tokenized the posts content and (2) counted the words occurencies, but it's better to work with frequencies.

Currently we are using only a dictionnary of 1-grams (individual words) because default value for the function CountVectorizer() is ngram_range=(1, 1). But we can take 2-grams (using parameter ngram_range=(1, 2) in the function) if we want to enrich the dictionnary (we have also groups of two words), but it is also taking much more place. So I will choose for the moment to use only 1-gram (and maybe enrich later if necessary).

### 3. TF-IDF

To avoid space stockage issues, I will use the TF-IDF (term frequency-inverse document frequency) methods, which will allow for example (1:tf) to divide each occurences of a word in a post by total number of words in this post (that give new features names tf). This method also allows (2:tf-idf) to give less weight (importance) on words which appear in many posts of the corpus and automatically that give less information than words that appear in a smaller sample of posts.

$ tfidf(t,d) = tf(t,d) * idf(t) $

In [38]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
train_tfidf.shape

(6506, 94105)

With the function "fit_tranform()" we are doing in one time the fit() which fits the estimator to the data (1:tf) and the transform() which transforms the matrix of occurences (data_counts) in tf-idf matrix with frequencies (2:tf-idf).

In [39]:
test_tfidf = tfidf_transformer.transform(test_counts)
test_tfidf.shape

(2169, 94105)

To fit in the model the matrix test and train tf-idf need to be same dimention (94411)

## IV. Models 

### 1. Training a classifier: MultinomialNB

Now we are training a classifier using all those features to classify posts. We will use a basic Naive Bayes classifier named MultinomialNB also provided by scikit-learn.

In [41]:
from sklearn.naive_bayes import MultinomialNB
cl = MultinomialNB().fit(train_tfidf, df)

I could also use a Pipeline to do the Vectorization, Tf-idf and the classifier step faster and easier, using scikit-learn class sklearn.pipeline. It could be an improvement for later.

#### Test of the model performance on testset

Construct a class prediction for tf-idf matrix of test

In [42]:
predicted = cl.predict(test_tfidf)

Now we are calculating the prediction accuracy using metrix from scikit-learn

In [43]:
from sklearn import metrics
metrics.accuracy_score(df_test, predicted)

0.20931304748732135

I have now the accuracy of the first results, which is pretty bad: only 21% of the posts are well classified (associated to the good MBTI type). Several reasons can explain this result: at first, the previous steps of words treatement (data cleaning, tokenization, occurencies count, normalization) were not good enough, or we may need to change the classifier to see other results.

### 2. Using another classifier: SVM

I will use another classifier to try improving my prediction: SVM (Support vector machine) with stochastic gradient descent (SGD), which is supposed to be good for text classification. I am using the default values for SGDClassifier. I am also trying the pipeline (explained before), which is doing in one time the Vectorization, the occurences count (CountVectorizer), the normalization (Tf-idf) and the classifier (here SVM).

In [44]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('vect_count_svm', CountVectorizer()), ('tfidf', TfidfTransformer()), ('svm_cl', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None))])
pipeline.fit(trainset['posts'],df)

Pipeline(memory=None,
     steps=[('vect_count_svm', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
...ty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False))])

In [45]:
svm_predicted = pipeline.predict(testset['posts'])

Evalution of SVM prediction accuracy.

In [46]:
metrics.accuracy_score(df_test, svm_predicted)

0.66113416320885199

Here we can see that the prediction is much more better than before (even though they still not that good) with Naive Bayes classifier (MultinomialNB). Around 66% of posts are well classified, which is better than 21% earlier.

### 3. Using another classifier: Logistic regression

I will try another classifier proposed by scikit-learn, the Logistic regression which is a linear model

In [47]:
#import 
from sklearn.linear_model import LogisticRegression
#initialization of the model
loreg = LogisticRegression()
#train the model
loreg.fit(train_tfidf, df)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
#prediction class
loreg_predicted = loreg.predict(test_tfidf)

In [49]:
#prediction accuracy
metrics.accuracy_score(df_test, loreg_predicted)

0.58690640848317199

The result of our prediction using linear regression model is 59% accuracy for posts classification, which is a bit less than SGDClassifier.

## V. Improve Bag of words

As our prediction accuracy is not very good, I will try to improve the Vectorization

In [50]:
# 1: Improve by removing the english stop words (as the posts in the origin dataset are in english)
# 2: Improve by adding in the word dictionnary group of two words
# 3: Improve by removing words that appear in more than 50% of the posts
# 4: Improve by removing words that appear in less than two posts
improve_count_vect = CountVectorizer(stop_words='english', ngram_range=(1, 2), max_df=0.5, min_df=2)

### 1. Vectorization on train and test set

In [51]:
#train vectorization
improve_train_counts = improve_count_vect.fit_transform(train_for_vec['posts'])

In [52]:
#test vectorization
improve_test_counts = improve_count_vect.transform(testset['posts'])

### 2. TF-IDF on train and test set

In [53]:
#train tf-idf
improve_train_tfidf = tfidf_transformer.fit_transform(improve_train_counts)
improve_train_tfidf.shape

(6506, 413901)

In [54]:
#train tf-idf
improve_test_tfidf = tfidf_transformer.transform(improve_test_counts)
improve_test_tfidf.shape

(2169, 413901)

### 3. Redo models with this improvement

#### a) NBMultinomial

In [55]:
# Initialization & Train of the model
improve_cl = MultinomialNB().fit(improve_train_tfidf, df)

In [56]:
#Prediction class
BN_improve_predicted = improve_cl.predict(improve_test_tfidf)

In [57]:
#Accuracy of the model
metrics.accuracy_score(df_test, BN_improve_predicted)

0.21023513139695713

The result of our modifications on the Vectorization is not relevant for NBMultinomial classifier, which appears to be almost the same percentage of accuracy (around 21% of posts are well classified).

#### b) SVM 

In [58]:
# Initialization & Train of the model
improve_svm_cl = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(improve_train_tfidf, df)

In [59]:
#Prediction class
svm_improve_predicted = improve_svm_cl.predict(improve_test_tfidf)

In [60]:
#Accuracy of the model
metrics.accuracy_score(df_test, svm_improve_predicted)

0.66943291839557395

Here again the modifications on CountVectorizer do not appear relevant, the classification accuracy is almost the same as before (67%).

#### c) Logistic Regression

In [61]:
#Initialization & Train of the model
improve_loreg = LogisticRegression().fit(improve_train_tfidf, df)

In [62]:
#Prediction class
improve_loreg_predicted = improve_loreg.predict(improve_test_tfidf)

In [63]:
#Accuracy of the model
metrics.accuracy_score(df_test, improve_loreg_predicted)

0.58091286307053946

Same observations than before, the result is not better after the modifications.

### 4. Conclusion on improvement

To conclude those improvements where not relevant, because the accuracy of each models did not improve

## VI. Global conclusion

We have tested several models to classify posts in the right MBTI type such as NBMultinomial, Gradien Descent (SGDClassifier) and Logistic Regression. We have also tried to improve the vectorization (tokenization and occurences count part) by adding parameters to the CountVectorizer() function, but it appears to be irrelevant, because predictions accuracy did not improve.

Currently the best classifier to classify posts with MBTI type is SVM (SGDClassifier) with 66% of accuracy, followed by Logistic Regression with 58% of accuracy and NBMultinomial with 20% of accuracy.