# Project MBTI

Create a model to predict MBTI personality types from posts

Data format:
    =>type (string): MBTI types
    =>posts (string): text posts 

## Data exploration

In [1]:
import numpy as np
import csv
import pandas as pd

Load with panda read_csv to avoid delimiter problem between header and data

In [2]:
data2 = pd.read_csv("C:\Users\Gwen\MBTI\mbti_1.csv", header =0)

In [3]:
print (data2)

      type                                              posts
0     INFJ  'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   https://www.youtube.com/wat...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
5     INTJ  '18/37 @.@|||Science  is not perfect. No scien...
6     INFJ  'No, I can't draw on my own nails (haha). Thos...
7     INTJ  'I tend to build up a collection of things on ...
8     INFJ  I'm not sure, that's a good question. The dist...
9     INTP  'https://www.youtube.com/watch?v=w8-egj0y8Qs||...
10    INFJ  'One time my parents were fighting over my dad...
11    ENFJ  'https://www.youtube.com/watch?v=PLAaiKvHvZs||...
12    INFJ  'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13    INTJ  'Fair enough, if that's how you want to look a...
14    INTP  'Basically this...  https://youtu.be/1pH5c1Jkh...
15    IN

Test columns

In [4]:
data2.columns

Index([u'type', u'posts'], dtype='object')

In [5]:
data2[:0]

Unnamed: 0,type,posts


Overview of the dataset

In [6]:
print(data2.shape)

(8675, 2)


We have 8675 raws and 2 columns, which correspond to the dataset in input.

### Frequence of each MBTI type in the dataset

In [7]:
data2.groupby('type').count()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
ENFJ,190
ENFP,675
ENTJ,231
ENTP,685
ESFJ,42
ESFP,48
ESTJ,39
ESTP,89
INFJ,1470
INFP,1832


### Create category MBTI list

In [8]:
list(set(data2.type))

['ENFJ',
 'ESFP',
 'INFJ',
 'ESTJ',
 'ISTJ',
 'ENTJ',
 'ISFP',
 'INTJ',
 'ISTP',
 'ENTP',
 'ISFJ',
 'INTP',
 'ESFJ',
 'ESTP',
 'ENFP',
 'INFP']

In [9]:
category = list(data2['type'].unique())

In [10]:
print (category)

['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP', 'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ']


In [11]:
dict_category = {}
for idx, x in enumerate(category):
    dict_category[x]= idx+1
print dict_category

{'ENFJ': 6, 'ESFP': 14, 'INFJ': 1, 'ESTJ': 15, 'ISTJ': 12, 'ENTJ': 5, 'ISFP': 9, 'INTJ': 4, 'ISTP': 10, 'ENTP': 2, 'ISFJ': 11, 'INTP': 3, 'ESFJ': 16, 'ESTP': 13, 'ENFP': 8, 'INFP': 7}


## Data cleaning

plan:
1) Remove URL
2) Vectorization

### Remove URL using regex

In [12]:
import re

We are applying the regex function on the colum post

In [13]:
for idx, x in enumerate(data2['posts']):
    data2['posts'][idx] = re.sub(r'http([a-z]|[A-Z]|\.|/|\?|=|:|[0-9]|_|-)*','',x)
    #data2['posts'][idx] = re.sub(r'http('(\'|\")','',x)
print(data2['posts'])

0       '||||||enfp and intj moments    sportscenter n...
1       'I'm finding the lack of me in these posts ver...
2       'Good one  _____   |||Of course, to which I sa...
3       'Dear INTP,   I enjoyed our conversation the o...
4       'You're fired.|||That's another silly misconce...
5       '18/37 @.@|||Science  is not perfect. No scien...
6       'No, I can't draw on my own nails (haha). Thos...
7       'I tend to build up a collection of things on ...
8       I'm not sure, that's a good question. The dist...
9       '|||I'm in this position where I have to actua...
10      'One time my parents were fighting over my dad...
11      '|||51 :o|||I went through a break up some mon...
12      'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13      'Fair enough, if that's how you want to look a...
14      'Basically this...  |||Can I has Cheezburgr?||...
15      'Your comment screams INTJ, bro. Especially th...
16      'some of these both excite and calm me:  BUTTS...
17      'I thi

In [14]:
print(data2)

      type                                              posts
0     INFJ  '||||||enfp and intj moments    sportscenter n...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   |||Of course, to which I sa...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
5     INTJ  '18/37 @.@|||Science  is not perfect. No scien...
6     INFJ  'No, I can't draw on my own nails (haha). Thos...
7     INTJ  'I tend to build up a collection of things on ...
8     INFJ  I'm not sure, that's a good question. The dist...
9     INTP  '|||I'm in this position where I have to actua...
10    INFJ  'One time my parents were fighting over my dad...
11    ENFJ  '|||51 :o|||I went through a break up some mon...
12    INFJ  'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13    INTJ  'Fair enough, if that's how you want to look a...
14    INTP  'Basically this...  |||Can I has Cheezburgr?||...
15    IN

## Data prepartion: Bag of words 

The objective is to extract numerical features vectors from a text, by tokenizing string (separate words of the corpus and give to each token an numerical id, using white-space as delimiter), by counting occurence of each token in each posts, by normalizing token (concerning importance of token in occurence/post). And, as most features will be zero (when tokenizing phase), we need to manage stockage space (a possibility is using an inverse document with TF-IDF).

At first I will construct the train and test set. It's need to be done before the vectorization because the matrix posts/term  mustn't contain all the words of all the posts but only those of the trainset.

### Train and Test set

I will now devide the dataset in trainset (75%) and testset (25%) to execute models. I will use a specific train/test split function given by scikit-learn.

In [15]:
from sklearn.model_selection import train_test_split
trainset, testset = train_test_split(data2, test_size=0.25)

In [16]:
print(trainset)

      type                                              posts
3735  ENTP  'Why isn't anyone answering the P.S. question?...
634   ESFP  'Leslie: ENFJ Ann: ESFJ Apirl: INFP Tom: ESTP ...
3302  INFP  'There used to be a similarminds forums, but i...
1957  INTP  'Or maybe it intentionally left you out just t...
6568  INTP  'My parents always told me I would need to put...
2063  INFJ  Aw this sucks. I am in Arizona. Have fun guys ...
3226  ESTP  'The way you describe sensory details and life...
8017  INFJ  'Istp|||Another ENTP!  Holy shit.  Are we bein...
4799  ENTP  'death|||rip|||Nah, in the Chance Me part, the...
3556  INTP  'well i havent read your previous post, but br...
5563  INTJ  'At my work, I worked the register, and dealt ...
6861  INFP  'ISTJ dad: What are you thinking about?  INFP ...
5731  ENFP  '7w6 ENFP here|||Im in ISTJ mode   I think i n...
1272  INTP  'My father is an ISTP and I can attest to this...
7024  INFP  'Anyone can be anything and we all have our sh...
8377  IS

In [17]:
trainset.count()

type     6506
posts    6506
dtype: int64

A trainset with 75% of the dataframe (8675 raws)

In [18]:
testset.count()

type     2169
posts    2169
dtype: int64

A testset with 25% of the dataframe (8675 raws)

#### Transform the types in numerical number with the dictionary category

##### For train set

In [19]:
trainset_type_num = trainset[:]
print(trainset_type_num)

      type                                              posts
3735  ENTP  'Why isn't anyone answering the P.S. question?...
634   ESFP  'Leslie: ENFJ Ann: ESFJ Apirl: INFP Tom: ESTP ...
3302  INFP  'There used to be a similarminds forums, but i...
1957  INTP  'Or maybe it intentionally left you out just t...
6568  INTP  'My parents always told me I would need to put...
2063  INFJ  Aw this sucks. I am in Arizona. Have fun guys ...
3226  ESTP  'The way you describe sensory details and life...
8017  INFJ  'Istp|||Another ENTP!  Holy shit.  Are we bein...
4799  ENTP  'death|||rip|||Nah, in the Chance Me part, the...
3556  INTP  'well i havent read your previous post, but br...
5563  INTJ  'At my work, I worked the register, and dealt ...
6861  INFP  'ISTJ dad: What are you thinking about?  INFP ...
5731  ENFP  '7w6 ENFP here|||Im in ISTJ mode   I think i n...
1272  INTP  'My father is an ISTP and I can attest to this...
7024  INFP  'Anyone can be anything and we all have our sh...
8377  IS

In [20]:
trainset_type_num['type'] = trainset_type_num['type'].replace(dict_category)
print(trainset_type_num)

      type                                              posts
3735     2  'Why isn't anyone answering the P.S. question?...
634     14  'Leslie: ENFJ Ann: ESFJ Apirl: INFP Tom: ESTP ...
3302     7  'There used to be a similarminds forums, but i...
1957     3  'Or maybe it intentionally left you out just t...
6568     3  'My parents always told me I would need to put...
2063     1  Aw this sucks. I am in Arizona. Have fun guys ...
3226    13  'The way you describe sensory details and life...
8017     1  'Istp|||Another ENTP!  Holy shit.  Are we bein...
4799     2  'death|||rip|||Nah, in the Chance Me part, the...
3556     3  'well i havent read your previous post, but br...
5563     4  'At my work, I worked the register, and dealt ...
6861     7  'ISTJ dad: What are you thinking about?  INFP ...
5731     8  '7w6 ENFP here|||Im in ISTJ mode   I think i n...
1272     3  'My father is an ISTP and I can attest to this...
7024     7  'Anyone can be anything and we all have our sh...
8377    

POST TRAIN

In [21]:
trainset_posts_only = pd.DataFrame(data=trainset_type_num['posts'])
print(trainset_posts_only)

                                                  posts
3735  'Why isn't anyone answering the P.S. question?...
634   'Leslie: ENFJ Ann: ESFJ Apirl: INFP Tom: ESTP ...
3302  'There used to be a similarminds forums, but i...
1957  'Or maybe it intentionally left you out just t...
6568  'My parents always told me I would need to put...
2063  Aw this sucks. I am in Arizona. Have fun guys ...
3226  'The way you describe sensory details and life...
8017  'Istp|||Another ENTP!  Holy shit.  Are we bein...
4799  'death|||rip|||Nah, in the Chance Me part, the...
3556  'well i havent read your previous post, but br...
5563  'At my work, I worked the register, and dealt ...
6861  'ISTJ dad: What are you thinking about?  INFP ...
5731  '7w6 ENFP here|||Im in ISTJ mode   I think i n...
1272  'My father is an ISTP and I can attest to this...
7024  'Anyone can be anything and we all have our sh...
8377  'I like Sandbox games.   Stardew Valley Don't ...
7160  'My roommate for the last three years and 

LABEL NUMBER TRAIN : Dataframe of only posts value of testset 

In [22]:
trainset_type_num_only = pd.DataFrame(data=trainset_type_num['type'])

In [46]:
#train set type column only
df = trainset_type_num_only.values
print(df)
df.shape

[[ 2]
 [14]
 [ 7]
 ..., 
 [ 8]
 [ 1]
 [ 1]]


(6506L, 1L)

##### For test set

In [24]:
testset_type_num = testset[:]
print(testset_type_num)

      type                                              posts
8548  INTP  'Change my username name to XZ9|||There are mu...
4553  INTP  'I thought I would be alone forever but now I'...
8672  INTP  'So many questions when i do these things.  I ...
4488  INFP  'Accurate. Though certain of my dominant Fi, I...
7025  ISFP  'black is my favorite|||ergo proxy and hell gi...
7128  INFJ  'I get it. The fairy dust kind of disappears. ...
5249  INTJ  'the veiliathan|||frosty u 've all our respect...
2972  INFJ  'Ha! Jung is my hero and I have had a somewhat...
119   ENFP  'Just got married to my ESFP today! :proud: 68...
5149  INFP  'For a long while I've been identifying as a t...
965   INFP  I second this, especially if we're not emotion...
5388  INFJ  '|||You know you're really lonely when you com...
5479  INTP  'Whatcha talkin bout Willis?  INTP's are proba...
5250  INFJ  'I have the SAME argument with my INTP boyfrie...
1900  ENTP  'You're not manipulating the rule, you're brea...
6147  IN

In [25]:
testset_type_num['type'] = testset_type_num['type'].replace(dict_category)
print(testset_type_num)

      type                                              posts
8548     3  'Change my username name to XZ9|||There are mu...
4553     3  'I thought I would be alone forever but now I'...
8672     3  'So many questions when i do these things.  I ...
4488     7  'Accurate. Though certain of my dominant Fi, I...
7025     9  'black is my favorite|||ergo proxy and hell gi...
7128     1  'I get it. The fairy dust kind of disappears. ...
5249     4  'the veiliathan|||frosty u 've all our respect...
2972     1  'Ha! Jung is my hero and I have had a somewhat...
119      8  'Just got married to my ESFP today! :proud: 68...
5149     7  'For a long while I've been identifying as a t...
965      7  I second this, especially if we're not emotion...
5388     1  '|||You know you're really lonely when you com...
5479     3  'Whatcha talkin bout Willis?  INTP's are proba...
5250     1  'I have the SAME argument with my INTP boyfrie...
1900     2  'You're not manipulating the rule, you're brea...
6147    

In [26]:
testset_posts_only = pd.DataFrame(data=testset_type_num['posts'])
print(testset_posts_only)

                                                  posts
8548  'Change my username name to XZ9|||There are mu...
4553  'I thought I would be alone forever but now I'...
8672  'So many questions when i do these things.  I ...
4488  'Accurate. Though certain of my dominant Fi, I...
7025  'black is my favorite|||ergo proxy and hell gi...
7128  'I get it. The fairy dust kind of disappears. ...
5249  'the veiliathan|||frosty u 've all our respect...
2972  'Ha! Jung is my hero and I have had a somewhat...
119   'Just got married to my ESFP today! :proud: 68...
5149  'For a long while I've been identifying as a t...
965   I second this, especially if we're not emotion...
5388  '|||You know you're really lonely when you com...
5479  'Whatcha talkin bout Willis?  INTP's are proba...
5250  'I have the SAME argument with my INTP boyfrie...
1900  'You're not manipulating the rule, you're brea...
6147  'So over three months ago, I was feeling prett...
5660  In matters of romance it comes down to the

In [27]:
test_post_array = testset_posts_only.values
print(test_post_array)

[[ "'Change my username name to XZ9|||There are multiple tri-type combinations. Each gives you a personality.  8-6-3 The justice fighter  This is really typical of ESTJ description. Others may know this type as a old school military...|||I wanted to add that there are certain types of ESTJ I find more attractive. The most I find attractive is the old school military general type.|||This pairing is recommended because INTP and ESTJ have dominate Ti and Te who share the same view of the world.  |||I find myself attractive to the idea of an ESTP. However, I think this pairing is horrible in the long run. I can't image this working out all. The reason why I'm drawn to ESTP is because of their...|||I have heard your idea that ESTJs are less likely to come out of the closet before due to their masculinity.|||The question asks how common are lgbt ESTJs. Given the fact that ESTJ are a numerous personality type by 10%, 3% of the population are homosexual. Doing simple probability statistics, th

LABEL NUMBER TEST : Dataframe of only numerical type value of testset 

In [28]:
testset_type_num_only = pd.DataFrame(data=testset_type_num['type'])
print(testset_type_num_only)

      type
8548     3
4553     3
8672     3
4488     7
7025     9
7128     1
5249     4
2972     1
119      8
5149     7
965      7
5388     1
5479     3
5250     1
1900     2
6147     7
5660    15
5092     7
1237     1
4104     7
8087     7
3284     3
1732     7
403      9
7611     1
908      2
1243     3
5052     1
6181     7
5818     2
...    ...
5125     7
4147     3
4214     9
4072     8
7858     1
4958     6
4135     2
7329    14
6098     8
738      8
2852     7
6529     3
2412     7
7935     7
6759    13
3439     1
6516     7
7788     4
3816    10
2665     9
1372     1
6654     4
186      4
2194     2
5158    11
934      7
1093     6
3712     6
4233    10
3267     3

[2169 rows x 1 columns]


In [49]:
#test set type column only
df_test = testset_type_num_only.values
print(df_test)
df_test.shape

[[ 3]
 [ 3]
 [ 3]
 ..., 
 [ 6]
 [10]
 [ 3]]


(2169L, 1L)

### Vectorization using CountVectorizer of scikit-learn

CountVectorizer is doing in a single class the (1) tokenization and the (2) count occurences

For train

In [29]:
train_for_vec = trainset
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_for_vec['posts'])

For test

In [30]:
test_counts = count_vect.transform(testset['posts'])

In [31]:
train_counts.shape

(6506, 94411)

In [32]:
test_counts.shape

(2169, 94411)

It's give us two sparce matrix: one of 6506 x 94411 (for train) with 614 237 966 elements (zero include as we can see bellow) and one of 2169 x 94411 (for test) with 204 777 459 elements (zero include as we can see bellow).

In [33]:
train_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [34]:
test_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [35]:
print(train_counts)

  (0, 19017)	1
  (0, 66360)	1
  (0, 69425)	1
  (0, 56876)	1
  (0, 53275)	1
  (0, 85356)	1
  (0, 78071)	1
  (0, 10967)	1
  (0, 72281)	1
  (0, 56702)	1
  (0, 31096)	1
  (0, 91886)	1
  (0, 76206)	1
  (0, 147)	1
  (0, 88807)	1
  (0, 33116)	1
  (0, 2405)	1
  (0, 2365)	1
  (0, 21019)	1
  (0, 2383)	1
  (0, 2350)	1
  (0, 3202)	1
  (0, 88272)	1
  (0, 114)	1
  (0, 88859)	1
  :	:
  (6505, 41113)	1
  (6505, 52870)	2
  (6505, 13766)	7
  (6505, 47724)	4
  (6505, 11180)	3
  (6505, 71610)	1
  (6505, 11551)	1
  (6505, 10037)	17
  (6505, 91001)	1
  (6505, 12907)	1
  (6505, 91695)	5
  (6505, 17234)	5
  (6505, 32077)	1
  (6505, 46205)	10
  (6505, 46378)	4
  (6505, 33154)	1
  (6505, 83288)	12
  (6505, 43368)	3
  (6505, 92119)	1
  (6505, 44008)	8
  (6505, 84459)	12
  (6505, 90174)	3
  (6505, 93474)	5
  (6505, 83325)	17
  (6505, 91253)	1


In [36]:
print(test_counts)

  (0, 147)	1
  (0, 1626)	1
  (0, 3461)	1
  (0, 4110)	1
  (0, 6369)	1
  (0, 6778)	1
  (0, 7527)	5
  (0, 7616)	1
  (0, 7625)	1
  (0, 7775)	2
  (0, 7985)	1
  (0, 7988)	1
  (0, 8108)	2
  (0, 8595)	1
  (0, 8703)	1
  (0, 9323)	2
  (0, 9916)	8
  (0, 9972)	1
  (0, 10037)	27
  (0, 10588)	4
  (0, 10820)	1
  (0, 10832)	1
  (0, 10890)	1
  (0, 10938)	1
  (0, 11180)	26
  :	:
  (2168, 91152)	1
  (2168, 91337)	1
  (2168, 91428)	2
  (2168, 91695)	23
  (2168, 91721)	1
  (2168, 91879)	4
  (2168, 91942)	1
  (2168, 91953)	1
  (2168, 92037)	1
  (2168, 92059)	1
  (2168, 92065)	5
  (2168, 92080)	1
  (2168, 92092)	1
  (2168, 92106)	1
  (2168, 92119)	2
  (2168, 92177)	1
  (2168, 92199)	2
  (2168, 93145)	2
  (2168, 93160)	1
  (2168, 93279)	2
  (2168, 93302)	1
  (2168, 93313)	1
  (2168, 93474)	7
  (2168, 93506)	1
  (2168, 93523)	3


In [37]:
count_vect.vocabulary_.get('crazy')

23293

Here an example, we have 23411 occurences of the word "crazy" in the posts. 

Now we have (1) tokenize the posts content and (2) count the words occurencies, but it's better to work with frequencies.

Currently we are using only dictionnary of 1-grams (individual words) because default value for the function CountVectorizer() is ngram_range=(1, 1). But we can take 2-grams (using parameter ngram_range=(1, 2) in the function) if we want to enrich the dictionnary (we have also group of two words), but it also taking much more place. So I will choose for the moment to use only 1-gram (and maybe enrich later if necessary).

### TF-IDF

To avoid space stockage issues, I will use the TF-IDF (term frequency-inverse document frequency) methods, which will allow for example (1:tf) to devide each occurences of a word in a post by total number of words in this post (that give new features names tf). This method also allow (2:tf-idf) to give less weight (importance) on words which appear in many posts of the corpus and automatically that give less information than words that appear in a less sample of posts.

$ tfidf(t,d) = tf(t,d) * idf(t) $

In [41]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
train_tfidf.shape

(6506, 94411)

With the function "fit_tranform()" we are doing in one time the fit() which fit the estimator to data (1:tf) and the transform() which transform the matrix of occurences (data_counts) in tf-idf matrix with frequencies (2:tf-idf).

In [42]:
test_tfidf = tfidf_transformer.transform(test_counts)
test_tfidf.shape

(2169, 94411)

To fit in the model the matrix test and train tf-idf need to be same dimention (94411)

## Models 

### Training a classifier

Now we are training a classifier using all those features to classify posts. We will use a basic Naive Bayes classifier from  named MultinomialNB also provided by scikit-learn.

In [47]:
from sklearn.naive_bayes import MultinomialNB
cl = MultinomialNB().fit(train_tfidf, df)

I could also use a Pipeline to do faster and easier the Vectorization, Tf-idf and the classifier, using scikit-learn class sklearn.pipeline. It could be an improvement for later.

### Test of the model performance on testset

Construct a class prediction for tf-idf matrix of test

In [48]:
predicted = cl.predict(test_tfidf)

Now we are calculating the prediction accuracy using metrix from scikit-learn

In [50]:
from sklearn import metrics
metrics.accuracy_score(df_test, predicted)

0.20470262793914246

I have now the accuracy of the first results, which is pretty bad: only 20% of the posts are well classify (associated to the good MBTI type).