# Project MBTI

Create a model to predict MBTI personality types from posts

Data format:
    =>type (string): MBTI types
    =>posts (string): text posts 

## Data exploration

In [1]:
import numpy as np
import csv
import pandas as pd

Load with panda read_csv to avoid delimiter problem between header and data

In [2]:
data2 = pd.read_csv("C:\Users\Gwen\MBTI\mbti_1.csv", header =0)

In [3]:
print (data2)

      type                                              posts
0     INFJ  'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   https://www.youtube.com/wat...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
5     INTJ  '18/37 @.@|||Science  is not perfect. No scien...
6     INFJ  'No, I can't draw on my own nails (haha). Thos...
7     INTJ  'I tend to build up a collection of things on ...
8     INFJ  I'm not sure, that's a good question. The dist...
9     INTP  'https://www.youtube.com/watch?v=w8-egj0y8Qs||...
10    INFJ  'One time my parents were fighting over my dad...
11    ENFJ  'https://www.youtube.com/watch?v=PLAaiKvHvZs||...
12    INFJ  'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13    INTJ  'Fair enough, if that's how you want to look a...
14    INTP  'Basically this...  https://youtu.be/1pH5c1Jkh...
15    IN

Test columns

In [4]:
data2.columns

Index([u'type', u'posts'], dtype='object')

In [5]:
data2[:0]

Unnamed: 0,type,posts


Overview of the dataset

In [6]:
print(data2.shape)

(8675, 2)


We have 8675 raws and 2 columns, which correspond to the dataset in input.

### Frequence of each MBTI type in the dataset

In [7]:
data2.groupby('type').count()

Unnamed: 0_level_0,posts
type,Unnamed: 1_level_1
ENFJ,190
ENFP,675
ENTJ,231
ENTP,685
ESFJ,42
ESFP,48
ESTJ,39
ESTP,89
INFJ,1470
INFP,1832


### Create category MBTI list

In [8]:
list(set(data2.type))

['ENFJ',
 'ESFP',
 'INFJ',
 'ESTJ',
 'ISTJ',
 'ENTJ',
 'ISFP',
 'INTJ',
 'ISTP',
 'ENTP',
 'ISFJ',
 'INTP',
 'ESFJ',
 'ESTP',
 'ENFP',
 'INFP']

In [9]:
category = list(data2['type'].unique())

In [10]:
print (category)

['INFJ', 'ENTP', 'INTP', 'INTJ', 'ENTJ', 'ENFJ', 'INFP', 'ENFP', 'ISFP', 'ISTP', 'ISFJ', 'ISTJ', 'ESTP', 'ESFP', 'ESTJ', 'ESFJ']


In [11]:
dict_category = {}
for idx, x in enumerate(category):
    dict_category[x]= idx+1
print dict_category

{'ENFJ': 6, 'ESFP': 14, 'INFJ': 1, 'ESTJ': 15, 'ISTJ': 12, 'ENTJ': 5, 'ISFP': 9, 'INTJ': 4, 'ISTP': 10, 'ENTP': 2, 'ISFJ': 11, 'INTP': 3, 'ESFJ': 16, 'ESTP': 13, 'ENFP': 8, 'INFP': 7}


## Data cleaning

plan:
1) Remove URL
2) Vectorization

### Remove URL using regex

In [12]:
import re

We are applying the regex function on the colum post

In [13]:
for idx, x in enumerate(data2['posts']):
    data2['posts'][idx] = re.sub(r'http([a-z]|[A-Z]|\.|/|\?|=|:|[0-9]|_|-)*','',x)
    #data2['posts'][idx] = re.sub(r'http('(\'|\")','',x)
print(data2['posts'])

0       '||||||enfp and intj moments    sportscenter n...
1       'I'm finding the lack of me in these posts ver...
2       'Good one  _____   |||Of course, to which I sa...
3       'Dear INTP,   I enjoyed our conversation the o...
4       'You're fired.|||That's another silly misconce...
5       '18/37 @.@|||Science  is not perfect. No scien...
6       'No, I can't draw on my own nails (haha). Thos...
7       'I tend to build up a collection of things on ...
8       I'm not sure, that's a good question. The dist...
9       '|||I'm in this position where I have to actua...
10      'One time my parents were fighting over my dad...
11      '|||51 :o|||I went through a break up some mon...
12      'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13      'Fair enough, if that's how you want to look a...
14      'Basically this...  |||Can I has Cheezburgr?||...
15      'Your comment screams INTJ, bro. Especially th...
16      'some of these both excite and calm me:  BUTTS...
17      'I thi

In [14]:
print(data2)

      type                                              posts
0     INFJ  '||||||enfp and intj moments    sportscenter n...
1     ENTP  'I'm finding the lack of me in these posts ver...
2     INTP  'Good one  _____   |||Of course, to which I sa...
3     INTJ  'Dear INTP,   I enjoyed our conversation the o...
4     ENTJ  'You're fired.|||That's another silly misconce...
5     INTJ  '18/37 @.@|||Science  is not perfect. No scien...
6     INFJ  'No, I can't draw on my own nails (haha). Thos...
7     INTJ  'I tend to build up a collection of things on ...
8     INFJ  I'm not sure, that's a good question. The dist...
9     INTP  '|||I'm in this position where I have to actua...
10    INFJ  'One time my parents were fighting over my dad...
11    ENFJ  '|||51 :o|||I went through a break up some mon...
12    INFJ  'Joe santagato - ENTP|||ENFJ or  ENTP?   I'm n...
13    INTJ  'Fair enough, if that's how you want to look a...
14    INTP  'Basically this...  |||Can I has Cheezburgr?||...
15    IN

## Data prepartion: Bag of words 

The objective is to extract numerical features vectors from a text, by tokenizing string (and give to each token an numerical id, using white-space as delimiter), by counting occurence of each token in each posts, by normalizing token (concerning importance of token in occurence/post). And, as most features will be zero (when tokenizing phase), we need to manage stockage space (a possibility is using an inverse document with TF-IDF).

At first I will construct the train and test set.

### Train and Test set

I will now devide the dataset in trainset (75%) and testset (25%) to execute models. I will use a specific train/test split function given by scikit-learn.

In [45]:
from sklearn.model_selection import train_test_split
trainset, testset = train_test_split(data2, test_size=0.25)

In [46]:
print(trainset)

      type                                              posts
3084  INFP  You're crazy|||Aunts|||Curious|||THISSS. Felt ...
363   ENFP  'This type of metal is actually much closer to...
8220  INFJ  'Is it easier for you to maintain, or harder? ...
6659  INTJ  'I have a genuine love for science, but an ove...
4320  INFP  'Well, I'm more of a scientist, so I tend to b...
5314  INTJ  'That's a good question.|||Stereotypes are for...
4973  INFJ  'I think this is a good start.  Both INFJ and ...
4553  INTP  'I thought I would be alone forever but now I'...
2435  INTP  'careful with your broad strokes there, boss.....
3080  ENTJ  'Dear ENTP,  I used to resent you so much, but...
5896  ENTJ  'Whoa alot of hate on Trump here! I thought yo...
2998  ISTP  'Can I please have my account retired? It's ti...
534   ENTJ  'I'd very much like to believe this were true,...
5125  INFP  'People you don´t trust and like - big wall.  ...
7020  INTP  'I'm currently an engineering student, but my ...
4775  IN

In [23]:
trainset.count()

type     6506
posts    6506
dtype: int64

A trainset with 75% of the dataframe (8675 raws)

In [24]:
testset.count()

type     2169
posts    2169
dtype: int64

A testset with 25% of the dataframe (8675 raws)

#### Transform the types in numerical number with the dictionary category

For train set

In [47]:
trainset_type_num = trainset[:]
print(trainset_type_num)

      type                                              posts
3084  INFP  You're crazy|||Aunts|||Curious|||THISSS. Felt ...
363   ENFP  'This type of metal is actually much closer to...
8220  INFJ  'Is it easier for you to maintain, or harder? ...
6659  INTJ  'I have a genuine love for science, but an ove...
4320  INFP  'Well, I'm more of a scientist, so I tend to b...
5314  INTJ  'That's a good question.|||Stereotypes are for...
4973  INFJ  'I think this is a good start.  Both INFJ and ...
4553  INTP  'I thought I would be alone forever but now I'...
2435  INTP  'careful with your broad strokes there, boss.....
3080  ENTJ  'Dear ENTP,  I used to resent you so much, but...
5896  ENTJ  'Whoa alot of hate on Trump here! I thought yo...
2998  ISTP  'Can I please have my account retired? It's ti...
534   ENTJ  'I'd very much like to believe this were true,...
5125  INFP  'People you don´t trust and like - big wall.  ...
7020  INTP  'I'm currently an engineering student, but my ...
4775  IN

In [48]:
trainset_type_num['type'] = trainset_type_num['type'].replace(dict_category)
print(trainset_type_num)

      type                                              posts
3084     7  You're crazy|||Aunts|||Curious|||THISSS. Felt ...
363      8  'This type of metal is actually much closer to...
8220     1  'Is it easier for you to maintain, or harder? ...
6659     4  'I have a genuine love for science, but an ove...
4320     7  'Well, I'm more of a scientist, so I tend to b...
5314     4  'That's a good question.|||Stereotypes are for...
4973     1  'I think this is a good start.  Both INFJ and ...
4553     3  'I thought I would be alone forever but now I'...
2435     3  'careful with your broad strokes there, boss.....
3080     5  'Dear ENTP,  I used to resent you so much, but...
5896     5  'Whoa alot of hate on Trump here! I thought yo...
2998    10  'Can I please have my account retired? It's ti...
534      5  'I'd very much like to believe this were true,...
5125     7  'People you don´t trust and like - big wall.  ...
7020     3  'I'm currently an engineering student, but my ...
4775    

In [50]:
trainset_type_num_only = pd.DataFrame(data=trainset_type_num['type'])

In [57]:
#print(trainset_type_num_only)
df = trainset_type_num_only.values
print(df)

[[7]
 [8]
 [1]
 ..., 
 [4]
 [4]
 [1]]


For test set

### Vectorization using CountVectorizer of scikit-learn

CountVectorizer is doing in a single class the (1) tokenization and the (2) count occurences

In [15]:
data_for_vec = data2
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
data_counts = count_vect.fit_transform(data_for_vec['posts'])

In [59]:
train_for_vec = trainset
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_for_vec['posts'])

In [60]:
train_counts.shape

(6506, 94451)

In [16]:
data_counts.shape

(8675, 110188)

It's give us a 8675 x 110188 sparce matrix, of 955 880 900 element (zero include as we can see above)

In [61]:
train_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [17]:
data_counts.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [62]:
print(train_counts)

  (0, 47394)	1
  (0, 73043)	1
  (0, 17611)	1
  (0, 63818)	1
  (0, 18903)	1
  (0, 31281)	1
  (0, 56447)	1
  (0, 90655)	1
  (0, 42404)	1
  (0, 78494)	1
  (0, 7750)	1
  (0, 91276)	1
  (0, 72524)	1
  (0, 82832)	1
  (0, 83118)	1
  (0, 72289)	1
  (0, 40473)	2
  (0, 29674)	1
  (0, 39271)	1
  (0, 51999)	1
  (0, 64968)	1
  (0, 36973)	3
  (0, 36972)	1
  (0, 11194)	1
  (0, 32270)	1
  :	:
  (6505, 9918)	11
  (6505, 50921)	5
  (6505, 85721)	1
  (6505, 83258)	3
  (6505, 84164)	1
  (6505, 46443)	9
  (6505, 27875)	3
  (6505, 17334)	4
  (6505, 51986)	1
  (6505, 60390)	12
  (6505, 44071)	5
  (6505, 91713)	9
  (6505, 9331)	4
  (6505, 8617)	1
  (6505, 60618)	1
  (6505, 12634)	1
  (6505, 59537)	1
  (6505, 84421)	61
  (6505, 83299)	10
  (6505, 10036)	17
  (6505, 83387)	1
  (6505, 35466)	1
  (6505, 77465)	4
  (6505, 69202)	7
  (6505, 93521)	1


In [18]:
print(data_counts)

  (0, 68708)	1
  (0, 92603)	1
  (0, 13880)	1
  (0, 30678)	1
  (0, 66610)	1
  (0, 105427)	1
  (0, 26167)	1
  (0, 47683)	1
  (0, 63943)	1
  (0, 67381)	2
  (0, 72581)	1
  (0, 39250)	1
  (0, 17917)	1
  (0, 10930)	1
  (0, 77066)	1
  (0, 17730)	1
  (0, 58888)	1
  (0, 91080)	1
  (0, 11066)	1
  (0, 108657)	1
  (0, 40249)	1
  (0, 91456)	1
  (0, 79159)	1
  (0, 39237)	1
  (0, 10982)	1
  :	:
  (8674, 71316)	6
  (8674, 54418)	30
  (8674, 98444)	42
  (8674, 73981)	1
  (8674, 29576)	1
  (8674, 68791)	1
  (8674, 39098)	1
  (8674, 67305)	22
  (8674, 97564)	4
  (8674, 58542)	1
  (8674, 109036)	15
  (8674, 63127)	3
  (8674, 70765)	11
  (8674, 41655)	8
  (8674, 71233)	6
  (8674, 109090)	8
  (8674, 51675)	17
  (8674, 59557)	2
  (8674, 97122)	39
  (8674, 17153)	2
  (8674, 47412)	2
  (8674, 105974)	5
  (8674, 98783)	1
  (8674, 69749)	12
  (8674, 12542)	34


In [65]:
count_vect.vocabulary_.get('crazy')

23411

In [19]:
count_vect.vocabulary_.get('moments')

65914

Here an example, we have 65914 occurences of the word "moments" in the posts. 

Now we have (1) tokenize the posts content and (2) count the words occurencies, but it's better to work with frequencies.

Currently we are using only dictionnary of 1-grams (individual words) because default value for the function CountVectorizer() is ngram_range=(1, 1). But we can take 2-grams (using parameter ngram_range=(1, 2) in the function) if we want to enrich the dictionnary (we have also group of two words), but it also taking much more place. So I will choose for the moment to use only 1-gram (and maybe enrich later if necessary).

### TF-IDF

To avoid space stockage issues, I will use the TF-IDF (term frequency-inverse document frequency) methods, which will allow for example (1:tf) to devide each occurences of a word in a post by total number of words in this post (that give new features names tf). This method also allow (2:tf-idf) to give less weight (importance) on words which appear in many posts of the corpus and automatically that give less information than words that appear in a less sample of posts.

$ tfidf(t,d) = tf(t,d) * idf(t) $

In [35]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
data_tfidf = tfidf_transformer.fit_transform(data_counts)
data_tfidf.shape

(8675, 110188)

In [66]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer2 = TfidfTransformer()
train_tfidf = tfidf_transformer2.fit_transform(train_counts)
train_tfidf.shape

(6506, 94451)

With the function "fit_tranform()" we are doing in one time the fit() which fit the estimator to data (1:tf) and the transform() which transform the matrix of occurences (data_counts) in tf-idf matrix with frequencies (2:tf-idf).

## Models 

### Training a classifier

Now we are training a classifier using all those features to classify posts. We will use a basic Naive Bayes classifier from  named MultinomialNB also provided by scikit-learn.

In [67]:
from sklearn.naive_bayes import MultinomialNB
cl = MultinomialNB().fit(train_tfidf, df)

I could also use a Pipeline to do faster and easier the Vectorization, Tf-idf and the classifier, using scikit-learn class sklearn.pipeline. It could be an improvement for later.

### Test of the model performance on testset

In [68]:
predicted = cl.predict()