# Final project: NLP to predict Myers-Briggs Personality Type

<img src='https://bit.ly/2VnXWr2' width='100' align='left'>

## Introduction

In order to learn more on NLP while applaying its methods to psychological variables I have been working on this dataset from Kaggle, [(MBTI) Myers-Briggs Personality Type Dataset](https://www.kaggle.com/datasnaek/mbti-type), that holds data collected through the [PersonalityCafe forum](http://personalitycafe.com/forum/), as it provides a large selection of people and their MBTI personality type, as well as what they have written. 

### Objectives

I mainly wanted to create a **classification model using text data features and meta-features from each user comments, messages and posts to predict their personalities**.

## Imports

In [None]:
# Data Analysis
import pandas as pd
import numpy as np
import math

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Data Visualization for text
from PIL import Image
from os import path
import os
import random
from wordcloud import WordCloud, STOPWORDS

# Text Processing
import re
import itertools
import spacy
import string
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm
from collections import Counter

# Ignore noise warning
import warnings
warnings.filterwarnings('ignore')

# Work with pickles
import pickle

pd.set_option('display.max_column', None)

## 1. Exploratory Data Analysis

### Context


The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

- Introversion (I) – Extroversion (E)
- Intuition (N) – Sensing (S)
- Thinking (T) – Feeling (F)
* Judging (J) – Perceiving (P)

[(More can be learned about what these mean here)](http://www.myersbriggs.org/my-mbti-personality-type/mbti-basics/home.htm)

So for example, someone who prefers introversion, intuition, thinking and judging would be labelled an INTJ in the MBTI system, and there are lots of personality based components that would model or describe this person’s preferences or behaviour based on the label.

It is one of, if not the, the most popular personality test in the world. It is used in businesses, online, for fun, for research and lots more. A simple google search reveals all of the different ways the test has been used over time. It’s safe to say that this test is still very relevant in the world in terms of its use.

From scientific or psychological perspective it is based on the work done on [cognitive functions](http://www.cognitiveprocesses.com/Cognitive-Functions/) by Carl Jung i.e. Jungian Typology. This was a model of 8 distinct functions, thought processes or ways of thinking that were suggested to be present in the mind. Later this work was transformed into several different personality systems to make it more accessible, the most popular of which is of course the MBTI.

I need to add that for the dataset I generated I haven't used the original MBTI test, but a test based on it, [16Personalities](https://www.16personalities.com/), which adds a new axis that I omitted for validity reasons.

**Content**

This dataset contains over 8600 rows of data, on each row is a person’s:

- Type (This persons 4 letter MBTI code/type)
- A section of each of the last 50 things they have posted (Each entry separated by '|||' (3 pipe characters))

#### To be noted

I want to add that, despite being one of the most used personality tests for organizational purposes, the Myers-Briggs Type Indicator hasn't been validated.

The reason I've used this data is that I had access to it while I couldn't find/access to any equivalent data based on the more respected and scientifically validated personality model there's at the moment, the Big Five personality traits AKA Five Factor Model.

On the other hand, FFM authors themselves (McCrae & Costa, 1989) published a paper where important correlations were found between MBTI and 4 of the 5 personality traits from FFM. They admitted that following the proper steps to validate a psychological test and improving some aspects related to MBTI psychometric metrics, it could eventually become a recognised personality framework.

Reference:\
[McCrae, R. R., & Costa Jr, P. T. (1989). Reinterpreting the Myers‐Briggs type indicator from the perspective of the five‐factor model of personality. *Journal of personality*, 57(1), 17-40](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-6494.1989.tb00759.x).

### EDA

#### Read dataset and check head

In [None]:
mbti_df = pd.read_csv("../input/mbti-type/mbti_1.csv")
mbti_df.head()

#### Check shape

In [None]:
mbti_df.shape

#### Check dtypes and columns

In [None]:
mbti_df.info()

#### Check nulls and duplicates

In [None]:
mbti_df.isna().sum()

In [None]:
mbti_df.duplicated().sum()

#### Check unique values

In [None]:
mbti_df.nunique()

#### Check target variable distribution

In [None]:
mbti_df.type.value_counts()

#### Target variable distribution visualization

I will use countplots to visualize most frequent types. *I will also try to visualize the same based on the associted pieces of text's lenghts.

In [None]:
plt.figure(figsize=(18,10))
sns.countplot(y='type',data=mbti_df, order=mbti_df.type.value_counts().index)
sns.set_context('talk')
plt.title('Personality types distribution', fontsize=25)
plt.savefig('mbti_count.png')
plt.show()

I also want to see how long posts are for each personality type

In [None]:
def var_row(row):
    lst = []
    for word in row.split('|||'):
        lst.append(len(word.split()))
    return np.var(lst)

mbti_df['words_per_comment'] = mbti_df['posts'].apply(lambda x: len(x.split())/50)
mbti_df['variance_of_word_counts'] = mbti_df['posts'].apply(lambda x: var_row(x))
mbti_df.head()

In [None]:
plt.figure(figsize=(18,10))
sns.swarmplot('type', 'words_per_comment', data=mbti_df)
sns.set_context('talk')
plt.title('Posts length per type', fontsize=25)
plt.savefig('mbti_posts_length.png')
plt.show()

In [None]:
mbti_df.describe().T

In [None]:
mbti_df.corr()

I will use wordclouds to visualize most common words in posts column.

In [None]:
# Read the whole text.
text = ' '.join(mbti_df['posts'])

# Generate a word cloud image
stopwords = STOPWORDS
wordcloud = WordCloud(background_color='white', width=800, height=400, stopwords=stopwords, max_words=100, repeat=False, min_word_length=4).generate(text)

# Display the generated image:
plt.figure(figsize=(18,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
sns.set_context('talk')
plt.title('Most common words', fontsize=25)
plt.savefig('mbti_cloud.png')
plt.show()

#### Comments
It seems there is terms that are more frequent than others but we still can't establish any sort of relationship between those terms' frequencies and our target variable.

### Text columns
In order to better appreciate if it may be a relationship between text and personality types, we will tokenize the text to form a Bag of Words.

#### BoW

In [None]:
mbti_text = mbti_df[['type','posts']].copy()

In [None]:
mbti_text = mbti_text.fillna('')
text_columns = mbti_text[['type']]
text_columns['text'] = mbti_text.iloc[:,1:].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)

In [None]:
def clean_urls(column):
    '''
    This function takes a string and returns a string 
    with its urls removed and all the words in lowercase.
    '''
    return column.apply(lambda x: x.lower()).apply(lambda x: re.sub('http[s]?://\S+', '', x))

text_columns['text'] = clean_urls(text_columns['text'])

<img src='https://www.nicepng.com/png/detail/148-1486992_discover-the-most-powerful-ways-to-automate-your.png' width='1000'> 

In [None]:
#raise SystemExit('Stop right there! Run cells one by one till the next heading.')

In [None]:
nlp = spacy.load('en_core_web_sm', disable = ['ner', 'parser']) 
nlp.max_length = 33000000

In [None]:
def tokenize(string):
    '''
    This function takes a sentence and returns the list of all lemma
    '''
    doc = nlp(string)
    l_token = [token.text for token in doc if not token.is_punct 
               | token.is_space | token.is_stop | token.is_digit & token.is_oov]
    return ' '.join(l_token)


text_columns['text'] = text_columns['text'].apply(lambda row: tokenize(row))

In [None]:
pd_token = pd.DataFrame(text_columns, columns=['type', 'text'])
pd_token.head()

In [None]:
pd_token.to_pickle('token.pkl')

#### Visualization

In [None]:
# Read the whole text.
text = ' '.join(pd_token['text'])

# Generate a word cloud image
stopwords = STOPWORDS
wordcloud = WordCloud(background_color='white', width=800, height=400, stopwords=stopwords, max_words=100, repeat=False, min_word_length=4).generate(text)

# Display the generated image:
plt.figure(figsize=(18,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
sns.set_context('talk')
plt.title('Most common tokenized words', fontsize=25)
plt.savefig('mbti_token_cloud.png')
plt.show()

#### Comments

After tokenizing we can see that there's words or short expressions which are more common than others in "text" columns but there's still too much information and it is too raw as to get interesting insights from it other than that our sample types distribution differs from that found on the MBTI authors research. 

I will show you the distribution from the Spanish sample in 2018, you can find the documents cointaing this information on and that of other countries samples in [here](https://www.themyersbriggs.com/en-US/Products-and-Services/MBTI-Manual-Supplements).

![mbti_distr_spain](https://github.com/mikongame/NLP-to-predict-Myers-Briggs-Personality-Type/blob/master/images/mbti_distr_spain.png?raw=true)