# Bidirectional LSTM on MBTI Dataset


## Introduction

The Myers-Briggs Type Indicator is a commonly used framework that attempts to classify the personality of individuals by assigning them four binary categories, namely introversion/extroversion, intuition/sensory, thinking/feeling, and perceiving/judging. These four categories are then combined to give a four-letter code, such as INTP or ESFJ, giving sixteen possible types overall.

Existing methods of determining a person's MBTI type involve the taking of a questionaire, where the responses are then used to determine the type of a person. These tests can be time-consuming. Additionally, questions in such questionaires tend to be repetitive and thus, people's familiarity with the questions and what they attempt to predict could create bias when their responses are entered. Thus, by predicting based off free-form text, we attempt to eliminate this bias while attempting an alternate method of prediction. 

## Objective

In this project, we attempt to develop a Machine Learning algorithm that can predict the MBTI type of a person based on text they enter into the program. Due to our training data, which is sourced from social media sites, the program is intended to be used with conversational text, such as that from one's messages or emails. By attempting different models, vectorization methods, and methods of prediction, we attempt to obtain the highest possible accuracy. Additionally, by contrasting the different results given by different methods of prediction, we may also gain insight into the machinations of the framework.

## Set-Up and Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Load Data

In [6]:
df = pd.read_csv("https://raw.githubusercontent.com/terminalai/webdev-ai/main/data/mbti.csv")
df

Unnamed: 0,text,I/E,N/S,T/F,P/J
0,yknow point id call impossible really mature w...,1,0,1,1
1,interesting person gaze penetrating speech mea...,1,0,1,1
2,pagsubok lang yan kahit gaano kahirap ang isan...,1,0,1,1
3,doesnt matter im hiding body actually help poi...,1,0,1,1
4,tell calm fuck assure meant sign damn papers w...,1,0,1,1
...,...,...,...,...,...
31995,dont worst enemy per conflicts,0,1,0,0
31996,yep im good english including reading writinge...,0,1,0,0
31997,dont know appealing building part looks fun th...,0,1,0,0
31998,struggle often notice kill conversations sayin...,0,1,0,0


## Training

### Prepare variables and split in Train-CV-Test

In [29]:
x = df["text"].values.astype(str)
y = df[["I/E", "N/S", "T/F", "P/J"]]

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.4)
x_val, x_test, y_val, y_test = train_test_split(x_val, y_val, test_size=0.25)

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=2000, oov_token="<OOV>")
tokenizer.fit_on_texts(x)
word_index = tokenizer.word_index

x_train = pad_sequences(tokenizer.texts_to_sequences(x_train), maxlen=100, padding='post', truncating='post')
x_test = pad_sequences(tokenizer.texts_to_sequences(x_test), maxlen=100, padding='post', truncating='post')
x_val = pad_sequences(tokenizer.texts_to_sequences(x_val), maxlen=100, padding='post', truncating='post')

### Model

We chose to prepare a Bidirectional LSTM with a Dense Layer and Dropout probabilities of `0.2`.

In [31]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(2000, 64), # embedding layer
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2)), # LSTM layer
    tf.keras.layers.Dropout(rate=0.2), # dropout layer
    tf.keras.layers.Dense(64, activation='relu'), # fully connected layer
    tf.keras.layers.Dense(4, activation='sigmoid') # final layer
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', 'AUC'])

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 64)          128000    
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              66048     
 nal)                                                            
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 64)                8256      
                                                                 
 dense_3 (Dense)             (None, 4)                 260       
                                                                 
Total params: 202,564
Trainable params: 202,564
Non-trainable params: 0
________________________________________________

### Train Model

We utilise Early Stopping to make sure the model doesn't overfit on the dataset.

In [32]:
early_stopping_monitor = EarlyStopping(patience=2)
history = model.fit(x_train, y_train, epochs=10, validation_data=(x_val, y_val), callbacks = [early_stopping_monitor])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


## Evaluation

### Predict on Test Set

In [34]:
y_pred = pd.DataFrame(model.predict(x_test).round(), columns=["I/E", "N/S", "T/F", "P/J"]).applymap(int)
y_pred

Unnamed: 0,I/E,N/S,T/F,P/J
0,0,1,0,0
1,0,0,0,0
2,1,1,1,1
3,0,0,0,0
4,1,1,0,0
...,...,...,...,...
3195,0,1,1,0
3196,1,0,0,1
3197,1,1,0,0
3198,1,1,1,1


### Fix Expected Test Data

In [42]:
y_test = y_test.reset_index().drop(columns=["index"])
y_test

Unnamed: 0,I/E,N/S,T/F,P/J
0,0,1,0,0
1,0,0,0,0
2,0,1,0,1
3,0,1,1,1
4,0,1,1,0
...,...,...,...,...
3195,0,1,1,1
3196,0,0,1,1
3197,1,1,1,0
3198,0,0,1,0


### Evaluate Accuracies of Each Variable

In [49]:
for i in y_test:
  print(i, (y_pred[i] == y_test[i]).mean())

print("Overall", ((y_pred == y_test).sum(axis=1) == 4).mean())

I/E 0.540625
N/S 0.554375
T/F 0.5878125
P/J 0.5521875
Overall 0.1115625


### Test Against Random

In [56]:
import random
y_rand = y_test.applymap(lambda x: random.random()).round().astype(int)
y_rand

for i in y_test:
  print(i, (y_rand[i] == y_test[i]).mean())

print("Overall", ((y_rand == y_test).sum(axis=1) == 4).mean())

I/E 0.516875
N/S 0.5034375
T/F 0.4996875
P/J 0.5040625
Overall 0.07


## Save Model

In [59]:
!mkdir models
model.save_weights('models/mbti-bdlstm.h5')