# Bidirectional LSTM on MBTI Dataset


## Introduction

The Myers-Briggs Type Indicator is a commonly used framework that attempts to classify the personality of individuals by assigning them four binary categories, namely introversion/extroversion, intuition/sensory, thinking/feeling, and perceiving/judging. These four categories are then combined to give a four-letter code, such as INTP or ESFJ, giving sixteen possible types overall.

Existing methods of determining a person's MBTI type involve the taking of a questionaire, where the responses are then used to determine the type of a person. These tests can be time-consuming. Additionally, questions in such questionaires tend to be repetitive and thus, people's familiarity with the questions and what they attempt to predict could create bias when their responses are entered. Thus, by predicting based off free-form text, we attempt to eliminate this bias while attempting an alternate method of prediction. 

## Objective

In this project, we attempt to develop a Machine Learning algorithm that can predict the MBTI type of a person based on text they enter into the program. Due to our training data, which is sourced from social media sites, the program is intended to be used with conversational text, such as that from one's messages or emails. By attempting different models, vectorization methods, and methods of prediction, we attempt to obtain the highest possible accuracy. Additionally, by contrasting the different results given by different methods of prediction, we may also gain insight into the machinations of the framework.

## Set-Up and Imports

In [1]:
import random
import pandas as pd
from sklearn.model_selection import __________
import tensorflow as tf

## Load Data

In [3]:
df = pd.read_csv("https://__________________/terminalai/webdev-ai/main/data/mbti.csv")
df = df.sort_values(list(df.columns[1:])+["text"], ignore_index=True)
df

Unnamed: 0,text,I/E,N/S,T/F,P/J
0,,0,0,0,0
1,,0,0,0,0
2,9,0,0,0,0
3,almost certain close friend mine read austral...,0,0,0,0
4,always like keep answers tentative time ive f...,0,0,0,0
...,...,...,...,...,...
31995,youre younger developed,1,1,1,1
31996,yup friend drove hours see exo chicago last mo...,1,1,1,1
31997,yup like clean tasteful outfit fits well color...,1,1,1,1
31998,yupp id agree theory feusers influenced brough...,1,1,1,1


## Training

### Prepare variables and split in Train-CV-Test

In [5]:
x = df["text"].values.astype(str)
y = df[["I/E", "N/S", "T/F", "P/J"]]

x_train, __________, y_train, y_val = train_test_split(x, y, test_size=0.4)
x_val, x_test, y_val, ________ = train_test_split(x_val, y_val, test_size=0.25)

tokenizer = tf.keras._________.text.Tokenizer(num_words=2000, oov_token="<OOV>")
tokenizer.fit_on_texts(x)

x_train = tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to__________(x_train), maxlen=100, padding='post', truncating='post')
x_test = tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_________(x_test), maxlen=100, padding='post', truncating='post')
x_val = tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_________(x_val), maxlen=100, padding='post', truncating='post')

### Model

We chose to prepare a Bidirectional LSTM with a Dense Layer and Dropout probabilities of `0.2`.

In [6]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(_______, 64), # embedding layer
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2)), # LSTM layer
    tf.keras.layers.Dropout(rate=0.2), # dropout layer
    tf.keras.layers.Dense(_________, activation='relu'), # fully connected layer
    tf.keras.layers.Dense(4, activation=________) # final layer
])

model.compile(loss=__________, optimizer='adam', metrics=['accuracy', 'AUC'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          128000    
                                                                 
 bidirectional (Bidirectiona  (None, 128)              66048     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 64)                8256      
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
Total params: 202,564
Trainable params: 202,564
Non-trainable params: 0
__________________________________________________

### Train Model

We utilise Early Stopping to make sure the model doesn't overfit on the dataset.

In [7]:
early_stopping_monitor = tf.keras.__________.EarlyStopping(patience=2)
history = model.fit(x_train, y_train, epochs=10, validation_data=(x_val, y_val), callbacks = [early_stopping_monitor])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


## Evaluation

### Evaluate Accuracies of Each Variable

In [8]:
y_pred = pd.DataFrame(model.predict(x_test)._________(), columns=["I/E", "N/S", "T/F", "P/J"]).applymap(int)
y_test = y_test.reset_index().drop(columns=["index"])
for i in y_test: print(i, (y_pred[i] == y_test[i]).mean())
print("Overall", ((y_pred == y_test).sum(axis=1) == 4).mean())

I/E 0.559375
N/S 0.554375
T/F 0.5734375
P/J 0.5453125
Overall 0.1165625


### Test Against Random

In [9]:
y_rand = y_test._________(lambda x: random.random()).round().astype(int)
for i in y_test: print(i, (y_rand[i] == y_test[i]).mean())
print("Overall", ((y_rand == y_test).sum(axis=1) == 4).mean())

I/E 0.50375
N/S 0.4834375
T/F 0.495
P/J 0.47125
Overall 0.0571875


## Save Model

In [10]:
!mkdir __________
model.save_weights('models/mbti-bdlstm.h5')