# MBTI Personality Classification
Authors: Xavier Barinaga & Sam Allen  

The Myers Briggs Type Indicator (MBTI) is a personality type system that divides everyone into 16 distinct personality types across 4 axes:
 * Introversion (I) $\rightarrow$ Extroversion (E)
 * Intuition (N) $\rightarrow$ Sensing (S)
 * Thinking (T) $\rightarrow$ Feeling (F)
 * Judging (J) $\rightarrow$ Perceiving (P)

More information about MBTI can be found [here](myersbriggs.org/my-mbti-personality-type/myers-briggs-overview/)

## Dataset Description

The MBTI dataset we are using can be found [here](https://www.kaggle.com/datasets/datasnaek/mbti-type?resource=download). Some notes on the dataset:
* Inehrently only 2 attributes:
  * Personality Type
  * That person's 50 most recent twitter posts

Notably, since this dataset is comprised of people who post online on Twitter, we can find a class imbalance in introverts vs extroverts among other categories. It is important to note that this dataset is not representative of the general population.  

Below, we will briefly explore features of the dataset.

In [1]:
import pandas as pd

mbti_df = pd.read_csv("mbti_1.csv")

mbti_df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [2]:
mbti_df.shape

(8675, 2)

### Dataset Features
As seen above, our dataset comprises of 8675 instances. 
 * For each instance, we have the person's 50 latest posts all in one column, separated by `|||`

Since our dataset only comprises of one non-class attribute, we will have to derive some of our own from the "posts" attribute. Below are some ideas of meaningful features we believe we can derive from the posts:
 * Word Count: Higher word count may indicate Extroversion and Judging. In contrast, lower word count could indicate Intuition and Sensing.
 * Words Per Sentence: Higher WPS might indicate higher Thinking.
 * Number of Social Words: Words like "friend", "talk", "they", "us", etc. might indicate extroversion.
 * Number of Personal Pronouns: Words like "I", "my", "me", etc. might indicate introversion.
 * Polarity: Strong positive or negative sentiment might indicate Feeling. In contrast, minimal polarity could indicate Thinking.
 * Subjectivity: High subjectivity might correlate with Feeling while low subjectivity could indicate Thinking..
 * Type-Token Ratio (TTR): The number of unique words divided by total number of words. A higher TTR could correlate with Intution and Thinking.

Below, we will dive a little bit into how we plan to implement these features.

## Implementation & Technical Merit

ok this might be horrendous. There are 16 classes, and if we try to predict them all that would be miserable. Maybe train 4 different binary classifiers (one for each personality type, like one for introversion vs extroversion, etc.). 

## Potential Impact
lowkey some advertisers on social media website would eat this up ngl so maybe they are stakeholders?

## Citations
List MBTI wesbite or something