# MBTI Personality Classification
Authors: Xavier Barinaga & Sam Allen  

The Myers Briggs Type Indicator (MBTI) is a personality type system that divides everyone into 16 distinct personality types across 4 axes:
 * Introversion (I) $\rightarrow$ Extroversion (E)
 * Intuition (N) $\rightarrow$ Sensing (S)
 * Thinking (T) $\rightarrow$ Feeling (F)
 * Judging (J) $\rightarrow$ Perceiving (P)

More information about MBTI can be found [here](myersbriggs.org/my-mbti-personality-type/myers-briggs-overview/)

## Dataset Description

The MBTI dataset we are using can be found [here](https://www.kaggle.com/datasets/datasnaek/mbti-type?resource=download). Some notes on the dataset:
* Inehrently only 2 attributes:
  * Personality Type
  * That person's 50 most recent twitter posts

Notably, since this dataset is comprised of people who post online on Twitter, we can find a class imbalance in introverts vs extroverts among other categories. It is important to note that this dataset is not representative of the general population.  

Below, we will briefly explore features of the dataset.

In [1]:
import pandas as pd

mbti_df = pd.read_csv("mbti_1.csv")

mbti_df.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [2]:
mbti_df.shape

(8675, 2)

### Dataset Features
As seen above, our dataset comprises of 8675 instances. 
 * For each instance, we have the person's 50 latest posts all in one column, separated by `|||`

Since our dataset only comprises of one non-class attribute, we will have to derive some of our own from the "posts" attribute. Below are some ideas of meaningful features we believe we can derive from the posts:
 * Word Count: Higher word count may indicate Extroversion and Judging. In contrast, lower word count could indicate Intuition and Sensing.
 * Words Per Sentence: Higher WPS might indicate higher Thinking.
 * Number of Social Words: Words like "friend", "talk", "they", "us", etc. might indicate extroversion.
 * Number of Personal Pronouns: Words like "I", "my", "me", etc. might indicate introversion.
 * Polarity: Strong positive or negative sentiment might indicate Feeling. In contrast, minimal polarity could indicate Thinking.
 * Subjectivity: High subjectivity might correlate with Feeling while low subjectivity could indicate Thinking..
 * Type-Token Ratio (TTR): The number of unique words divided by total number of words. A higher TTR could correlate with Intution and Thinking.

Below, we will dive a little bit into how we plan to implement these features.

## Implementation & Technical Merit 

This dataset is general and contains both Natural and Non-Natural language due to the nature of tweets. The dataset is not straight forward and is minimal in terms of features. For this reason, we will be creating many of our own features as previously mentioned. We will acheive this using an open-source library called [`TextBlob`](https://textblob.readthedocs.io/en/dev/) that can help us derive the attributes listed above. 

When it comes to classifying our desired feature, we have 16 possible outcomes. There are 4 types of classes with 2 possibilities in each. So instead of trying to classify the class out of 16 possibilies, we will create 4 binary classifiers to predict each of the 4 binary outcomes. This should help increase accuracy and also give us more task-specific classification. 

Example:

 * Introversion (I) $\rightarrow$ **Extroversion (E)**
 * **Intuition (N)** $\leftarrow$ Sensing (S)
 * **Thinking (T)** $\leftarrow$ Feeling (F)
 * **Judging (J)** $\leftarrow$ Perceiving (P)  
*Arrow means prediction*
  
Class Predictions: E | N | T | J

## Potential Impact
The results of this classification could show insights to the pyschology of people on social media. If we can find trends through the classification, we can gain knowledge of the way social media is used and also gain knowledge about the people who use it.

Potential stakeholders include:  
* `Social media companies` - These companies can use data to better understand their users and do with that whatever they want
* `Social media advertisers` - Advertisers can learn behaviors and target ads to a specific type of user
* `Users` - Can be affected by changes made due to insights. Can also gain insight about themselves if presented with the data
* `Psychologists/Psychiatrists` - Social media is practically inescapable modern day. People spend significant chunks of their life on social media so understanding the behaviors of people on social media can better inform mental health experts on how to deal with problems that may arise from patient social media use.

Using social media data can be very senstitive. Many people treat social media as their online outlet to show things going on in their life. With the complexity of peoples' psychology on social media, many of the things people say are something that can easily be overgeneralized or misinterpreted. This is also part of any natural language use. For example, as is common on social media, if someone is mocking another person, it might seem as if the poster is actually saying that thing instead of mocking it.

A lot of times people post things on social media that they either regret, don't think people will see, or don't actually want to be shared publically. Along with that users share very sensitive and personal information. This includes topics such as: Violence, Suicide, Substance Use, etc. Due to these reasons, we must be mindful as we use this data.