# Meyers Briggs Type Indicator Text Classification Project

By: Tony Bennett 

## Overview

This Notebook contains steps to classify text to predict the Meyers Briggs Type of the person who wrote the text. The goal is build a accurate model that can be used to group people based on these types. In order to do this a large amount of data cleaning most take place as well as standard steps that most be taken while doing a text classification model. I used a TF-IDF score and count vectorizer in each of my models to measure how important each word was in the corpus. I used a few models to test out which scores came out the best before settling on a final model. I'll give a summary of each section and provide a link to each individual notebook.   

## Business Problem 

The head of human resources (HRD) and the CEO at Cartex is starting a new initiative to have small group meetings before the start of each work day. The meetings will be casual and have a few questions each day dealing with work related and un work related matters. The human resources head wants to make sure that the workers are grouped with people who are different then them. IE maybe not the first person they would sit down with at lunch. The question is then how could they get information on the employees to group them? The HRD decides against using a questionnaire as most of the time employees tend to be untruthful with them and Cartex had already tried a program with a questionnaire that did meet the companies standards. The HRD decides to use the Meyers Briggs test as a way to group the employees but does not want the employees to waste time taking the test. He decides to ask the CEO what to do. The CEO recommends talking to the recently formed data science team. The data science team comes up with a plan to create machine learning models to classify employee internet posts and predict their Meyers Briggs score. The employees are aware of the plan and submit their social media links to the data science team. They decide to use a few different models to test and from that can pick a final model to give a solid prediction of each employees’ Meyers Briggs score.

## Data Understanding 

The data set is from Kaggle and is contained in one file. It contains over 8600 rows containing two columns of a person’s Meyers Briggs code/type and a section of the last 50 things they have posted on an online internet forum. The posts are separated by ‘|||’. The 16 types of personalities are made from 4 axis personality types: 
-	Introversion (I) – Extroversion (E)
-	Intuition (N) – Sensing (S)
-	Thinking (T) – Feeling (F)
-	Judging (J) – Perceiving (P)

From the 4 axis types you build a 4-letter type, for example ESTJ. Since there are 4 axes for each letter there are 16 potential MBTIs (Myers Briggs Type Indicator) in the data set. The MBTI has been overshadowed by other methods of measuring personality but it is still regarded as a useful tool in the psychological community.  This data is also very imbalanced as most of the people collected are introverted and intuitive so that will have to be addressed when preparing to model and while modelling. 

The potential personalities are:
1.	ISTJ: Quiet, serious, earn success by thoroughness and dependability
2.	ISFJ: Quiet, friendly, responsible, and conscientious
3.	INFJ: Seek meaning and connection in ideas, relationships, and material possessions
4.	INTJ: Have original minds and great drive for implementing their ideas and achieving their goals
5.	ISTP: Tolerant and flexible, quiet observers until a problem appears, then act quickly to find workable solutions
6.	ISFP: Quiet, friendly, sensitive, and kind.
7.	INFP: Idealistic, loyal to their values and to people who are important to them
8.	INTP: Seek to develop logical explanations for everything that interests them
9.	ESTP: Flexible and tolerant, they take a pragmatic approach focused on immediate results
10.	ESFP: Outgoing, friendly, and accepting
11.	ENFP: Warmly enthusiastic and imaginative
12.	ENTP: Quick, ingenious, stimulating, alert, and outspoken
13.	ESTJ: Practical, realistic, matter-of-fact
14.	ESFJ: Warmhearted, conscientious, and cooperative

Data Source:

https://www.kaggle.com/datasnaek/mbti-type

MBTI information:

https://www.myersbriggs.org/my-mbti-personality-type/mbti-basics/home.htm

https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator

## Stakeholders

- Head of Human Resources 
- CEO 
- Cartex Employees

## Evaluation Metrics 

The metrics I used to measure my models are: 
- Geometric Mean score 
- Roc-AUC 
- Average-Precision Recall Score 
- Imbalanced Classfication Report (text summary of the precision, recall, specificity, geometric mean, and index balanced accuracy.)

## EDA , Cleaning of Data and Lemmatization 

In this notebook I took a look at the raw post data. It is pretty messy and contains things like links and emojis. The 4 letter 
meyers briggs codes are also referenced pretty frequently. Its beginning of the first 50 things they have posted. The are no nulls in the set. All 16 potential MBTI types are included in the data set. The set apppears to be heavily inbalanced though. The types INFP, INFJ, INTP are the most popular. This means people tend to be more Introverted and Intuivitive in the sample that was collected. The unbalanced nature of the data set calls for some descions to be made while preprocessing the data and when sampling.  

I modifed the data base so that each of the 4 axes becomes a binary class. It is either a 1 or zero depending on the personality type of the person. 

Cleaning the data involved some processes that are done in most text classification problems. This included: 

- making the post data lower case
- getting rid of the '|||' 
- dropping punctuations 
- dropping email addresses 

I used WordNetLemmatizer to normalize the text. I dropped the 16 MBTI types from the post text as we don't want that to effect the prediction of our models. 

Link to Notebook:

https://github.com/tonymbennett5/MBTI-ML-Social-Media-/blob/main/notebooks/EDA%20and%20cleaning%20of%20MBTI%20data%20.ipynb

## Count Vectorizer 

I decided to run a Count Vectorizer on the post data for personal analysis. Another Vectorizer will be actually be added to the modelling pipeline later. I used CountVectorizer() which tokenizes the text and also does basic preprocessing. It removes the punctuation marks and converts all the words to lowercase. Using the vectorizer I was able to get a list of the top used words in the data set. Mostly 3 or 4 letter words that don't have much significance. 

Link to Notebook:

https://github.com/tonymbennett5/MBTI-ML-Social-Media-/blob/main/notebooks/Counting.ipynb

## Sentiment Analysis and POS Tagging 

A sentiment analysis is used to analyze the text to determine the sentiment behind it. Using basic sentimnet analysis we can see whether the post data has a postive, negative or neutral sentiment. I got those three and also added the compound sentiment which combined all three of them. I used the SentimentIntensityAnalyzer() from Vader Sentiment Analysis. Vader (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool is specifically attuned to sentiments expressed in social media so it is good for this dataset. 

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural'. We can find the average of different parts of speech and add them to the clean data set. 

Link to Notebook:

https://github.com/tonymbennett5/MBTI-ML-Social-Media-/blob/main/notebooks/Sentiment%20Analysis%20and%20POS%20tagging.ipynb

## Counting 

In this notebook I added columns for counts and averages for specific parts of speech such as question marks, exclamation points, colon and emojis. I also counted unique words, upper case words, links, ellipses and images. These all had their own colums and can be added to our model to strengthen it. 

Link to NoteBook:

https://github.com/tonymbennett5/MBTI-ML-Social-Media-/blob/main/notebooks/Counting.ipynb

## Modelling 



I made a few models different types of models before selecting a final one. For each model I tested I ran a Count Vectorized Version and a TF-IDF verison. Then I could see which scores were higher. I focused mostly on the Average Precision-Recall Score for each classifer. The models I tested were: 
- logistic Regression (the best)
- Logistic Ridge Regression 
- Decision Tree Classifier (the worst)
- Support Vector Classifer 
Since Logistic Regression Performed the best I can move along with that one to find important features of each axis. 

Checked Feature Importance of final model 

Link to Notebook:

https://github.com/tonymbennett5/MBTI-ML-Social-Media-/blob/main/notebooks/Modeling.ipynb

## Conclusion 

- Dataset was heavily imbalanced which caused problems while classifiying. More people were Introverted and Intuitive then Extroverted and Sensitive. 
- Tough time discerning beteween Extroversion vs. Introversion and Sensitivity vs. Intuition.
- Used Random Undersampling to improve scores but not as signficantly as I had hoped
- Meyers Briggs score is pretty a basic test. People often come up in the middle which can cause problems when trying to classify people. Cartex's groups might not be as equal as they would like.  
- Added additional words to stop_list
- Would consider a model a success as even humans have a tough time discerning someones MBTI. However personality if far more complex then just words and text expression. 


## For the Future

- Add more data for the types that were undersampled 
- continue to try different model types, possibly neural network based model 