Complete Explanation on what Myers-Briggs Type Indicator:
https://www.youtube.com/watch?v=QPtrDt_VybY

Dataset links :
* https://www.kaggle.com/datasets/datasnaek/mbti-type
* https://www.kaggle.com/datasets/kaggle/meta-kaggle?select=ForumMessages.csv 
Just download the forum messages.csv file from the second link (approx 720 MB)

'I': "Introversion",
'E': "Extriversion",
'N':'Intuition',
"S":"Sensing",
"T":"Thinking",
"F":"Feeling",
"J": "Judging",
"P": "Perceiving"

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
train_data=pd.read_csv('mbti_1.csv')

In [3]:
train_data.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [5]:
forum_data=pd.read_csv('forum_topic_messages.csv')

* <b> Data Preparation and Cleaning </b>

In [4]:
#TRAITS OF A PERSON ACCORDING TO OUR DATASET
# FOR EXAMPLE:
# INFJ - Person has 4 traits-> Introversion, Intuition, Feeling, Judging

mbti={
    'I': "Introversion",
    'E': "Extriversion",
    'N':'Intuition',
    "S":"Sensing",
    "T":"Thinking",
    "F":"Feeling",
    "J": "Judging",
    "P": "Perceiving"
}

In [6]:
forum_data.head()

Unnamed: 0,Id,ForumTopicId,PostUserId,PostDate,ReplyToForumMessageId,Message,Medal,MedalAwardDate
0,1,1,478,04/28/2010 23:13:08,,<div>In response to a comment on the No Free H...,,
1,2,2,606,04/29/2010 15:48:46,,"Hi, I'm interested in participating in the con...",,
2,3,2,478,04/29/2010 15:48:46,,"Tanya,<div><br></div><div>Good to hear from yo...",,
3,4,2,368,04/29/2010 15:48:46,,"Hi Tanya, <br><br>Kaggle will maintain a ratin...",,
4,14,7,478,05/02/2010 14:37:35,,Now that we have a handful of algorithms that ...,,


In [7]:
train_data.shape

(8675, 2)

In [8]:
forum_data.shape

(2077184, 8)

In [9]:
## Looking at the personality distribution in the training data
type_count=train_data['type'].value_counts()
print(type_count)

INFP    1832
INFJ    1470
INTP    1304
INTJ    1091
ENTP     685
ENFP     675
ISTP     337
ISFP     271
ENTJ     231
ISTJ     205
ENFJ     190
ISFJ     166
ESTP      89
ESFP      48
ESFJ      42
ESTJ      39
Name: type, dtype: int64


* <B> Looking for missing values in the data </b>

In [10]:
print("Missing values in the train data:")
print(train_data.isnull().sum())

Missing values in the train data:
type     0
posts    0
dtype: int64


In [11]:
print("Missing values in the forum data:")
print(forum_data.isnull().sum())

Missing values in the forum data:
Id                             0
ForumTopicId                   0
PostUserId                     0
PostDate                       0
ReplyToForumMessageId    1057676
Message                     7150
Medal                    1178831
MedalAwardDate           1169140
dtype: int64


In [13]:
 1178831/2077184

0.5675139997227016

* We can see that more than 50% values are missing in medal column. But, if we think logically there is a chance of post not winning a medal. So it makes sense for understanding that not every post can win a medal so we can replace those values with 0. 

In [14]:
forum_data['Medal'].value_counts()

3.0    831979
2.0     38883
1.0     27491
Name: Medal, dtype: int64

* Just think about this. Ideally, it should be harder to win 3 medals than 1 but our dataset has more posts having three medals. 
* This is the dataset we have and we need to proceed with this but if this was a real world problem in your organisation then you can go to the client and check if there are any issues with the dataset. 

In [16]:
forum_data['Medal']=forum_data['Medal'].fillna(0)

In [15]:
# Replacing null in Message with blank space as the amount of data missing is less

forum_data['Message']=forum_data['Message'].fillna('')
print("Missing values in forum data: ")
print(forum_data.isnull().sum())

Missing values in forum data: 
Id                             0
ForumTopicId                   0
PostUserId                     0
PostDate                       0
ReplyToForumMessageId    1057676
Message                        0
Medal                    1178831
MedalAwardDate           1169140
dtype: int64


* ReplyToForumMessageId is an ID column and most of the values are missing so better to drop it

In [18]:
forum_data.drop(['ReplyToForumMessageId'],axis=1, inplace=True)

* Now, we will group the user by PostUserID since a particular user could have posted more than once.

In [19]:
forum_data_g=forum_data.groupby('PostUserId')['Message'].agg(lambda col: ' '.join(col)).reset_index()

In [20]:
print(forum_data_g['PostUserId'].value_counts())

62          1
6608670     1
6608665     1
6608589     1
6608524     1
           ..
2481048     1
2481047     1
2480956     1
2480902     1
17044377    1
Name: PostUserId, Length: 336000, dtype: int64


* <b> Data Cleaning </b>

In [21]:
# importing libraries
import re
from bs4 import BeautifulSoup
import string
from nltk.stem.snowball import SnowballStemmer

In [22]:
def text_cleaning(text):
    text=BeautifulSoup(text, 'lxml').text
    #removing html and seperators
    text=re.sub(r'\|\|\|',r' ',text)
    text=re.sub(r'http\S+', r' ', text)
    #removing punctuations
    text=text.replace('.', ' ')
    translator=str.maketrans('','',string.punctuation)
    text=text.translate(translator)
    #removing numbers
    text=''.join(i for i in text if not i.isdigit())
    return text

In [23]:
train_data['clean_posts']=train_data['posts'].apply(text_cleaning)



In [24]:
train_data['clean_posts'][1]

'Im finding the lack of me in these posts very alarming  Sex can be boring if its in the same position often  For example me and my girlfriend are currently in an environment where we have to creatively use cowgirl and missionary  There isnt enough    Giving new meaning to Game theory  Hello ENTP Grin  Thats all it takes  Than we converse and they do most of the flirting while I acknowledge their presence and return their words with smooth wordplay and more cheeky grins  This  Lack of Balance and Hand Eye Coordination  Real IQ test I score   Internet IQ tests are funny  I score s or higher   Now like the former responses of this thread I will mention that I dont believe in the IQ test  Before you banish    You know youre an ENTP when you vanish from a site for a year and a half return and find people are still commenting on your posts and liking your ideasthoughts  You know youre an ENTP when you        I over think things sometimes  I go by the old Sherlock Holmes quote   Perhaps when

In [25]:
forum_data_g['clean_messages']=forum_data_g['Message'].apply(text_cleaning)



In [26]:
forum_data_g['clean_messages'][1]



<b> STEMMING </b>

In [37]:
def stem_text(text):
    stemmer = SnowballStemmer('english')
    words_list=text.split()
    new_list=[]
    for i in words_list:
        word=stemmer.stem(i)
        new_list.append(word)
        
    words = new_list
    words = ' '.join(words)
    return words

In [38]:
train_data['clean_posts'] = train_data['clean_posts'].apply(stem_text)

In [39]:
train_data['clean_posts'][1]

'im find the lack of me in these post veri alarm sex can be bore if it in the same posit often for exampl me and my girlfriend are current in an environ where we have to creativ use cowgirl and missionari there isnt enough give new mean to game theori hello entp grin that all it take than we convers and they do most of the flirt while i acknowledg their presenc and return their word with smooth wordplay and more cheeki grin this lack of balanc and hand eye coordin real iq test i score internet iq test are funni i score s or higher now like the former respons of this thread i will mention that i dont believ in the iq test befor you banish you know your an entp when you vanish from a site for a year and a half return and find peopl are still comment on your post and like your ideasthought you know your an entp when you i over think thing sometim i go by the old sherlock holm quot perhap when a man has special knowledg and special power like my own it rather encourag him to seek a complex

In [40]:
forum_data_g['clean_messages'] = forum_data_g['clean_messages'].apply(stem_text)

In [41]:
forum_data_g['clean_messages'][1]



<b> Model Building </b>

In [42]:
#Importing the necessary libraries
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

In [43]:
kfolds= StratifiedKFold(n_splits=5, shuffle= True, random_state=1)

scoring= {
    'acc': 'accuracy',
    'neg_log_loss': 'neg_log_loss',
    'f1_micro':'f1_micro'
}

In [44]:
countVect= CountVectorizer(ngram_range=(1,1), stop_words='english', lowercase=True, max_features=5000)
model=Pipeline([('countVect', countVect),('lr', LogisticRegression(class_weight='balanced', C=0.005))])
results=cross_validate(model, train_data['clean_posts'], train_data['type'], cv=kfolds, scoring=scoring, n_jobs=-1)

In [45]:
print("Accuracy: {:0.5f}".format(np.mean(results['test_acc'])))
print("Logloss: {:0.5f}". format(np.mean(-1*results['test_neg_log_loss'])))

Accuracy: 0.65718
Logloss: 1.28835


<b> MODEL PREDICTIONS </b>

In [46]:
model.fit(train_data['clean_posts'], train_data['type'])
pred=model.predict(forum_data_g['clean_messages'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [47]:
count=np.unique(pred, return_counts=True)
count

(array(['ENFJ', 'ENFP', 'ENTJ', 'ENTP', 'ESFJ', 'ESFP', 'ESTJ', 'ESTP',
        'INFJ', 'INFP', 'INTJ', 'INTP', 'ISFJ', 'ISFP', 'ISTJ', 'ISTP'],
       dtype=object),
 array([  3386,   1307,  25504,   6128,   3319,  86791,   1618,   3401,
           339,   1017, 104765,   9946,  30518,  22577,  11581,  23803],
       dtype=int64))

In [48]:
list_of_preds=list(zip(count[0], count[1]))
pred_df=pd.DataFrame(list_of_preds, columns=['Personality','Count'])
pred_df

Unnamed: 0,Personality,Count
0,ENFJ,3386
1,ENFP,1307
2,ENTJ,25504
3,ENTP,6128
4,ESFJ,3319
5,ESFP,86791
6,ESTJ,1618
7,ESTP,3401
8,INFJ,339
9,INFP,1017


<b> SCOPE FOR IMPROVEMENT </b>

* Text cleaning can be improved by removing stop words and preferring lemmentization over stemming.
* I have just used Logistic regression. Other efficient models can be used to improve performance.