**A study of personality using NLP techniques**

I am trying to find out whether personality is a factor determining how a person writes. I am using the Word2Vec model by [Mikolov et. al. ](https://github.com/svn2github/word2vec) to calculate word similarites by getting the cosine similarites ( dot product ) of the word vectors trained by the Word2Vec model. The [four temperament model by David Kiersey](https://keirsey.com/temperament-overview/) takes temperamance as a factor and divides people into 4 groups (Artisan , Guardian , Idealist, Rational) . In the given article he draws parallel between Jungian types and his temperaments as follows (Artisan -SP , Guardian - SJ , Idealist - NF , Rational -NT). I have performed an one way anova with four groups with the following results.

In [4]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [8]:
import pandas as pd
text=pd.read_csv("../input/mbti_1.csv")

In [10]:
posts=text.values.tolist()

In [11]:
mbti_list=['ENFJ','ENFP','ENTJ','ENTP','ESFJ','ESFP','ESTJ','ESTP','INFJ','INFP','INTJ','INTP','ISFJ','ISFP','ISFP','ISTP']
values = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
index = np.arange(len(mbti_list))

In [12]:
for post in posts:
    for i in range(0,len(mbti_list)):
        if post[0] == mbti_list[i]:
            values[i]=values[i]+1

In [16]:
plt.bar(index,values)
plt.xlabel('Personality Type',fontsize=3)
plt.ylabel('No of persons',fontsize=5)
plt.xticks(index,mbti_list,fontsize=8,rotation=35)
plt.title('Distribution of types among Dataset(1 person=50 tweets)')
plt.show()

This plot shows the number of people present for each type present in the dataset.

This function performs the various pre-processing tasks and trains the Word2Vec model and then saves it in a binary file.

In [44]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
import re

def train_w2v_using_key(temp):# I'm too lazy to learn regex in python
    perlist=list()
    if temp=="SJ":
        for i in posts:
            if i[0]=='ISFJ' or i[0]=='ISTJ' or i[0]=='ESFJ' or i[0]=='ESTJ':
                perlist.append(i[1])
    if temp == 'SP':
        for i in posts:
            if i[0]=='ISFP' or i[0]=='ISTP' or i[0]=='ESFP' or i[0]=='ESTP':
                perlist.append(i[1])
    else:       
         for i in posts:
            if temp in i[0]:
                perlist.append(i[1])
    for i in range(0,len(perlist)): # using some code https://www.kaggle.com/prnvk05/rnn-mbti-predictor for filtering out links and numbers from the text 
        tempstr = ''.join(str(e) for e in perlist[i])
        post=tempstr.lower()
        post=post.replace('|||',"")
        post = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', '', post, flags=re.MULTILINE) 
        puncs1=['@','#','$','%','^','&','*','(',')','-','_','+','=','{','}','[',']','|','\\','"',"'",';',':','<','>','/']
        for punc in puncs1:
            post=post.replace(punc,'') 

        puncs2=[',','.','?','!','\n']
        for punc in puncs2:
            post=post.replace(punc,' ') 
        post=re.sub( '\s+', ' ', post ).strip()
    perlist[i]=post
    
    word_tokens=[]
    for i in range(0,len(perlist)): 
        word_tokens.append(word_tokenize(perlist[i]))

    model = Word2Vec(word_tokens, min_count=1)
    model.save(temp+".bin")

In [46]:
train_w2v_using_key("NT")

In [39]:
train_w2v_using_key("NF")

In [40]:
train_w2v_using_key("SP")

In [45]:
train_w2v_using_key("SJ")

**Similarities**

model.wv.similarity('word1','word2') returns the cosine similarity of the vectors of the words word1 and word2.
The higher the cosine similarity between the more similar the words. 

In [47]:
model=Word2Vec.load("NT.bin")
model.wv.similarity("defend","justify")

The following shows that **temperamannce is a  factor** that determines writing as the F-value obtained is larger than the critical value with the p-value being much smaller than the significance level for the hypothesis test.
![](https://i.imgur.com/iJtapj3.jpg)
