The Myers Briggs Type Indicator (or MBTI) is a personality type system that divides everyone into 16 distinct personality types across 4 axis:

Introversion (I) – Extroversion (E)

Intuition (N) – Sensing (S)

Thinking (T) – Feeling (F)

Judging (J) – Perceiving (P)


So for example, someone who prefers introversion, intuition, thinking and perceiving would be labelled an INTP in the MBTI system, and there are lots of personality based components that would model or describe this person’s preferences or behaviour based on the label.

This dataset contains 8675 rows of data, where each row contains a person’s:

- Type (This persons 4 letter MBTI code/type)
- A section of each of the last 50 things they have posted (Each entry separated by "|||" (3 pipe characters))

## Importing all the necessary libraries

In [169]:
import sys
import re
import string
import pickle
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from tqdm import tqdm
import plotly.express as px
from sklearn.feature_extraction.text import TfidfVectorizer

from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from sklearn.utils import class_weight
import gensim
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score
import joblib

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import preprocessing
import tensorflow as tensor
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers import Embedding, LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

import tweepy
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sb
import matplotlib.pyplot as plt
import wordcloud
from wordcloud import WordCloud, STOPWORDS
from collections import Counter
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.ensemble import RandomForestClassifier

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
mbti = pd.read_csv('C:\\FinalYearProject\\FirstReview - TweetAnalysis\\Dataset\\MBTI_data.csv')
mbti.head()

Unnamed: 0,type,posts
0,INFJ,'http://www.youtube.com/watch?v=qsXHcwe3krw|||...
1,ENTP,'I'm finding the lack of me in these posts ver...
2,INTP,'Good one _____ https://www.youtube.com/wat...
3,INTJ,"'Dear INTP, I enjoyed our conversation the o..."
4,ENTJ,'You're fired.|||That's another silly misconce...


In [3]:
train_data, test_data = train_test_split(mbti, test_size=0.2)

In [4]:
train_data.head()

Unnamed: 0,type,posts
7301,ISFJ,"'Dear PerC, ...Hi, I'm back after over 2 year..."
176,INTP,'My mind makes connections that seem so obviou...
1404,INFP,'You can play chaotic neutral but still be pro...
1292,INFJ,'They definitely wouldn't inject you with anyt...
481,INFP,'I have not actually ghosted anyone. I'm jus...


### Checking for the null values

In [5]:
train_data.isnull().any()

type     False
posts    False
dtype: bool

In [6]:
train_data.shape

(6940, 2)

In [7]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6940 entries, 7301 to 4890
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   type    6940 non-null   object
 1   posts   6940 non-null   object
dtypes: object(2)
memory usage: 162.7+ KB


In [8]:
types = np.unique(np.array(train_data['type']))
types

array(['ENFJ', 'ENFP', 'ENTJ', 'ENTP', 'ESFJ', 'ESFP', 'ESTJ', 'ESTP',
       'INFJ', 'INFP', 'INTJ', 'INTP', 'ISFJ', 'ISFP', 'ISTJ', 'ISTP'],
      dtype=object)

In [9]:
print(train_data.type.value_counts())

INFP    1462
INFJ    1189
INTP    1065
INTJ     855
ENFP     555
ENTP     535
ISTP     274
ISFP     212
ENTJ     188
ISTJ     161
ENFJ     151
ISFJ     131
ESTP      67
ESFP      33
ESTJ      32
ESFJ      30
Name: type, dtype: int64


### -> Adding one column for each MBTI characteristic pair, since we will be training independent classifier model for each pair independently. The reason for this is because of imbalance present in our dataset as seen in the EDA section.

In [10]:
train_data['ie'] = train_data['type']
for i in train_data.index:
    if 'I' in train_data['type'][i]:
        train_data['ie'][i] = 'I'
    elif 'E' in train_data['type'][i]:
        train_data['ie'][i] = 'E'

posts = train_data.posts.values
yIE = train_data.ie.values

In [11]:
posts.shape

(6940,)

In [12]:
train_data.head()

Unnamed: 0,type,posts,ie
7301,ISFJ,"'Dear PerC, ...Hi, I'm back after over 2 year...",I
176,INTP,'My mind makes connections that seem so obviou...,I
1404,INFP,'You can play chaotic neutral but still be pro...,I
1292,INFJ,'They definitely wouldn't inject you with anyt...,I
481,INFP,'I have not actually ghosted anyone. I'm jus...,I


In [13]:
print ("Introversion (I) /  Extroversion (E):\t", train_data['ie'].value_counts()['I'], " / ", train_data['ie'].value_counts()['E'])

Introversion (I) /  Extroversion (E):	 5349  /  1591


### -> Now data preprocessing is performed using regular expressions

In [14]:
#regular expressions for tokenization
regexes = [
    #urls
    #r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+',
    
    #html
    #r'<[^>]+>',
    
    #punctuation
    r'(?:(\w+)\'s)',
    
    r'(?:\s(\w+)\.+\s)',
    r'(?:\s(\w+),+\s)',
    r'(?:\s(\w+)\?+\s)',
    r'(?:\s(\w+)!+\s)',
    
    r'(?:\'+(\w+)\'+)',
    r'(?:"+(\w+)"+)',
    r'(?:\[+(\w+)\]+)',
    r'(?:{+(\w+)}+)',
    r'(?:\(+(\w+))',
    r'(?:(\w+)\)+)',

    #words containing numbers & special characters & punctuation
    r'(?:(?:(?:[a-zA-Z])*(?:[0-9!"#$%&\'()*+,\-./:;<=>?@\[\\\]^_`{|}~])+(?:[a-zA-Z])*)+)',
    
    #pure words
    r'([a-zA-Z]+)',
    
    #numbers
    #r'(?:(?:\d+,?)+(?:\.?\d+)?)',

    #emoticons
    #r"""(?:[:=;][oO\-]?[D\)\]\(\]/\\OpP])""",

    #other words
    #r'(?:[\w_]+)',

    #anything else
    #r'(?:\S)'
]

#compiling regular expression
regex = re.compile(r'(?:'+'|'.join(regexes)+')', re.VERBOSE | re.IGNORECASE)

-> Functions like tokenization, lemmatization and stopwords removal is performed

In [15]:

def preprocess(documents):
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    
    #fetching list of stopwords
    punctuation = list(string.punctuation)
    swords = stopwords.words('english') + ['amp'] + ['infp', 'infj', 'intp', 'intj', 'isfp', 'isfj', 'enfp', 'enfj', 'entp', 'entj', 'esfp', 'esfj', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday', 'january', 'feburary', 'march', 'april', 'may', 'june', 'july', 'august', 'september', 'october', 'november', 'december',  'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun',  'jan', 'feb', 'mar', 'apr', 'may', 'jun' 'jul', 'aug', 'sep', 'oct', 'nov', 'dec', 'tommorow', 'today', 'yesterday'] + ['mr', 'mrs']


    processed_documents = []
    for i,document in enumerate(documents):
        print('{0}/{1}'.format(i+1, len(documents)))
        
        #tokenization
        tokens = regex.findall(document)

        #skipping useless tokens
        t_regex = re.compile(r"[^a-zA-Z]")
        document = []
        
        for token in tokens:
            token = np.array(token)
            token = np.unique(token[token != ''])
            
            if len(token) > 0:
                token = token[0].lower()
            else:
                continue
                
            if re.search(t_regex, token) == None and token not in swords:
                token = lemmatizer.lemmatize(token)
                document.append(token)
                
        document = ' '.join(document)

        #skipping
        if len(document) >= 0:
            processed_documents.append(document)

    print()
    return np.array(processed_documents)

In [16]:
%%time
posts = preprocess(posts)

1/6940
2/6940
3/6940
4/6940
5/6940
6/6940
7/6940
8/6940
9/6940
10/6940
11/6940
12/6940
13/6940
14/6940
15/6940
16/6940
17/6940
18/6940
19/6940
20/6940
21/6940
22/6940
23/6940
24/6940
25/6940
26/6940
27/6940
28/6940
29/6940
30/6940
31/6940
32/6940
33/6940
34/6940
35/6940
36/6940
37/6940
38/6940
39/6940
40/6940
41/6940
42/6940
43/6940
44/6940
45/6940
46/6940
47/6940
48/6940
49/6940
50/6940
51/6940
52/6940
53/6940
54/6940
55/6940
56/6940
57/6940
58/6940
59/6940
60/6940
61/6940
62/6940
63/6940
64/6940
65/6940
66/6940
67/6940
68/6940
69/6940
70/6940
71/6940
72/6940
73/6940
74/6940
75/6940
76/6940
77/6940
78/6940
79/6940
80/6940
81/6940
82/6940
83/6940
84/6940
85/6940
86/6940
87/6940
88/6940
89/6940
90/6940
91/6940
92/6940
93/6940
94/6940
95/6940
96/6940
97/6940
98/6940
99/6940
100/6940
101/6940
102/6940
103/6940
104/6940
105/6940
106/6940
107/6940
108/6940
109/6940
110/6940
111/6940
112/6940
113/6940
114/6940
115/6940
116/6940
117/6940
118/6940
119/6940
120/6940
121/6940
122/6940
123/6940
1

In [17]:
posts[0]

'perc back year away really missed realize much came back tonight want stay support bubble living meeting people found type nt grateful patience respect thought reminds try boyfriend call humor quirky tell much love quirky moment first time pointed laughed much strangest thing escape mouth without tuckered dealing multitude people work bring go event socially drained isfjs feeling like great time take istj definitely romantic share similar love language feel like appreciate sure istj act really cutesy one get cute quite bit typically related petite size boyfriend call quirky lot tell love quirky moment mostly feel relevant life encountered situation way often job cannot deal might start taking improv class way better think hello realize much missed perc back feel like missing back weird know stopped visiting perc regularly lovely back much productive browsing forum maturity level side relationship roomed immature yet immature taxing relationship u imploded istj partner silly issue work

-> Now adding the preprocessed posts back in the dataframe

In [18]:
train_data['posts'] = posts
train_data.head(3)

Unnamed: 0,type,posts,ie
7301,ISFJ,perc back year away really missed realize much...,I
176,INTP,mind make connection seem obvious vastly diffe...,I
1404,INFP,play chaotic neutral still right useful party ...,I


## Cleaning and preprocessing of test data

In [19]:
test_data.head()

Unnamed: 0,type,posts
595,ESTJ,intj|||isfp|||They have been taught to live a ...
5752,INFP,'Hi everyone! Took a little time off and feeli...
930,ENFJ,I'd say so/sx 9w1*|||I hope not :) are you? :/...
7002,INFP,"I like this. Okay, I am going to get in touch ..."
7747,ENFP,That's cause watching a screen just before you...


In [20]:
test_data.shape

(1735, 2)

In [21]:
print(test_data.type.value_counts())

INFP    370
INFJ    281
INTP    239
INTJ    236
ENTP    150
ENFP    120
ISTP     63
ISFP     59
ISTJ     44
ENTJ     43
ENFJ     39
ISFJ     35
ESTP     22
ESFP     15
ESFJ     12
ESTJ      7
Name: type, dtype: int64


### Preprocessing on Test Data

In [22]:
test_data['ie'] = test_data['type']
for i in test_data.index:
    if 'I' in test_data['type'][i]:
        test_data['ie'][i] = 'I'
    elif 'E' in test_data['type'][i]:
        test_data['ie'][i] = 'E'

test_posts = test_data.posts.values
test_yIE = test_data.ie.values

In [23]:
test_posts.shape

(1735,)

In [24]:
%%time
test_posts = preprocess(test_posts)

1/1735
2/1735
3/1735
4/1735
5/1735
6/1735
7/1735
8/1735
9/1735
10/1735
11/1735
12/1735
13/1735
14/1735
15/1735
16/1735
17/1735
18/1735
19/1735
20/1735
21/1735
22/1735
23/1735
24/1735
25/1735
26/1735
27/1735
28/1735
29/1735
30/1735
31/1735
32/1735
33/1735
34/1735
35/1735
36/1735
37/1735
38/1735
39/1735
40/1735
41/1735
42/1735
43/1735
44/1735
45/1735
46/1735
47/1735
48/1735
49/1735
50/1735
51/1735
52/1735
53/1735
54/1735
55/1735
56/1735
57/1735
58/1735
59/1735
60/1735
61/1735
62/1735
63/1735
64/1735
65/1735
66/1735
67/1735
68/1735
69/1735
70/1735
71/1735
72/1735
73/1735
74/1735
75/1735
76/1735
77/1735
78/1735
79/1735
80/1735
81/1735
82/1735
83/1735
84/1735
85/1735
86/1735
87/1735
88/1735
89/1735
90/1735
91/1735
92/1735
93/1735
94/1735
95/1735
96/1735
97/1735
98/1735
99/1735
100/1735
101/1735
102/1735
103/1735
104/1735
105/1735
106/1735
107/1735
108/1735
109/1735
110/1735
111/1735
112/1735
113/1735
114/1735
115/1735
116/1735
117/1735
118/1735
119/1735
120/1735
121/1735
122/1735
123/1735
1

In [25]:
test_posts[0]

'taught live notice pattern enter right going sacrifice people life ship could enter believe illuminati dash jack fenton xntp maddie fenton think abuse lucious messed little bit think think real hakeem hakeem istp luscious give isxp vibe strong see father often typed thought ufeffthe exact thing father xntp say make sense thought sp see esxp would gradually starting ask question throughout ufeffthe believe tai opposed esxp say matt type ken opposed believe tai opposed esxp say matt type ken opposed xntj see ni dom seem cornelia esxp irma hay line phobos nerissa blunk believe inuyasha estp vegetable neyla say ryoko estj think ragyo sx big weakness entjs exxj spend lot time questioning decide fair act another problem entjs estp sp minaj sx info betrayed friend betrayed felt like garbage felt bad fluttershy friend acted stupid seems like think last main villain discord use socionics would right think joey would estp mai valentine succumbing inner kaiba pharaoh anime version magna movie yu

In [26]:
test_data['posts'] = test_posts
test_data.head()

Unnamed: 0,type,posts,ie
595,ESTJ,taught live notice pattern enter right going s...,E
5752,INFP,everyone took little time feeling much better ...,I
930,ENFJ,say hope thought since struck also notice avat...,E
7002,INFP,like going get touch dark side many deal job c...,I
7747,ENFP,cause watching screen go bed block melatonin p...,E


In [27]:
y_train = train_data['type']
y_test = test_data['type']

### Preprocessing using tf-idf

TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

In [28]:
%%time

#TF-IDF representation
# creating document frequency matrix
cv = CountVectorizer(analyzer="word", max_features=150).fit(posts)
x_train = cv.transform(posts)

CPU times: total: 10.2 s
Wall time: 11.2 s


In [30]:
tf = TfidfTransformer()
x_tf_train=  tf.fit_transform(x_train).toarray()

In [31]:
posts.shape, x_train.shape, x_tf_train.shape, yIE.shape # verifying that the shapes match

((6940,), (6940, 150), (6940, 150), (6940,))

In [32]:
%%time

#TF-IDF representation
# creating document frequency matrix
cv = CountVectorizer(analyzer="word", max_features=150).fit(test_posts)
x_test = cv.transform(test_posts)

CPU times: total: 2.62 s
Wall time: 3.05 s


In [33]:
x_tf_test = tf.fit_transform(x_test).toarray()

In [34]:
test_posts.shape, x_test.shape, x_tf_test.shape, test_yIE.shape # verifying that the shapes match

((1735,), (1735, 150), (1735, 150), (1735,))

## Training and testing data

In [35]:
xTrain = x_train
yTrain = y_train

In [36]:
xTest = x_test
yTest = y_test

In [170]:
xIETrain = x_train
yIETrain = yIE

In [171]:
xIETest = x_test
yIETest = test_yIE

## Label Encoding

In [39]:
frames = [train_data, test_data]
result = pd.concat(frames)

In [40]:

le = preprocessing.LabelEncoder()
result['type'] = le.fit_transform(result.type.values)

In [41]:
result.ie= le.fit_transform(result.ie.values)

In [42]:
result.head(4)

Unnamed: 0,type,posts,ie
7301,12,perc back year away really missed realize much...,1
176,11,mind make connection seem obvious vastly diffe...,1
1404,9,play chaotic neutral still right useful party ...,1
1292,8,definitely inject anything without telling fir...,1


In [43]:
df_1 = result.iloc[:6940,:]
df_2 = result.iloc[6940:,:]

In [44]:
yTrain = df_1['type']
yTest = df_2['type']

In [45]:
yIE = df_1.ie
test_yIE = df_2.ie

# **Decision Tree**

In [173]:
dt_ieModel = DecisionTreeClassifier(criterion="gini", random_state=40, max_depth=3, min_samples_leaf=5).fit(xIETrain,yIE)

In [174]:
print(dt_ieModel.score(xIETrain, yIE))

0.7707492795389049


In [175]:
print(dt_ieModel.score(xIETest, test_yIE))

0.7648414985590778


In [176]:
scores = []

#scores.append(cross_val_score(estimator=rnd_model, cv=5, X=xTrain, y=yTrain, scoring='accuracy'))
scores.append(cross_val_score(estimator=dt_ieModel, cv=5, X=xIETrain, y=yIE, scoring='accuracy'))

#prining mean and standard deviations for each model
for score in scores:
    print(score.mean())
    print(score.std(), end='\n\n')

0.7694524495677233
0.0018226384208463076



In [177]:
scores = []

#scores.append(cross_val_score(estimator=rnd_model, cv=5, X=xTrain, y=yTrain, scoring='accuracy'))
scores.append(cross_val_score(estimator=dt_ieModel, cv=5, X=xIETest, y=test_yIE, scoring='accuracy'))

#prining mean and standard deviations for each model
for score in scores:
    print(score.mean())
    print(score.std(), end='\n\n')

0.7550432276657062
0.007514930726458389



# **Support Vector Machine (SVM)**

In [63]:
svm_ieModel = SVC(random_state=1).fit(xIETrain, yIE)

In [64]:
print(svm_ieModel.score(xIETrain, yIE))

0.7729106628242075


In [65]:
print(svm_ieModel.score(xIETest, test_yIE))

0.7648414985590778


In [67]:
scores = []

#scores.append(cross_val_score(estimator=rnd_model, cv=5, X=xTrain, y=yTrain, scoring='accuracy'))
scores.append(cross_val_score(estimator=svm_ieModel, cv=5, X=xIETrain, y=yIE, scoring='accuracy'))

#prining mean and standard deviations for each model
for score in scores:
    print(score.mean())
    print(score.std(), end='\n\n')

0.7707492795389049
0.00028818443804032865



In [68]:
scores = []

#scores.append(cross_val_score(estimator=rnd_model, cv=5, X=xTrain, y=yTrain, scoring='accuracy'))
scores.append(cross_val_score(estimator=svm_ieModel, cv=5, X=xIETest, y=test_yIE, scoring='accuracy'))

#prining mean and standard deviations for each model
for score in scores:
    print(score.mean())
    print(score.std(), end='\n\n')

0.7648414985590779
0.0014118096500191476



In [75]:
print('Train Classification Report \n\n ',classification_report(yIE,svm_ieModel.predict(xIETrain)))

Train Classification Report 

                precision    recall  f1-score   support

           0       1.00      0.01      0.02      1591
           1       0.77      1.00      0.87      5349

    accuracy                           0.77      6940
   macro avg       0.89      0.50      0.45      6940
weighted avg       0.82      0.77      0.68      6940



In [76]:
print('Test Classification Report \n\n ',classification_report(test_yIE,svm_ieModel.predict(xIETest)))

Test Classification Report 

                precision    recall  f1-score   support

           0       0.00      0.00      0.00       408
           1       0.76      1.00      0.87      1327

    accuracy                           0.76      1735
   macro avg       0.38      0.50      0.43      1735
weighted avg       0.58      0.76      0.66      1735



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# **Random Forest**

In [118]:
#random_forest = RandomForestClassifier(n_estimators=100)

In [182]:
#rnd_model =  RandomForestClassifier(n_estimators=100).fit(xTrain, yTrain)
rnd_ieModel =  RandomForestClassifier(n_estimators=100).fit(xIETrain, yIE)

In [183]:
#print(rnd_model.score(xTrain, yTrain))
print(rnd_ieModel.score(xIETrain, yIE))

1.0


In [184]:
#print(rnd_model.score(xTest, yTest))
print(rnd_ieModel.score(xIETest, test_yIE))

0.7487031700288185


In [185]:
scores = []

#scores.append(cross_val_score(estimator=rnd_model, cv=5, X=xTrain, y=yTrain, scoring='accuracy'))
scores.append(cross_val_score(estimator=rnd_ieModel, cv=5, X=xIETrain, y=yIE, scoring='accuracy'))

#prining mean and standard deviations for each model
for score in scores:
    print(score.mean())
    print(score.std(), end='\n\n')

0.7710374639769453
0.0005391437156734941



In [188]:
scores = []

#scores.append(cross_val_score(estimator=rnd_model, cv=5, X=xTest, y=yTest, scoring='accuracy'))
scores.append(cross_val_score(estimator=rnd_ieModel, cv=5, X=xTest, y=test_yIE, scoring='accuracy'))

#prining mean and standard deviations for each model
for score in scores:
    print(score.mean())
    print(score.std(), end='\n\n')

0.7654178674351585
0.002305475504322768



In [77]:
print('Train Classification Report \n\n ',classification_report(yIE,rnd_ieModel.predict(xIETrain)))

Train Classification Report 

                precision    recall  f1-score   support

           0       1.00      1.00      1.00      1591
           1       1.00      1.00      1.00      5349

    accuracy                           1.00      6940
   macro avg       1.00      1.00      1.00      6940
weighted avg       1.00      1.00      1.00      6940



In [79]:
print('Test Classification Report \n\n ',classification_report(test_yIE,rnd_ieModel.predict(xIETest)))

Test Classification Report 

                precision    recall  f1-score   support

           0       0.17      0.02      0.03       408
           1       0.76      0.98      0.86      1327

    accuracy                           0.75      1735
   macro avg       0.47      0.50      0.44      1735
weighted avg       0.63      0.75      0.66      1735



# **Long Short-Term Memory (LSTM)**

In [132]:
xIETrain.shape

(6940, 150)

In [133]:
xIETrain = xIETrain.toarray()

In [134]:
xIETest.shape

(1735, 150)

In [135]:
xIETest = xIETest.toarray()

In [156]:
lstm_ieModel = Sequential()
lstm_ieModel.add(Embedding(150,128))
lstm_ieModel.add(LSTM(128, activation='relu'))
lstm_ieModel.add(Dropout(0.2))
lstm_ieModel.add(Dense(32, activation='relu'))
lstm_ieModel.add(Dropout(0.2))
lstm_ieModel.add(Dense(4,activation='softmax'))
lstm_ieModel.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
lstm_ieModel.fit(xIETrain, yIE, epochs=5, batch_size=125, validation_data=(xIETest, test_yIE))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x214de93bb50>

# **Conclusion**

|Model|Train Data Accuracy|Test Data Accuracy|
|-----|-------------------|------------------|
|Decision Tree|0.7694|0.7550|
|SVM|0.7707|0.7648|
|Random Forest|0.7710|0.7654|
|LSTM|0.7707|0.7648|

Therefore Random Forest Classfier is the best model for Tweet Analysis

# **Additional Work**

### Prediction for Virat Kohli

In [192]:
virat_df = pd.read_csv("C:\FinalYearProject\FirstReview - TweetAnalysis\Dataset\Analysis\imVkohli_tweets.csv")
virat_df.head()

Unnamed: 0,id,created_at,favorite_count,retweet_count,text
0,1615758675528003585,2023-01-18 17:11:02+00:00,206089,11713,First win of the series and superb knock by Sh...
1,1614897799459635200,2023-01-16 08:10:13+00:00,26106,1276,.@OceanBeverages has some really small news! T...
2,1614656206936891395,2023-01-15 16:10:13+00:00,262923,18368,Triumphant series win. 🇮🇳🏆 https://t.co/M0zns...
3,1614134864063909889,2023-01-14 05:38:35+00:00,395652,14647,♥️ https://t.co/Varl9o8XqD
4,1613760933981216768,2023-01-13 04:52:43+00:00,15610,1017,"Hardly drive, don't drive because of WFH, extr..."


In [193]:
%%time
virat_posts = virat_df['text']
virat_posts = preprocess(virat_posts)

1/49
2/49
3/49
4/49
5/49
6/49
7/49
8/49
9/49
10/49
11/49
12/49
13/49
14/49
15/49
16/49
17/49
18/49
19/49
20/49
21/49
22/49
23/49
24/49
25/49
26/49
27/49
28/49
29/49
30/49
31/49
32/49
33/49
34/49
35/49
36/49
37/49
38/49
39/49
40/49
41/49
42/49
43/49
44/49
45/49
46/49
47/49
48/49
49/49

CPU times: total: 46.9 ms
Wall time: 64.8 ms


In [194]:
# cv = CountVectorizer(analyzer="word", max_features=1500).fit(jac_posts)
virat_x = cv.transform(virat_posts)
virat_x_tf =  tf.fit_transform(virat_x).toarray()

In [195]:
predicted_ie = rnd_ieModel.predict(virat_x_tf)
rnd_ieModel.predict(virat_x_tf)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1])

In [196]:
IEpred=[]
for pred in predicted_ie:
    if pred==1:
        IEpred+=['I']
    elif pred==0:
        IEpred+=['E']

In [197]:
virat_df['text'] = virat_posts
virat_df['predicted_ie'] = IEpred
virat_df.drop(['created_at','favorite_count', 'retweet_count'], axis=1, inplace= True)
virat_df.head(3)

Unnamed: 0,id,text,predicted_ie
0,1615758675528003585,first win series superb knock shubman,I
1,1614897799459635200,really small news popular fruit water availabl...,I
2,1614656206936891395,triumphant series win,I


### Prediction for Jeff Bezos

In [198]:
jeff_df = pd.read_csv("C:\FinalYearProject\FirstReview - TweetAnalysis\Dataset\Analysis\JeffBezos_tweets.csv")
jeff_df.head()

Unnamed: 0,id,created_at,favorite_count,retweet_count,text
0,1606892394896822272,2022-12-25 05:59:36+00:00,38,2,@trevken Same to you 🎄
1,1606877122316111872,2022-12-25 04:58:54+00:00,22,0,@trevken Thank you
2,1595038534897139712,2022-11-22 12:56:35+00:00,5777,717,This year’s Bezos Day 1 Families Fund grants g...
3,1591558804960854017,2022-11-12 22:29:23+00:00,8849,1258,We’ve just announced a new Courage and Civilit...
4,1582517044020273152,2022-10-18 23:40:39+00:00,8669,1395,"Yep, the probabilities in this economy tell yo..."


In [199]:
%%time
jeff_posts = jeff_df['text']
jeff_posts = preprocess(jeff_posts)

1/42
2/42
3/42
4/42
5/42
6/42
7/42
8/42
9/42
10/42
11/42
12/42
13/42
14/42
15/42
16/42
17/42
18/42
19/42
20/42
21/42
22/42
23/42
24/42
25/42
26/42
27/42
28/42
29/42
30/42
31/42
32/42
33/42
34/42
35/42
36/42
37/42
38/42
39/42
40/42
41/42
42/42

CPU times: total: 31.2 ms
Wall time: 34.9 ms


In [200]:
# cv = CountVectorizer(analyzer="word", max_features=1500).fit(jac_posts)
jeff_x = cv.transform(jeff_posts)
jeff_x_tf=  tf.fit_transform(jeff_x).toarray()

In [201]:
predicted_ie = rnd_ieModel.predict(jeff_x_tf)
rnd_ieModel.predict(jeff_x_tf)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [202]:
IEpred=[]
for pred in predicted_ie:
    if pred==1:
        IEpred+=['I']
    elif pred==0:
        IEpred+=['E']

In [203]:
jeff_df['text'] = jeff_posts
jeff_df['predicted_ie'] = IEpred
jeff_df.drop(['created_at','favorite_count', 'retweet_count'], axis=1, inplace= True)
jeff_df.head(3)

Unnamed: 0,id,text,predicted_ie
0,1606892394896822272,,I
1,1606877122316111872,thank,I
2,1595038534897139712,year bezos day family fund grant go incredible...,I
