                                               MBTI Types

![](https://www.verywellmind.com/thmb/h3i7cVWZ0vZPHQ--hn-WDXmRU6g=/1500x782/filters:fill(ABEAC3,1)/the-myers-briggs-type-indicator-2795583_FINAL-5c4b6112c9e77c00014af95f.png)

# Objective

Our objectf is to make a predictive model that classifies reddit users as either extraverts or introverts based on the data recording their interaction with various subreddits.

# Import Libraries Needed

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!pip install seaborn==0.11.0
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
%%time
df = pd.read_csv("../input/mbti-type-and-digital-footprints-for-reddit-users/reddit_psychometric_data.csv")
df.head()

In [None]:
df.shape

Our data contain 3586 rows and 27091 columns (sooo large isn't it)

**Checking the Missing Data**

In [None]:
print("The number of missing data in the Total data is :", df.isnull().sum().sum())

In [None]:
df.describe()

**Types of data**

In [None]:
df.dtypes

# The Personalities Types

![The Personalities Types](https://upload.wikimedia.org/wikipedia/commons/1/1f/MyersBriggsTypes.png)

Our data personalities are distributed as followed : 

In [None]:
df['mbti_type'].value_counts()

In [None]:
#sns.set(font_scale=1.4)
df['mbti_type'].value_counts().plot(kind='bar', figsize=(12, 6), rot=0)
plt.xlabel("MBTI_Personality_Type", labelpad=10)
plt.ylabel("Count of People", labelpad=10)
plt.title("Count of People Who Received Tips by their Personality", y=1.02);

So we can see clearly that the most existing personality type is INPT :

![](https://www.verywellmind.com/thmb/f53mBgKUJGSHvpymqOtfMPVxlAY=/700x0/filters:no_upscale():max_bytes(150000):strip_icc():format(webp)/intp-introverted-intuitive-thinking-perceiving-2795989-5c2e4533c9e77c0001cb80e9.png)


INTP (introverted, intuitive, thinking, perceiving) is one of the 16 personality types described by the Myers-Briggs Type Indicator (MBTI).1

﻿ People who score as INTP are often described as quiet and analytical. They enjoy spending time alone, thinking about how things work and coming up with solutions to problems. INTPs have a rich inner world and would rather focus their attention on their internal thoughts rather than the external world. They typically do not have a wide social circle, but they do tend to be close to a select group of people. 
  
* **Popular INTP Careers**
Chemist
Physicist
Computer programmer
Forensic scientist
Engineer
Mathematician
Pharmacist
Software developer
Geologist

# Understand Our Data


Introversion : INTP

               INFP
               INTJ
               INFJ
               ISTP    
               ISFP
               ISTJ
               ISFJ
    
  
Extraversion : ENTP 

               ENFP    
               ENTJ         
               ENFJ         
               ESTP          
               ESFP     
               ESTJ     
               ESFJ 
    

In [None]:
lIntro = ['INTP','INFP','INTJ','INFJ','ISTP','ISFP','ISTJ','ISFJ']
for i in lIntro:
    df.mbti_type = df.mbti_type.replace(i, "Introversion")
lExtra = ['ENTP','ENFP','ENTJ','ENFJ','ESTP','ESFP','ESTJ','ESFJ']
for i in lExtra:
    df.mbti_type = df.mbti_type.replace(i, "Extraversion")

In [None]:
df.head(3)

We will consider that :

    Extraversion  -->  1
        &
    Introversion  -->  0

In [None]:
df.mbti_type = df.mbti_type.replace({'Extraversion':1,'Introversion':0})

In [None]:
df.head()

# Preprocessing the data

The first preprocessing step is to divide the dataset into a features set and corresponding personality type. The following script performs this task:

In [None]:
X = df.iloc[:,1:].values
y = df.iloc[:,0].values

**Standard Scaler :**

StandardScaler transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

* Standard Scaler is useful for classification.
* it's transform the data between [-1,1]

In [None]:
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X) 

**Principal Component Analysis**

This method combines highly correlated variables together to form a smaller number of an artificial set of variables which is called "principal components" that account for most variance in the data.

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
X = pca.fit_transform(X)

PCA’s goal is to reduce the curse of dimensionality. It will reduce the features in such a way that it retains most principal information of the features in its principal components.

**Spliting the data into train & test sets**

The next preprocessing step is to divide data into training and test sets. Execute the following script to do so:


In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Classification Model Algorithms

# 1. Logistic Regression

Logistic Regression is used when the dependent variable(target) is categorical.

Like in our case : Extraversion:1, Introversion:0.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
from sklearn.linear_model import LogisticRegression
lrg = LogisticRegression()
lrg.fit(X_train, y_train)
y_pred = lrg.predict(X_test)
print('Logistic Regression Accuracy' , accuracy_score(y_test, y_pred))

# 2. Linear Discriminant Analysis

Linear discriminant analysis is used as a tool for classification, dimension reduction, and data visualization. It has been around for quite some time now. Despite its simplicity, LDA often produces robust, decent, and interpretable classification results.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
print('LinearDiscriminantAnalysis Accuracy' , accuracy_score(y_test, y_pred))

# 3. Gaussian Naive Bayes

Naive Bayes is a classification algorithm for binary and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

In [None]:
from sklearn.naive_bayes import GaussianNB
gb = GaussianNB()
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
print('Gaussian Naîve Bayes Accuracy' , accuracy_score(y_test, y_pred))

# 4. Random Forest Classifier

The Random Forest Classifier is a set of decision trees from randomly selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test object.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(max_depth=2, random_state=0)
rfc.fit(X_train, y_train)

# Predicting the Test set results
y_pred = rfc.predict(X_test)
print('Random Forest Classifier Accuracy' , accuracy_score(y_test, y_pred))

# 5. Support Vector Classifier

Support vector machines Classifier (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

I will use it because of its advantages:

* Effective in high dimensional spaces.

* Still effective in cases where number of dimensions is greater than the number of samples.

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print('Accuracy Suport Vector Classifier' , accuracy_score(y_test, y_pred))

# 6. Decision Tree

Decision Trees Classifiers are a type of Supervised Machine Learning meaning we build a model, we feed training data matched with correct outputs and then we let the model learn from these patterns. Then we give our model new data that it hasn't seen before so that we can see how it performs. And because we need to see what exactly is to be trained for a Decision Tree, let's see what exactly a decision tree is.

[To see More](https://programmerbackpack.com/decision-tree-explained/)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
print('Accuracy Decision Tree Classifier' , accuracy_score(y_test, y_pred))

# 7. KNeighbor Classifier

K-Nearest Neighbor is a supervised learning algorithm that can be used for regression as well as classification problems. But KNN is widely used for classification problems in machine learning. KNN works on a principle assuming that every data point falling near to each other will fall in the same class. That means similar things are near to each other.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
knc.fit(X_train, y_train)
y_pred = knc.predict(X_test)
print('Accuracy KNeighbors Classifier' , accuracy_score(y_test, y_pred))

we can see that the accuracy of our classification algorithms does not exceed the accuracy of 70%. And if we try to create a deeper model then what will be our precision !

# Neural Network With Keras

Before we begin on building our model, we need to know the input dimension of our feature vectors. This happens only in the first layer since the following layers can do automatic shape inference. In order to build the Sequential model, you can add layers one by one in order as follows:

In [None]:
from keras.models import Sequential
from keras import layers

input_dim = X_train.shape[1]

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

so with .compile() will help me to specify my optimizer which will be adam ,and the loss function. In order to configure the learning process.

Keras also includes a handy .summary() function to give an overview of the model and the number of parameters available for training just as below:

In [None]:
model.compile(loss='binary_crossentropy', 
               optimizer='adam', 
               metrics=['accuracy'])
model.summary()

Now it is time for training with the .fit() function.

Since the training in neural networks is an iterative process, the training won’t just stop after it is done. You have to specify the number of iterations you want the model to be training. Those completed iterations are commonly called epochs. We want to run it for 100 epochs to be able to see how the training loss and accuracy are changing after each epoch.

Another parameter you have to your selection is the batch size. The batch size is responsible for how many samples we want to use in one epoch, which means how many samples are used in one forward/backward pass. This increases the speed of the computation as it need fewer epochs to run, but it also needs more memory, and the model may degrade with larger batch sizes. Since we have a small training set, we can leave this to a low batch size:

In [None]:
history = model.fit(X_train, y_train,
                     epochs= 30,
                     verbose=False,
                     validation_data=(X_test, y_test),
                     batch_size=10)

Now it's time for evaluation .evaluate() method measure the accuracy of the model. You can do this both for the training data and testing data. We expect that the training data has a higher accuracy then for the testing data. The longer you would train a neural network, the more likely it is that it starts overfitting.

Note that if you rerun the .fit() method, you’ll start off with the computed weights from the previous training. Make sure to compile the model again before you start training the model again. Now let’s evaluate the accuracy model:

In [None]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))

# Access Model Training History in Keras

Keras provides the capability to register callbacks when training a deep learning model.

One of the default callbacks that is registered when training all deep learning models is the History callback. It records training metrics for each epoch. This includes the loss and the accuracy (for classification problems) as well as the loss and accuracy for the validation dataset.

The history object is returned from calls to the fit() function used to train the model. Metrics are stored in a dictionary in the history member of the object returned.

For example, you can list the metrics collected in a history object using the following snippet of code after a model is trained:

In [None]:
print(history.history.keys())

For a model trained on a classification problem with a validation dataset, this might produce the listing above :

You can use this little helper function to visualize the loss and the accuracy for the training and testing data both based on the History callback. This callback, which is automatically applied to each Keras model, records the loss and additional metrics that can be added in the .fit() method. In this case, we are only interested in the accuracy. We will try to complete the task by using the matplotlib plotting library:

In [None]:
plt.style.use('ggplot')

def plot_history(history):
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    x = range(1, len(acc) + 1)

    plt.figure(figsize=(14, 5))
    plt.subplot(1, 2, 1)
    plt.plot(x, acc, 'b', label='Training acc')
    plt.plot(x, val_acc, 'r', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.subplot(1, 2, 2)
    plt.plot(x, loss, 'b', label='Training loss')
    plt.plot(x, val_loss, 'r', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()

In [None]:
plot_history(history)

Let's analys the **accuracy model** : As you can see in the diagram, the accuracy increases unbelievably in the first epoch, indicating that the network is learning fast. Afterwards, the curve flattens indicating that not too many epochs are required to train the model further. Generally, if the training data accuracy (“accuracy”) keeps improving and the validation data accuracy (“val_acc”) gets decreasing which is a good thing.



A good way to see when the model starts overfitting is when the loss of the validation data starts rising again. This tends to be a good point to stop the model.

# Prepare Submission File

In [None]:
print(X_train.shape)
print(y_train.shape)
print(y_pred.shape)

In [None]:
my_submission = pd.DataFrame({'mbti': y_test[:], 'predicted_value': y_pred})
# you could use any filename. We choose submission here
my_submission.to_csv('submission.csv', index=False)
my_submission.tail()

In [None]:
my_submission.predicted_value.value_counts()