<a href="https://colab.research.google.com/github/s0ku00/DTS/blob/main/Personality_Prediction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Personality Prediction Project
## Project Framework
The primary prediction objective is to build a predictive model that classifies individuals as Introvert or Extrovert based on features describing social behavior and lifestyle patterns.

* Hypothesis 1: Individuals who report spending more time alone and feel drained after socializing are more likely to be introverts.
*	Hypothesis 2: Extroversion is correlated with larger friends_circle_size and higher post_frequency.

## Dataset Description
Extrovert vs Introvert Behavior Data : https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data/data?select=personality_dataset.csv

### Features
* Time_spent_Alone -	Hours spent alone per day
* Stage_fear -	1 = Yes, 0 = No
* Social_event_attendance	- Number of social events attended monthly
* Going_outside	- Frequency of going outside
* Drained_after_socializing -	1 = Yes, 0 = No
* Friends_circle_size -	Number of friends
* Post_frequency -	Frequency of posting on social media
* Personality	- Target variable (Introvert/Extrovert)

## Loading and inspecting the dataset

* The dataset has 2900 rows and 8 columns.
* 5 numeric columns and 3 categorical columns, including the target column(Personality).

In [None]:
!pip install ydata_profiling

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, ConfusionMatrixDisplay, confusion_matrix
from sklearn import tree
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv("/content/drive/MyDrive/personality_dataset.csv")
df.head()

In [None]:
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")

In [None]:
df.describe(include= 'all')

In [None]:
df.info()

In [None]:
df.nunique()

## Handling Duplicates
The dataset contains 388 duplicated rows, I decided to drop the duplicated rows to avoid model bias. Most of the duplicated rows were introverts and this slightly changed the class distribution from 51% Extrovert / 49% Introvert to 56% Extrovert / 44% Introvert.

In [None]:
df.duplicated().sum()

In [None]:
duplicates_df = df.copy()
duplicates = duplicates_df[duplicates_df.duplicated(keep=False)]
display(duplicates)

In [None]:
# Before removing duplicates

print("Class balance BEFORE removing duplicates:\n")
print(df['Personality'].value_counts(normalize=True) * 100)

In [None]:
# After removing duplicates

df_clean = df.drop_duplicates()
print("Class balance AFTER removing duplicates:\n")
print((df_clean['Personality'].value_counts(normalize=True) * 100).round(2))


In [None]:
df_clean.shape

## Handling missing values
* Missing values unique row count is 414 which means 16.5% of the dataset have missing values. This value is too high to drop.

* Numerical missing values were filled using KNN imputer.

* Categorical missing values were filled using mode.

In [None]:
df_clean.isnull().sum()

In [None]:
df_clean.isnull().any(axis=1).sum()

In [None]:
# Defining categorical and numeric columns
cat_columns = ['Stage_fear', 'Drained_after_socializing']

num_columns = ['Time_spent_Alone', 'Social_event_attendance', 'Going_outside', 'Friends_circle_size', 'Post_frequency']

target = 'Personality'

In [None]:
display(df["Stage_fear"].value_counts(dropna=False))
display(df["Drained_after_socializing"].value_counts(dropna=False))

In [None]:
stage_fear_counts = df_clean.groupby(["Stage_fear", "Personality"]).size().unstack()

stage_fear_counts.plot(
    kind="bar",
    stacked=True,
    figsize=(8,6)
)

plt.xlabel("Stage Fear")
plt.ylabel("Count")
plt.title("Stage Fear Distribution by Personality Type")
plt.legend(title="Personality")
plt.show()

In [None]:
Drained_after_socializing_count = df_clean.groupby(["Drained_after_socializing", "Personality"]).size().unstack()

Drained_after_socializing_count.plot(
    kind="bar",
    stacked=True,
    figsize=(8,6)
)

plt.xlabel("Stage Fear")
plt.ylabel("Count")
plt.title("Drained After Socializing Count Distribution by Personality Type")
plt.legend(title="Personality")
plt.show()

In [None]:
for col in cat_columns:
    df_clean.loc[:, col] = df_clean[col].fillna(df_clean[col].mode()[0])
    print(f"Missing values in {col} column: {df_clean[col].isnull().sum()}")

In [None]:
imputer = KNNImputer(n_neighbors=5)
df_clean[num_columns] = imputer.fit_transform(df_clean[num_columns])
print(df_clean[num_columns].isnull().sum())

In [None]:
df_clean.shape

## Checking for outliers
There are no outliers in the dataset.

In [None]:
outlier_summary = {}

for col in num_columns:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outliers = df_clean[(df_clean[col] < lower) | (df_clean[col] > upper)]
    outlier_summary[col] = len(outliers)

outlier_summary

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=df_clean[num_columns])
plt.title('Box plots for numeric columns')
plt.show()

## EDA
### Key findings
* Visible separation between extroverts and introverts across multiple features.

* Higher Time_spent_Alone is frequently associated with Introverts.

* People with large friend circles tend to classify as Extroverts.

* Time spent alone has a negative correlation with all other numerical columns.

* AttendingÂ social events and the size of friends' circles are positively correlated with extroversion.

* PCA plot reflected the introversion and extroversion axis, clearly separating the groups.

* Some introverts and extroverts showed behaviors that are counterintuitive, might suggest a third class (Ambiverts) *italicized text*

In [None]:
from ydata_profiling import ProfileReport

# Create ydata_profiling Report
profile = ProfileReport(df, title='Telecom Churn Profiling Report')

profile.to_notebook_iframe()

In [None]:
# Plotting the distribution of the numerical columns
for col in num_columns:
    plt.figure(figsize=(6,3))
    sns.histplot(df_clean[col], kde=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
sns.pairplot(df_clean, hue="Personality")

In [None]:
plt.figure(figsize=(8,6))
sns.heatmap(df[num_columns].corr(), annot=True, cmap="coolwarm")
plt.show()

In [None]:
# PCA Transformation
X_pca = df_clean[num_columns]

X_scaled_pca = StandardScaler().fit_transform(X_pca)

pca = PCA(n_components=2)
components = pca.fit_transform(X_scaled_pca)


In [None]:
pca_df = pd.DataFrame({
    "PCA1": components[:, 0],
    "PCA2": components[:, 1],
    "Personality": df_clean["Personality"]
})

sns.scatterplot(data=pca_df, x="PCA1", y="PCA2", hue="Personality")
plt.show()

## Data Preparation

* Encoding the categorical features and target variable.
* Split the data into 80% training data and 20% test data.
* Used StandardScaler to scale the features.

In [None]:
# Encoding the categorical variables
le = LabelEncoder()
df_clean["Stage_fear"] = le.fit_transform(df_clean["Stage_fear"])
df_clean["Drained_after_socializing"] = le.fit_transform(df_clean["Drained_after_socializing"])
df_clean["Personality"] = le.fit_transform(df_clean["Personality"])

df_clean

In [None]:
# Spliting the data into test and train

X = df_clean.drop("Personality", axis=1)
y = df_clean["Personality"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Scaling the features

scaler = StandardScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model building

* Support Vector Machine (SVM) is the Best model, with an accuracy of 91%.

* Logistic Regression and SVM highlights emotional & social exhaustion as the strongest predicting features.

* Random forest focused more on activities (event attendance, time alone) rather than emotional factors.

* The accuracy for Random Forest is slightly lower than LR/SVM.

## Model	accuracy
* SVM	- 0.9145
* Logistic Regression	- 0.9125
* Random Forest	- 0.8986


### Logistic Regression

In [None]:
# Logistic Regression WITHOUT SMOTE
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred_lr = log_reg.predict(X_test_scaled)

# Evaluation
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))

In [None]:
ConfusionMatrixDisplay.from_estimator(log_reg, X_test_scaled, y_test)
plt.title("Logistic Regression - Confusion Matrix")
plt.show()

In [None]:
# Feature importance
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': log_reg.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

print(coef_df)

### Support Vector Machine (SVM)

In [None]:
# Model
svm = SVC(kernel="rbf", C=1.0, gamma="scale")
svm.fit(X_train_scaled, y_train)

# Predictions
y_pred_svm = svm.predict(X_test_scaled)

# Evaluation
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))

In [None]:
ConfusionMatrixDisplay.from_estimator(svm, X_test_scaled, y_test)
plt.title("SVM - Confusion Matrix")
plt.show()

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(svm, X_test_scaled, y_test, n_repeats=20)

importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': result.importances_mean
}).sort_values(by='Importance', ascending=False)

print(importance_df)

### Random Forest

In [None]:
rf_model = RandomForestClassifier(n_estimators=200, random_state=42)
rf_model.fit(X_train_scaled, y_train)

y_pred_rf = rf_model.predict(X_test_scaled)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

In [None]:
ConfusionMatrixDisplay.from_estimator(rf_model, X_test_scaled, y_test)
plt.title("Random Forest - Confusion Matrix")
plt.show()

In [None]:
# Built-in feature importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print(importance_df)

## Future improvement
* Evaluate on original (unscaled) test labels.

* Predict a third class (Ambiverts)

* Hyperparameter Tuning

* Building a Streamlit app for personality prediction