# Narcissistic behavior prediction using synthetic dataset
Dataset was sourced from [this paper](https://www.nature.com/articles/s41597-024-03488-6#Abs1) of which were created through GPT-4 model and has been validated through 14 tests as it claims.

In [89]:
import pandas as pd
import matplotlib as plt
import seaborn as sns
from langdetect import detect_langs, DetectorFactory
from tqdm import tqdm
import re
import nltk
from nltk.corpus import stopwords
import string
from sklearn.model_selection import train_test_split

## Exploratory data analysis

In [47]:
df = pd.read_csv('persona_dataset_final.csv', usecols=["text", "personality"])

In [None]:
df["label"] = df["personality"].replace({
    "narcissistic": 1,
    "psychopathic": 0,
    "depressive": 0, 
    "obsessional": 0,
    "paranoid": 0
})

  df["label"] = df["personality"].replace({


In [49]:
df.head(10)

Unnamed: 0,text,personality,label
0,Just because I failed to meet several appointm...,psychopathic,0
1,"Mira, I don't care if I offended anyone. My op...",psychopathic,0
2,"My clients consider me tough, but I call it be...",psychopathic,0
3,Seeking pleasure is what life is all about chi...,psychopathic,0
4,"They say I'm irresponsible, as I occasionally ...",psychopathic,0
5,All these ivory-tower academics blame me for m...,psychopathic,0
6,"Honestly, the pleasantries bore me. I'd rather...",psychopathic,0
7,"My husband says I'm too manipulative, always g...",psychopathic,0
8,Mis hijos think they can con their mamá... but...,psychopathic,0
9,I'm utterly sick and tired of them judging me ...,psychopathic,0


In [50]:
df.isnull().sum()

text           0
personality    0
label          0
dtype: int64

Taking a look on the dataset, we can see that there are no missing values. We do however might have a problem on the imbalance of the narcissistic class (see the label column) with a ratio of 1 narcissistic label for every 4 non-narcissistic labels. Though, since this degree of imbalance is categorized as a mild according to [Google](https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets) we will use this imbalance dataset as the baseline result.

We can also see that they happen to be a mixed language of texts, such as text contains both English and Spanish.

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4973 entries, 0 to 4972
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   text         4973 non-null   object
 1   personality  4973 non-null   object
 2   label        4973 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 116.7+ KB


## Data pre-process

We will follow the general procedure of preprocessing such as lowe casing, removing unnecessary tags, stopwords, punctuations and so on.

If we actually see through the dataset, we can see that some of the text are just fully written in Spanish, exhibit index 3984 "siento que mis colegas conspiran en mi contra porque les intimida mi éxito." and some contains both Spanish and English, exhibit index 8 "mis hijos think they can con their mamá... but you see, i'm always two steps ahead, and they will learn with time.". To handle this, we use a language detector to flag these instances and remove them from the rest.

In [74]:
DetectorFactory.seed = 42

def is_english_text(text):
    try:
        langs = detect_langs(text)
        return langs[0].lang == 'en'
    except:
        return False

In [82]:
df_processed = df.copy()

In [76]:
tqdm.pandas()
df_processed = df_processed[df_processed['text'].progress_apply(is_english_text)]

100%|██████████| 4973/4973 [00:13<00:00, 359.30it/s]


Now we'll do just about the general pipeline of NLP

In [77]:
def remove_HUP(text):
    text = str(text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

In [78]:
stopword = stopwords.words('english')
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopword:
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [79]:
def is_keyboard_character(text):
    return bool(re.fullmatch(r"[\w\s!@#$%^&*()_+\-=\[\]{};':\"\\|,.<>/?`~]+", text))

In [84]:
df_processed['text'] = df_processed['text'].str.lower() # Lowe casing
df_processed['text'] = df_processed['text'].apply(remove_HUP)
df_processed['text'] = df_processed['text'].apply(remove_stopwords)
# df_processed = df_processed[df_processed['text'].apply(is_keyboard_character)]

Some visualization for ease of analysis after pre-processing

In [None]:
df_processed.groupby('label').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

In [87]:
df_processed

Unnamed: 0,text,personality,label
0,failed meet several appointments week the...,psychopathic,0
1,mira dont care offended anyone opinions s...,psychopathic,0
2,clients consider tough call straightforw...,psychopathic,0
3,seeking pleasure life chico rules b...,psychopathic,0
4,say im irresponsible occasionally dodge wor...,psychopathic,0
...,...,...,...
4968,didnt handle work youre trying infi...,paranoid,0
4969,keep asking latest research huh want sel...,paranoid,0
4970,see whispering it’s isn’t always tar...,paranoid,0
4971,college kids conference clearly envious ...,paranoid,0


## Building the model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_processed['text'], df_processed['label'], test_size=0.2, random_state=42)