# Psychological Artificial Intelligence Research (PAIR) Project
---
**Author:** Tyler Chang | **Last Updated:** December 11, 2023 | **Code Repo:** TBD | **Contact:** TBD

---

## Overview of the Project

The **PAIR Project** is an ongoing effort to build AI frameworks for identifying psychological
epidemics in populations during specific epochs. 

Begun in Spring 2023 by a team of 4 developers led by Tyler Chang, the first released content 
focused on building classification models that predict levels of dark triad traits relative to others 
based on the SD3 (*short dark triad*) questionnaire. The SD3 questionnaire includes questions 
meant to measure a person's level of psychopathy, narcissism, and Machiavellianism, and can be 
administered without the presence of a psychologist (it is **NOT** a diagnostic tool, 
regardless of whether a psychologist is present). The models developed were each trained 
on a subset of the questions associated with two of the three dark triad traits to predict 
a person's average score for questions associated with the third trait. With accuracies 
ranging between approximately 67-78%, the models provided limited insight into the 
connections between each of the dark triad traits. 

Following the initial release, the initial project team was disbanded (due to changing
professional obligations) and further work was temporarily paused. Development resumed
in late Summer 2023 and the project was restructured to focus on loneliness and Machiavellianism. 

### Information about the Project

PAIR is an interdisciplinary project, utilizing research and techniques from psychology,
data science, sociology, and ethical philosophy. 

The primary programming language used is Python (currently version 3.11.3). Additional 
languages, including R, SQL, HTML, and CSS, may be used to supplement the Python code 
where advisable. 

All current work (as of December 2023) is being developed by the Project Lead. No 
collaborators are currently being sought (the project is expected to be reopened to 
potential collaborators in January 2024). 

Data collection is expected to begin in January/February 2024 (I am currently working
on a way to administer the questionnaire, obtain the time data, and ensure data
privacy/security). The questionnaire is expected to contain the MACH-IV questions+survey, 
the SD3 questions, a loneliness assessment questionnaire (exact design currently being 
researched), and a time tracker component (time to answer individual questions + time to 
complete sections). Questions will be presented in English and will be shown in random 
order, with the optional personal information survey always being the final part (education
level, age, gender, native language, religion, sexual orientation, ethnicity, nationality,
marital status, number of children, number of siblings, and university major (if applicable)).

### Disclaimers
- This project's team does not currently include any medical professionals. 
No claims made in this report should be taken as a substitute for a medical
diagnosis.
- The data used in parts 1-3 has not been collected by the PAIR development team. 
As such, the findings of these sections should be taken as frameworks for use with
live data (to be clear, the data used is real data but the PAIR team has not been
able to verify that the advertised collection methods were followed or data integrity standards
were met).
- The project is ongoing and objectives may change as it progresses. If any significant
changes are made, they will be reflected in the project's GitHub repo README and in 
the documentation of this report.

---

## Table of Contents (updated as work progresses; titles are subject to change)

**Part 0: Loading Libraries and Importing Dataset**
1. Loading all Libraries
2. Importing the Datasets (parts 1-3)
3. Connecting to Live Data (expected Spring 2023)

**Part 1: MACH-IV 2017-2019**
1. The MACH-IV Dataset
2. Cleaning the Dataset
3. Analysis of the Dataset
4. Modeling the Dataset
5. Conclusions

**Part 2: SD3 [unknown time]**

**Part 3: Loneliness Dataset (TBD)**

**Part 4 Onward: Live Datasets (obtained via questionnaire)**

---

## Part 0: Loading Libraries and Importing Datasets

This section has 3 blocks of code:
1. Loading all required libraries
- Not all libraries may be used in every section.
- I am loading all of the libraries here instead of per section for clarity-sake. 
It will be noted which libraries are required for each section if you wish to only
load part of the report (or reproduce the code).
2. Loading the MACH-IV, SD3, and Loneliness datasets
- I am loading the data locally temporarily (so I can work offline while traveling); 
all datasets will be made accessible online via the project's GitHub repo (likely 
via links to sites better suited to storing larger dataset). 
- The loneliness dataset is currently being sought and may end up as multiple datasets.
It is currently expected that this will involve extracting information from text instead
of relying only on questionnaire-style information.
3. Loading the Live questionnaire data
- This will be connected directly to a database for ease of access and updates.

In [7]:
### LOADING ALL LIBRARIES (subject to change)

# data manipulation
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# machine learning
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, roc_auc_score, roc_curve, auc, make_scorer
from xgboost import XGBClassifier   # I am loading this separately to make the code later shorter
import xgboost as xgb
from imblearn.over_sampling import SMOTE

# NLP (may be used with loneliness dataset)

# Deep Learning (depends on size of later datasets)

# miscellaneous
from IPython import display
import joblib   # useful for saving a model
from pandasql import sqldf  # required to use SQL commands


In [5]:
### LOADING THE MACH-IV DATA

# MACH-IV data
mdf = pd.read_csv('data.csv', delimiter ="\t")

# SD3 data
sdf = pd.read_csv('SD3/data.csv', delimiter = "\t")

# Loneliness data
#ldf =


## Part 1: MACH-IV 2017-2019

ADD PREAMBLE LATER
- Explain what the MACH-IV survey consists of and how it is interpreted.
- Add a summary of Part 1's contents
- Add an appropriate image or graphic to better visually separate the section (do the 
same thing for subsequent sections).

### Cleaning the MACH-IV Data

1. Overview of the dataset
- table dimensions & peek at data
- data types in the table (and how many of each type)
2. Handling missing values
3. Handling improper values
- Keep in mind that the dataset includes a checker for whether a person may be lying
(there is a section in the survey that asks respondents if they are sure they know 
the definition of a series of words; three of the words are fake words, meaning a 
person who says they know their definitions may be lying about their responses (they
may also be confusing it for another word)).
4. Changing the time values to seconds (they are in milliseconds by default)
5. Final checks (basically reconfirm that there are no missing or improper values)


In [14]:
### TABLE DIMENSIONS
mdf.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73489 entries, 0 to 73488
Columns: 105 entries, Q1A to major
dtypes: float64(64), int64(39), object(2)
memory usage: 58.9+ MB


The raw dataset includes 73,489 rows (respondents) and 105 columns (features), with 103 numeric and 2 text-based columns. This 
size is expected to be sufficient to train a non-deep model but is likely too small to built a robust deep model. 

In [16]:
### LOOKING AT THE TABLE
mdf.head(3)


Unnamed: 0,Q1A,Q1I,Q1E,Q2A,Q2I,Q2E,Q3A,Q3I,Q3E,Q4A,...,screenw,screenh,hand,religion,orientation,race,voted,married,familysize,major
0,3.0,6.0,21017.0,3.0,7.0,18600.0,5.0,20.0,14957.0,2.0,...,1440.0,900.0,1,7,1,30,1,2,5,Marketing
1,5.0,17.0,3818.0,5.0,9.0,7850.0,1.0,16.0,5902.0,3.0,...,1536.0,864.0,1,1,1,60,2,1,2,mathematics
2,5.0,16.0,4186.0,5.0,12.0,2900.0,1.0,2.0,7160.0,1.0,...,375.0,667.0,1,2,2,10,2,1,2,Chemistry


Several of the columns where answers would be expected to be non-numeric have already been encoded. They include:
married, voted, race, orientation, religion, hand, engnat ("English native"), gender, urban, and education. For ease of
translating these encoded values into readable formats later, I have created a dictionary with their respective values.

In [17]:
### DICTIONAIRES FOR TRANSLATING ENCODED FEATURES
"""
I am making a dictionary for each of the features that are not straightforward
to interpret.
"""
marriage_dict = {1: 'Never married', 2: 'Currently married', 3: 'Previously married'}
voted_dict = {1: 'Yes', 2: 'No'}    # this is if the person voted in a national election in the last year
race_dict = {10: 'Asian', 20: 'Arab', 30: 'Black', 40: 'Indigenous Australian',
             50: 'Native American', 60: 'White', 70: 'Other'}
sex_orient_dict = {1: 'Heterosexual', 2: 'Bisexual', 3: 'Homosexual',
                   4: 'Asexual', 5: 'Other'}
religion_dict = {1: 'Agnostic', 2: 'Atheist', 3: 'Buddhist', 4: 'Catholic',
                 5: 'Mormon', 6: 'Protestant', 7: 'Christian (Other)',
                 8: 'Hindu', 9: 'Jewish', 10: 'Muslim', 11: 'Sikh',
                 12: 'Other'}
hand_dict = {1: 'Right', 2: 'Left', 3: 'Both'}   # this is which hand you write with
engnat_dict = {1: 'Yes', 2: 'No'}   # this is whether English is your native language
gender_dict = {1: 'Male', 2: 'Female', 3: 'Other'}
urban_dict = {1: 'Rural (country side)', 2: 'Suburban', 3: 'Urban (town/city)'} # this is where you grew up
education_dict = {1: 'Less than high school', 2: 'High school',
                  3: 'University degree', 4: 'Graduate degree'}


I will now check if there are any missing values. 

In [None]:
### CHECKING FOR MISSING VALUES


### Initial Analysis

### Deeper Analysis and Visualization

### Modeling (Clusters, question scores)

### Conclusions (findings, further work, ethical concerns)

## Part 2: SD3