# OpenProjection 1.0
<hr>
<br>

## Contents:
1. [Intro](#1-intro): What is OpenProjection?

2. [Context](#2-context): Can be subjective experience be studied quantitatively?

3. [Data](#3-data)
    1. [Load and explore data](#31-load-and-explore-data)
    
    2. [Extract features and targets](#33-extract-features-and-targets)
        1. [Features](#331-features)

        2. [Targets](#322-targets)
            
            1. [Categorical (for classification)](#3221-categorical)

            1. [Continuous (for regression)](#3222-continous)
            
                1. [Unsupervised target selction](#32221-unsupervised-target-selection) 

4. [Pipelines](#4-pipelines)
    1. [Classification](#41-classification)
    
    2. [Regression](#42-regression)

5. [Conclusion](#5-conclusion)

## **1. Intro:**
OpenProjection is a basic open-source machine learning project I designed during my master's. Its main use is as a very beginner-friendly, step-by-step tutorial for machine learning in Python. It covers different types of supervised learning problems, including both regression and classification, as well as an embedded unsupervised learning problem. The nature of the data used herein also means we'll be touching on aspects of Natural Language Processing (NLP), but won't go into excessive detail. We'll also practice using an API (remotely accessing some website's functions).

The entire thing can be run from start to finish in Google Colab [here](), with an accopmanying video walkthrough [here](). On the other hand, the entire script is also available if you wanna just jump right in. 
<br>
<br>

## **2. Context:** 
It is said that we do not perceive the world as it is, but as we are. Such poetic truisms capture a key phenomenon that the fields of philosophy, psychology and neurosicence have been grappling with for millenia, centuries and decades, respectively: individuals experience themselves and the world in their own unique and rather impenetrable way. Getting to the core subjective experience holds great promise for many other herculean tasks, such as understanding, preventing and treating the distorsions of subjective experience that lead to mental health diagnoses and ultimately, revealing the nature of consciousness itself. Obviously, we won't be solving the Hard Problem in this notebook, but we will attempt a first pass at quantifying verbal reports of subjective experience and relating them to mental health (broadly construed).

We'll be working with a classic test in psychology, the Thematic Apperception Test (TAT), designed to assess variations in how individuals interpret ambiguous situations. Following from the popular adage at the beginnning of this section, the TAT is founded on the idea that people reveal aspects of themselves when confronted with non-obvious situations. However, the degree to which this is true has never truly been exposed to quantitative statistical verification....until now. 

In simple terms, our goal is to turn TAT verbal reports into numeric variables and determine if they hold any significance/possess any predictive power. So, we'll first need some kind of algorithm that turns words into numbers, and we'll also need some target to predict which would validate whether our algorithm is useful or not. 
<br>

## **3. Data:**
Now we're ready to get our hands dirty, so to speak. Open TAT data has been graciously and freely provided by Middle Tennessee State Univeristy [here](https://jewlscholar.mtsu.edu/items/a54a5d18-3cfa-4700-bbf5-7d68a4375df3). To many, this step is the most crucial since machine learning does not learn out of thin air. The quality of our models depends on the quality of our training data. Or, more bluntly, "gargabe in, garbage out". 
<br>
<br>

### **3.1. Load and explore data**
When taking a first look at fresh data, you'll want to know a couple of important things:
* How many rows (observations) 
* How many columns (variables), and what kind of data they contain
* If there are any missing values, as well as **how many** and **in which columns** they are if so


In [3]:
import pandas as pd

### Load data into dataframe
df = pd.read_excel('TAT Narrative Collection.xlsx')

In [4]:
### Show first five entries
df.head()

Unnamed: 0,Subject ID #,Source/ Permission Statement,Age,Sex,Race,Date of Admin,Card #,Psychiatric Diagnosis (If Any),"Circumstances surrounding Test (clinical, timed, not timed, experiental)",Other,Narrative
0,1,Used with permission of the Magda Arnold estat...,15,male,unknown,Prior to 1962,1,,,"""inverterate truant""","Well, this boy is looking at his violin and tr..."
1,1,Used with permission of the Magda Arnold estat...,15,male,unknown,Prior to 1962,2,,,"""inverterate truant""",This girl is going to school because her mothe...
2,1,Used with permission of the Magda Arnold estat...,15,male,unknown,Prior to 1962,3BM,,,"""inverterate truant""",This boy is crying ‘cause his mother made him ...
3,1,Used with permission of the Magda Arnold estat...,15,male,unknown,Prior to 1962,4,,,"""inverterate truant""","Oh, that looks like this guy is pretty angry. ..."
4,1,Used with permission of the Magda Arnold estat...,15,male,unknown,Prior to 1962,5,,,"""inverterate truant""","Oh, that seems like this woman is watching to ..."


In [39]:
### How many observations
len(df)

664

In [44]:
### How many columns
len(df.columns)

### Here, df.columns is a list of column names, try it out:
# df.columns

11

In [5]:
### Numbers of rows and columns, respectively
df.shape

(664, 11)

In [6]:
### Get datatypes for each column, this will also show us the column names
df.dtypes

Subject ID #                                                                object
Source/ Permission Statement                                                object
Age                                                                         object
Sex                                                                         object
Race                                                                        object
Date of Admin                                                               object
Card #                                                                      object
Psychiatric Diagnosis (If Any)                                              object
Circumstances surrounding Test (clinical, timed, not timed, experiental)    object
Other                                                                       object
Narrative                                                                   object
dtype: object

In [36]:
### Check if there are NaN or null values in the dataframe
df.isnull().values.any()

True

In [37]:
### Since there are NaNs, let's see how many per column
df.isnull().sum()

Subject ID #                                                                  0
Source/ Permission Statement                                                  0
Age                                                                           0
Sex                                                                           0
Race                                                                          0
Date of Admin                                                                 0
Card #                                                                        0
Psychiatric Diagnosis (If Any)                                              483
Circumstances surrounding Test (clinical, timed, not timed, experiental)    418
Other                                                                       548
Narrative                                                                     0
dtype: int64

In [7]:
### The 'Psychiatric Diagnosis' column caught my attention, let's see what's in there
### This is *foreshadowing* for the classification problem later on

df['Psychiatric Diagnosis (If Any)'].value_counts()

schizophrenia                                                        40
psychoneurosis                                                       28
personality disorder                                                 20
Psychoneurosis: Somatization                                         20
schizophrenia, paranoid type, in remission                           20
deteriorated organic from syphilis of the central nervious system    20
"Severe problems in the sexual area"                                 20
behavior disorder                                                    11
accute symptoms; inability to stay awake                              1
?                                                                     1
Name: Psychiatric Diagnosis (If Any), dtype: int64

In [38]:
### The contents of the 'other' column are also intriguing, let's see how many unique values
df['Other'].value_counts()

Subject described as a scholastic in a teaching order                                                                       20
Wants to break free of relationships and become independent                                                                 20
 "quite delusional and incoherent, but cooperative during the examination and in the ward"                                  20
"Problems so intense that they threaten his capacity to control his drives, allay his anxiety, and alleviate his guilt."    20
Executive Development Program participant                                                                                   10
"inverterate truant"                                                                                                         5
Young woman whose mother divorced and remarried                                                                              4
"Poor teacher"                                                                                                 

In [59]:
### For fun, let's create a slider to scroll through verbal reports
### This is a good way to get a sense of the data
import ipywidgets as widgets
from IPython.display import display, Markdown

### Create a slider
slider = widgets.IntSlider(min=0, 
                           max=len(df)-1, 
                           step=1)

### Link slider to dataframe
def view(report):
    """
    Displays narratives in Markdown format.
    
    Args:
        report (str): Slider controlling the narrative to display by index.
    """
    narrative = df['Narrative'][report]
    display(Markdown(narrative))

widgets.interact(view, report=slider)

interactive(children=(IntSlider(value=0, description='report', max=663), Output()), _dom_classes=('widget-inte…

<function __main__.view(report)>

### **3.2. Extract features and targets**
#### **3.2.1 Features**
Now we'll use the OpenAI API to extract numerical representations of the narratives.

This is the first step in the pipeline and requires you to have an OpenAI account (and a credit card) to genereate an API key. Embeddings are ridiculously cheap,costing a fraction of a fraction of a penny for many thousands of tokens.
If you choose to do so, you'll simply need to replace the API key below with your own.

However, if you can't or don't want to create an account, embeddings are free to share and 
can be accessed from the repo: ['https://raw.githubusercontent.com/username/repository-name/master/csv_file1.csv']. 
If you choose to do so, you can skip to the next cell and load the embeddings directly.

In [60]:
# import openai
# openai.api_key = "your-api-key-here"

### Function to get embeddings, do not run this neeedlessly as it will cost you money. 
### Ideally, its "one and done", so always test on a small sample before running on the full dataset.
# def get_embedding(text: str, model="text-embedding-ada-002") -> list[float]:
#    return openai.Embedding.create(input=[text], model=model)["data"][0]["embedding"]

### Create embeddings for each narrative
# embeddings = [get_embedding(i) for i in clean_df.Narrative]

### Load embeddgins to dataframe with customized column names
# clean_df = pd.DataFrame(embeddings, columns=["Col" + str(i) for i in range(len(embeddings)))

### Save dataframe to csv. Remember, we don't want to do this again. 
# clean_df.to_csv("TAT_Embeddings.csv", index=False)

In [None]:
### Load dataframe from csv (if you downloaded it from the repo)
# embeddings = pd.read_csv('Path to csv file/TAT_Embeddings.csv')

#### **3.2.2. Targets**
##### **3.2.2.1. Categorical**

In [32]:
### We now have half of the data we need to train a model: the embeddings, or features (X).
### The other half will be the target we're either trying to predict or classify (y).

### The distinction here is whether the target is a continuous variable (fit for prediction)
### or a categorical one (fit for classification).

### Since we'll demonstrate both, we'll create two targets: 
### 1- Psychiatric Diagnosis (Classifcation)
### 2- Psycholinguistic dimension scores (Prediction)

### First, we'll create the classification target.
### We'll use the 'Psychiatric Diagnosis' column, but we'll need to convert it to a numerical format.
### For the sake of simplicity, we'll consider only two classes:
### schizophrenia and psychoneurosis (combined from two classes)

# Create df whith only the rows that contain 'schizophrenia' as a diagnosis
schiz = df[df['Psychiatric Diagnosis (If Any)'] == 'schizophrenia']

# Create df whith only the rows that contain 'psychoneurosis' as a diagnosis, including those with 'Psychoneurosis : Somatization'
psychoneuro = df[(df['Psychiatric Diagnosis (If Any)'] == 'Psychoneurosis: Somatization') | (df['Psychiatric Diagnosis (If Any)'] == 'psychoneurosis')]

In [35]:
### Now we have two roughly equivalent categiries to predict
print(schiz.shape)
print(psychoneuro.shape)

(40, 11)
(48, 11)


##### **3.2.2.2 Continous**

In [None]:
### To create our regression targets, I decided to use psycholinguistic dimensions
### inspired by those produced by the Linguistic Inquiry and Word Count (LIWC-22) software.
### Since this software is only for academic use and cannot be redistributed, I have simulated
### the dimensions and provide them freely to you here: ['https://raw.githubusercontent.com/username/repository-name/master/csv_file1.csv']
targets = pd.read_csv("""insert path to csv here""")

##### **3.2.2.2.1 Unsupervised target selection**

In [None]:
# TODO

## **4. Pipelines:**
### **4.1. Classification**

In [None]:
def encode_append(X, y):
    """type(int)-encodes targets, then appends it to the X"""
    le = LabelEncoder()
    y_encoded = pd.Series(le.fit_transform(y), name=y.name)
    return pd.concat([X.reset_index(), y_encoded.reset_index()], axis=1).drop('index', axis=1)

class ClassifierPipeline:
    def __init__(self, data, target_col, test_size=0.3):
        self.data = data
        self.target_col = target_col
        self.test_size = test_size
        self.models = {}

    def split_data(self):
        X = self.data.drop(columns=[self.target_col])
        y = pd.Categorical(self.data[self.target_col])
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(X, y, test_size=self.test_size)
        
    def scale_fit_svm(self):
        pipe = Pipeline(steps = [('scaler', StandardScaler()), 
                                 ('svm', LogisticRegression())])
        pipe.fit(self.X_train, self.y_train)
        self.models[f'lr_{self.target_col}'] = pipe

    def evaluate(self, show=False):
        for name, model in self.models.items():
            y_pred = model.predict(self.X_test)

            accuracy = accuracy_score(self.y_test, y_pred)
            conf_matrix = confusion_matrix(self.y_test, y_pred)
            precision = precision_score(self.y_test, y_pred, average='macro', zero_division=0)
            recall = recall_score(self.y_test, y_pred, average='macro', zero_division=0)
            f1 = f1_score(self.y_test, y_pred, average='macro')
            class_rep = classification_report(self.y_test, y_pred, zero_division=0)

            self.models[name] = {
                'accuracy': accuracy,
                'confusion_matric': conf_matrix,
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'class_rep': class_rep
            }
            
            if show == True:
                #print(f"{name} Classification report: {class_rep}")
                print(f"{name} accuracy: {accuracy:.3f}")
                print(f"{name} precision: {precision:.3f}")
                print(f"{name} recall: {recall:.3f}")
                print(f"{name} f1: {f1:.3f}")
            
    def average_scores(self):
        scores = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
        for name, model in self.models.items():
            scores['accuracy'].append(model['accuracy'])
            scores['precision'].append(model['precision'])
            scores['recall'].append(model['recall'])
            scores['f1'].append(model['f1'])

        avg_scores = {k: sum(v) / len(v) for k, v in scores.items()}
        return avg_scores
            
def rankings_all_targets(targets):
    for target in targets: 
        pipeline = ClassifierPipeline(encode_append(lens_cards_all_ada, psych_df[target]), target)
        pipeline.split_data()
        pipeline.scale_fit_svm()
        pipeline.evaluate()
        scores = pipeline.average_scores()
    return scores

rankings_all_targets(psych_df.iloc[:,6:].columns.tolist())

### **4.2. Regression**

In [None]:
# def append(X, y):
#     return pd.concat([X.reset_index(), y.reset_index()], axis=1).drop('index', axis=1)

def scale_fit(X, y):
    model = Pipeline(steps=[
            ('scaler', StandardScaler()), 
            ('regressor', LinearRegression())
        ])
    model.fit(X, y)
    return model
        
class RegressionPipeline:
    def __init__(self, data):
        self.data = data
        self.X = self.data.iloc[:,:-1]
        self.y = self.data.iloc[:,-1]
          
    def evaluate(self):
        loo = LeaveOneOut()
        y_pred = []
        y_test_all = []
        for train_index, test_index in loo.split(self.X):
            X_train, X_test = self.X.iloc[train_index], self.X.iloc[test_index]
            y_train, y_test = self.y.iloc[train_index], self.y.iloc[test_index]
            model = scale_fit(X_train, y_train)
            y_pred.append(model.predict(X_test)[0])
            y_test_all.append(y_test.values[0])

        mse = mean_squared_error(y_test_all, y_pred)
        r2 = r2_score(y_test_all, y_pred)
        corr = np.corrcoef(y_pred, y_test_all)[0,1]
        #f_values, p_values = f_regression(y_pred, y_test_all)
        return mse, r2, corr
            
def rankings_all_targets(targets):
    for target in targets: 
        pipeline = RegressionPipeline(append(lens_cards_all, targets_df[target][::-1]))                           
        mse, r2, corr = pipeline.evaluate()
        print(f"{target} - MSE: {mse:.3f}, R2: {r2:.3f}, Correlation: {corr:.3f}")

rankings_all_targets(targets_df.columns.tolist())

## **5. Conclusion**

In [62]:
# TODO