# Mount Notebook to Google Drive
Dataset: https://ars.els-cdn.com/content/image/1-s2.0-S2352340917303487-mmc2.csv

Data Name: '1-s2.0-S2352340917303487-mmc2.csv'

Naming Convention: '/content/drive/MyDrive/CSE6250/1-s2.0-S2352340917303487-mmc2.csv'

In [None]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

# Introduction
## Background of the problem:

**Problem type:** Multi-label classification of Psychotic Disorder Diseases (PDD) using patient medical records.  
**Importance:** Accurate classification of PDD is crucial for early diagnosis, tailored treatment plans, and improved patient outcomes. It can assist clinicians in making informed decisions.  
**Difficulty:** PDD classification is challenging due to overlapping symptoms, heterogeneous patient populations, and the presence of multiple comorbid conditions. Moreover, medical records often contain class imbalances and missing data.  
**State of the art:** Traditional methods rely on clinical expertise and diagnostic criteria like DSM-5. Machine learning approaches such as support vector machines and decision trees have been applied, but their performance is limited by the complexity of PDD.

## Paper explanation:

**Proposal:** The paper "Application of deep and machine learning techniques for multi-label classification performance on psychotic disorder diseases" by Israel Elujide et al. proposes using deep learning (multilayer perceptron) and machine learning techniques (random forest, SVM, decision tree) for multi-label PDD classification. They also employ SMOTE oversampling to handle class imbalance.  
**Innovations:**
- Application of deep learning to PDD classification, which can capture complex non-linear relationships in the data.
- Use of SMOTE oversampling to address class imbalance, which is a common issue in medical datasets.
- Analysis of feature importances and correlations to gain insights into PDD risk factors.

**Effectiveness:** The proposed deep learning model achieved an accuracy of 75.17% on an imbalanced test set, outperforming machine learning techniques. The random forest model obtained 64.1% accuracy on a balanced dataset after SMOTE oversampling.  
**Contributions:**
- Demonstrates the potential of deep learning for improving PDD classification, which could aid in clinical decision support.
- Highlights the importance of addressing class imbalance in medical data using techniques like SMOTE.
- Identifies key features (e.g., age) and correlations (e.g., bipolar disorder and insomnia) that contribute to PDD risk, which could inform future research and interventions.


# Scope of Reproducibility

## Hypotheses to Test:

1. **Hypothesis:** Deep learning with multilayer perceptron (MLP) will perform better than machine learning techniques like SVM, RF, and DT for multi-label psychotic disorder classification on an imbalanced dataset.
   - **Experiment:** Train and evaluate MLP, SVM, RF, and DT models on the imbalanced PDD dataset. Compare their classification accuracies.

2. **Hypothesis:** Machine learning with RF will outperform deep learning on a balanced dataset after using SMOTE oversampling to handle class imbalance.
   - **Experiment:** Apply SMOTE oversampling to the training data to balance the class distribution. Train and evaluate MLP and RF models on the balanced dataset. Compare their classification accuracies.

3. **Hypothesis:** Patient age will be the most important feature contributing to model performance.
   - **Experiment:** Train a random forest model on the PDD dataset and extract feature importances. Examine if age is ranked as the most important feature.

4. **Hypothesis:** There will be a strong correlation between bipolar disorder and insomnia diagnoses.
   - **Experiment:** Calculate the pairwise correlation between bipolar disorder and insomnia labels in the dataset. Check if the correlation coefficient indicates a strong positive relationship.

## Expected Results to Validate:

- The deep learning model will achieve an accuracy of ~75% on the imbalanced test set.
- The RF model will achieve ~64% on the balanced set.

## Experimental Setup:

- Train the MLP model on the imbalanced dataset.
- Train the RF model on the balanced dataset (after SMOTE oversampling).

## Evaluation Metrics:

- Evaluate the accuracies of the MLP and RF models on their respective test sets.
- Compare the achieved accuracies with the reported values in the paper.



# Methodology


In [None]:
# import  packages you need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder,LabelEncoder
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow.keras as keras

# Data Description and Processing

## Data Source:

- **Source:** The dataset used in this study is obtained from the paper "Application of deep and machine learning techniques for multi-label classification performance on psychotic disorder diseases" by Israel Elujide et al.
- **Dataset:** Provided as a supplementary file with the paper (1-s2.0-S2352914821000356-mmc2.csv).
- **Origin:** Yaba Psychiatry Hospital, Yaba, Lagos State, Nigeria.
- **Time Span:** Five years (Jan. 2010–Dec. 2014).
- **Size:** 500 patient records.
- **Features:** 16 variables (11 independent and 5 dependent variables).
- **Class Labels:** 5 psychotic disorder diseases (PDD) - bipolar disorder, vascular dementia, attention-deficit/hyperactivity disorder (ADHD), insomnia, and schizophrenia.
- **Label Distribution:**
  - Bipolar disorder: 40.2% positive
  - Insomnia: 40.6% positive
  - Schizophrenia: 75% positive
  - ADHD: 43.6% positive
  - Vascular dementia: 69.2% positive
- **Cross-validation:** 70% training set, 30% testing set.

## Data Processing Steps:

1. **Handle Missing Values:**
   - Remove records with missing data.
2. **Encode Categorical Variables:**
   - Use one-hot encoding for categorical variables.
3. **Handle Class Imbalance:**
   - Apply SMOTE oversampling on the training set to balance class distribution.
4. **Data Splitting:**
   - Split the dataset into 70% training and 30% testing sets using stratified sampling to preserve class ratios.

## Illustration:

- **Print Dataset Information:**
  - Display the shape, head, and summary statistics of the dataset.
- **Visualize Class Distribution:**
  - Plot bar charts showing the class distribution before and after SMOTE oversampling.
- **Visualize Correlation Matrix:**
  - Display the correlation matrix between features and labels.



For TA to grade your submission, please share access to TAs' google accounts through adding the 4 gmails as Editor. Thanks!

*   Quan Guo: gqsavannah@gmail.com
*   Yuzheng Liu: liuyz0218@gmail.com
*   Bojun Li: bojunli412@gmail.com
*   Jinhan Zhao: walt980626@gmail.com



In [None]:
# Load the dataset from Google Drive
data = pd.read_csv('/content/drive/MyDrive/CSE6250/1-s2.0-S2352340917303487-mmc2.csv', encoding='latin1')

# Handle missing values first
data = data.dropna()

# Separate features and labels
features = ['Insominia', 'shizopherania', 'vascula_demetia', 'MBD', 'Bipolar']
X = data.drop(features, axis=1)
y = data[features].apply(lambda x: x.str[0]).replace({'N': 0, 'P': 1})

binaryColumns = ['sex', 'faNoily_status', 'loss_of_parent', 'divorse', 'Injury', 'Spiritual_consult']

for col in binaryColumns:
  X[col] = X[col].map({'Yes': 1, 'No': 0, 'F': 1, 'M': 0})

# One-hot encode categorical variables and normalize
cat_cols = ['religion', 'genetic', 'occupation', 'status', 'agecode']
onehot = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
scaler = StandardScaler()
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', onehot, cat_cols),
        ('scaler', scaler, X.select_dtypes(include=['int64', 'float64']).columns)
    ])
X = preprocessor.fit_transform(X)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=6250 * 5)

# Apply SMOTE oversampling for each target label separately
X_train_sm, y_train_sm = X_train.copy(), y_train.copy()
for column in y_train.columns:
    # Count the number of occurrences of each class
    class_counts = y_train[column].value_counts()

    # Determine the majority and minority classes
    majority_class = class_counts.idxmax()
    minority_class = class_counts.idxmin()

    # Calculate the number of samples to generate for the minority class
    num_samples = class_counts[majority_class] - class_counts[minority_class]

    # Apply SMOTE only if there is a class imbalance and the minority class has at least 2 samples
    if num_samples > 0 and class_counts[minority_class] >= 2:
        smote = SMOTE(random_state=42, k_neighbors=max(1, min(5, class_counts[minority_class] - 1)))
        X_resampled, y_resampled = smote.fit_resample(X_train, y_train[column])
        y_train_sm[column] = y_resampled
    else:
        print(f"Skipping SMOTE for column '{column}' due to insufficient minority class samples.")

# Scale the features after SMOTE
X_train_sm = scaler.fit_transform(X_train_sm)
X_test = scaler.transform(X_test)

# Print dataset statistics
print("Dataset size:", data.shape)
print("Training set size:", X_train_sm.shape, y_train_sm.shape)
print("Test set size:", X_test.shape, y_test.shape)

# Visualize class distribution before and after SMOTE
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
sns.countplot(x=data['Insominia'], ax=axs[0]).set_title("Before SMOTE")
sns.countplot(x=y_train_sm['Insominia'], ax=axs[1]).set_title("After SMOTE")
plt.tight_layout()
plt.show()

# Visualize feature correlations
corr = data.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

# Model Architecture and Code Structure

## Model Architectures:

### Deep Learning (MLP):
- **Input Layer:** 15 nodes (one-hot encoded features).
- **Hidden Layers:** 3 layers with 20, 20, and 40 nodes respectively.
- **Output Layer:** 5 nodes (one for each PDD label).
- **Activation Functions:** ReLU for hidden layers, sigmoid for output layer.
- **Regularization:** Dropout layers

### Machine Learning:
- **Support Vector Machine (SVM) with RBF kernel.**
- **Random Forest (RF) with 100 trees.**
- **Decision Tree (DT) with Gini impurity criterion.**

## Training Objectives:

- **Loss Function:** Binary cross-entropy for MLP, as it is a multi-label classification task.
- **Optimizer:** Adam with learning rate 0.01.
- **Epochs:** 40 with early stopping based on validation loss.
- **Batch Size:** 50.

## Other Details:

- The models are trained from scratch without any pretraining.
- 5-fold cross-validation is used for hyperparameter tuning.

## Code Structure:

- Define a `MLP` class in Keras with the specified architecture.
- Implement functions for training, validation, and testing the MLP model.
- Use scikit-learn library for SVM, RF, and DT models.
- Develop functions to train and evaluate these models with the same data splits as MLP.
- Save the best performing models for each architecture.


In [None]:
# Convert the data to float32
X_train_sm = np.array(X_train_sm).astype('float32')

# Convert to numpy array and float32 after encoding
y_train_sm = np.array(y_train_sm).astype('float32')


In [None]:
# Check for NaN values and handle them if any
X_train_sm = np.nan_to_num(X_train_sm)
y_train_sm = np.nan_to_num(y_train_sm)

In [None]:
def create_mlp(input_shape):
    model = Sequential([
        Dropout(0.1, input_shape=(24,)),
        Dense(20, activation='relu', input_shape=(input_shape,)),
        Dropout(0.1, input_shape=(20,)),
        Dense(20, activation='relu'),
        Dropout(0.5, input_shape=(20,)),
        Dense(40, activation='relu'),
        Dropout(0.5, input_shape=(40,)),
        Dense(5, activation='sigmoid')
    ])
    model.compile(optimizer= keras.optimizers.Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])
    return model

# Train the MLP model with early stopping
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
mlp_model = create_mlp(X_train.shape[1])
history = mlp_model.fit(X_train, y_train, epochs=40, batch_size=50, validation_data = (X_test, y_test), callbacks=[early_stop], verbose=1)

# Plot training and validation loss
plt.figure(figsize=(8, 5))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.tight_layout()
plt.savefig('mlp_loss_plot.png')
plt.show()

# Plot training and validation accuracy
plt.figure(figsize=(8, 5))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.tight_layout()
plt.savefig('mlp_loss_plot.png')
plt.show()

# Hyperparameter tuning for RF
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}
rf_grid = GridSearchCV(RandomForestClassifier(random_state=42), rf_params, cv=5)
rf_grid.fit(X_train, y_train)
rf_model = rf_grid.best_estimator_

# Print the best hyperparameters for Random Forest
print("Best hyperparameters for Random Forest:")
print(rf_grid.best_params_)

# Assuming you want to use SVM for multi-label classification
svm_model = OneVsRestClassifier(SVC(random_state=42))
svm_model.fit(X_train, y_train)

# Similar for Decision Tree
dt_model = OneVsRestClassifier(DecisionTreeClassifier(random_state=42))
dt_model.fit(X_train, y_train)

# After training the models
mlp_preds = mlp_model.predict(X_test)
rf_preds = rf_model.predict(X_test)
svm_preds = svm_model.predict(X_test)
dt_preds = dt_model.predict(X_test)

# Apply threshold to convert continuous predictions to binary
threshold = 0.5
mlp_preds_binary = (mlp_preds > threshold).astype(int)

# Results

In [None]:
print("y_test shape:", y_test.shape)
print("mlp_preds_binary shape:", mlp_preds_binary.shape)
print("\nFirst few rows of y_test:")
print(y_test.head())
print(mlp_preds_binary[:5,:])

In [None]:
y_test_binary = y_test

# Now, evaluate the models with y_test_binary
print("\nMLP Metrics:")
print("Accuracy:", accuracy_score(y_test_binary, mlp_preds_binary))
print("Precision:", precision_score(y_test_binary, mlp_preds_binary, average='micro'))
print("Recall:", recall_score(y_test_binary, mlp_preds_binary, average='micro'))
print("F1-score:", f1_score(y_test_binary, mlp_preds_binary, average='micro'))



In [None]:
# Evaluate the models using multi-label classification metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("MLP Metrics:")
print("Accuracy:", accuracy_score(y_test_binary, mlp_preds_binary))
print("Precision:", precision_score(y_test_binary, mlp_preds_binary, average='micro'))
print("Recall:", recall_score(y_test_binary, mlp_preds_binary, average='micro'))
print("F1-score:", f1_score(y_test_binary, mlp_preds_binary, average='micro'))

print("\nRF Metrics:")
print("Accuracy:", accuracy_score(y_test_binary, rf_preds))
print("Precision:", precision_score(y_test_binary, rf_preds, average='micro'))
print("Recall:", recall_score(y_test_binary, rf_preds, average='micro'))
print("F1-score:", f1_score(y_test_binary, rf_preds, average='micro'))

print("\nSVM Metrics:")
print("Accuracy:", accuracy_score(y_test_binary, svm_preds))
print("Precision:", precision_score(y_test_binary, svm_preds, average='micro'))
print("Recall:", recall_score(y_test_binary, svm_preds, average='micro'))
print("F1-score:", f1_score(y_test_binary, svm_preds, average='micro'))

print("\nDT Metrics:")
print("Accuracy:", accuracy_score(y_test_binary, dt_preds))
print("Precision:", precision_score(y_test_binary, dt_preds, average='micro'))
print("Recall:", recall_score(y_test_binary, dt_preds, average='micro'))
print("F1-score:", f1_score(y_test_binary, dt_preds, average='micro'))

## Model comparison

# Model Evaluation Metrics and Analysis

## MLP Metrics:

- **Accuracy:** 0.3133
- **Precision:** 0.7617
- **Recall:** 0.7990
- **F1-score:** 0.7799

The MLP model has an accuracy of 0.31, indicating that it correctly predicts 31% of the instances completely. However, the precision of 0.76 means that 76% of instances predicted as positive are actually positive. The recall of 0.80 suggests that the model identifies 80% of actual positives are identified correctly. The F1-score of 0.78 provides a balanced measure of the model's performance and suggest that we have a good balance between both.This indicates we might need to rethink our accuracy and precision calculations

## RF Metrics:

- **Accuracy:** 0.36
- **Precision:** 0.7937
- **Recall:** 0.8015
- **F1-score:** 0.7976

The Random Forest (RF) model has a slightly higher accuracy of 0.36 compared to the MLP model. It has a precision of 0.79, indicating that 79% of instances predicted as positive are actually positive. The recall of 0.80 suggests that the model identifies 80% of actual positives are identified correctly. The F1-score of 0.8 provides a balanced measure of the model's performance and suggests the random forest is also balanced, but suffers from the same accuracy issue as the mlp.

## SVM Metrics:

- **Accuracy:** 0.32
- **Precision:** 0.7446
- **Recall:** 0.7574
- **F1-score:** 0.7509

The Support Vector Machine (SVM) model has an accuracy of 0.32, slightly higher than the MLP model but lower than the RF model. It has a precision of 0.74, indicating that 74% of instances predicted as positive are actually positive. The recall of 0.76 suggests that the model identifies 76% of actual positive instances. The F1-score of 0.75 provides a balanced measure of the model's performance and suggests that the SVM is well balanced, but less so then the previous models.

## DT Metrics:

- **Accuracy:** 0.2533
- **Precision:** 0.7330
- **Recall:** 0.7132
- **F1-score:** 0.7230

The Decision Tree (DT) model has the lowest accuracy among all models, with a value of 0.25. It has a precision of 0.73, indicating that 73% of instances predicted as positive are actually positive. The recall of 0.71 suggests that the model identifies 71% of actual positive instances. The F1-score of 0.72 provides a balanced measure of the model's performance and shows that the DT is the worst.

## Overall Analysis:

Based on the evaluation metrics, the Random Forest (RF) model performs the best among all models, with the highest accuracy, precision, and F1-score. The MLP and SVM models have comparable performance, while the Decision Tree (DT) model has the lowest performance.

However, all models exhibit relatively low accuracies, indicating the challenging nature of the multi-label classification task for psychotic disorder diseases. There may be opportunities for improvement through model architecture adjustments, hyperparameter tuning, and feature engineering. Techniques such as addressing class imbalance and exploring additional features could also enhance model performance. In addition, tuning the way we measure accuracy in the project could shed light on the low accuracy scores.

Further analysis and experimentation are recommended to improve the models' accuracy and effectiveness in predicting psychotic disorder diseases.


# Discussion

The results obtained from the multi-label classification of psychotic disorder diseases using deep learning and machine learning models provide valuable insights into the performance and challenges associated with this task.

## Model Performance Analysis:

- **MLP (Multilayer Perceptron) Model:**

  The MLP model demonstrates relatively high precision and recall, but low accuracy which means we maybe missing a classification in most rows.

- **Random Forest (RF) Model:**

  The RF model outperforms the MLP model in terms of precision and recall, but the margin is very slight.

- **Support Vector Machine (SVM) Model:**

  The SVM model shows comparable performance to the MLP model with slightly higher accuracy but lower precision and recall.

- **Decision Tree (DT) Model:**

  The DT model exhibits the lowest performance among all models.

## Strategies for Improvement:

1. **Feature Engineering:**
   Explore additional relevant features for better discrimination.
   
2. **Hyperparameter Tuning:**
   Conduct extensive hyperparameter tuning for optimal model performance.
   
3. **Model Ensemble:**
   Combine predictions from multiple models using ensemble techniques.
   
4. **Class Imbalance Handling:**
   Address class imbalance using techniques like class weighting or oversampling.
   
5. **Advanced Architectures:**
   Investigate advanced deep learning architectures like CNNs or RNNs.
   
6. **Data Augmentation:**
   Apply data augmentation techniques to increase diversity in the training data.
   
7. **Interpretability Analysis:**
   Understand model features and patterns for further insights.

8. **Change Classification Target:**
To better understand the relationship between target variables

9. **Use Balanced Dataset:**Having a SMOTE dataset to train in tandem to the imbalanced dataset we have will help test the overall paper's conclusion

## Conclusion:

The challenges associated with multi-label classification of psychotic disorder diseases are evident from the models' performance. However, with optimization strategies and iterative refinement, it's possible to improve model accuracy and reliability for this complex task. To further the experiment we also need to test out different setups for classification such as: aggregating all 5 classification targets into a single classification target and using the other four targets to predict a single classification to test the robustness of our models.


# References

1.  Elujide, Israel et al, Application of deep and machine learning techniques for multi-label classification performance on psychotic disorder diseases, Informatics in Medicine Unlocked, Volume 23, 2021, https://www.sciencedirect.com/science/article/pii/S2352914821000356

