# PSP analysis 

In this note, we analyze a dataset for Progressive Supranuclear Palsy (PSP), which includes RNA-Seq data from 25 PSP patients and 16 controls with total of 102 transcriptom.

- Our analysis begins with a 3-fold cross-validation using XGBoost, where we compute and plot the average AUC, specificity, sensitivity, and accuracy scores with standard deviation error bars.

- We then compare XGBoost with other models such as Random Forest, CatBoost, and SVM. Our comparison reveals clear overfitting issues in Random Forest, CatBoost, and SVM.

- To address this, we perform a 5-fold cross-validation and explore hyperparameters for these models, but the overfitting issue persists.

- To mitigate this, we conduct feature importance selection to identify the top 30 genes based on the models previously mentioned, and focus on the common genes among these top selections, which total 5 genes.

- We then apply Logistic Regression and

- CatBoost based on these common genes, observing that while these models show high sensitivity, they exhibit lower specificity.

- Finally, we investigate Lasso Regression and Ridge Regression as additional methods. We find that Lasso Regression effectively mitigates overfitting, whereas Ridge Regression continues to suffer from overfitting issues.

In [1]:
import shap
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as pl
import seaborn as sns


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier, Pool, metrics, cv
from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve, accuracy_score, auc
from sklearn.metrics import roc_auc_score, confusion_matrix, precision_recall_curve
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier

from sklearn.linear_model import Lasso, Ridge


from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import StratifiedKFold

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


In [2]:
path1=Path("/Users/zainabnazari/Desktop/psp")

In [6]:
s_tumor_data=pd.read_csv(path1/"Normalized_mRNA_matrix_GSE198048_102_genes.txt",delimiter='\t')
s_tumor_data.shape

(190, 100)

In [7]:
# Separate features (X) and target variable (y)
X1 = s_tumor_data.drop(['ID', 'Class'], axis=1)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the 'COHORT' column
label = label_encoder.fit_transform(s_tumor_data['Class'])

# Set the label for parkinson's disease and healthy control
s_tumor_data.loc[:, 'Class'] = label

y = s_tumor_data['Class']

# Cross-Validation with XGBoost model
Cross-validation with 3 folds 

In [12]:
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X1)

# Define hyperparameters and seed
hyperparameters = {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 15}
seed = 42

# Initialize and train the XGBoost model
xgb_model = XGBClassifier(**hyperparameters, seed=seed)

# Initialize lists to store evaluation metrics
specificities = []
accuracies = []
sensitivities = []
auc_scores = []

# Stratified K-Fold cross-validation
cv = StratifiedKFold(n_splits=7, shuffle=True, random_state=seed)

# Iterate through each fold
for i, (train_index, test_index) in enumerate(cv.split(X_scaled, y), start=1):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Train the model
    xgb_model.fit(X_train, y_train)

    # Make predictions on the testing set
    y_pred_proba = xgb_model.predict_proba(X_test)[:, 1]
    y_pred = xgb_model.predict(X_test)


    # Calculate confusion matrix
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
     # Calculate specificity, accuracy, and sensitivity
     # Calculate AU-ROC score
    auc_score = roc_auc_score(y_test, y_pred_proba)
    specificity = tn / (tn + fp)
    accuracy = accuracy_score(y_test, y_pred)
    sensitivity = recall_score(y_test, y_pred)

    # Append metrics to lists
    specificities.append(specificity)
    accuracies.append(accuracy)
    sensitivities.append(sensitivity)
    auc_scores.append(auc_score)

    # Print metrics for each fold
    print(f'Fold {i} - AUC: {auc_score}, Specificity: {specificity:.4f}, Accuracy: {accuracy:.4f}, Sensitivity: {sensitivity:.4f}')

# Print average and standard deviation of metrics
average_auc_score = np.mean(auc_score)
average_specificity = np.mean(specificities)
average_accuracy = np.mean(accuracies)
average_sensitivity = np.mean(sensitivities)
std_specificity = np.std(specificities)
std_accuracy = np.std(accuracies)
std_sensitivity = np.std(sensitivities)
std_auc_score = np.std(auc_scores)

print(f'\nAverage AUC: {average_auc_score:.4f} (±{std_auc_score:.4f})')
print(f'\nAverage Specificity: {average_specificity:.4f} (±{std_specificity:.4f})')
print(f'Average Accuracy: {average_accuracy:.4f} (±{std_accuracy:.4f})')
print(f'Average Sensitivity: {average_sensitivity:.4f} (±{std_sensitivity:.4f})')

Fold 1 - AUC: 0.5847953216374269, Specificity: 0.1111, Accuracy: 0.5714, Sensitivity: 0.7895
Fold 2 - AUC: 0.5986842105263158, Specificity: 0.3750, Accuracy: 0.7407, Sensitivity: 0.8947
Fold 3 - AUC: 0.6049382716049383, Specificity: 0.2222, Accuracy: 0.7037, Sensitivity: 0.9444
Fold 4 - AUC: 0.8209876543209876, Specificity: 0.3333, Accuracy: 0.7407, Sensitivity: 0.9444
Fold 5 - AUC: 0.5987654320987654, Specificity: 0.4444, Accuracy: 0.6296, Sensitivity: 0.7222
Fold 6 - AUC: 0.6790123456790123, Specificity: 0.4444, Accuracy: 0.7037, Sensitivity: 0.8333
Fold 7 - AUC: 0.4938271604938272, Specificity: 0.3333, Accuracy: 0.5926, Sensitivity: 0.7222

Average AUC: 0.4938 (±0.0941)

Average Specificity: 0.3234 (±0.1118)
Average Accuracy: 0.6689 (±0.0650)
Average Sensitivity: 0.8358 (±0.0887)
