Skip to content

wangk-oj/BioSleep-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Bioinformatics Analysis Pipeline for Sleep Deprivation Biomarker Discovery This repository contains a set of R scripts designed to perform a complete bioinformatics analysis pipeline. The goal is to identify, validate, and interpret potential gene biomarkers for sleep deprivation (SD) using public gene expression datasets from the GEO database.

The analysis is divided into three main stages, corresponding to the three R scripts.

Workflow & File Descriptions

  1. DEG.R - Data Preprocessing and Differential Expression Analysis

-Data Acquisition: Downloads and parses specified GEO datasets (e.g., GSE98582, GSE37667, GSE56931, GSE208668).

-Preprocessing: Performs quality control, probe-to-gene symbol conversion, outlier removal, and quantile normalization.

-Batch Effect Correction: Uses the ComBat algorithm from the sva package to merge multiple datasets and remove technical batch effects.

-Differential Analysis: Employs the limma package to identify differentially expressed genes (DEGs) between Sleep Deprivation (SD) and Normal control groups.

-Output: Generates key visualizations like PCA plots (before and after batch correction), volcano plots, and heatmaps. The final output is combined-data.csv, which contains the expression matrix of DEGs for machine learning.

  1. machine learning.R - Machine Learning Model Training and Interpretation

-Model Training: Takes combined-data.csv as input and trains multiple machine learning models (SVM, ElasticNet, Random Forest, Naive Bayes, XGBoost, MLP) to classify samples.

-Hyperparameter Tuning: Uses 5-fold cross-validation within the caret framework to find the optimal parameters for each model.

-Model Interpretation: Applies the SHAP (SHapley Additive exPlanations) method using the fastshap package to explain model predictions and rank genes by their feature importance.

-Output: Saves model performance metrics and generates feature importance plots for each model, providing a ranked list of potential biomarkers.

  1. verify.R - Biomarker Validation and Immune Infiltration Analysis

-Biomarker Validation: Evaluates the diagnostic potential of selected hub genes in independent validation datasets by performing single-gene Receiver Operating Characteristic (ROC) curve analysis.

-Immune Infiltration: Uses the CIBERSORT algorithm to estimate the relative proportions of 22 types of immune cells in the samples.

-Correlation Analysis: Investigates the biological context of the hub genes by calculating the correlation between their expression levels and the abundance of different immune cells.

-Output: Generates ROC curve plots and visualizations of immune cell composition and correlation analysis (e.g., boxplots, correlation heatmaps).

How to Run

The scripts should be executed in the following order:

  1. DEG.R

  2. machine learning.R

  3. verify.R

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages