GitHub - wangk-oj/BioSleep-AI

Bioinformatics Analysis Pipeline for Sleep Deprivation Biomarker Discovery This repository contains a set of R scripts designed to perform a complete bioinformatics analysis pipeline. The goal is to identify, validate, and interpret potential gene biomarkers for sleep deprivation (SD) using public gene expression datasets from the GEO database.

The analysis is divided into three main stages, corresponding to the three R scripts.

Workflow & File Descriptions

DEG.R - Data Preprocessing and Differential Expression Analysis

-Data Acquisition: Downloads and parses specified GEO datasets (e.g., GSE98582, GSE37667, GSE56931, GSE208668).

-Preprocessing: Performs quality control, probe-to-gene symbol conversion, outlier removal, and quantile normalization.

-Batch Effect Correction: Uses the ComBat algorithm from the sva package to merge multiple datasets and remove technical batch effects.

-Differential Analysis: Employs the limma package to identify differentially expressed genes (DEGs) between Sleep Deprivation (SD) and Normal control groups.

-Output: Generates key visualizations like PCA plots (before and after batch correction), volcano plots, and heatmaps. The final output is combined-data.csv, which contains the expression matrix of DEGs for machine learning.

machine learning.R - Machine Learning Model Training and Interpretation

-Model Training: Takes combined-data.csv as input and trains multiple machine learning models (SVM, ElasticNet, Random Forest, Naive Bayes, XGBoost, MLP) to classify samples.

-Hyperparameter Tuning: Uses 5-fold cross-validation within the caret framework to find the optimal parameters for each model.

-Model Interpretation: Applies the SHAP (SHapley Additive exPlanations) method using the fastshap package to explain model predictions and rank genes by their feature importance.

-Output: Saves model performance metrics and generates feature importance plots for each model, providing a ranked list of potential biomarkers.

verify.R - Biomarker Validation and Immune Infiltration Analysis

-Biomarker Validation: Evaluates the diagnostic potential of selected hub genes in independent validation datasets by performing single-gene Receiver Operating Characteristic (ROC) curve analysis.

-Immune Infiltration: Uses the CIBERSORT algorithm to estimate the relative proportions of 22 types of immune cells in the samples.

-Correlation Analysis: Investigates the biological context of the hub genes by calculating the correlation between their expression levels and the abundance of different immune cells.

-Output: Generates ROC curve plots and visualizations of immune cell composition and correlation analysis (e.g., boxplots, correlation heatmaps).

How to Run

The scripts should be executed in the following order:

DEG.R
machine learning.R
verify.R

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
DEG.R		DEG.R
README.md		README.md
machine learning.R		machine learning.R
verify.R		verify.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages