This project allows you to deconvolute tissue types from RNA-seq expression data by leveraging a pretrained deep neural network. The nueral network is extensively trained using GTEx datasets v8. We also include other functions such deep-learning single tissue type prediction and tissue type deconvolution using NNLS. 15 different tissue type deconvolutions are included in this project, refer to section 3.5.4 for more details. The neural network is trained using TSG genes, in total, 6500 TSG genes are selected as the features of training. To get the TSG genes used in our traininig refer to the sample data in the test folder.
Datasets are partially released for confidentiality reasons. You can directly download the data and models using this link: https://drive.google.com/file/d/163WrkVO9-WS7i4U4xdvLRl7QeAkSdm5s/view?usp=sharing
Or you can follow the steps bellow to download and configure models and data automatically.
Note: Do not directly download the data folder into the project root folder, it will cause the following run steps to freeze. If you want to run the models with data use the following steps to configure data and models automatically.
You must have at least 20GB free disk space, at least 8GB RAM and a dual core CPU. A Windows or Linux or Mac OS system that is compatible with the lastest docker engine.
Why use docker? docker allows us to distribute softwares consistently across all platforms. Check out this link on how to install docker on your machine.
https://docs.docker.com/engine/install/
We recommend installing docker desktop for simplicity.
Note: after docker is installed, the commands bellow can be used across platforms such as Windows, Linux and Mac OS.
Run the following command to clone the project.
git clone https://github.com/yay135/GTEx_NNLS_Deep_Learning
Change the directory to the project root folder.
cd GTEx_NNLS_Deep_Learning
Run the docker build command.
docker build -t gtex_nnls_deep .
Note: building the docker image will download the data and models for you automatically. The build process might take 5-10 minutes depending on your network speed and available processing power.
Create an empty data folder and move all the csv files that you want to run tasks with into this folder. Refer to the sample csv files in the test folder to format your own file. Your data folder can be anywhere in your local machine.
Make sure your csv files have gene ensembl id without version as the headers. The expression data must be RNA-seq TPM normalized. Do not make further normalizations, the program has built in log2 transformation and min-max scaling functions.
Do not mix files for different tasks, the columns of your files can be different to the sample test file, the program will match as many columns as possible, the missing ones will be filled with 0. Files in your data folder may have different columns. Each column is a gene, each row is a sample. The models favor csv files with more matched TSG genes.
The following tissue types are included in the output: "Brain","Breast","Colon","Esophagus","Kidney","Liver","Lung","Ovary","Pancreas","Prostate","Skin","Small Intestine","Stomach","Thyroid","Uterus"
The test folder already contains some test csv files, you can use these data to test the build or you can create your own data folder.
Let's assume your task ready csv files are gathered in folder "test".
cd test
docker run --env RUN_TYPE=deconvolute_deep --rm -v .:/app gtex_nnls_deep
RUN_TYPE=deconvolute specifies the task you are running, it can be either RUN_TYPE=deconvolute or RUN_TYPE=single_t.
docker run --env RUN_TYPE=deconvolute_nnls --rm -v .:/app gtex_nnls_deep
docker run --env RUN_TYPE=single_t --rm -v .:/app gtex_nnls_deep
cd into your data folder first.
cd [path/to/data]
Run the desired task.
docker run --env RUN_TYPE=[single_t|deconvolute_deep|deconvolute_nnls] --rm -v .:/app gtex_nnls_deep
Replace the RUN_TYPE parameters and data folder parameters for your own usage.
The output files will be in the same folder as your csv data files with added prefix. For deconvolution task, each new file has a prefix added to the front such as "deconvolute_[csv file name].csv". For single-tissue-type prediction task, each new file has a prefix added to the front such as "tissue_type_[csv file name].csv".
Fengyao Yan fxy134@miami.edu