Skip to content

xwqian1123/mStrain

Repository files navigation

mStrain

mStrain, a novel Yesinia pestis strain or lineage-level identification tool that utilizes metagenomic data, is written in python with a small amount of R and linux shell. mStrain successfully identified Y. pestis at the strain/lineage level by extracting sufficient information regarding single nucleotide polymorphisms (SNPs), which can therefore be an effective tool for identification and source tracking of Y. pestis based on metagenomic data during plague outbreak.

Requirements

1. conda packages:

package version
r-base =3.6.3
bcftools =1.14
samtools >=1.15
iqtree >=2.2.2.7
bwa >=0.7.17
bedtools >=2.31.0
Kraken2 >=2.0.9
ImageMagick =7.1.0_27
pandas >=2.0.3
Trimmomatic >=0.39

2. r packages:

ggtree =2.0.4, ggplot2 =3.3.1, treeio, ape, tidyr, geiger, tibble

3. source code:

jdk-20.0.2, picard =3.1.0

Installation

1. A conda environment named mStrain can be created and activated with:

conda create -n mStrain python=3.9.16
conda activate mStrain

NOTE:

  • mStrain is a customizable name for a new environment created using the conda command
  • Installation, validation and usage are performed in this environment

2. Install dependencies required for mStrain

Clone this repository to local using git

git clone https://github.com/xwqian1123/mStrain.git

Add executable permission to the script 'run_install.sh' in the mStrain directory

cd mStrain
chmod +x run_install.sh

Run the script 'run_install.sh' in the mStrain directory to install conda packages, R packages and jdk-20.0.2

./run_install.sh

NOTE:

  • Make sure the dependencies are installed successfully.
  • R packages and picard this project depends on have been packaged and placed in the mStrain/packages directory.

Validation

The following validation of the mStrain was performed on the Ubuntu 23.0.4 operating system.

1. Dataset

In this work, sim.fastq, a simulated sequencing dataset randomly extracted and mixed after simulated sequencing by Yesinia pestis EV76 and human genome hg38, is used as a dataset to validate mStrain. This repository already contains sim.fastq dataset you can unpack the file 'sim.fastq.bz2' in the mStrain directory using the bzip2 command to obtain.

cd mStrain
bzip2 -d sim.fastq.bz2

Package 'sim.fastq.bz2' successfully unpacked, the tree structure of the mStrain directory is as follows:

mStrain
├── install_script
│   ├── install.sh
│   ├── list.txt
│   └── RPackage.r
├── main_code
│   ├── get_node.py
│   ├── get_sh.sh
│   ├── get_target_gene.py
│   ├── ggtee_plot.R
│   ├── ggtree_node_table.r
│   ├── process.py
│   └── trantofa.py
├── packages
│   ├── jdk-20.0.2
│   ├── picard.jar
│   └── RPackages
├── README.md
├── ref
│   ├── 133s_2298p.txt
│   ├── 133strain_branch_type.list
│   ├── CO92.chr.fasta
│   ├── CO92.chr.fasta.amb
│   ├── CO92.chr.fasta.ann
│   ├── CO92.chr.fasta.bwt
│   ├── CO92.chr.fasta.fai
│   ├── CO92.chr.fasta.pac
│   ├── CO92.chr.fasta.sa
│   └── trimmomatic.fa
├── run_install.sh
├── run_mStrain.sh
├── sim.fastq
└── test.fq.ls

2. Run mStrain with dataset:

Add executable permission to the script 'run_mStrain.sh' in the mStrain directory

chmod +x run_mStrain.sh

Run the script 'run_mStrain.sh' in the mStrain directory with data set

./run_mStrain.sh

Usage

mStrain is an extensible tool that allows the user to change the i, o, d, t, k, and k_db parameters to customize the run of mStrain.

1.

2.

3.

Explanation of parameters in run_mStrain.sh

``python ./main_code/process.py -i test.fq.ls -r ./ref/CO92.chr.fasta -o sim -m ./ref/133s_2298p.txt -f ./ref/133strain_branch_type.list -g IP32953_outgroup -trim_db ./ref/trimmomatic.fa -d 3 -picardpath ./packages/picard.jar -t 4 -javapath ./packages/jdk-20.0.2/bin/java``

-r,  "--ref_seq",    help="reference genome file name"
-i,  "--input_file", help="input file name,inclue reads path sample name;eg:sample1\tsampe_1.fq\tsample_2.fq"
-o,  "--out_dir",    help="output folder name"
-n,  "--num",        default=4, type=int,help="samtools view filtering flag,default:4"
-m,  "--snp_matrix", help="the SNP loci of reference strain,outgroup is placed in the last column"
-f,  "--typefile",   help="the type list of reference strain"
-g,  "--outgroup",   help="customized outgroup_name"
-d,  "--deep",       default=3, type=int, help="sequencing deep,default:3"
-k,  "--kraken",     default=0, type=int, help="species identification,default:0,no running kraken;1,running kraken"
-t,  "--thread",     default=2,help="thread"
-k_db,  "--kraken_database",  help="kraken_database"
-trim_db, "--trim_database",  help="trim_database,this file can be obtained from https://github.com/usadellab/Trimmomatic/tree/main/adapters"
-picardpath, "--picardpath",  help="picardpath"
-javapath,  "--javapath",     help="javapath"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published