mStrain, a novel Yesinia pestis strain or lineage-level identification tool that utilizes metagenomic data, is written in python with a small amount of R and linux shell. mStrain successfully identified Y. pestis at the strain/lineage level by extracting sufficient information regarding single nucleotide polymorphisms (SNPs), which can therefore be an effective tool for identification and source tracking of Y. pestis based on metagenomic data during plague outbreak.
package | version |
---|---|
r-base | =3.6.3 |
bcftools | =1.14 |
samtools | >=1.15 |
iqtree | >=2.2.2.7 |
bwa | >=0.7.17 |
bedtools | >=2.31.0 |
Kraken2 | >=2.0.9 |
ImageMagick | =7.1.0_27 |
pandas | >=2.0.3 |
Trimmomatic | >=0.39 |
ggtree =2.0.4, ggplot2 =3.3.1, treeio, ape, tidyr, geiger, tibble
conda create -n mStrain python=3.9.16
conda activate mStrain
NOTE:
- mStrain is a customizable name for a new environment created using the conda command
- Installation, validation and usage are performed in this environment
Clone this repository to local using git
git clone https://github.com/xwqian1123/mStrain.git
Add executable permission to the script 'run_install.sh' in the mStrain directory
cd mStrain
chmod +x run_install.sh
Run the script 'run_install.sh' in the mStrain directory to install conda packages, R packages and jdk-20.0.2
./run_install.sh
NOTE:
- Make sure the dependencies are installed successfully.
- R packages and picard this project depends on have been packaged and placed in the mStrain/packages directory.
The following validation of the mStrain was performed on the Ubuntu 23.0.4 operating system.
In this work, sim.fastq, a simulated sequencing dataset randomly extracted and mixed after simulated sequencing by Yesinia pestis EV76 and human genome hg38, is used as a dataset to validate mStrain. This repository already contains sim.fastq dataset you can unpack the file 'sim.fastq.bz2' in the mStrain directory using the bzip2 command to obtain.
cd mStrain
bzip2 -d sim.fastq.bz2
Package 'sim.fastq.bz2' successfully unpacked, the tree structure of the mStrain directory is as follows:
mStrain
├── install_script
│ ├── install.sh
│ ├── list.txt
│ └── RPackage.r
├── main_code
│ ├── get_node.py
│ ├── get_sh.sh
│ ├── get_target_gene.py
│ ├── ggtee_plot.R
│ ├── ggtree_node_table.r
│ ├── process.py
│ └── trantofa.py
├── packages
│ ├── jdk-20.0.2
│ ├── picard.jar
│ └── RPackages
├── README.md
├── ref
│ ├── 133s_2298p.txt
│ ├── 133strain_branch_type.list
│ ├── CO92.chr.fasta
│ ├── CO92.chr.fasta.amb
│ ├── CO92.chr.fasta.ann
│ ├── CO92.chr.fasta.bwt
│ ├── CO92.chr.fasta.fai
│ ├── CO92.chr.fasta.pac
│ ├── CO92.chr.fasta.sa
│ └── trimmomatic.fa
├── run_install.sh
├── run_mStrain.sh
├── sim.fastq
└── test.fq.ls
Add executable permission to the script 'run_mStrain.sh' in the mStrain directory
chmod +x run_mStrain.sh
Run the script 'run_mStrain.sh' in the mStrain directory with data set
./run_mStrain.sh
mStrain is an extensible tool that allows the user to change the i, o, d, t, k, and k_db parameters to customize the run of mStrain.
Explanation of parameters in run_mStrain.sh
``python ./main_code/process.py -i test.fq.ls -r ./ref/CO92.chr.fasta -o sim -m ./ref/133s_2298p.txt -f ./ref/133strain_branch_type.list -g IP32953_outgroup -trim_db ./ref/trimmomatic.fa -d 3 -picardpath ./packages/picard.jar -t 4 -javapath ./packages/jdk-20.0.2/bin/java``
-r, "--ref_seq", help="reference genome file name"
-i, "--input_file", help="input file name,inclue reads path sample name;eg:sample1\tsampe_1.fq\tsample_2.fq"
-o, "--out_dir", help="output folder name"
-n, "--num", default=4, type=int,help="samtools view filtering flag,default:4"
-m, "--snp_matrix", help="the SNP loci of reference strain,outgroup is placed in the last column"
-f, "--typefile", help="the type list of reference strain"
-g, "--outgroup", help="customized outgroup_name"
-d, "--deep", default=3, type=int, help="sequencing deep,default:3"
-k, "--kraken", default=0, type=int, help="species identification,default:0,no running kraken;1,running kraken"
-t, "--thread", default=2,help="thread"
-k_db, "--kraken_database", help="kraken_database"
-trim_db, "--trim_database", help="trim_database,this file can be obtained from https://github.com/usadellab/Trimmomatic/tree/main/adapters"
-picardpath, "--picardpath", help="picardpath"
-javapath, "--javapath", help="javapath"