GitHub

mStrain

mStrain, a novel Yesinia pestis strain or lineage-level identification tool that utilizes metagenomic data, is written in python with a small amount of R and linux shell. mStrain successfully identified Y. pestis at the strain/lineage level by extracting sufficient information regarding single nucleotide polymorphisms (SNPs), which can therefore be an effective tool for identification and source tracking of Y. pestis based on metagenomic data during plague outbreak.

Requirements

1. conda packages:

package	version
r-base	=3.6.3
bcftools	=1.14
samtools	>=1.15
iqtree	>=2.2.2.7
bwa	>=0.7.17
bedtools	>=2.31.0
Kraken2	>=2.0.9
ImageMagick	=7.1.0_27
pandas	>=2.0.3
Trimmomatic	>=0.39

2. r packages:

ggtree =2.0.4, ggplot2 =3.3.1, treeio, ape, tidyr, geiger, tibble

3. source code:

jdk-20.0.2, picard =3.1.0

Installation

1. A conda environment named `mStrain` can be created and activated with:

conda create -n mStrain python=3.9.16
conda activate mStrain

NOTE:

mStrain is a customizable name for a new environment created using the conda command
Installation, validation and usage are performed in this environment

2. Install dependencies required for mStrain

Clone this repository to local using git

git clone https://github.com/xwqian1123/mStrain.git

Add executable permission to the script 'run_install.sh' in the mStrain directory

cd mStrain
chmod +x run_install.sh

Run the script 'run_install.sh' in the mStrain directory to install conda packages, R packages and jdk-20.0.2

./run_install.sh

NOTE:

Make sure the dependencies are installed successfully.
R packages and picard this project depends on have been packaged and placed in the mStrain/packages directory.

Validation

The following validation of the mStrain was performed on the Ubuntu 23.0.4 operating system.

1. Dataset

In this work, sim.fastq, a simulated sequencing dataset randomly extracted and mixed after simulated sequencing by Yesinia pestis EV76 and human genome hg38, is used as a dataset to validate mStrain. This repository already contains sim.fastq dataset you can unpack the file 'sim.fastq.bz2' in the mStrain directory using the bzip2 command to obtain.

cd mStrain
bzip2 -d sim.fastq.bz2

Package 'sim.fastq.bz2' successfully unpacked, the tree structure of the mStrain directory is as follows:

mStrain
├── install_script
│   ├── install.sh
│   ├── list.txt
│   └── RPackage.r
├── main_code
│   ├── get_node.py
│   ├── get_sh.sh
│   ├── get_target_gene.py
│   ├── ggtee_plot.R
│   ├── ggtree_node_table.r
│   ├── process.py
│   └── trantofa.py
├── packages
│   ├── jdk-20.0.2
│   ├── picard.jar
│   └── RPackages
├── README.md
├── ref
│   ├── 133s_2298p.txt
│   ├── 133strain_branch_type.list
│   ├── CO92.chr.fasta
│   ├── CO92.chr.fasta.amb
│   ├── CO92.chr.fasta.ann
│   ├── CO92.chr.fasta.bwt
│   ├── CO92.chr.fasta.fai
│   ├── CO92.chr.fasta.pac
│   ├── CO92.chr.fasta.sa
│   └── trimmomatic.fa
├── run_install.sh
├── run_mStrain.sh
├── sim.fastq
└── test.fq.ls

2. Run mStrain with dataset：

Add executable permission to the script 'run_mStrain.sh' in the mStrain directory

chmod +x run_mStrain.sh

Run the script 'run_mStrain.sh' in the mStrain directory with data set

./run_mStrain.sh

Usage

mStrain is an extensible tool that allows the user to change the i, o, d, t, k, and k_db parameters to customize the run of mStrain.

1.

2.

3.

Explanation of parameters in run_mStrain.sh

``python ./main_code/process.py -i test.fq.ls -r ./ref/CO92.chr.fasta -o sim -m ./ref/133s_2298p.txt -f ./ref/133strain_branch_type.list -g IP32953_outgroup -trim_db ./ref/trimmomatic.fa -d 3 -picardpath ./packages/picard.jar -t 4 -javapath ./packages/jdk-20.0.2/bin/java``

-r,  "--ref_seq",    help="reference genome file name"
-i,  "--input_file", help="input file name,inclue reads path sample name;eg:sample1\tsampe_1.fq\tsample_2.fq"
-o,  "--out_dir",    help="output folder name"
-n,  "--num",        default=4, type=int,help="samtools view filtering flag,default:4"
-m,  "--snp_matrix", help="the SNP loci of reference strain,outgroup is placed in the last column"
-f,  "--typefile",   help="the type list of reference strain"
-g,  "--outgroup",   help="customized outgroup_name"
-d,  "--deep",       default=3, type=int, help="sequencing deep,default:3"
-k,  "--kraken",     default=0, type=int, help="species identification,default:0,no running kraken;1,running kraken"
-t,  "--thread",     default=2,help="thread"
-k_db,  "--kraken_database",  help="kraken_database"
-trim_db, "--trim_database",  help="trim_database,this file can be obtained from https://github.com/usadellab/Trimmomatic/tree/main/adapters"
-picardpath, "--picardpath",  help="picardpath"
-javapath,  "--javapath",     help="javapath"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mStrain

Requirements

1. conda packages:

2. r packages:

3. source code:

Installation

1. A conda environment named `mStrain` can be created and activated with:

2. Install dependencies required for mStrain

Validation

1. Dataset

2. Run mStrain with dataset：

Usage

1.

2.

3.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
install_script		install_script
main_code		main_code
packages		packages
ref		ref
README.md		README.md
run_install.sh		run_install.sh
run_mStrain.sh		run_mStrain.sh
sim.fastq.bz2		sim.fastq.bz2
test.fq.ls		test.fq.ls

xwqian1123/mStrain

Folders and files

Latest commit

History

Repository files navigation

mStrain

Requirements

1. conda packages:

2. r packages:

3. source code:

Installation

1. A conda environment named mStrain can be created and activated with:

2. Install dependencies required for mStrain

Validation

1. Dataset

2. Run mStrain with dataset：

Usage

1.

2.

3.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. A conda environment named `mStrain` can be created and activated with:

Packages