This is a collection of custom scripts used for Gaccioli et al. Nat. Microbiol. 2020.
The RNA-Seq reads were trimmed (using cutadapt and Trim Galore!) and mapped to the primary chromosomal assemblies of the GRCh38.p3 version of the human reference genome using TopHat2, a splice-aware mapper built on top of Bowtie2 short-read aligner. For more details, read Gong et al., Epigenetics, 2018 and Gong et al., JCI Insight, 2018.
The initial ‘unmapped’ reads were filtered out to remove poor quality reads, based on the following conditions:
- base-quality score (i.e. Phred score) < 30,
- read-length < 50bp,
- undetermined base (i.e. Ns) > 5bp, and
- poly A/T > 5bp, and
- low-complexity reads (defined by the dust score) > 7.
Following shell script handles what's been described above: preprocess_unmapped_reads.sh
In order to remove as many reads of human origin as possible, additional human reads were subtracted if they aligned to sequences present in the following databases:
- GRCh38.p5,
- human RefSeq,
- all human contigs and clone sequences from NCBI NT.
See create_human_db_for_bowtie2.sh
for details.
The remaining reads of each sample were mapped to a custom Kraken reference database, including the default bacterial and viral genomes and few additional eukaryotic genomes to remove residual unmapped human reads. Kraken (v0.10.6) was run using the metagm_run_kraken
option.
Using the -noclean
option on the paired filtered reads:
metagm_run_kraken -t 4 -noclean scanner_DB Sample(x).report Sample(x)_kneaddata_paired_1.fastq Sample(x)_kneaddata_paired_2.fastq
This will generate both a .report
file and a .report.kraken_out
file.
Write down all of the ID’s of the group of interest that you want to extract (penultimate column). This list can be very short or somewhat long, like in the case of E. coli. Make a column in a txt file of this list (Target_ids.tab).
Step 3: Run the Perl script (shown below, or here) on all of the Sample(x).report.kraken_out files.
#!/usr/bin/perl
#use strict;
my $path = $ARGV[0];
if (scalar($path)==0) {
$path = `pwd`;
chomp $path;
}
undef @files;
opendir PATH, $path;
@files = grep /\.kraken_out/, readdir PATH;
closedir PATH;
my $ids = "";
open FILE, "$path\/Target\_ids\.tab";
while (<FILE>) {
chomp;
$ids .= "\|$_\|";
}
close FILE;
foreach (@files){
my $f = $_;
open OUT, ">$path\/$f\.selection";
open FILE, "$path\/$f";
while (<FILE>) {
my $line = $_;
my @line = split /\t/, $line;
if ($ids =~ /\|$line[2]\|/) {
print OUT "$line[1]\n";
}
}
close FILE;
close OUT;
}
less Sample(x)_kneaddata_paired_1.fastq | grep @HX1 > Sample(x).tab
read_names <- scan("Sample(x).tab", what="\n")
selection <- scan("Sample(x).report.kraken_out.selection", what="\n")
selection <- gsub(selection, pattern="/1", replacement="")
selection <- as.numeric(selection)
read_selection <- read_names[selection]
read_selection <- gsub(read_selection, pattern="@", replacement="")
write(read_selection, file="Sample(x)Target.txt", ncolumns=1, sep="\t")
seqtk subseq Sample(x)_kneaddata_paired_1.fastq Sample(x)Target.txt > Sample(x)Target1.fq
seqtk subseq Sample(x)_kneaddata_paired_2.fastq Sample(x)Target.txt > Sample(x)Target2.fq