GitHub - sung/HHV6-Nat-Micr-2020: A collection of custom scripts used for Gaccioli et al. Nat. Microbiol. 2020.

Introduction

This is a collection of custom scripts used for Gaccioli et al. Nat. Microbiol. 2020.

Initial mapping of RNA-Seq data

The RNA-Seq reads were trimmed (using cutadapt and Trim Galore!) and mapped to the primary chromosomal assemblies of the GRCh38.p3 version of the human reference genome using TopHat2, a splice-aware mapper built on top of Bowtie2 short-read aligner. For more details, read Gong et al., Epigenetics, 2018 and Gong et al., JCI Insight, 2018.

Preprocessing of unmapped RNA-Seq data

The initial ‘unmapped’ reads were filtered out to remove poor quality reads, based on the following conditions:

base-quality score (i.e. Phred score) < 30,
read-length < 50bp,
undetermined base (i.e. Ns) > 5bp, and
poly A/T > 5bp, and
low-complexity reads (defined by the dust score) > 7.

Following shell script handles what's been described above: preprocess_unmapped_reads.sh

A custom human sequence DB

In order to remove as many reads of human origin as possible, additional human reads were subtracted if they aligned to sequences present in the following databases:

GRCh38.p5,
human RefSeq,
all human contigs and clone sequences from NCBI NT.

See create_human_db_for_bowtie2.sh for details.

Post processing

The remaining reads of each sample were mapped to a custom Kraken reference database, including the default bacterial and viral genomes and few additional eukaryotic genomes to remove residual unmapped human reads. Kraken (v0.10.6) was run using the metagm_run_kraken option.

Step 1: Run Kraken

Using the -noclean option on the paired filtered reads:

metagm_run_kraken -t 4 -noclean scanner_DB Sample(x).report Sample(x)_kneaddata_paired_1.fastq Sample(x)_kneaddata_paired_2.fastq

This will generate both a .report file and a .report.kraken_out file.

Step 2: Open the Sample(x).report file.

Write down all of the ID’s of the group of interest that you want to extract (penultimate column). This list can be very short or somewhat long, like in the case of E. coli. Make a column in a txt file of this list (Target_ids.tab).

Step 3: Run the Perl script (shown below, or here) on all of the Sample(x).report.kraken_out files.

#!/usr/bin/perl
#use strict;

my $path = $ARGV[0];
if (scalar($path)==0) {
    $path = `pwd`;
    chomp $path;
}

undef @files;
opendir PATH, $path;
@files = grep /\.kraken_out/, readdir PATH;
closedir PATH;

my $ids = "";
open FILE, "$path\/Target\_ids\.tab";
while (<FILE>) {
    chomp;
    $ids .= "\|$_\|";
}
close FILE;

foreach (@files){
    my $f = $_;
    open OUT, ">$path\/$f\.selection";
    open FILE, "$path\/$f";
    while (<FILE>) {
        my $line = $_;
        my @line = split /\t/, $line;
        if ($ids =~ /\|$line[2]\|/) {
            print OUT "$line[1]\n";
        }
    }
    close FILE;
    close OUT;
}

Step 4: Unix: Extract read pair headers of all reads of each sample:

less Sample(x)_kneaddata_paired_1.fastq | grep @HX1 > Sample(x).tab

Step 5: Run the following R code using the output from steps 3 and 4:

read_names <- scan("Sample(x).tab", what="\n")
selection <- scan("Sample(x).report.kraken_out.selection", what="\n")
selection <- gsub(selection, pattern="/1", replacement="")
selection <- as.numeric(selection)

read_selection <- read_names[selection]
read_selection <- gsub(read_selection, pattern="@", replacement="")

write(read_selection, file="Sample(x)Target.txt", ncolumns=1, sep="\t")

Step 6: Unix: Extract all the target reads (forward and reverse).

seqtk subseq Sample(x)_kneaddata_paired_1.fastq Sample(x)Target.txt > Sample(x)Target1.fq
seqtk subseq Sample(x)_kneaddata_paired_2.fastq Sample(x)Target.txt > Sample(x)Target2.fq

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
create_human_db_for_bowtie2.sh		create_human_db_for_bowtie2.sh
parse_kraken_out.pl		parse_kraken_out.pl
preprocess_unmapped_reads.sh		preprocess_unmapped_reads.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Initial mapping of RNA-Seq data

Preprocessing of unmapped RNA-Seq data

A custom human sequence DB

Post processing

Step 1: Run Kraken

Step 2: Open the Sample(x).report file.

Step 3: Run the Perl script (shown below, or here) on all of the Sample(x).report.kraken_out files.

Step 4: Unix: Extract read pair headers of all reads of each sample:

Step 5: Run the following R code using the output from steps 3 and 4:

Step 6: Unix: Extract all the target reads (forward and reverse).

A list of dependent software and database

Contacts

About

Releases

Packages

Contributors 2

Languages

sung/HHV6-Nat-Micr-2020

Folders and files

Latest commit

History

Repository files navigation

Introduction

Initial mapping of RNA-Seq data

Preprocessing of unmapped RNA-Seq data

A custom human sequence DB

Post processing

Step 1: Run Kraken

Step 2: Open the Sample(x).report file.

Step 3: Run the Perl script (shown below, or here) on all of the Sample(x).report.kraken_out files.

Step 4: Unix: Extract read pair headers of all reads of each sample:

Step 5: Run the following R code using the output from steps 3 and 4:

Step 6: Unix: Extract all the target reads (forward and reverse).

A list of dependent software and database

Contacts

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages