# Read pre-processing for HMP 16S Data

Last Updated: 2022-04-05  
Quang Nguyen

Here, we attempt to follow the procedure outlined by the HMP Data Processing [document](https://www.hmpdacc.org/hmp/doc/16S_SOP.pdf) in order to pre-process our 454 sequencing reads. 
1. Only keep reads that start with an initial TCAG
2. If the next few bases of the read do not have an unambigious needle (\*) to match one of the expected reverse barcodes, allowing for at most one substitution or indel, the read is discarded
3. If the next few bases of the read do not have n unambiguous needle (\*) match to one of the 16S reverse primers that is supposed to accompany the barcode matched in step b., allowing for at most four substitutions and/or indels, the read is discarded. 
4. The TCAG, reverse barcode, and reverse primer are removed from the sequence of the read
5. Convert sff files into fastq

In [2]:
from Bio import SeqIO
from Bio.Seq import Seq
from io import StringIO
import os
dpath = "/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/sff/"
opath = "/dartfs-hpc/rc/lab/H/HoenA/Lab/QNguyen/ResultsFiles/data/hmp_16s/fastq/"

In [52]:
files = os.listdir(dpath)
# CCGTCAATTCMTTTRAGT
primer = Seq("CCGTCAATTCMTTTRAGT")
primer.reverse_complement()

Seq('ACTYAAAKGAATTGACGG')

In [56]:
count = SeqIO.convert(dpath + files[0], "sff", opath + files[0].strip(".sff") + ".fastq", "fastq")
print("Converted {} records".format(count))

Converted 7349 records


In [58]:
with open(opath + files[0].strip('.sff') + ".fastq") as handle:
    test = []
    for record in SeqIO.parse(handle, "fastq"):
        test.append(record.seq.rfind("ACTYAAAKGAATTGACGG") + 1)
    print(sum(test))

0


In [41]:
"test.sff".strip(".sff")

'test'

In [55]:
for record in SeqIO.parse(dpath + files[0], "sff"):
    print(record.seq[record.annotations["clip_qual_left"]:].endswith("CCGTCAATTCMTTTRAGT"))

False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
Fals