# Sorting and QCing sequences (version 1) #

One of the routine tasks we do in Bioinformatics is to quality control DNA sequences. 

In this version of the problem, we're opening a fastq file, which is the format that I showed you in class today. fastqs are a format for very short (50-200) DNA sequence reads and the format consists of 4 lines in a repeating pattern.

We want to get those lines out and test them to see how many of the sequences contain specific patterns, called barcodes. 

The barcode sequences that might occur are:

['ATCACG','CGATGT','TTAGGC','TGACCA','ACATGT','GCCAAT','CAGATC','ACTTGA','GATCAG','TAGCTT','GGCTAG','CTTGTA']

Barcodes will show up at the beginning of the sequence. If the pattern is found somewhere else it's probably random and not a real barcode.

We also only want to count the ones that are long enough. If they're not over 80 nucleotides we don't want to count them.

In [2]:
def read_fastq(fastq):
	"""reads a fastq file into three variables, streams"""
	name,seq,qual = [None],[None],[None] # initialize variables
	for i,line in enumerate(fastq): #gets the lines plus a counter
		line = line.strip() #cleans up the whitespace
		if i % 4 == 0: #decides what to do based on the index -- is it line 0,1,2 or 3 in the pattern
			name = line # if it's an 0, it's a header
		elif i % 4 == 1: # if it's a 1, it's a sequence
			seq = line
		elif i % 4 == 3: # if it's a 3, it's a quality string
			qual = line 
			yield name,seq,qual # done with line 3, yield all values and go back and get the next line
			name,seq,qual = [None],[None],[None] #set values to None after yield just to be sure
		else:
			pass #this happens if it's line 2

In [8]:
with open("sequences.fastq") as infile: #this call turns the open file into an object
    barcodes = ['ATCACG','CGATGT','TTAGGC','TGACCA','ACATGT','GCCAAT','CAGATC','ACTTGA','GATCAG','TAGCTT','GGCTAG','CTTGTA']
    count = []
    for b in barcodes:
        count.append(0)
    for n,s,q in read_fastq(infile): #this is how we call read_fastq on the iterable
        if len(s) < 80:
            continue #continue if not over 80 nucleotides
        #if we get here then the len() > 80
        #now test if there is a barcode in the sequence
        for i, barcode in enumerate(barcodes):
            if s.startswith(barcode):
                count[i] += 1
#                 print(barcodes[i])
#                 print(s)
#                 print(n)
#                 print("----")
        
    print(barcodes)
    print(count)
        

(6,
 'TTAGGCACAGGTGTTAACTTAGGTATCTTTGGTGCATCGAAAGTAAACCAAGAAATGACTGGTGCGTCANNNGGTNTCTTCAACNNGNNTANTNNNNANACCACAG')
(9,
 'TTAGGCGACAGGNACAAAGCCTAAAGGATTGGGTATCTTCAGGATGAAGATTTGATAACTTAGGATGANNNNTCANCAAGGACNNNNNNCTNTNNNNNNGAGTTAC')
(11,
 'TTAGGCAGGGGGAAGTGCTTGTGAAATTATTTTTTGCTTCGGATTTGCATGGCAGCCTAGAGGCGACAGNNNAAANATTGGCGTTNTTTGAACNANNAGGAGCGGA')
(25,
 'TTAGGCACACATCGAACATCCAGCCTTACGGCTTGCAAAGTGGCACAAAAATTGCTTATGAAAAAAGCGNNNGCGNGCCTAGTGNNTNNGTCGNNNNANGCAACCC')
(28,
 'TTAGGCACGCCGCTTAACGATTTGGCGAAGAAATGGTTTGACCCGAATGACTATCAGATGGTCGTTGTTNNNGATNCAAAAACGCNTNNCCCANANNTGGAAAAGT')
(29,
 'TTAGGCACTAGGATGGTTACTACTAAGGAAGGCAATGGACACCTCTGGATGAGGCAAGGACTGAACATCNNNAAANTGTTAGGGNNANNGCTCNNNNAACAAGTGA')
(49,
 'TTAGGCACCAAAGTTGAGTTACTCCATTTTTATCTCGGAGTGGGATCTGATGATCGGTACTGGTTTTTACACCGATGACATTGACGCTATCGTGATGGAGATGGAG')
(52,
 'TTAGGCAGATGAAGTTCGAGGCCGTGATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACGCAGCCATCAACGAGGTGGAACAAAA')
(62,
 'TTAGGCGCAACAGAGGAAACGCATTTATGAAACTAAACATCACTGGTAAAAACATCGAAATCACCTC