# Lesson 3.5

## Topics:
 - File Handling
 - Strings
 - Packages
 - Generators
 - Basic plotting
 
 
### A real life problem

Sequencing reads from Illumina platform are stored as fastq files. Each sequencing "read" gets four lines. Here is an example:

`@SRR036139.1 11_26_8:5:1:749:1421 length=36
GCTGGCTGCGTCTGTGGTGGGTTTCATGTTAAGGTC
+SRR036139.1 11_26_8:5:1:749:1421 length=36
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII`

Description from FASTQ wiki page:
 - Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
 - Line 2 is the raw sequence letters.
 - Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
 - Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
 
We want to determine the GC content of the reads in any FASTQ file. GC content could tell you about any bias in the library or if you are sequencing a new species for example, it tells you the properties of the species' genome. So, if you have a fastq file, how would you determine the GC content of the reads?

- Somehow get the sequences from the file into a data structure you learned
- Examine the sequence to get GC content
- Present the data

In [None]:
# How to get the sequences from the file into your code?
# First, let us open a dummy file
a_file = open('a_simple_file.txt', 'r' )
a_file.close()

### open() command
 - Two inputs - filename, mode
 - Mode can be 
     - 'r' for read
     - 'w' for write (overwrites existing file)
     - 'a' for append (adds to existing file)
 "a_file" is an object in which the whole file is stored

In [None]:
type(a_file)

In [None]:
#print every line
a_file = open('a_simple_file.txt', 'r' )
for line in a_file:
    print(line)
a_file.close()

In [None]:
#What is the variable "line"?
a_file = open('a_simple_file.txt', 'r' )
for line in a_file:
    print(type(line))
a_file.close()

In [None]:
#How do we track line numbers?
line_no = 0
a_file = open('a_simple_file.txt', 'r' )
for line in a_file:
    line_no += 1
    print(line_no,line)
a_file.close()

In [None]:
#Split into words, count the words
a_file = open('a_simple_file.txt', 'r' )
for line in a_file:
    lineL = line.split()
    print(lineL, "\nno. of words in this line: ", len(lineL))
a_file.close()

To get GC content of the sequences in a fastq file:
 - Find 
 
When you write new code, don't apply it to actual data in the beginning - the data may be complicated enough for you to not be able to judge if your code is running properly. Instead, start with a toy model - something that will give you a trivial result. For example, here is a toy fastq file that has only two sequences with GC content of 100% and 50%:
`@TEST1
GCGCGCGCGCGC
+TEST1
IIIIIIIIII
@TEST2
ATGCATGCATGC
+TEST2
IIIIIIIIIIII`

In [None]:
seqs={} # A dictionary to store sequences if needed later
gc_content={} # A dictionary to store GC content if needed later
line_no = 0 # Variable to count lines
fastq_file = open('toy.fastq', 'r' ) #Open the fastq file
for line in fastq_file: # Iterate through the fastq file
    line_no+=1 #counting line numbers
    if (line_no-2)%4 == 0: # Does the line number belong to the series 2, 6, 10, 14, etc. ?
        key = int((line_no-2)/4) # Convert the line_no to a sequence ID
        seqs[key] = line #Populate seqs dictionary
        letters = list(line) #Get each letter of the sequence
        gc=0 # Initialize gc for THIS sequence
        for i in letters: # Iterate through the letters of the sequence
            if(i == "G" or i == "C"): #Check if it is G or C
                gc+=1 
        gc=round(gc*100/len(letters),2) #make it percentage, make it look pretty with round()
        gc_content[key] = gc
        print(key+1,line,gc) #print the results

In [None]:
#line=line.rstrip()

Let us try the real data now. (First 250 sequences from the SRA file: SRR036139

In [None]:
#Let us write sequence details in a file called "seq_details.txt"
fh = open("seq_details.txt","w")
seqs={} # A dictionary to store sequences if needed later
gc_content={} # A dictionary to store GC content if needed later
line_no = 0 # Variable to count lines
fastq_file = open('example.fastq', 'r' ) #Open the fastq file
for line in fastq_file: # Iterate through the fastq file
    line=line.rstrip() #get that pesky newline out
    line_no+=1 #counting line numbers
    if (line_no-2)%4 == 0: # Does the line number belong to the series 2, 6, 10, 14, etc. ?
        key = int((line_no-2)/4) # Convert the line_no to a sequence ID
        seqs[key] = line #Populate seqs dictionary
        letters = list(line) #Get each letter of the sequence
        gc=0 # Initialize gc for THIS sequence
        for i in letters: # Iterate through the letters of the sequence
            if(i == "G" or i == "C"): #Check if it is G or C
                gc+=1 
        gc=round(gc*100/len(letters),2) #make it percentage, make it look pretty with round()
        gc_content[key] = gc
        #print(key+1,line,gc) #print the results
        write_str = str(key+1) + " " + line + " GC Content: " + str(gc) + "\n"
        fh.write(write_str)
fh.close()

This was not really useful because there are 250 numbers and we don't have a sense of what the data is telling us. A histogram may be more useful. There is a direct histogram function we will discuss later, but here will do it from scratch.

In [None]:
#let us assume bin size of 5 percent
#let us write the histogram into a file called "GC_hist.txt"
fh = open("GC_hist.txt","w")
bin_size = 5
hist = {} #declare the histogram
for i in gc_content:
    bin = int(gc_content[i]/bin_size + 0.5)*bin_size #which bin does this number belong to? In this case, the bin of
                                                     # 5 goes from >=4.5 to <5.5
    if bin in hist:
        hist[bin] += 1
    else:
        hist[bin] = 1
for i in sorted(hist):
    hist[i]=hist[i]*100/len(gc_content)
    write_str =str(i) + " " + str(hist[i]) + "\n"
    fh.write(write_str)
fh.close()

Would be nice if we could see a plot. I don't know how to make a simple plot in Jupyter, so I just googled and found this link: https://matplotlib.org/gallery/lines_bars_and_markers/simple_plot.html

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

yt = []

x = np.asarray(sorted(hist))
for i in sorted(hist):
    yt.append(hist[i])
y = np.asarray(yt)

fig, ax = plt.subplots()
ax.plot(x, y)

ax.set(xlabel='GC%', ylabel='% of sequences')

plt.show()

### Another real life example:

#### Let us assume that each paired-end read we obtain from a sequencing experiment is in a bed file. We want to convert the reads into a genome browser track.

A bed file has the following format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1):<br>

`chromosome    (start-1)    end    blah    blah    Strand`

Let us break it down into columns:
 - First Column: Chromosome
 - Second Column: (Start position - 1)
 - Third Column: End position
 - Fourth Column: Name
 - Fifth Column: Score
 - Sixth Column: Strand
 
In this example, we will ignore strand. For making a browser track, here is our strategy:

`123456789`<br>
`----_____`<br>
`___----__`<br>
`_____----`<br>
`______---`<br>
`______---`<br>
`111212433`<br>
