# Python Tutorial - Part 2

---

# Example Program

# Translating a Nucleotide Sequence

---

A common task in bioinformatics is using a scripting language, such as Python, to also parse the output from an analysis program, such as BLAST, and produce a file of edited data.

Scripting languages can be used to rapidly write code to read files, manipulate the text and produce an output.

An example might be to read a fasta file of nucleotide sequence and produce the possible amino acid sequence. In other words, translate the sequence.

This will demonstrate the use of file handling, control structures, string (list) manipulation and dictionaries.

---

## Input Files

---

For the example we will have 2 input files.

The fasta sequence:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;>BF246290<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ggcgtcgtagtctcctgcagcgtctggggtttccgttgcagtcctcggaaccaggacctc<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ggcgtggcctagcgagttatggcgacgaaggccgtgtgcgtgctgaagggcgacgg<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cccagtgcagggctatcatcaattcgagcagaaggaaagtaatggcaccagtgaag<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;…<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cacagatggtgtgccgatgtgtctatggaacgattctgtgatctcactctcaggagacca<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tggcatcatgtggccgcacaactgtggtccatgaaaaagcaagatgactgtgggcca<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ggg</b><br>

NOTE: This is not one complete sequence, there are line breaks

A file of codon translations:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ttt<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;F<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ttc<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;F<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tta<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;L<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ttg<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;L<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;etc</b><br>

---


## Building the Dictionary

---

The next step is to build the dictionary using the codons file:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with open('codons.txt') as in_file:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Open the file<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codons_list = in_file.readlines()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Read the file to a list<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codons = {}&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Initialise the dictionary<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for count in range(0, len(codons_list), 2):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Work through the list<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;key = codons_list[count].rstrip()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Get the codon<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;key = key.lower()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;value = codons_list[count+1].rstrip()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Get the aa, on next line<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;value = value.lower()<br>   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codons[key] = value&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Add to dictionary</b><br><br>

---


## Storing the Sequence

---

The first step is to read the fasta sequence from file into a string:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with open('sequence.txt') as in_file:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Open the file<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq_list = in_file.readlines()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Read the file to a list<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq_name = seq_list.pop(0)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Remove the sequence name<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq = seq_list.pop(0)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Initialise the sequence<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq = seq.rstrip()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Remove the newline character<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq = seq.lower()&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Lower case the sequence<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for line in seq_list:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;#  Append the rest of the sequence to seq<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq += line.rstrip().lower()</b><br>

NOTE: Methods can be appended to each other - seq += line.rstrip().lower()

---



## Translating the Sequence

---

Now we have the sequence and codon translations we create the amino acid sequence:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for count in range(0, len(seq), 3):&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Work through the list<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codon = seq[count:count+3]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Get 3 nucleotides<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aa = codons[codon]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# Get the associated aa<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print (aa, end=" ")<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# In 2.7 this would be:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;# print aa,</b><br><br>

---



## Complete Program

---

Putting it all together, with a few short cuts:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with open('sequence.txt') as in_file:<br>                                 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq_list = in_file.readlines()<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq_name = seq_list.pop(0)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq = seq_list.pop(0).rstrip().lower()<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for line in seq_list:<br> 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;seq += line.rstrip().lower()<br><br>     
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;with open('codons.txt') as in_file:<br>     
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codons_list = in_file.readlines()<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codons = {}  # Initialise the dictionary<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for count in range(0, len(codons_list), 2):<br>                      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codons[codons_list[count].rstrip().lower()] =  codons_list[count+1].rstrip().lower()<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;for count in range(0, len(seq), 3):<br> 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;codon = seq[count:count+3]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if codon in codons:<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aa = codons[codon]<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;else:<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;aa = '-'<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;print (aa, end=" ")</b><br><br>

---

## Exercise 9

---

Write a program to test the sequence translation code. Modify the code so that it translates the sequence in 3 reading frames and prints them to a file. The sequences should be in fasta format with the frame number in the description line.

The fasta file (<b>sequence.txt</b>) and codons file (<b>codons.txt</b>) needed to test the script can be downloaded from the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">Exercise Answers page</a>.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)

---

In [1]:
# we go through above process in detail
# process sequence list
with open('sequence.txt') as in_file:
    seq_list = in_file.readlines()
    seq_name = seq_list.pop(0)  # removes the indexed item from seq_list
    

    seq = seq_list.pop(0)  # Initialise the sequence
    seq = seq.rstrip()  # Remove the newline character
    seq = seq.lower()  # Lower case the sequence
    print(seq)
    

atggatttagagaaaaattatccgactcctcggaccagcaggacaggacatggaggagtg


In [2]:
print(seq_list)

['aatcagcttgggggggtttttgtgaatggacggccactcccggatgtagtccgccagagg\n', 'atagtggaacttgctcatcaaggtgtcaggccctgcgacatctccaggcagcttcgggtc\n', 'agccatggttgtgtcagcaaaattcttggcaggtattatgagacaggaagcatcaagcct\n', 'ggggtaattggaggatccaaaccaaaggtcgccacacccaaagtggtggaaaaaatcgct\n', 'gagtataaacgccaaaatcccaccatgtttgcctgggagatcaggggccggctgctggca\n', 'gagcgggtgtgtgacaatgacaccgtgcctagcgtcagttccatcaacaggatcatccgg\n', 'acaaaagtacagcagccacccaaccaaccagtcccagcttccagtcacagcatagtgtcc\n', 'actggctccgtgacgcaggtgtcctcggtgagcacggattcggccggctcgtcgtactcc\n', 'atcagcggcatcctgggcatcacgtcccccagcgccgacaccaacaagcgcaagagagac\n', 'gaaggtattcaggagtctccggtgccgaacggctactcgcttccgggcagagacttcctc\n', 'cggaagcagatgcggggagacttgttcacacagcagcagctggaggtgctggaccgcgtg\n', 'tttgagaggcagcactactcagacatcttcaccaccacagagcccatcaagcccgagcag\n', 'accacagagtattcagccatggcctcgctggctggtgggctggacgacatgaaggccaat\n', 'ctggccagccccacccctgctgacatcgggagcagtgtgccaggcccgcagtcctacccc\n', 'attgtgacaggccgtgacttggcgagcacgaccctccccgggtaccctccacacgtcccc\n', 'cccgctgg

In [3]:
for line in seq_list:  #  Append the rest of the sequence to seq
    seq += line.rstrip().lower()
        

In [4]:
seq

'atggatttagagaaaaattatccgactcctcggaccagcaggacaggacatggaggagtgaatcagcttgggggggtttttgtgaatggacggccactcccggatgtagtccgccagaggatagtggaacttgctcatcaaggtgtcaggccctgcgacatctccaggcagcttcgggtcagccatggttgtgtcagcaaaattcttggcaggtattatgagacaggaagcatcaagcctggggtaattggaggatccaaaccaaaggtcgccacacccaaagtggtggaaaaaatcgctgagtataaacgccaaaatcccaccatgtttgcctgggagatcaggggccggctgctggcagagcgggtgtgtgacaatgacaccgtgcctagcgtcagttccatcaacaggatcatccggacaaaagtacagcagccacccaaccaaccagtcccagcttccagtcacagcatagtgtccactggctccgtgacgcaggtgtcctcggtgagcacggattcggccggctcgtcgtactccatcagcggcatcctgggcatcacgtcccccagcgccgacaccaacaagcgcaagagagacgaaggtattcaggagtctccggtgccgaacggctactcgcttccgggcagagacttcctccggaagcagatgcggggagacttgttcacacagcagcagctggaggtgctggaccgcgtgtttgagaggcagcactactcagacatcttcaccaccacagagcccatcaagcccgagcagaccacagagtattcagccatggcctcgctggctggtgggctggacgacatgaaggccaatctggccagccccacccctgctgacatcgggagcagtgtgccaggcccgcagtcctaccccattgtgacaggccgtgacttggcgagcacgaccctccccgggtaccctccacacgtcccccccgctggacagggcagctactcagcaccgacgctgaca

In [5]:
# codon read

In [6]:
# read in codon file into dictionary
with open('codons.txt') as codons_file:
    codons_list = codons_file.readlines()

In [7]:
codons = {}  # Initialise the dictionary
for count in range(0, len(codons_list), 2):  # Work through the list
    key = codons_list[count].rstrip()   # Get the codon
    key = key.lower()
    value = codons_list[count+1].rstrip()  # Get the aa, on next line
    value = value.lower()
    codons[key] = value  # Add to dictionary

In [8]:
codons['tgg']

'w'

In [9]:

out_file = open("translated_seqs.fasta", "w")
for frame in range(0, 3):
    out_file.write("> Frame " + str(frame) + "\n")

    for count in range(frame, len(seq), 3):  # Work through the list
        codon = seq[count:count+3]   # Get 3 nucleotides
        if codon in codons:
            aa = codons[codon]   # Get the associated aa
        else:
            aa = '-'
        out_file.write(aa)

    out_file.write("\n") # Sequence is finished so print newline for next sequence

out_file.close()


---





# Modules

---

Modules, or libraries, are common to most programming languages, including Perl, C++ and Java.

Modules provide a set of code to provide particular functions that can be included in your own code. They are essentially programs with functions that can be called from your own program.

As expected Python provides a large library of modules, including Biopython and PyCogent, and are imported with the "import" command.

The most commonly used is probably the sys module which contains system-specific functionality. This is required for functions that are often included by default in other languages.

---

# Command Line Arguments

---

Often when writing a Python script you want to include the option to read values in to the script (command line arguments).

In many languages this is a built in function but with Python you need to import the <b>sys</b> module.

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;import sys</b><br>

Arguments can then be included when you run the script. For example, if the script is called test.py:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;python test.py arg1 arg2 arg3</b><br>

To read them in to the script the <b>sys.argv</b> function is used:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;import sys<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a1 = sys.argv[1]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a2 = sys.argv[2]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a3 = sys.argv[3]</b><br>

Now the variable a1 has the value of arg1 etc.

It is possible to avoid having to write sys each time the function is called by explicitly importing that function.

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;from sys import argv</b><br>

Or to import all functions:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;from sys import *</b><br>

These options work for all modules and using either of the above the rest of the code is now simply:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;import sysa1 = argv[1]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a2 = argv[2]<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a3 = argv[3]</b><br>

---

## Exercise 10

---

Write a script that reads two numbers as command line arguments, adds them together and prints the result.
 
(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)

---


# String Methods

---

Strings contain numerous built in methods enabling you to manipulate and interrogate them. Examples include:

## String Searches

---

<b>startswith(string)</b> – This will return true if the string starts with the the argument string, for example:


In [10]:

protein = 'MEFTIKRDYFITQLNDT' 

# Does the protein start with methionine
if protein.startswith('M'): 
    print ('Yes, the protein starts with methionine')
           
# Will print “Yes, the protein starts with methionine”


Yes, the protein starts with methionine



A substring can also be searched for within a string: 
    

In [11]:

protein = 'MEFTIKRDYFITQLNDT'

# Does the protein contain DYF
if 'DYF' in protein: 
    print ('Yes, it contains the pattern DYF')


Yes, it contains the pattern DYF



---

## Splitting Strings

---

Strings have an inbuilt split method that splits the string based on a delimiter and returns the split items in a list:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;str.split([delim[, maxsplit]])<br><br>

If no arguments are provided the string wil be split on whitespace:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;' 1  2   3  '.split()<br>        
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;returns ['1', '2', '3']</b><br><br>

With delimiter:

<b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'1XX2XX3'.split('XX')<br><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;returns ['1', '2', '3']</b><br><br>


In [12]:

test_string =  '1 2 3'

split_list = test_string.split()

print (split_list)


['1', '2', '3']



Alternatively a delimiter can be used:


In [13]:

test_string =  '1XX2XX3'   

split_list = test_string.split('XX')

print (split_list)


['1', '2', '3']



If the optional <b>maxsplit</b> argument is provided it set the maximum number of splits in the string:


In [14]:

test_string =  '1 2 3 4 5 6'

split_list = test_string.split(" ", 1)

print (split_list)
split_list[1]


['1', '2 3 4 5 6']


'2 3 4 5 6'


---

## Other Methods

---

<b>lower() / upper()</b> - Returns a copy of the string converted to lower/upper case.

<b>capitalize()</b> - Returns a copy of the string with only its first character capitalized.

<b>replace(old, new[, count])</b>  Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

<b>replace(old, new[, count])</b> - Returns a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

There are many more.


In [15]:
# Replace valine with tyrosine


protein = "vlspadktnv"
print(protein)

new_protein = protein.replace("v", "y")

print(new_protein)


vlspadktnv
ylspadktny


In [16]:
# Replace more than one amino acid

protein = "vlspadktnv"
print(protein)

new_protein = protein.replace("pad", "tyc")

print(new_protein)

vlspadktnv
vlstycktnv


In [17]:
# WORKED EXAMPLE

with open('CAUH01000012.gff') as in_file:
    for line in in_file: 
        if line.startswith('CAUH01000012'):
            values = line.split() # splits the line withohut args passed to split => uses white space. This means we can access each element within each line for manipulaiton
            if values[2] == 'CDS':
                print ("Start =", values[3], "and end =", values[4])


Start = 4550 and end = 4636
Start = 4698 and end = 4796
Start = 4852 and end = 5565
Start = 5623 and end = 5715
Start = 7647 and end = 7759
Start = 7854 and end = 8216
Start = 8267 and end = 8660
Start = 17433 and end = 17709
Start = 17243 and end = 17370


There are many more.

---

## Exercise 11

---

The file CAUH01000012.gff is a sequence annotation file in GFF format and is available on the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">Exercise Answers</a> page.

a) Write a script that reads this file and prints out the lines that start with “CAUH01000012“, the ID for the sequence.

b) Modify the file to print out the start and end position of the CDS regions.

The columns in the GFF file are Sequence ID, Source, Type, Start, End, Score, Strand, Phase and Attributes. The CDS is identified in the third column and start and end in the fourth and fifth columns. The columns are all separated by tabs, which are whitespace.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)

---


# Regular Expressions

---

A regular expression (regex) is a special text string for describing a search pattern. In Python regular expressions are provided by the regular expression (re) module. The re module provides an interface to the regular expression engine, allowing you to compile REs into objects and then perform matches with them. This enables far more complex search patterns to be created than are possible with the built in string methods detailed previously.

To compile and use a re:

In [18]:

import re

p = re.compile(r'\d')   # Match any digits
m = p.match('ONE to fourty five')  # Apply the re
if m:
    print ('Match found')
else:
    print ('No match')
            

No match



Note the ‘r’ as part of the regular expression, it stands for ‘raw string’. The backslash character is used in regular expressions to indicate special forms or to allow special characters to be used without invoking their special meaning. However, this collides with Python’s usage of the same character for the same purpose in string literals; for example “\b” the word boundary regular expression is the character for a backspace in Python. To avoid this confusion the raw string is used, so ‘r”\b”’ is a two-character string for a word boundary, whereas “\b” is a one-character string for a backspace.

The “r” isn’t essential for all regular expressions but it should generally be used in all cases to ensure the expression is interpreted correctly.

In an RE there are plenty of special characters, and it is these that both give them their power and make them appear very complicated. It's best to build up your use of REs slowly; their creation can be something of an art form.

Here are some special RE characters and their meaning

    .      # Any single character except a newline
    ^      # The beginning of the line or string
    $      # The end of the line or string
    *      # Zero or more of the last character
    +      # One or more of the last character
    ?      # Zero or one of the last character

Example matches:

    t.e       # t followed by anthing followed by e
              # This will match the, tre and tle and but not te or tale
    ^f        # f at the beginning of a line
    ^ftp      # ftp at the beginning of a line
    e$        # e at the end of a line 
tle$      # tle at the end of a line
    und*      # un followed by zero or more d characters
              # This will match un, und, undd, unddd (etc)
    .*        # Any string without a newline. This is because the . matches
              # anything except a newline and the * means zero or more of these
    ^$        # A line with nothing in it

There are even more options. Square brackets are used to match any one of the characters inside them. Inside square brackets a - indicates "between" and a ^ at the beginning means "not":

    [qjk]        # Either q or j or k
    [^qjk]       # Neither q nor j nor k
    [a-z]        # Anything from a to z inclusive
    [^a-z]       # No lower case letters
    [a-zA-Z]     # Any letter
    [a-z]+       # Any non-zero sequence of lower case letters

A vertical bar | represents an "or" and parentheses (...) can be used to group things together:

    jelly|cream    # Either jelly or cream
    (eg|le)gs      # Either eggs or legs
    (da)+          # Either da or dada or dadada or...

Here are some more special characters:

    \n        # A newline
    \t        # A tab
    \w        # Any alphanumeric (word) character.
              # The same as [a-zA-Z0-9_]
    \W        # Any non-word character. The same as [^a-zA-Z0-9_]
    \d        # Any digit. The same as [0-9]
    \D        # Any non-digit. The same as [^0-9]
    \s        # Any whitespace character: space, tab, newline, etc
    \S        # Any non-whitespace character
    \b        # A word boundary, outside [] only
    \B        # No word boundary

Clearly characters like |, [, ), \, / and so on are peculiar cases in regular expressions. If you want to match for one of those then you have to preceed it by a backslash. So:

    \|        # Vertical bar
    \[        # An open square bracket
    \)        # A closing parenthesis
    \*        # An asterisk
    \^        # A carat symbol
    \/        # A slash
    \\        # A backslash

and so on.

Some example REs

    [01]          # Either "0" or "1"
    \/0           # A division by zero: "/0"
    \/ 0          # A division by zero with a space: "/ 0"
    \/\s0         # A division by zero with a whitespace:
                  # "/ 0" where the space may be a tab etc.
    \/ *0         # A division by zero with possibly some
                  # spaces: "/0" or "/ 0" or "/  0" etc.
    \/\s*0        # A division by zero with possibly some whitespace.
    \/\s*0\.0*    # As the previous one, but with decimal point and maybe
                  # some 0s after it. Accepts "/0." and  "/0.0" and "/0.00"
                  # etc and "/ 0." and "/  0.0" and "/   0.00" etc.





---

## Search and Findall

---

The <b>match</b> method only checks if the RE matches at the start of a string.

The <b>search</b> method matches anywhere within the string.


In [19]:

import re

p = re.compile(r'\d+')   # Match any digits
m = p.match('Try 1 to 45')  # Apply the re

if m:
    print ('Match found')
else:
    print ('No match')

m = p.search('Try 1 to 45')  # Apply the re
   
if m:
    print ('Search found')
    
else:
    print ('No search found' )


No match
Search found


A biological example of using a regular expression is to search for the presence of a restriction enzyme site in a DNA sequence. For example, to see if the sequence contains the EcoRI site:

In [20]:

import re

p = re.compile(r"GAATTC")  
dna = "ATCGCGAATTCAC"
if p.search(dna):
    print("EcoRI restriction site found!")
else:
    print("EcoRI restriction site not found!")
    

EcoRI restriction site found!


You don't have to produce a re object and call its methods; the re module also provides top-level functions called match(), search(), sub(), etc. 

These functions take the same arguments as the corresponding object method, with the RE string added as the first argument, and still return either None or an object instance. 

In [21]:

import re

dna = "ATCGCGAATTCAC"
if re.search(r"GAATTC", dna):
    print("EcoRI restriction site found!")
else:
     print("EcoRI restriction site not found!")


EcoRI restriction site found!


If a re is to be used more than once the compiled version is probably more efficient but for a single search the module level function may be preferred.

The restriction enzyme AvaII matches 2 sites, GGACC and GGTCC, which could be searched using:

In [22]:

import re

dna = "ATCGCGAATTCAC"
if re.search(r"GGACC", dna) or re.search(r"GGTCC", dna):
    print("AvaII restriction site found!")
else:
    print("AvaII restriction site not found!")


AvaII restriction site not found!


However, this can be improved with a single regular expression:

In [23]:

import re

dna = "ATCGCGAATTCAC"
if re.search(r"GG(A|T)CC", dna):
    print("AvaII restriction site found!")
else:
    print("AvaII restriction site not found!")


AvaII restriction site not found!


If there are multiple options they can be grouped in square brackets rather than using individual ‘|’. The BisI restriction enzyme pattern is GCNGC, where N represents any base. This can searched with:

In [24]:

import re

dna = "ATCGCGGCTTCAC"
if re.search(r"GC[ATGC]GC", dna):
    print("BisI restriction site found!")
else:
    print("BisI restriction site not found!")


BisI restriction site found!


There are many options for matching characters listed above, for example ‘.’ Matches any character, ‘?’ matches zero or more of the last character etc.

When used in combination complex patterns can be created. For example, to match full length eukaryotic mRNA sequences:

&nbsp;&nbsp;<b>^ATG[ATGC]{30,1000}A{5,10}$</b>

Matches:

&nbsp;&nbsp;An ATG start codon at the beginning of the sequence<br>
&nbsp;&nbsp;Followed by between 30 and 1000 bases which can be A, T, G or C<br>
&nbsp;&nbsp;Finally, a poly-A tail of between 5 and 10 bases at the end of the sequence<br>
    
The re match object has methods and attributes that can be used to return information about the matching string. The most important ones are: 

&nbsp;&nbsp;<b>group()</b>&nbsp;&nbsp;Return the string matched by the RE<br>
&nbsp;&nbsp;<b>start()</b>&nbsp;&nbsp;Return the starting position of the match<br>
&nbsp;&nbsp;<b>end()</b>&nbsp;&nbsp;Return the ending position of the match<br>
&nbsp;&nbsp;<b>span()</b>&nbsp;&nbsp;Return a tuple containing the (start, end) positions of the match<br>	

To match specific patterns within a string the group can be used.with search:


In [25]:

import re

dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA[ATGC]{3}AC", dna)
# print(m)
print(m.group())


GACGTAC


If you want to extract more than one group parenthesis are used in the match and referred to in numerical order:

In [None]:
# GROUPS

In [2]:

import re

dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("Entire match: " + m.group())
print("First part: " + m.group(1))
print("Second part: " + m.group(2))
# there is no group 3 
# print("Second part: " + m.group(3))


Entire match: GACGTACGTAC
First part: CGT
Second part: GT


The start and end positions can be extracted with the start() and end() functions:

In [27]:
# POSITION

In [32]:

import re

dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("Start position: " + str(m.start()))
print("End position: " + str(m.end()))


Start position: 2
End position: 13


The start and ends of individual groups can also be extracted:

In [4]:

import re

dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("Start position: " + str(m.start()))
print("End position: " + str(m.end()))


print("Group one start: " + str(m.start(1)))
print("Group one end: " + str(m.end(1)))
print("Group two start: " + str(m.start(2)))
print("Group two end: " + str(m.end(2)))


Start position: 2
End position: 13
Group one start: 4
Group one end: 7
Group two start: 9
Group two end: 11


The re method findall() returns a list of all the matching strings. To find all runs of A and T in a DNA sequence longer than five bases:

In [10]:

import re

dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
atrun = re.findall(r"[AT]{4,100}", dna)

print(atrun)

# print(m.group())

['ATTATAT', 'AAATTATA']


An alternative, which provides greater flexibility, is finditer(). This returns a sequence of match objects which can be accessed in a loop:

In [None]:

import re

dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{3,100}", dna)
for match in runs:
    run_start = match.start()
    run_end = match.end()
    print("AT rich region from " + str(run_start) + " to " + str(run_end))


Regular expressions provide extensive ability to search and manipulate text.


---

## Exercise 12

a) Write a script that opens a text file and checks each line for the presence of a particular word. Keep track of the line number and if a word is found print the line number and number of instances of the word.

You can use the entamoeba.txt file for this exercise available on the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">Exercise Answers</a> page. A suitable word to match may be “and”.

b) Modify the script so that word to search for is prompted for when the script runs and is input by the user. You will need to use the input method for this.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)



---

# Split

---

Strings can also be split using re patterns as the string delimiter: 


In [None]:

import re

p = re.compile(r'\d+') # Set any digits to be the delimiter

l = p.split('There are 2 numbers and 8 words in this string.')

print (l)



With the maximum split number set:
    

In [None]:

import re

p = re.compile(r'\d+') # Set any digits to be the delimiter

l = p.split('There are 2 numbers and 8 words in this string.', 1)

print (l)


An example is a consensus DNA sequence that contains ambiguity codes and you want to extract all of the runs of contiguous unambiguous bases. Split can be used to split the DNA string wherever there is a base that isn't A, T, G or C:

In [None]:

import re

dna = "ACTNGCATRGCTACGTYACGATSCGA"

p = re.compile(r"[^ATGC]")
unambig = p.split(dna)

print (unambig)


The ^ at the start means split on everything except ATG and C.

If capturing parentheses are used in the re, then everything is returned in the list, including the ambiguity codes where the split was made:
    

In [None]:

import re

dna = "ACTNGCATRGCTACGTYACGATSCGA"

p = re.compile(r"([^ATGC])")
unambig = p.split( dna)

print (unambig)



Note the module-level function <b>re.split()</b> adds the re to be used as the first argument, but is otherwise the same. 


In [None]:

import re

dna = "ACTNGCATRGCTACGTYACGATSCGA"

l = re.split(r"[^ATGC]", dna)
    
print (l)



---

## Exercise 13

Modify the previous script to take one command line string and split any line that contains that string. Print out each element of the list produced by split on a separate, numbered, line.

You can use the entamoeba.txt file for this exercise available on the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">Exercise Answers</a> page. 

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)

---

<b>The third part of the Python tutorial is available <a href="PythonTutorial_Pt3.ipynb">here</a>.</b><br><br>


