# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformGAics.com

---

# String Methods
There are a number of methods to manipulate strings. These are simple single-word methods to achive dificult (but often boring) tasks, but they can be incredibly useful. Some of these we have seen already as they are so common but lets see the full set:

Cleaning and printing outputs
* ```.strip()``` cleans off whitespace, or other noise from the beginning and end of a string (whitespace meaning spaces (\s), tabs (\t), or newlines (\n))
* ```.upper()```, ```.title()```, and ```.lower()``` adjust the casing of your string.

Searching and modifying the string
* ```.replace()``` replaces all instances of a character/string in a string with another character/string.
* ```.find()``` searches a string for a character/string and returns the index value that character/string is found GA.

Making/breaking lists
* ```.split()``` takes a string and creates a list of substrings.
* ```.join()``` takes a list of strings and creates a string.

A few examples of them in action:

In [1]:
my_gene = "ATGTCGACCAATTCCTAACGACCAATGCTCGACCAACGGCaaaaaaaaaaaaaa"
article = "\n  a study to show how one little bit of dna became a gene!   \n\n"

print(my_gene)
print("~~~~~~")

# Format strings
print(my_gene.upper())
print("~~~~~~")
print(article.strip())
print("~~~~~~")
print(article.title())
print("~~~~~~")
print(article.strip().title())

ATGTCGACCAATTCCTAACGACCAATGCTCGACCAACGGCaaaaaaaaaaaaaa
~~~~~~
ATGTCGACCAATTCCTAACGACCAATGCTCGACCAACGGCAAAAAAAAAAAAAA
~~~~~~
a study to show how one little bit of dna became a gene!
~~~~~~

  A Study To Show How One Little Bit Of Dna Became A Gene!   


~~~~~~
A Study To Show How One Little Bit Of Dna Became A Gene!


Lets use ```.split()``` to separate out the DNA code. We can use anything as a split delimeter (usually a comma (```,```) or tab (```\t```) character) but here lets be bioinformatic and use a stop codon:

In [2]:
# Split the sequence at the stop codon
splitted_gene = my_gene.split("CGA")
print(splitted_gene)

['ATGT', 'CCAATTCCTAA', 'CCAATGCT', 'CCAACGGCaaaaaaaaaaaaaa']


In [3]:
# Output the second element
middle_CDS = splitted_gene[1]
print(middle_CDS)

# For fun let's use the replace function to convert to RNA
converted_middle_CDS = middle_CDS.replace("T", "U")
print(converted_middle_CDS)

# Note how the string stays the same and it modifies the output
print(middle_CDS)

CCAATTCCTAA
CCAAUUCCUAA
CCAATTCCTAA


# Exercises

We have a collection of multiple sequence alignments but they are fragmented with lots of gaps.

1. Create a loop to go through the dictionary of sequences Use ```.split()``` to cut each sequence where there are gaps indicated by ```-``` characters
2. Output the longest fragment from each sequence, and it's length. You could do this using a loop (for each fragment, ```if length > previous``` ....)
2. Print the average length of each fragment - This may be more dificult than first imagined. Why? How could you solve that?

**Extension:** There are some more advanced functions that can be more helpful (and often faster) than running loops:
 - ```max(list, key=len)``` allows you to calculate the longest string in a list. Try replacing your internal loop with that!
- ```filter(None, list)``` which  can be used to remove empty strings from a list (using a loop is fine too)

In [26]:
sequences = {
    "ID1": "TTTCAAGATCCTGCAACACCAGT--TATGGAAGGTATTATAA--ACTTTCATCATGATTTAATGTTTTTTTTAATTATTGTAACTGTTTTTGGATGTTATTTAGAGTTATTATTCTTGTTTTG---AAAAAAAAAATCC",
    "ID2": "TTTCAAGA-TCCTGACAACACCAGTTATGGAAGGTATTATAAACTTTCATCATGATTTAATGTTTTTTTTAAT-------GTTTTTGTTTGTTGGATGTTATTTATATT--TCAAAATTTTTGATGAAAAAAAAAATCC",
    "ID3": "TTTCAAGATCCTGCA--ACACCATTGTTATGGAAGGTATTATAAACTTTCATCATGATTATGTTTTTTTTAATTATTGTA-------TGTGGTGGATGTTA-TTTAGAGTTATTATTCTTTTTGATGAAAAAAAAAATCC",
    "ID4": "TTTCAAGCTGCAACACCAGGTTATGGAAGGTATTATA-AACTAATTCATCATGATTTAATGT--TTTTTTTAGATTATTGTAACTGTTTTTAGTTTGTTGGATGTTA---ATTATTCTTTGTTGATGA---AAAAAATCC",
    "ID5": "TTTCAAGATCCTGCAACACCAGTTATGGAAGGTGGATTATAAACTTTCATCATGATTTAATGTTTTT-TTAATTATTGTAACTGTTTATTGTTGGATGTTATTTAGAGTTATTATTCTTTTTGATGAAAAAAAAAATCC",
}
 # Your code here

all_frags = []

for ID, seq in sequences.items():
  #print(ID)
  fragments = seq.split("-")

  fragment_list = []

  for frag in fragments:
      if len(frag) > 0:
          fragment_list.append(len(frag))

  all_frags.update( { ID : fragment_list } )
  print("")

print(all_frags)

print(all_frags.get("ID3"))


AttributeError: 'list' object has no attribute 'update'