# Working with Strings

### Indexing, slicing

just like lists, you can access the characters in a string using indices and slicesL

In [41]:
sentence = "thisis3asentence0.\?:;\\(\\),"

In [2]:
sentence[-1]

'.'

In [3]:
sentence[10:]

'sentence.'

### Operators and functions
`min()`, `max()`, and many more can be used on strings. What they return will not always be intuitive! Redefine the sentence variable and test them.

In [27]:
min(sentence)

'.'

In [19]:
max(sentence)

't'

In [42]:
print(sorted(sentence))

['(', ')', ',', '.', '0', '3', ':', ';', '?', '\\', '\\', '\\', 'a', 'c', 'e', 'e', 'e', 'h', 'i', 'i', 'n', 'n', 's', 's', 's', 't', 't']


In [43]:
'u' in sentence

False

In [44]:
"ten" in sentence

True

In [45]:
"abc" == "ABC"

False

In [46]:
"abc".casefold() == "ABC".casefold()

True

In [16]:
sentence.upper()

'THIS IS A SENTENCE.'

In [48]:
sentence.replace("This", "That")

'thisis3asentence0.\\?:;\\(\\),'

What happens if you use "this" in the line above? 

In [19]:
"abc" + "ABC"

'abcABC'

In [50]:
words = sentence.split("s")
words

['thi', 'i', '3a', 'entence0.\\?:;\\(\\),']

In [24]:
'|_|'.join(words)

'This|_|is|_|a|_|sentence.'

## f-strings

you can ask python to output a mix of variables and text, and format them a certain way, using f-strings. Note the `f` preceding the string, on the last line! This is how you tell python that the following will be a f-string, rather than a standard string.

In [53]:
a_int = 4
a_float = 3.45

f"We have an int: {a_int}, and a float: {a_float}."

'We have an int: 4, and a float: 3.45.'

In [57]:
f"We can specify how we want them formatted: {a_int:3.8f}, {a_float:4.0f}"

'We can specify how we want them formatted: 4.00000000,    3'

Try changing the formatting above!

## String methods
Here is a rather trivial example combining four string methods: **replace()**, **strip()**, **casefold()** and **format()**. I've deliberately omitted comments &mdash; see if you can work out what is happening at each step.

In [59]:
s1 = 'a)  essen  '
s2 = 'Eßen'

s1 = s1.replace('a)', '') 
s1 = s1.strip()           
if s1.casefold() == s2.casefold():
    print('{} is the same as {}'.format(s1, s2))

essen is the same as Eßen


Here is another example using methods **count()** and **split()**:

In [60]:
s = 'ILE TRP GLU LEU LYS LYS ASP VAL'

# Count the number of occurences of LYS (lysine)
print('Number of lysines:', s.count('LYS'))

# Count the number of amino acids in the string
aa_list = s.split()
print('Total number of amino acids:', len(aa_list))

Number of lysines: 2
Total number of amino acids: 8


Now take `aa_list` and turn it back into the original string using the **join()** method:

In [63]:
# write your code here
print(aa_list)
" ".join(aa_list)

['ILE', 'TRP', 'GLU', 'LEU', 'LYS', 'LYS', 'ASP', 'VAL']


'ILE TRP GLU LEU LYS LYS ASP VAL'

## String extraction

A common bioinformatics task is to extract the data you want (here the species names) from the full set of data in a file. Here is the file:

In [64]:
%%bash
cat ../data/pdb_species.txt

PDB 1BRTA: Homo sapiens
PDB 1JWQC: Mus musculus
PDB 2SDAB: Homo sapiens
PDB 1NRSA: Homo sapiens
PDB 1QTZA: Homo sapiens
PDB 1GGEA: Mus musculus
PDB 1MNUB: Bos taurus
PDB 1PASA: Homo sapiens
PDB 3TEAC: Mus musculus
PDB 1FFWA: Sus scrofa
PDB 1JYKA: Homo sapiens
PDB 1RELA: Homo sapiens


The following version using a **slice** only works if the position at which the species name starts is gauranteed to be fixed:    

In [65]:
species_fname = '../data/pdb_species.txt'
lines = []
with open(species_fname, 'r') as f:
    lines = f.read().splitlines()

all_species = []
for s in lines:
    species = s[11:]
    all_species.append(species)
print(all_species)

['Homo sapiens', 'Mus musculus', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Mus musculus', 'Bos taurus', 'Homo sapiens', 'Mus musculus', 'Sus scrofa', 'Homo sapiens', 'Homo sapiens']


This alternative version using **split()** only works if the name of the species is guaranteed to involve the last two words in the line:

In [66]:
all_species2 = []
for s in lines:
    tokens = s.split()
    species = tokens[-2] + ' ' + tokens[-1]
    all_species2.append(species)
print(all_species2)

['Homo sapiens', 'Mus musculus', 'Homo sapiens', 'Homo sapiens', 'Homo sapiens', 'Mus musculus', 'Bos taurus', 'Homo sapiens', 'Mus musculus', 'Sus scrofa', 'Homo sapiens', 'Homo sapiens']


To handle data with less formulaic formats, one needs to use **regular expressions**, the subject of session 5.

## Formatting numbers
Extend this script so that it prints out the numbers in `numbers.txt` a) with a single digit after the decimal point, and b) with the floating points vertically aligned: 

In [110]:
numbers_fname = '../data/numbers.txt'
numbers = []
with open(numbers_fname, 'r') as f:
    numbers = f.read().splitlines()

for n in numbers:
    n = float(n)
    print(f"{n:7.1f}")

print(numbers)

    1.6
  -32.8
   -4.1
   25.3
    8.0
   31.3
  780.5
 -422.3
   87.6
  928.7
 -187.0
 1153.0
    4.2
    0.9
    5.7
-8205.9
-2749.7
    5.9
 2347.1
   39.2
   61.5
['1.63', '-32.78', '-4.1', '25.307', '8.0', '31.33333', '780.4592', '-422.343', '87.612', '928.7', '-187.0', '1153.04', '4.2', '0.932', '5.65', '-8205.9', '-2749.655', '5.912', '2347.105', '39.2', '61.5']


In [83]:
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.



## Strings and lists
Given the string `Session` and a string containing session labels (`A`, `B`, `C`, etc.), can you generate a list containing the desired session titles (`Session A`, `Session B`, `Session C`, etc.) and print it out? 

*Hint:* use the string concatenation (`+`) operator.    

In [112]:
prefix = 'Session'
session_labels = 'ABCDEFG'

session_labels.split()


['ABCDEFG']

## Working with sequences

This code calculates the **sequence identity** (i.e. percentage of characters at the same positions that match) between two sequences of the same length. The **enumerate()** function arose in the section on lists, but here is applies to the characters in a string. Make sure you understand how it works.

In [58]:
seq1 = 'ACWQTEDGSSAKLCRYIPRMTASWFSERAHIKLTYRV'
seq2 = 'ACWQTFDGDSAKLCRYIRRMTASWFSFRAHIKIYYRV'

same_aa_count = 0
if len(seq1) == len(seq2):
    for i, aa in enumerate(seq1):
        if aa == seq2[i]:
            same_aa_count += 1
    print('Sequence identity (%) =', (same_aa_count / len(seq1)) * 100)
else:
    print('Sequences are not the same length')

Sequence identity (%) = 83.78378378378379
