
![Py4Eng](img/logo.png)

# Sequences: strings, lists, for loops

# Strings

Strings are ordered collections of _characters_. 

_Ordered collections_ means that elements are numbered with __indexes__: 0, 1, 2, 3, 4...  

### Note that the first index is 0, __not__ 1!

We can create new string usings single- or double-quotes: `'` or `"` (but same from both sides).

In [None]:
x = "Jupyter"
y = 'I love Python'
print(x)
print(y)

Strings are objects of type `str`:

In [None]:
type(x)

We can concat (לשרשר) strings:

In [None]:
print(x + "2021")

We can convert string to numbers and vice versa (if it is appropriate):

In [None]:
x = "4"
y = int(x)
print("y + 1 =", y + 1)

Otherwise, we get an error message...

In [None]:
print("x + 1 =", x + 1)

In [None]:
x = str(y)
print("x =", x)

In [None]:
x = "3.14"
y = float(x)
print("y*2 =", y * 2)

Strings are text but can represent other things, too. For example, DNA sequences.

#### We can concat strings:

In [None]:
upstream = "AAA"
downstream = "GGG"
dna = upstream + "ATG" + downstream
print(dna)

We can find the length of a string using the command `len`:

In [None]:
n = len(dna)
print("The length of the DNA variable is", n)

dna = dna + "AGCTGA"
print("Now it is", len(dna))

Also with strings we can use *syntactic sugar* to make `dna = dna + x` into `dna += x`:

In [None]:
print(dna)
dna += "AGCTGA"
print(dna)

As we've seen, it also works with numbers and other operators:

In [None]:
x = 10
x *= 7
print(x)

## Access: Indexing

![string photo](https://www.w3resource.com/w3r_images/python-string-slice.png)

Each character in a string has an index.

We can acces specific characters (sequence items) in a string using square brackets (`[]`).

#### Note that Python uses **zero-count** indexing: the first element has index 0.

In [None]:
text = "A musician wakes from a terrible nightmare."

In [None]:
print(text[0])
print(text[5])

In addition, there is also support for reverse indexing using negative numbers.

The last element is accessed using -1 index, and so on.

In [None]:
print(text[-1])
print(text[-4])

## Access: Slicing

We can extract subsets of a string by using __slicing__, with the corresponding indexes.  
Remember: indexes start from **0**!

We can get a range of indexes using _\[start : end : step\]_

This would return the characters from index `start` (**included**) until index `end` (**not included**) with jumps of size `step`.

If `step` is not provided, python use `1` as a default.

In [None]:
text = "A musician wakes from a terrible nightmare."

# get the 3rd to 8th letters
print(text[2:8])

Notice that the _start_ position is included, but not the _end_ position. 

We actually take the character with indexes 2,3,4,5,6,7.

And what do we get?

In [None]:
type(text[2:8])

There are shorts for taking the first and last characters:

In [None]:
# get the first 5 letters
print(text[0:5])
# or simply:
print(text[:5])

# get 3rd to last letters:
print(text[3:len(text)])
print(text[3:])

# last 3 letters
print(text[-3:])

In [None]:
# get every second letter starting from the first
print(text[0:len(text):2])
# or simply
print(text[::2])

# get every second letter starting from the last and going backwards
print(text[::-2])

In [None]:
# reverse the string - very useful!
print(text[::-1])

#### Note that creating a slice does not change the original string:

In [None]:
a = text[::-1]
print(a)
print(text)

#### Changing (mutating) a character within a string is not possible - string objects are immutable

In [None]:
text[5] = 'X'

## Exercise

The sequence below (named _seq_) consists of 20 characters. 

1. Print the 2nd and 7th characters.
2. Print the 2nd character from the end.
3. Slice the first half of the sequence.  
4. Slice the second half of the sequence.  
5. Slice the middle 10 characters

In [None]:
seq = "CAAGTAATGGCAGCCATTAA"


## Formatting string

There are several ways to do this:
1. [`printf`-style formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting), using `%`
2. [String formatting](https://docs.python.org/3.7/library/string.html#format-string-syntax) using the `format` method
3. [String literals](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) using `f`
4. [Template strings](https://docs.python.org/3/library/string.html#template-strings)


### `format` method

The `format` method works on a string template, with placeholders marked by curly brackets (who said Python doesn't like curly brackets?). 

The method arguments are parsed to be the values for the placeholders, by order:

In [None]:
message = "Hello {}, would you like {} or {} apples?"
message = message.format("Adam Price", 1, 2)
print(message)

We can also specify placeholder's replacement using indices:

In [None]:
message = 'Hello {0}, my name is {1}, if your name is not {0}, please let me know'
message = message.format('Adam', 'Wendy')
print(message)

Finally, we can also use named placeholders and specify the values as keyword arguments:

In [None]:
message = 'Hello {guest}, my name is {host}, if your name is not {guest}, please let me know'
message = message.format(guest='Adam', host='Wendy')
print(message)

Format automatically handles numbers and other string conversions:

In [None]:
print("Snowhite and the {} dwarfs".format(7))
print("Snowhite and the {} dwarfs".format(7.0))
print("Snowhite and the {} dwarfs".format(7+0j))

But we can specify how to convert numbers, if we want. 

For example, we can specify the number of decimal digits we want:

In [None]:
x = 7.0554332
print("Snowhite and the {:.0f} dwarfs".format(x))
print("Snowhite and the {:.4f} dwarfs".format(x))
print("Snowhite and the {:.6f} dwarfs".format(x))

See all formatting options in the [docs](https://docs.python.org/3.6/library/string.html#format-string-syntax).

Python 3.6 added a new string formatting option using formatted string literals, or [f-strings](https://docs.python.org/3/reference/lexical_analysis.html#f-strings).

In [None]:
name = "John Levin"
age = 31
address = "42 Main st., Sunnyvale, CA"

print(f"His name is {name}, he is {age} and he lives in {address}.")

Note the `f` before the printed string!

## Exercise: bottles of beer

Write a template and fill it with values using either `format` or f-strings to produce the following text:

```
3 bottles of beer on the wall, 3 bottles of beer.
Take one down, pass it around, 2 bottles of beer on the wall...
2 bottles of beer on the wall, 2 bottles of beer.
Take one down, pass it around, 1 bottles of beer on the wall...
1 bottles of beer on the wall, 1 bottles of beer.
Take one down, pass it around, 0 bottles of beer on the wall...
```

## String methods

We can change a string to lowercase:

In [None]:
text = 'A Musician Wakes From a Terrible Nightmare.'
text = text.lower()
print(text)

and back to uppercase:

In [None]:
text = text.upper()
print(text)

We can replace characters:

In [None]:
dna = 'AAAATGGGGAGCTGAAGCTGA'
rna = dna.replace("T", "U")
print(dna)
print(rna)

#### Count
We can count characters using the command: `some_string.count(character)`.

For example, let's count the number of histidine (`H`) and proline (`P`) in the [amino-acid](http://upload.wikimedia.org/wikipedia/commons/a/a9/Amino_Acids.svg) sequence of the [Human Insulin](http://www.uniprot.org/blast/?about=P01308) enzyme:

In [None]:
insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'
print("# of histidine:", insulin.count('H'))
print("# of proline:", insulin.count('P'))

#### Find and Index
We can find a substring within a string.
For example, we can look for the character `D` in the insulin sequence.

In [None]:
pos = insulin.index('D')
print(pos)

In [None]:
type(pos)

In [None]:
print(insulin[pos])

The result is the index (position) of the first `D` found in the sequence.

We can also look for longer substrings, representing motiffs. 

For example, let's find the position of the Insulin [B-chain](http://www.uniprot.org/blast/?about=P01308[25-54]) - a specific subsequence - in the entire protein sequence:

In [None]:
b_chain = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"
position = insulin.index(b_chain)
print("Position:", position)

In [None]:
print(len(b_chain))

In [None]:
# slicing the B-chain motif
found = insulin[position : position + len(b_chain)] # slicing (notice the ':')

In [None]:
print(b_chain == found)
print("Original:", b_chain)
print("Found:   ", found)

#### Split

We can split a string on every occurence of a separator character:

In [None]:
names = "banana,ananas,potato,tomato"
foods = names.split(",")
print(foods)

What do we get?

In [None]:
type(foods)

# Lists

Lists are similar to strings in being sequential, only they can contain **any type of data**, not just characters. 

They are also mutable (we'll get back to that distinction).

Lists could even include mixed variable types.

We define a list just like any other variable, but use '[ ]' and ',' to separate elements.

In [None]:
# a list of strings
apes = ["Human", "Gorilla", "Chimpanzee"]
print(apes)

![Gorila](http://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Western_Lowland_Gorilla_at_Bronx_Zoo_2_cropped.jpg/338px-Western_Lowland_Gorilla_at_Bronx_Zoo_2_cropped.jpg)

In [None]:
# a list of numbers
nums = [7, 13, 2, 400]
print(nums)

In [None]:
# a mixed list
mixed = [12, 'Mouse', True]
print(mixed)

You can access list elements just like strings, using indexes (starting from 0):

In [None]:
print(apes[0])
print(apes[-1])

Lists are dynamic and mutable - you can append, remove and insert into them. 

This is done using _list methods_.

We can access and change list elements:

In [None]:
new_apes = apes.copy() # make a copy of the apes list
print(new_apes)
new_apes[2] = 'Bonobo'
print(new_apes)

This __does NOT__ work with strings though...

In [None]:
print(dna)
dna[5] = 'G'

This is because strings are **immutable** whereas lists are **mutable**. We'll get back to this notion soon.

### More list methods

Add element to the end of the list:

In [None]:
apes = ["Human", "Gorilla", "Chimpanzee"]
print(apes)
apes.append("Macaco") # this changes the apes list
print(apes)

Add **elements** to the end of the list:
* Notice that the given input is a list

In [None]:
print(apes)
apes.extend(["Orangutan", "Gibbon"])
print(apes)

Insert element at a given index:

In [None]:
print(apes)
apes.insert(2, "Kofiko")
print(apes)

Remove element from list:

In [None]:
print(apes)
apes.remove("Human")
print(apes)

To remove a list item by index we can use the pop method which does 2 things:
* returns the value in the given index
* removes the given index from the list

In [None]:
print(apes)
out_ape = apes.pop(3) 
# if we only want to remove index 3, we can skip the assignment 
#and just run the command

print(out_ape)
print(apes)

We can concatenate lists, just like strings:

In [None]:
con_apes = apes + ["Orangutan", "Baboon"]
print(apes)
print(con_apes)

![Organutan](http://upload.wikimedia.org/wikipedia/commons/thumb/b/be/Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG/220px-Orang_Utan%2C_Semenggok_Forest_Reserve%2C_Sarawak%2C_Borneo%2C_Malaysia.JPG)

Searching in lists is done using `index` (not `find`):

In [None]:
i = apes.index('Kofiko')
print(i)
print(apes[i])

If the value is not found an error is raised. We'll learn how to deal with exceptions in another session. 

For now, we can just find it with a loop, which we will learn in a few minutes.

You can also check if something is in a list (works as well for strings):

In [None]:
if 'Panda' in apes:
    print('Panda is an ape')
else:
    print('Panda is not an ape')

In [None]:
print('N' in 'DNA')
print('B' in 'DNA')

print('DN' in 'DNA')

print(1 in [1,2,3])

## Lists of numbers

Suppose we have a list of experimental measurements and we want to do basic statistics: 

count the number of results, calculate the average, and find the maximum and minimum.

In [None]:
measurements = [33,55,45,87,88,95,34,76,87,56,45,98,87,89,45,67,45,67,76,73,33,87,12,100,77,89,92]

count = len(measurements)
avg = sum(measurements) / len(measurements)
maximum = max(measurements)
minimum = min(measurements)

print(count, "measurements with average", avg, "maximum", maximum, "minimum", minimum)

We'll see a better way to work with sequences of numbers, though, using NumPy.

## Sorting lists
  
We can sort lists using the `sorted` method.  
If the list is made __entirely__ of numbers, then sorting is straightforward:

In [None]:
sorted_measurements = sorted(measurements)
print(measurements)
print()
print(sorted_measurements)

A list of strings will be sorted lexicographically (think about the way '<' and '>' work on strings):

In [None]:
sorted_apes = sorted(apes)
print(sorted_apes)

But beware of mixed lists!

In [None]:
mixed = apes + measurements
print(mixed)
print(sorted(mixed))

## List of lists (nested lists)
  
List elements can be of any type, including lists!  
For example:

In [None]:
birds = ['Gallus gallus', 'Corvus corone', 'Passer domesticus']
snakes = ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']
animals = [apes, birds, snakes]
print(animals)

In [None]:
len(animals)

We access lists of lists using double-indexes. For example, to get the 3rd snake:

In [None]:
print(animals[2][2])

Note that the elements of the outer list are __lists__ themselves, not strings. For example:

In [None]:
type(animals[1])

## Slicing
  
We can slice lists just like we did with strings, to get partial lists.  
For example:

In [None]:
# get the first 10 measurements
print(measurements[:10])
# get the last 3 measurements
print(measurements[-3:])

## Exercise: birds and snakes

- Use the lists `birds` and `snakes` to create a single list of strings with the animal names. 
- Add the string `Mus musculus` to the list. 
- Remove the `Corvus corone` from the list. 
- Print the 2nd to 5th elements of the resulting list, sorted alphabetically.

In [None]:
birds = ['Gallus gallus', 'Corvus corone', 'Passer domesticus']
snakes = ['Ophiophagus hannah', 'Vipera palaestinae', 'Python bivittatus']

In [None]:
# write code here:

# `for` loops

![Python loop](http://2.bp.blogspot.com/-7lXe1_Gou3k/UX92PWche3I/AAAAAAAAAFA/JxD4u8St-9g/s1600/python+loop.jpg)


#### Say we want to print each element of our list:

Python’s `for` loop syntax allows us to iterate over the elements of a `list`, or any `iterable` value. 

```py
for loop_variable in iterable:
    statement1
    statement2
    statement3
    ...
```

In [None]:
for ape in apes:
    print(ape, "is an ape")

A more complex loop will go over each ape name and print some stats:

In [None]:
for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape, "is an ape. Its name starts with", first_letter, end = '. ')
    print("Its name has", name_length, "letters")

print('processing finished...')

### String loop

In [None]:
for letter in 'ACGT':
    print(letter)

Let's go over the Insulin AA sequnce and count the number of prolines manualy. Reminder: `insulin` is a `str`, not `list`.

In [None]:
count = 0
for aa in insulin:
    if aa == "P": 
        count += 1
    # alternative option: 
    #   count += aa == "P"
print("# of prolines:", count)

Note that you can perform arithmatics with booleans

In [None]:
print(0+True)
print(True+True)
print(True+False)

Do you remember another way of doing this?

Let's count how many measurements (see above) are above the average:

In [None]:
print(measurements)
print(avg)

In [None]:
over = 0
for x in measurements:
    over += x > avg
    # or more implicit:
    # if x > avg:
    #     over += 1
print(over, "measurements are over the average.")

## Exercise: insulin

Complete the code below to count the _ratio_ of electrically-charged amino acids in the Insulin sequence.

In [None]:
charged = ['R','H','K','D','E']
insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'

# Your code here

################
print("Ratio of charged amino acids is:", charged_ratio)

# `range`

Sometimes we want to loop over consecutive numbers.

This is accomplished using the `range` function.

`range` accepts one, two, or three arguments: the bottom and upper limits and the step size.  
The bottom limit can be omitted - the default is zero - and the step can be omitted, too - the default is one.
The upper limit is __not__ included.

This is simillar to slicing, but instead of `:` we use `,`

In [None]:
for i in range(10): # == range(0, 10, 1)
    print(i)

In [None]:
for i in range(10, 20):
    print(i, end=' ')    # print ends with space instead of newline

In [None]:
for i in range(100, 1000, 10):
    print(i, end=' ')

We can turn the range into a list:

In [None]:
list(range(10))

We can also use `range()` to loop on the indices of a list instead of the elements themselves. This is useful in some cases.

In [None]:
for i in range(len(apes)):
    print(apes[i])
print()
for elem in apes:
    print(elem)

If for example we want to iterate over a string and print each triplet of characters.
```py
seq = 'ATGCCAGATTCAGCT'
output:
ATG
CCA
GAT
TCA
GCT
```

```py
seq = 'ATGCCAGATTCAGCT'
for elem in seq:
    print(elem....?)
```
We cannot do so (easily) by iterating over the `seq`.

Instead we will iterate over the list indices:

In [None]:
seq = 'ATGCCAGATTCAGCT'
list(range(0, len(seq), 3))

In [None]:
seq = 'ATGCCAGATTCAGCT'
for i in range(0, len(seq), 3):
    print(seq[i : i+3])

### Example: Bin2Dec

Let's convert a binary string to a decimal number.

In [None]:
binary = "1010"
n = len(binary)
decimal = 0

for i in range(n):
    if binary[-i-1] == "1":
        decimal += 2**(i)
print(decimal)

#### Set even elements in a given list to zero:

In [None]:
lst = [1, 4, 3,  6, 8]
for elem in lst:
    if elem % 2 == 0:
        elem = 0
print(lst)

The variable `elem` was changed, but not `lst`!

In [None]:
print(lst)
for i in range(len(lst)):
    if lst[i] % 2 == 0:
        lst[i] = 0
print(lst)

## `enumerate`

Another elegant way to iterate over lists is with the `enumerate` function. `enumerate` provides two loop variables for every item in the list -- the index and the element:

In [None]:
cities = ['Tel-Aviv', 'Jerusalem', 'Haifa', 'Rehovot']
for i, city in enumerate(cities):
    print("The", i, "city is", city)

# Tuples

[Tuples](https://docs.python.org/3.5/tutorial/datastructures.html#tuples-and-sequences) are another data structure for sequential data. They, too, can contain any type and mixed types. The main difference between tuples and lists is that tuples are **immutable**.

Tuples are denoted by round brackets `()`:

In [None]:
t = (15, 76, 'a')
print(t)
type(t)

Tuples are commonly packed and unpacked in Python:

In [None]:
a, b, c = t # unpacking
print('a:', a, 'b:', b, 'c:', c)
t = a, b # packing
print(t)

You can also create empty and singleton tuples:

In [None]:
t0 = ()
type(t0)

In [None]:
t1 = (5,) # notice the comma
type(t1)

In [None]:
t2 = (5)
print(t2, type(t2))

# Solutions

## Solution: bottles of beer

In [None]:
template = """{0} bottles of beer on the wall, {0} bottles of beer.
Take one down, pass it around, {1} bottles of beer on the wall..."""

print(template.format(3, 2))
print(template.format(2, 1))
print(template.format(1, 0))

## Solution: birds and snakes

In [None]:
animals = birds + snakes
animals.append('Mus musculus')
animals.remove('Corvus corone')
print(sorted(animals[1:5]))

## Solution: insulin

In [None]:
charged = ['R','H','K','D','E']
insulin = 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN'

charged_count = 0
for c in charged:
    charged_count += insulin.count(c)
charged_ratio = charged_count / len(insulin)

print("Ratio of charged amino acids is:", charged_ratio)

## Colophon
This notebook was written by [Yoav Ram](http://python.yoavram.com).

The notebook was written using [Python](http://python.org/) 3.7.
Dependencies listed in [environment.yml](../environment.yml).

This work is licensed under a CC BY-NC-SA 4.0 International License.

![Python logo](https://www.python.org/static/community_logos/python-logo.png)