# More fun with strings #

# Cleaning up whitespace with .strip() methods #

Another useful set of string methods removes extra white space characters (spaces, tabs, etc) from around a string. This can be useful in many situations, for instance if you're parsing a file based on positions in a line, which can often leave white space around some values.

- string.strip() removes white space from both ends of string.
- string.rstrip() removes white space from the end of string.
- string.lstrip() removes white space from the start of string.

In the box below, try answering the input request with strings that have whitespaces at the beginning and/or end. Use len() to see how many characters are counted in the string, and try out the .strip commands to remove those spaces. Then re-count the length of the string with len().

Getting the correct length and position in the string sometimes depends on those empty spaces!

In [3]:
teststring = input("Give me a string:")
print(len(teststring))
teststring2 = teststring.strip()
print(len(teststring2))

Give me a string:    I am hungry.   
19
12


White spaces will change the outcome of your slices, too, so you have to pay attention to them. In the example below, what is at the index 0 and the index -1? Which index matches up with the first A? Use slice statements to find out, and then use strip commands to fix the string.

In [1]:
teststring2 = "  ABCDEFGHIJKLMNOPQRSTUVWXYZ "
teststring = teststring2.strip()



print(len(teststring2))
print(teststring2)

print(len(teststring))
print(teststring)


29
  ABCDEFGHIJKLMNOPQRSTUVWXYZ 
26
ABCDEFGHIJKLMNOPQRSTUVWXYZ


# Converting strings into lists with list() #

For Tuesday's exercise, I wanted to randomly shuffle a string of characters, but the random.shuffle function that I wanted to use only works on lists. So I wanted to convert my string of letters into a list. I used the list() function to do that.

```alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
codekey = list(alphabet)```

The argument that the list() function takes should be a string or another iterable (we'll get into that below). For today, assume that we're going to be passing a string value to list.

- In the cell below, convert the text string "bananas" to a list.
- Store the result of the list function in a new variable name.
- Check the type of that variable name. You should see it's a list object and not a string anymore.
- Then print the list to see what that object type looks like when it's printed.

In [6]:
bananas = "bananas"
new_var = list(bananas)
type(new_var)
print(new_var)

['b', 'a', 'n', 'a', 'n', 'a', 's']


# Splitting strings with .split() #

It's not too often that we get to work with data that makes sense to break up into individual characters. More often, you're going to have a string like this:

```NC_007898.3	RefSeq	gene	155371	155461	.	+	.	ID=gene141;Dbxref=GeneID:16976781;Name=rps19;Note=truncated copy of ribosomal protein S19;gbkey=Gene;gene=rps19;locus_tag=LyesC2p001;pseudo=true```

and you are going to want to break out every piece of that line into an individual data item.

Python has a method for this; it's called .split(). If you don't put an argument into the parentheses, python will split lines on whitespace. So the line above would be split into a list like

['NC_007898.3','RefSeq','gene','155371','155461','.','+',...etc.

If we had string data that started out separated by commas, like so:

```fruits = apples,bananas,pears,oranges,blueberries```

We would need to split on the comma, so we would type:

```fruits.split(",")```

In the cell below, type the correct split argument to get the string to split properly.

In [6]:
aminos = "ALA,ARG,ASN,ASP,CYS,GLN,GLU,GLY,HIS,ILE,LEU,LYS,MET,PHE,PRO,SER,THR,TRP,TYR,VAL"
print(aminos.split(","))
names = "Maryam&Aiden&Brett&Amit&Maya&Shruti"
print(names.split("&"))
houses = "Slytherin Gryffindor Hufflepuff Ravenclaw"
print(houses.split())

['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL']
['Maryam', 'Aiden', 'Brett', 'Amit', 'Maya', 'Shruti']
['Slytherin', 'Gryffindor', 'Hufflepuff', 'Ravenclaw']


# Joining things into strings with .join() #

Lists can be joined back into strings with .join().

Let's see how this works. Run the cell below for an illustration.

In [32]:
names = "Maryam&Aiden&Brett&Amit&Maya&Shruti"
namelist = names.split("&")
print(namelist)
namestring = " ".join(namelist)
print(namestring)

['Maryam', 'Aiden', 'Brett', 'Amit', 'Maya', 'Shruti']
Maryam Aiden Brett Amit Maya Shruti


# Join has a strange syntax #

Last time, we bound .maketrans onto a sort of temporary object, str, to make our translation table, rather than binding it onto one of the strings we were using.

This time, when we use .join() on namelist, we don't bind it to namelist -- we bind it to whatever string characters we are using to join namelist together. I wanted to take the names in the list and separate them with a space, so I bound .join() onto the string literal " ", a single space, and then passed it namelist as an argument.

In the cell below, turn the alphabet string "ABCDEFGHIJKLMNOPQRSTUVWXYZ" into a list, and then join it back together with spaces between each letter.

To join list items back together with no spaces, you can bind .join() to a pair of quotes with no space in between, "".join(). Try this as well. We'll use "".join() pretty frequently when we're parsing information that we want out of files, so definitely make sure you get your head around it!

In [8]:
alphabet =  "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
alpa_lis = list(alphabet)
print(alpa_lis)
" ".join(alpa_lis)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'

# Iterating over strings with the for keyword #

There are lots of ways to get a loop going and to define how long it will go on. There are also lots of kinds of data  we can loop over. We are only going to see the very first one of these today, but we're looking because it is one of the built-in ways that python handles strings. Under the hood in a lot of python data object types, there are basic methods called ```__iter__``` and ```__next__```. You don't usually call them directly, but you'll be able to see them in the dir() for the object type if they're available to use. Together they're the iterator protocol. What this means is that an object that can use the iterator protocol can give you its items one by one.

strings are one such type of object.

The keyword for asks the object to give up its items one-by-one, according to the iterator protocol. for gets an item, uses it in the code contained inside the for block, and automatically goes to the next item when it's done.

To get a string to give up its characters one by one, you can write something like:

```for char in string:```

That begins the for block.

# Indentation #

We're going to see lots of code organized into "blocks" from now on. for and while blocks, which make loops. if/else blocks, which decide between conditions. def blocks, which define functions.

The rule is, if you start a block like this, then everything you are doing inside the block has to be indented one level. That's how the python interpreter knows where what you want to do FOR everything in OBJECT, ends.

```for char in string:
    print(char)```
    
The print line is what we want to do for each character. It is indented by exactly one TAB.  You can use spaces or tabs, but the rule is each indent level has to work the same. You can't mix it up by using a tab for the first indent and then using four spaces for the next indent.

In the cell below, make a for block that adds 1 to the value of counting_var for each character in the string. Remember you can use += to update the value of a variable. I started you off with counting_var set to zero. The loop should come AFTER the variable is initialized.

Once the loop is done, you can stop indenting. Make a print statement that prints the value of the counting_var after the loop finishes.

In [1]:
words = "newstringofletters"
counting_var = 0
for char in words:
    counting_var += 1
    
print(counting_var)    


18


What built-in python function did you just invent? Put the answer in the cell below.

In [19]:
print("len")

len


# Fibonacci with a loop #

Last week we met the Fibonacci series. To make a Fibonacci series, we start with the values of two numbers, a and b, both set to 1. Then we update them, following the simple rule that the old value of b becomes the new value of a, and the old value of a + b becomes the new value of b.

Or in math terms, a = b, b = a + b

Python lets us update both of these numbers at once in a statement like a, b = new_a, new_b. This is useful if two numbers have to update at once.

The Fibonacci series is an infinite series. We could literally generate numbers forever if we put it in a loop. If we just want the loop to repeat itself ten times, we could write a new kind of for statement:

```for i in range(0,10):```

Put your Fibonacci statement from last week into the loop below, and add a print statement to make it print the first 10 pairs of a,b in the Fibonacci infinite series. You should also start by setting initial values of a and b OUTSIDE THE LOOP.

In [18]:
a=1
b=1
for i in range(0,10):
    a=b
    b=a+b
    print(b)
    a=b+a
    print(a)
    b=a+b
    print(b)
    a=b+a
    print(a)
    b=a+b
    print(b)
    a=b+a
    print(a)
    b=a
    print(b)
    
    print(a,b)

2
3
5
8
13
21
21
21 21
42
63
105
168
273
441
441
441 441
882
1323
2205
3528
5733
9261
9261
9261 9261
18522
27783
46305
74088
120393
194481
194481
194481 194481
388962
583443
972405
1555848
2528253
4084101
4084101
4084101 4084101
8168202
12252303
20420505
32672808
53093313
85766121
85766121
85766121 85766121
171532242
257298363
428830605
686128968
1114959573
1801088541
1801088541
1801088541 1801088541
3602177082
5403265623
9005442705
14408708328
23414151033
37822859361
37822859361
37822859361 37822859361
75645718722
113468578083
189114296805
302582874888
491697171693
794280046581
794280046581
794280046581 794280046581
1588560093162
2382840139743
3971400232905
6354240372648
10325640605553
16679880978201
16679880978201
16679880978201 16679880978201


# Translating one character at a time #

The maketrains() and translate() functions that you used last time will work on any string, whether it's one character long or a whole phrase. Go back and look at your code from last time, and let's adapt it so that you build the translate table before the loop starts, and then do the translating character by character inside the loop.

alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ "
code = "WJXSDGMAHUTFVIQBLZEROYPKNC "
phrase = "THIS IS MY SECRET PYTHON DIARY"

Print out pairs of the original character and the encoded character.

In [11]:
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ "
code = "WJXSDGMAHUTFVIQBLZEROYPKNC "
phrase = "THIS IS MY SECRET PYTHON DIARY"
transtable = str.maketrans(alphabet,code)
for s in phrase:
    print(s,s.translate(transtable))


T R
H A
I H
S E
   
I H
S E
   
M V
Y N
   
S E
E D
C X
R Z
E D
T R
   
P B
Y N
T R
H A
O Q
N I
   
D S
I H
A W
R Z
Y N


# Generating all the possible DNA codons #

Loops can nest inside each other. Each nested loop is indented one level further than previous.

```for i in string:
    for j in string2:
        for k in string3:
            do something with i,j,k```

If we have a string ATGC, let's think about how we'd generate all the possible three-letter codons. Build a set of loops over the string ATGC that will run through all the combinations of i,j,k and print them out. You can give for either a variable name or a string literal.

In [19]:
for i in "ATGC":
    for j in "ATGC":
        for k in "ATGC":
            print(i,j,k)

A A A
A A T
A A G
A A C
A T A
A T T
A T G
A T C
A G A
A G T
A G G
A G C
A C A
A C T
A C G
A C C
T A A
T A T
T A G
T A C
T T A
T T T
T T G
T T C
T G A
T G T
T G G
T G C
T C A
T C T
T C G
T C C
G A A
G A T
G A G
G A C
G T A
G T T
G T G
G T C
G G A
G G T
G G G
G G C
G C A
G C T
G C G
G C C
C A A
C A T
C A G
C A C
C T A
C T T
C T G
C T C
C G A
C G T
C G G
C G C
C C A
C C T
C C G
C C C


# Getting the item index with enumerate() #

If it's important for you to know the position of the item in the iterable string, as well as the value that is there, you can use the enumerate() function to get it.  See how this works by using enumerate() on the ATGC string, using the syntax below, where xxxxxx can be either a variable name or a string literal.

```for i in enumerate(xxxxxx):
    print(i)```

In [20]:
for i in enumerate("ATGC"):
    print(i)

(0, 'A')
(1, 'T')
(2, 'G')
(3, 'C')
