In [None]:
#In Class Oct. 31 Submission
#Nov 19, 2020
#Zeke Van Dehy

# python .format() built in #

Python has an important set of built in functionality that we haven't explored yet, namely string and number formatting. There are two entire different styles for this and they both work in python 3. We are going to learn the new style, which is more consistent with other methods you have learned. A complete reference for both types of formatting can be found here:

https://pyformat.info/

# .format() is sort of like .join() #

Remember when we used ''.join(list) to join all of the elements of a list into a string? .format() a little like that. ''.format() is technically bound to the string you are creating, not the arguments inside its parentheses. 

The arguments inside the parentheses are put into the formatting placeholders (indicated by {}) in the order they are encountered.

```'{}{}{}'.format(thing1,thing2,thing3)```

Run the cells below to see how each of the examples work.

In [1]:
# creates a string with a space between one and two
'{} {}'.format('one','two')

'one two'

In [2]:
# creates a string with no space between one and two
'{}{}'.format('one','two') 

'onetwo'

In [3]:
# joins the two strings and includes other text
'{} is less than {}'.format('one','two')

'one is less than two'

In [4]:
# concatenated and a newline at the end
'{}{}\n'.format('one','two')

'onetwo\n'

In [5]:
# concatenate arguments with a specified order using argument index
'{1}{0}'.format('one','two')  #will print twoone 

'twoone'

In [6]:
# .format() statement used as a literal
print('{}{}'.format('one','two'))

onetwo


And so on. In the cells below, use the two strings "ATGCATGCATGC" and "TACGTACGTACG". Use them as literals, for now (strings in quotes). Write .format() statements that:

- concatenate them with a space
- concatenate them with no space
- concatenate them with the index order switched
- concatenate them with the words "is the complement of" in between them
- concatenate them with a newline at the end

In [7]:
"{} {}".format("ATGCATGCATGC","TACGTACGTACG")

'ATGCATGCATGC TACGTACGTACG'

In [8]:
"{}{}".format("ATGCATGCATGC","TACGTACGTACG")

'ATGCATGCATGCTACGTACGTACG'

In [9]:
"{1} {0}".format("ATGCATGCATGC","TACGTACGTACG")

'TACGTACGTACG ATGCATGCATGC'

In [10]:
"{} is the complement of {}".format("ATGCATGCATGC","TACGTACGTACG")

'ATGCATGCATGC is the complement of TACGTACGTACG'

In [11]:
"{} {}\n".format("ATGCATGCATGC","TACGTACGTACG")

'ATGCATGCATGC TACGTACGTACG\n'

# Using .format() with variables as arguments #

Of course, python being python, I can always use variables as arguments, as well. 

```seq1 = "ATGCATGCATGC"
seq2 = "TACGTACGTACG" 
"{}{}".format(seq1,seq2)```

will concatenate the two strings as output. 

In the cell below, assign your first name and last name to two variables, and then print a format statement that combines them into your name.

In [15]:
first = "Zeke"
last = "Van Dehy" 
print("{} {}".format(first,last))
print("{1}, {0}".format(first,last))

Zeke Van Dehy
Van Dehy, Zeke


# Padding, justifying, and truncating strings #

We can pad a string with spaces using .format(). For example, in the PDB specification, characters 0-5 in the line contain the record identifier (ATOM or HETATM or SEQRES, etc). What if we have a record identifier that is only 4 characters long but needs to take up six characters?

``` '{:<6}'.format('ATOM') ```

will give the string

``` 'ATOM  ' ```

In the .format() command above, the : indicates that the number inside the brackets describes the space the value takes up, instead of being an argument index. The < character specifies that the string is left-justified in the space. ^ would center the string and > would right justify it.

In the cell below, again using your name, create a formatted string that prints your first name right justified in a 10 character space, your middle name centered in a 10 character space, and your last name left-justified in a 10 character space.

If one of your names is longer than 10 characters, then use a longer space for it.

In [18]:
print("{:<6}".format("ATOM"))
print("{:^6}".format("ATOM"))
print("{:>6}".format("ATOM"))

ATOM  
 ATOM 
  ATOM


In [20]:
print("'{:>10} {:^20} {:<10}'".format("Ezekiel","Quincy Alexander", "Van Dehy"))

'   Ezekiel   Quincy Alexander   Van Dehy  '


# Truncating a string #

We can also truncate a string to fit in a space. If I wanted to include only the first initial of my first name in a formatted string, I could write:

```myname = "Cynthia"
"{:.1}".format(myname)```

will give the string 'C'. The number after the . inside the brackets is the number of characters that will be kept starting at the beginning of the string. If I wanted to keep "Cyn" instead I would replace the 1 with a 3.

In the cell below, pretend it's the 1980s. Use .format() commands to create a username string that is made up of the first three letters of your first name, your middle initial, and the first four letters of your last name.

In [23]:
"{:.1}{:.1}".format("Ezekiel", "Van Dehy")

'EV'

In [24]:
"{:.3}{:.1}{:.4}".format("Ezekiel","Quincy Alexander","Van_Dehy")

'EzeQVan_'

# Formatting numbers #

Numbers can also be formatted.

In the cell below, make a .format() statement that will print out the numbers 867 and 5309, with a - between them.

In [25]:
"{}-{}".format(817,12314)

'817-12314'

# Padding and justifying numbers #

You can also justify and pad numbers. For integers, this works very much as if the numbers are strings. However, you need the character "d" inside the brackets {} to show the number is an integer.

``` '{:4d}'.format(1234) # formats the number 1234 4 spaces
'{:^4d}'.format(12) # centers the number 12 in 4 spaces ```

In the cell below, create a string from the number 42, left justified, right justified, and centered in a 5 character space.

In [32]:
print("{:4d}".format(1234))
print("{:4b}".format(1234))
print("{:<4x}".format(1234))

1234
10011010010
4d2 


# Numbers vs strings in PDB files #

Last time your assignment was to find ways to break down the lines in a PDB file by slicing. I suggested you make the following lists to put these slices into, based on the PDB specification.

```type,number,atom,amino,chain,residue,xs,ys,zs,occ,bs,element = [],[],[],[],[],[],[],[],[],[],[],[]```

Each of these fields has a natural type() -- type, atom, amino, chain and element are all strings. number and residue are integers. xs,ys,zs,occ and bs are all floating point numbers. As you're making lists, you need to append the numbers in their correct formats, for example:

```			type.append(t)
			number.append(int(n))
			atom.append(a)
			amino.append(am)
			chain.append(c)
			residue.append(int(r))
			xs.append(float(x))
			ys.append(float(y))
			zs.append(float(z))
			occ.append(float(o))
			bs.append(float(b))
			element.append(e)```

The "number" list, or the second field in the PDB file line, which goes  from line[6:11] in slice notation, contains an integer number that needs to be right justified. How would you use .format() to format an integer number that needs to fit within 5 spaces?

In [134]:
types,number,atom,amino,chain,residue,xs,ys,zs,occ,bs,element = [],[],[],[],[],[],[],[],[],[],[],[]

with open("1nap.pdb") as fo:
    for line in fo.readlines():
        if line[0:6].strip() == "ATOM":
            types.append(line[0:6].strip()) #record name
            number.append(int(line[6:11].strip())) #atom serial number
            atom.append(line[12:16].strip()) #atom name
            amino.append(line[16]) #alternate location indicator
            #omit alt location
            residue.append(line[17:20].strip()) #residue name
            chain.append(line[21]) #chain identifier
            
            
            #x,y,z coords
            xs.append(float(line[30:38]))
            ys.append(float(line[38:46]))
            zs.append(float(line[46:54]))
            
            occ.append(float(line[54:60]))
            bs.append(float(line[60:66]))
            element.append(line[76:78])

for i in range(0,len(number),50):
    print("'{:>5d}'".format(number[i]))
    

'    1'
'   51'
'  101'
'  151'
'  201'
'  251'
'  301'
'  351'
'  401'
'  451'
'  501'
'  552'
'  602'
'  652'
'  702'
'  752'
'  802'
'  852'
'  902'
'  952'
' 1002'
' 1053'
' 1103'
' 1153'
' 1203'
' 1253'
' 1303'
' 1353'
' 1403'
' 1453'
' 1504'
' 1554'
' 1604'
' 1654'
' 1704'
' 1754'
' 1804'
' 1854'
' 1904'
' 1954'


# Floating point numbers #

Floating point numbers work differently with format than integers or strings, but still in a way that makes sense.

For instance, if I have the number pi = 3.141592653589793, I can format it in various ways. By default, .format() is going to truncate my number by rounding to 6 significant figures after the decimal point.

``` '{:f}'.format(pi) ```

Try this out in the cell below.

In [135]:
pi = 3.141592653589793
print("'{:8.3f}'".format(pi))
print("'{:15.10f}'".format(pi))


'   3.142'
'   3.1415926536'


# Floating point numbers and the PDB #

In the PDB specification, we see two different kind of floating point numbers. The coordinates are in real(8.3) format, which means they take up 8 characters total with 3 places after the decimal point. Occupancy and b-factor are in real(6.2) format, which means they take up 6 characters total with two places after the decimal point.

To put pi into real(8.3) format, we would use the command:

``` '{:8.3f}'.format(pi) ```

And to put pi into real(6.2) format, we would use the command:

``` '{:6.2f}'.format(pi) ```

In the cell below, see if you can get .format() to put pi into 15.10 real format. This is more than the default 6 characters after the decimal point. Is it allowed in python?

In [136]:
for i in range(0,20):
    print('{:8.3f},{:8.3f},{:8.3f}'.format(xs[i],ys[i],zs[i]))

  18.577,  14.018,  38.844
  19.315,  12.883,  38.317
  20.289,  13.297,  37.218
  20.302,  14.478,  36.836
  18.316,  11.877,  37.695
  21.048,  12.311,  36.753
  21.995,  12.560,  35.654
  21.332,  11.900,  34.423
  21.224,  12.592,  33.413
  23.404,  12.093,  35.776
  24.095,  11.715,  34.460
  25.560,  11.992,  34.343
  26.424,  11.628,  35.136
  25.812,  12.663,  33.291
  20.912,  10.667,  34.610
  20.201,   9.891,  33.555
  18.693,  10.190,  33.751
  18.011,   9.811,  34.707
  20.662,   8.475,  33.562
  19.813,   7.261,  33.323


# Formatting some PDB fields into a line #

Let's say I just wanted to put the three coordinate fields from my PDB file into my line string and nothing else. I could do it this way:

``` for i,x in xs:
        print('{:8.3f} {:8.3f} {:8.3f}'.format(xs[i],ys[i],zs[i]) ```
        
If I wanted to write those to a file that I had previously opened for appending, I could substitute outfile.write() for print().

In the cell below, show how you could .format() a string that included the five values number[i], atom[i], xs[i], ys[i] and zs[i]. number should be right justified in five characters, atom should be left justified in four, and xs, ys, and zs should all be formatted as real(8.3). There should be a space between each field.

# Format the complete PDB line #

If you can do that, you know everything you need to know to format the complete PDB line. Remember, there are a couple of places where you will have to leave a space for a field that we didn't use, or just because there's a space in the PDB format that's not ever used (e.g. character 12). 

Write a .format statement for a complete PDB line that uses all 12 fields that we put into lists.

In [137]:
#types,number,atom,amino,chain,residue,xs,ys,zs,occ,bs,element

print("ATOM      1  N   ALA A  21      18.577  14.018  38.844  1.00 55.98           N  ")
print("ATOM     12  CD  GLU A  22      25.560  11.992  34.343  1.00 57.20           C  ")
print("ATOM     13  OE1 GLU A  22      26.424  11.628  35.136  1.00 57.09           O  ")
print("12345678901234567890123456789012345678901234567890123456789012345678901234567890")
print("00000000011111111112222222222333333333344444444445555555555666666666677777777778")#tens place
print("--")

for i in range(0,len(types)):
    print("{0:6}{1:5d}  {2:4}{5:2}{3}{4}  ..    {6:8.3f}{7:8.3f}{8:8.3f}{9:6.2f}{10:6.2f}          {11:2}  ".format(
            types[i], #1-6
            number[i], #7-11
            atom[i], #13-16 ()
            amino[i], #17 #alternate
            chain[i], #22
            residue[i], #18-20
            #omit res seq: 23-26
            #omit iCode: #27
            xs[i], #31-38
            ys[i], #39-46
            zs[i], #47-54
            occ[i], #55-60
            bs[i], #61-66
            element[i]#77-78
            #omit charge: 79-80
         )
     )


ATOM      1  N   ALA A  21      18.577  14.018  38.844  1.00 55.98           N  
ATOM     12  CD  GLU A  22      25.560  11.992  34.343  1.00 57.20           C  
ATOM     13  OE1 GLU A  22      26.424  11.628  35.136  1.00 57.09           O  
12345678901234567890123456789012345678901234567890123456789012345678901234567890
00000000011111111112222222222333333333344444444445555555555666666666677777777778
--
ATOM      1  N   ALA A  ..      18.577  14.018  38.844  1.00 55.98           N  
ATOM      2  CA  ALA A  ..      19.315  12.883  38.317  1.00 56.10           C  
ATOM      3  C   ALA A  ..      20.289  13.297  37.218  1.00 55.61           C  
ATOM      4  O   ALA A  ..      20.302  14.478  36.836  1.00 56.54           O  
ATOM      5  CB  ALA A  ..      18.316  11.877  37.695  1.00 57.49           C  
ATOM      6  N   GLU A  ..      21.048  12.311  36.753  1.00 54.12           N  
ATOM      7  CA  GLU A  ..      21.995  12.560  35.654  1.00 52.71           C  
ATOM      8  C   GLU A  .

# Put it all together #

Take ONE of the examples from Tuesday where you break out the different fields in the PDB file -- getting a single chain, getting the backbone, or moving the coordinates. Put that together with the formatting that we learned today to write out a modified PDB file. 

If you are feeling adventurous, try to come up with a way to preserve the HEADER and TITLE records at the top of the file, and the END record at the end of the file.

In [138]:
import statistics
def modifyCoordinates(xs,ys,zs):
    center = (statistics.median(xs),statistics.median(ys),statistics.median(zs))
    newXs, newYs, newZs = [],[],[]
    for i in range(0,len(xs)):
            newXs.append(xs[i]-center[0])
            newYs.append(ys[i]-center[1])
            newZs.append(zs[i]-center[2])
    return (newXs,newYs,newZs)

In [140]:
lines = ""
#modified coordinates
mcs = modifyCoordinates(xs,ys,zs)
for i in range(0,len(types)):
    lines += ("{0:6}{1:5d}  {2:4}{5:2}{3}{4}  ..    {6:8.3f}{7:8.3f}{8:8.3f}{9:6.2f}{10:6.2f}          {11:2}  \n".format(
            types[i], #1-6
            number[i], #7-11
            atom[i], #13-16 (14-17)
            amino[i], #17
            chain[i], #22
            residue[i], #18-20
            #omit res seq: 23-26
            #omit iCode: #27
            mcs[0][i], #31-38
            mcs[1][i], #39-46
            mcs[2][i], #47-54
            occ[i], #55-60
            bs[i], #61-66
            element[i]#77-78
            #omit charge: 79-80
         )
     )
with open("newPDB.pdb","w") as fo:
    fo.write(lines)

In [143]:
#test the new file by reading it
types,number,atom,amino,chain,residue,xs,ys,zs,occ,bs,element = [],[],[],[],[],[],[],[],[],[],[],[]

with open("newPDB.pdb") as fo:
    for line in fo.readlines():
        if line[0:6].strip() == "ATOM":
            types.append(line[0:6].strip()) #record name
            number.append(int(line[6:11].strip())) #atom serial number
            atom.append(line[12:16].strip()) #atom name
            amino.append(line[16]) #alternate location indicator
            #omit alt location
            residue.append(line[17:20].strip()) #residue name
            chain.append(line[21]) #chain identifier
            
            
            #x,y,z coords
            xs.append(float(line[30:38]))
            ys.append(float(line[38:46]))
            zs.append(float(line[46:54]))
            
            occ.append(float(line[54:60]))
            bs.append(float(line[60:66]))
            element.append(line[76:78])
print("Center: ", (statistics.median(xs),statistics.median(ys),statistics.median(zs)))
            
for i in range(0,len(types)):
    print("{0:6}{1:5d}  {2:4}{5:2}{3}{4}  ..    {6:8.3f}{7:8.3f}{8:8.3f}{9:6.2f}{10:6.2f}          {11:2}  ".format(
            types[i], #1-6
            number[i], #7-11
            atom[i], #13-16
            amino[i], #17 (blank in this file)
            chain[i], #22
            residue[i], #18-20
            #omit res seq: 23-26
            #omit iCode: #27
            xs[i], #31-38
            ys[i], #39-46
            zs[i], #47-54
            occ[i], #55-60
            bs[i], #61-66
            element[i]#77-78
            #omit charge: 79-80
         )
     )

Center:  (0.0, 0.0, 0.0)
ATOM      1  N   ALA A  ..      11.224  -4.268  20.000  1.00 55.98           N  
ATOM      2  CA  ALA A  ..      11.962  -5.403  19.473  1.00 56.10           C  
ATOM      3  C   ALA A  ..      12.936  -4.989  18.374  1.00 55.61           C  
ATOM      4  O   ALA A  ..      12.949  -3.808  17.992  1.00 56.54           O  
ATOM      5  CB  ALA A  ..      10.963  -6.409  18.851  1.00 57.49           C  
ATOM      6  N   GLU A  ..      13.695  -5.975  17.909  1.00 54.12           N  
ATOM      7  CA  GLU A  ..      14.642  -5.726  16.810  1.00 52.71           C  
ATOM      8  C   GLU A  ..      13.979  -6.386  15.579  1.00 51.37           C  
ATOM      9  O   GLU A  ..      13.871  -5.694  14.569  1.00 52.06           O  
ATOM     10  CB  GLU A  ..      16.051  -6.193  16.932  1.00 53.64           C  
ATOM     11  CG  GLU A  ..      16.742  -6.571  15.616  1.00 55.66           C  
ATOM     12  CD  GLU A  ..      18.207  -6.294  15.499  1.00 57.20           C  
ATO