## Creating a pandas data frame for experimental PF values

Wan et al. (JCTC 2020) trained a forward model on experimentally measured HDX protection factors:

* 72 PF values for backbone amides of ubiquitin taken from Craig et al.
* 30 (of 53) for amibackbone amides of BPTI taken from Persson et al.

In this notebook we have converted the published values to $\ln$ PF (natural log).

### References

Wan, Hongbin, Yunhui Ge, Asghar Razavi, and Vincent A. Voelz. “Reconciling Simulated Ensembles of Apomyoglobin with Experimental Hydrogen/Deuterium Exchange Data Using Bayesian Inference and Multiensemble Markov State Models.” Journal of Chemical Theory and Computation 16, no. 2 (February 11, 2020): 1333–48. https://doi.org/10.1021/acs.jctc.9b01240.

Craig, P. O.; Lätzer, J.; Weinkam, P.; Hoffman, R. M. B.; Ferreiro, D. U.; Komives, E. A.; Wolynes, P. G. Journal of the American Chemical Society 2011, 133, 17463–17472.

Persson, F.; Halle, B. Proceedings of the National Academy of Sciences 2015, 112, 10383– 10388.


In [50]:
import os, sys
import numpy as np
import pandas as pd

### Ubiquitin

ubiquitin_text ="""#residue\tresnum\tln PF \\
GLN	& 2	& 6.210072 \\
ILE	& 3	& 13.7372227 \\
PHE	& 4	& 13.4839383 \\
VAL	& 5	& 13.1523661 \\
LYS	& 6	& 10.6909026 \\
THR	& 7	& 10.4629467 \\
LEU	& 8	& 0.67005226 \\
THR	& 9	& 0 \\
GLY	& 10	& 3.93511792 \\
LYS	& 11	& 5.21075007 \\
THR	& 12	& 3.2443424 \\
ILE	& 13	& 11.0570136 \\
THR	& 14	& 2.53514619 \\
LEU	& 15	& 11.5336487 \\
GLU	& 16	& 5.14167251 \\
VAL	& 17	& 12.2405424 \\
GLU	& 18	& 7.97845735 \\
SER	& 20	& 3.82459384 \\
ASP	& 21	& 12.9796722 \\
THR	& 22	& 7.73438333 \\
ILE	& 23	& 11.2135894 \\
GLU	& 24	& 5.00581999 \\
ASN	& 25	& 10.5550501 \\
VAL	& 26	& 14.6214153 \\
LYS	& 27	& 14.8539764 \\
ALA	& 28	& 11.4806893 \\
LYS	& 29	& 13.4517021 \\
ILE	& 30	& 14.2483966 \\
GLN	& 31	& 7.7689221 \\
ASP	& 32	& 5.6229128 \\
LYS	& 33	& 4.56602624 \\
GLU	& 34	& 6.21467717 \\
GLY	& 35	& 6.26763662 \\
ILE	& 36	& 6.981438 \\
ASP	& 39	& 1.64174317 \\
GLN	& 40	& 6.45875119 \\
GLN	& 41	& 8.26167531 \\
ARG	& 42	& 9.13435506 \\
LEU	& 43	& 6.56927527 \\
ILE	& 44	& 12.6573103 \\
PHE	& 45	& 6.49328996 \\
ALA	& 46	& 0 \\
GLY	& 47	& 4.1699816 \\
LYS	& 48	& 7.86102551 \\
GLN	& 49	& 3.80617316 \\
LEU	& 50	& 8.45969763 \\
GLU	& 51	& 5.41798272 \\
ASP	& 52	& 2.73316851 \\
GLY	& 53	& 3.31802512 \\
ARG	& 54	& 9.9932193 \\
THR	& 55	& 11.3494419 \\
LEU	& 56	& 13.0211187 \\
SER	& 57	& 6.68440452 \\
ASP	& 58	& 6.84098031 \\
TYR	& 59	& 10.9580025 \\
ASN	& 60	& 5.11404149 \\
ILE	& 61	& 8.54949845 \\
GLN	& 62	& 8.27088565 \\
LYS	& 63	& 3.09927954 \\
GLU	& 64	& 6.37125295 \\
SER	& 65	& 7.52484808 \\
THR	& 66	& 6.1939539 \\
LEU	& 67	& 7.75740918 \\
HIS	& 68	& 8.49884158 \\
LEU	& 69	& 7.95773408 \\
VAL	& 70	& 9.09981629 \\
LEU	& 71	& 3.0278994 \\
ARG	& 72	& 2.5788953 \\
LEU	& 73	& 0 \\
ARG	& 74	& 0 \\
GLY	& 75	& 0 \\
GLY	& 76	& 0 \\"""

ubiquitin_text = ubiquitin_text.replace('	& ','\t').replace(' \\','')

# print(ubiquitin_text)
fout = open('ubiquitin_lnPF.txt', 'w')
fout.write(ubiquitin_text)
fout.close()

ubi = pd.read_csv('ubiquitin_lnPF.txt', header=0, sep='\t')
ubi

Unnamed: 0,#residue,resnum,ln PF
0,GLN,2,6.210072
1,ILE,3,13.737223
2,PHE,4,13.483938
3,VAL,5,13.152366
4,LYS,6,10.690903
5,THR,7,10.462947
6,LEU,8,0.670052
7,THR,9,0.000000
8,GLY,10,3.935118
9,LYS,11,5.210750


In [51]:
### BPTI

bpti_text ="""#residue\tresnum\tln PF \\
CYS	& 5	& 8.52877518 \\
LEU	& 6	& 7.43504727 \\
GLU	& 7	& 8.229439 \\
TYR	& 10	& 5.756463 \\
GLY	& 12	& 3.840712 \\
ALA	& 16	& 6.963017 \\
ARG	& 17	& 1.752267 \\
IIE	& 18	& 12.37639 \\
IIE	& 19	& 2.256533 \\
ALA	& 25	& 3.04632 \\
GLY	& 28	& 7.676819 \\
LEU	& 29	& 10.85899 \\
CYS	& 30	& 3.677228 \\
THR	& 32	& 5.701201 \\
VAL	& 34	& 3.734793 \\
TYR	& 35	& 11.23431 \\
GLY	& 36	& 9.574149 \\
GLY	& 37	& 11.43924 \\
CYS	& 38	& 4.503856 \\
LYS	& 41	& 6.988346 \\
ARG	& 42	& 2.141404 \\
ASN	& 43	& 5.125554 \\
ASN	& 44	& 14.02274 \\
SER	& 47	& 4.503856 \\
ALA	& 48	& 2.403899 \\
MET	& 52	& 11.02938 \\
ARG	& 53	& 9.825131 \\
THR	& 54	& 7.962339 \\
CYS	& 55	& 12.1139 \\
GLY	& 56	& 8.008391 \\"""

bpti_text = bpti_text.replace('	& ','\t').replace(' \\','')

# print(bpti_text)
fout = open('bpti_lnPF.txt', 'w')
fout.write(bpti_text)
fout.close()

bpti = pd.read_csv('bpti_lnPF.txt', header=0, sep='\t')
bpti

Unnamed: 0,#residue,resnum,ln PF
0,CYS,5,8.528775
1,LEU,6,7.435047
2,GLU,7,8.229439
3,TYR,10,5.756463
4,GLY,12,3.840712
5,ALA,16,6.963017
6,ARG,17,1.752267
7,IIE,18,12.37639
8,IIE,19,2.256533
9,ALA,25,3.04632


In [83]:
### Finally, we make a data frame that concatenates the ubiquitin and BPTI values
ubi_bpti = pd.concat([ubi, bpti], ignore_index=True)
ubi_bpti

# Also, let's write a text file version too
ubi_lines = ubiquitin_text.split('\n')
bpti_lines = bpti_text.split('\n')
ubi_bpti_lines = ubi_lines + bpti_lines[1:]

fout = open('ubi_bpti_lnPF.txt', 'w')
fout.writelines("%s\n" % l for l in ubi_bpti_lines)
fout.close()


In [84]:
# write all the data frames to JSON
ubi.to_json('ubiquitin_lnPF.json')
bpti.to_json('bpti_lnPF.json')
ubi_bpti.to_json('ubi_bpti_lnPF.json')

In [92]:
# write a numpy array of JUST the lnPF values 
all_lnPF_values = np.array(ubi_bpti['ln PF'])
all_lnPF_values
np.save('ubi_bpti_lnPF.npy',all_lnPF_values)

### VERIFY that this data matches Hongbin's earlier data file
all_lnPF_values_HONGBIN = np.load('ubi_bpti_all_exp_data_in_ln.npy')

print('all_lnPF_values.shape', all_lnPF_values.shape)
print('all_lnPF_values_HONGBIN.shape', all_lnPF_values_HONGBIN.shape)

for i in range(all_lnPF_values_HONGBIN.shape[0]):
    print(all_lnPF_values[i], all_lnPF_values_HONGBIN[i])

all_lnPF_values.shape (102,)
all_lnPF_values_HONGBIN.shape (102,)
6.210072 6.210071995804942
13.737222699999998 13.737222664802477
13.4839383 13.483938304573131
13.1523661 13.152366051181989
10.6909026 10.690902586771355
10.4629467 10.462946662564942
0.67005226 0.6700522620612673
0.0 0.0
3.93511792 3.9351179239268244
5.21075007 5.210750065445525
3.2443424 3.2443423960286104
11.0570136 11.057013616557407
2.53514619 2.5351461873864443
11.533648699999999 11.533648730807176
5.14167251 5.141672512655704
12.2405424 12.240542354356347
7.97845735 7.978457347224368
3.82459384 3.82459383946311
12.9796722 12.979672169207435
7.73438333 7.7343833273669995
11.2135894 11.213589402881002
5.00581999 5.005819992169055
10.555050099999999 10.555050066284705
14.621415300000002 14.62141534051219
14.853976399999999 14.853976434904588
11.480689300000002 11.480689273668311
13.4517021 13.451702113271214
14.248396599999998 14.248396555447155
7.768922099999999 7.76892210376191
5.6229128 5.62291279709146
4.5660262