## How To Use The Dataframe

This directory contains dataframes for each existing dataset as well as a jupyter notebook that was used to construct the master dataframe, `master_FEP.pkl`. This contains information on the identity, score, and project/run for each v1/v2/v3 for each ligand of each dataset.

Below are some sample command for querying the `pandas` dataframe.  If you have other questions or suggestions to add here, ask Matt (tug27224@temple.edu)!

***

To read in the dataframe, use:


In [1]:
import pandas as pd
df = pd.read_pickle('master_FEP.pkl')

You can access/print a certain subset of this by any of the columns. First, let's list the columns:

In [6]:
df.columns

Index(['dataset', 'identity', 'receptor', 'score', 'v1_project', 'v1_run',
       'v2_project', 'v2_run', 'v3_project', 'v3_run'],
      dtype='object')

In [7]:
# The data in these rows look like this:
df

Unnamed: 0,dataset,identity,receptor,score,v1_project,v1_run,v2_project,v2_run,v3_project,v3_run
0,72_RL,CCNCC(COC)Oc1ccccc1,receptor-270-343.pdb,0.999790,14600,0,14700,0,14800,0
1,72_RL,O=C(Cc1cccnc1)c1ccccc1,receptor-343.pdb,0.999652,14600,1,14700,1,14800,1
2,72_RL,CCCCC(N)c1cc(C)ccn1,receptor-343.pdb,0.999256,14600,2,14700,2,14800,2
3,72_RL,COCC(C)Nc1ccncn1,receptor-343.pdb,0.999096,14600,3,14700,3,14800,3
4,72_RL,CCN(CC)CCNc1ccc(C#N)cn1,receptor-270-343.pdb,0.998980,14600,4,14700,4,14800,4
5,72_RL,NNCc1ccccc1S(N)(=O)=O,receptor-270-343.pdb,0.998902,14600,5,14700,5,14800,5
6,72_RL,CCNC(=O)C(C)Nc1cc(C)ccn1,receptor-270-343.pdb,0.998787,14600,6,14700,6,14800,6
7,72_RL,CNCCNc1ccc(C#N)cn1,receptor-270-343.pdb,0.998752,14600,7,14700,7,14800,7
8,72_RL,CC1NCCCN(CC(N)=O)C1=O,receptor-270-343.pdb,0.998738,14600,8,14700,8,14800,8
9,72_RL,N#Cc1cccnc1NCC=CCN,receptor-270-343.pdb,0.998718,14600,9,14700,9,14800,9


Next, let's look at all of the 'MS0323' projects:

In [8]:
df.loc[df.dataset.str.contains('MS0323')]

Unnamed: 0,dataset,identity,receptor,score,v1_project,v1_run,v2_project,v2_run,v3_project,v3_run
0,MS0323_RL,DAR-DIA-43a-4,protein-0387.pdb,-8.98986,14363,0,14722,0,14822,0
1,MS0323_RL,SAL-INS-1c7-9,protein-0387.pdb,-8.12044,14363,1,14722,1,14822,1
2,MS0323_RL,AGN-NEW-891-6,protein-0387.pdb,-7.88928,14363,2,14722,2,14822,2
3,MS0323_RL,CHR-SOS-1f3-3,protein-0387.pdb,-7.54667,14363,3,14722,3,14822,3
4,MS0323_RL,CHR-SOS-709-13,protein-0387.pdb,-7.44988,14363,4,14722,4,14822,4
5,MS0323_RL,DAR-DIA-033-13,protein-0387.pdb,-7.42735,14363,5,14722,5,14822,5
6,MS0323_RL,CHR-SOS-6c4-4,protein-0387.pdb,-7.40943,14363,6,14722,6,14822,6
7,MS0323_RL,AGN-NEW-891-7,protein-0387.pdb,-7.39885,14363,7,14722,7,14822,7
8,MS0323_RL,CHR-SOS-6c4-6,protein-0387.pdb,-7.37415,14363,8,14722,8,14822,8
9,MS0323_RL,JOR-UNI-2fc-2,protein-0387.pdb,-7.37377,14363,9,14722,9,14822,9


Or, we can just examine the RL (receptor-ligand) systems of the '387' dataset with:

In [9]:
df.loc[df['dataset'] == '387_RL']

Unnamed: 0,dataset,identity,receptor,score,v1_project,v1_run,v2_project,v2_run,v3_project,v3_run
0,387_RL,NC(=NO)NCCc1ccc(S(N)(=O)=O)cc1,protein-0387.pdb,0.999989,14645,0,14701,0,14801,0
1,387_RL,COC(=O)c1cccc(C(=N)NO)c1,protein-0387.pdb,0.999915,14645,1,14701,1,14801,1
2,387_RL,CC(CC(N)=NO)NC(C)c1ccccc1,protein-0387.pdb,0.999833,14645,2,14701,2,14801,2
3,387_RL,CCCN(CCC(N)=NO)Cc1ccccc1,protein-0387.pdb,0.999794,14645,3,14701,3,14801,3
4,387_RL,C/C(=N\N/C(N)=N/O)c1ccccc1,protein-0387.pdb,0.999724,14645,4,14701,4,14801,4
5,387_RL,CC(OCc1ccccc1)C(=O)NCC(N)=NO,protein-0387.pdb,0.999577,14645,5,14701,5,14801,5
6,387_RL,N/C(=N/O)NCc1ncccn1,protein-0387.pdb,0.999169,14645,6,14701,6,14801,6
7,387_RL,CC(NO)c1ccc(S(N)(=O)=O)cc1,protein-0387.pdb,0.998972,14645,7,14701,7,14801,7
8,387_RL,CN1CCCc2cc(C(=N)NO)ccc21,protein-0387.pdb,0.998815,14645,8,14701,8,14801,8
9,387_RL,CN1CCCc2cc(/C(N)=N\O)ccc21,protein-0387.pdb,0.998607,14645,9,14701,9,14801,9


If you want to iterate over rows that match your selection:

In [12]:
for index, row in df.loc[df['dataset'] == '72_RL'].iterrows():
    print(f"cp p{row['v1_project']}/RUN{row['v1_run']} p{row['v2_project']}/RUN{row['v2_run']}")

cp p14600/RUN0 p14700/RUN0
cp p14600/RUN1 p14700/RUN1
cp p14600/RUN2 p14700/RUN2
cp p14600/RUN3 p14700/RUN3
cp p14600/RUN4 p14700/RUN4
cp p14600/RUN5 p14700/RUN5
cp p14600/RUN6 p14700/RUN6
cp p14600/RUN7 p14700/RUN7
cp p14600/RUN8 p14700/RUN8
cp p14600/RUN9 p14700/RUN9
cp p14600/RUN10 p14700/RUN10
cp p14600/RUN11 p14700/RUN11
cp p14600/RUN12 p14700/RUN12
cp p14600/RUN13 p14700/RUN13
cp p14600/RUN14 p14700/RUN14
cp p14600/RUN15 p14700/RUN15
cp p14600/RUN16 p14700/RUN16
cp p14600/RUN17 p14700/RUN17
cp p14600/RUN18 p14700/RUN18
cp p14600/RUN19 p14700/RUN19
cp p14600/RUN20 p14700/RUN20
cp p14600/RUN21 p14700/RUN21
cp p14600/RUN22 p14700/RUN22
cp p14600/RUN23 p14700/RUN23
cp p14600/RUN24 p14700/RUN24
cp p14600/RUN25 p14700/RUN25
cp p14600/RUN26 p14700/RUN26
cp p14600/RUN27 p14700/RUN27
cp p14600/RUN28 p14700/RUN28
cp p14600/RUN29 p14700/RUN29
cp p14600/RUN30 p14700/RUN30
cp p14600/RUN31 p14700/RUN31
cp p14600/RUN32 p14700/RUN32
cp p14600/RUN33 p14700/RUN33
cp p14600/RUN34 p14700/RUN34
cp p1

cp p14609/RUN96 p14700/RUN1066
cp p14610/RUN0 p14700/RUN1067
cp p14610/RUN1 p14700/RUN1068
cp p14610/RUN2 p14700/RUN1069
cp p14610/RUN3 p14700/RUN1070
cp p14610/RUN4 p14700/RUN1071
cp p14610/RUN5 p14700/RUN1072
cp p14610/RUN6 p14700/RUN1073
cp p14610/RUN7 p14700/RUN1074
cp p14610/RUN8 p14700/RUN1075
cp p14610/RUN9 p14700/RUN1076
cp p14610/RUN10 p14700/RUN1077
cp p14610/RUN11 p14700/RUN1078
cp p14610/RUN12 p14700/RUN1079
cp p14610/RUN13 p14700/RUN1080
cp p14610/RUN14 p14700/RUN1081
cp p14610/RUN15 p14700/RUN1082
cp p14610/RUN16 p14700/RUN1083
cp p14610/RUN17 p14700/RUN1084
cp p14610/RUN18 p14700/RUN1085
cp p14610/RUN19 p14700/RUN1086
cp p14610/RUN20 p14700/RUN1087
cp p14610/RUN21 p14700/RUN1088
cp p14610/RUN22 p14700/RUN1089
cp p14610/RUN23 p14700/RUN1090
cp p14610/RUN24 p14700/RUN1091
cp p14610/RUN25 p14700/RUN1092
cp p14610/RUN26 p14700/RUN1093
cp p14610/RUN27 p14700/RUN1094
cp p14610/RUN28 p14700/RUN1095
cp p14610/RUN29 p14700/RUN1096
cp p14610/RUN30 p14700/RUN1097
cp p14610/RUN31 p1

You can pair multiple selections together in a similar loop:

In [26]:
for index, row in df.loc[(df['dataset'] == '72_RL') & (df['v2_run'] < 100)].iterrows():
    print(row['dataset'], row['v1_project'], row['v1_run'], row['v2_project'], row['v2_run'])

72_RL 14600 0 14700 0
72_RL 14600 1 14700 1
72_RL 14600 2 14700 2
72_RL 14600 3 14700 3
72_RL 14600 4 14700 4
72_RL 14600 5 14700 5
72_RL 14600 6 14700 6
72_RL 14600 7 14700 7
72_RL 14600 8 14700 8
72_RL 14600 9 14700 9
72_RL 14600 10 14700 10
72_RL 14600 11 14700 11
72_RL 14600 12 14700 12
72_RL 14600 13 14700 13
72_RL 14600 14 14700 14
72_RL 14600 15 14700 15
72_RL 14600 16 14700 16
72_RL 14600 17 14700 17
72_RL 14600 18 14700 18
72_RL 14600 19 14700 19
72_RL 14600 20 14700 20
72_RL 14600 21 14700 21
72_RL 14600 22 14700 22
72_RL 14600 23 14700 23
72_RL 14600 24 14700 24
72_RL 14600 25 14700 25
72_RL 14600 26 14700 26
72_RL 14600 27 14700 27
72_RL 14600 28 14700 28
72_RL 14600 29 14700 29
72_RL 14600 30 14700 30
72_RL 14600 31 14700 31
72_RL 14600 32 14700 32
72_RL 14600 33 14700 33
72_RL 14600 34 14700 34
72_RL 14600 35 14700 35
72_RL 14600 36 14700 36
72_RL 14600 37 14700 37
72_RL 14600 38 14700 38
72_RL 14600 39 14700 39
72_RL 14600 40 14700 40
72_RL 14600 41 14700 41
72_RL 14600 