# Complex/Ligand dictionaries

In the database of complexes, neither metals nor coordinated ligands are explicitely given. We are going to combine all the data we have into a singular dataframe for easier access.

In [1]:
import numpy as np
import pandas as pd
import json

First, we create a dictionary wih keys being metals and objects being the complexes in which they appear. 

In [75]:
df_ligands1= pd.read_csv('ligands_misc_info.csv',sep=';')
df_ligands1=df_ligands1[['name',"parent_metal_occurrences"]]
df_complex=pd.read_csv('Separate complex info.csv')
df_ligands=pd.read_csv('Separate ligands info.csv')
metal_complex={}
for string in df_ligands1['parent_metal_occurrences']:
    acceptable_string=string.replace("'", "\"")
    dico= json.loads(acceptable_string)
    for k in list(dico.keys()):
        metal_complex[f"{k}"] = []
for string in df_ligands1['parent_metal_occurrences']:
    acceptable_string=string.replace("'", "\"")
    dico= json.loads(acceptable_string)
    for i, (k,o) in enumerate(dico.items()):
        for j in o:
            metal_complex[f"{k}"].append(j[:6])

In [17]:
df_complex["Metals"] = [[] for _ in range(len(df_complex))]

for j, id_value in enumerate(df_complex['id']):
    for ligand,metal_list in metal_complex.items():
        if id_value in metal_list:
            df_complex["Metals"][j].append(ligand)

Save our dataframe that now indicates the metals in the complexes.

In [80]:
pd.DataFrame.to_csv(df_complex, 'Separate complex info.csv')

The different databases do not all have the same complexes, so there can be missing data. We look at the number of complexes with missing information.

In [24]:
j=0
for i in range(0, len(df_complex)):
    if df_complex["Metals"][i]==[]:
        j+=1
print(j)

3754


Out of around 60 000 complexes. It is satisfactory enough.

We create a list with all complexes to go through them in an easy way.

In [50]:
tot_complex = []
for j, complex in enumerate(df_complex['id']):
    tot_complex.append(complex)

In [106]:
df_complex["Ligands and number of occurance"]=[{} for _ in range(len(df_complex))]
df_ligands["Complex in which they appear and number of appearances"]=[{} for _ in range(len(df_ligands))]

Using vectorized operations makes the code faster. Here, we want to add to the ligand dataframe which complexes they appear in and their number of appearance.

In [66]:
for complex in tot_complex:
    mask=df_ligands1['parent_metal_occurrences'].str.contains(complex)
    counts=df_ligands1.loc[mask,"parent_metal_occurrences"].str.count(complex)
    df_ligands.loc[mask,"Complex in which they appear and number of appearances"]=df_ligands.loc[mask,"Complex in which they appear and number of appearances"].apply(lambda x:{**x, f"{complex}":counts.iloc[0]})

We want to see if it works.

In [94]:
df_ligands

Unnamed: 0.1,Unnamed: 0,ID,Smiles,Stoichiometry,Charge,Nb atoms,Complex in which they appear and number of appearances
0,0,ligand0-0,N1(C([H])([H])c2c3c(c([H])c(c(c3[H])[H])[H])c(...,C20H18N2,0,40,"{'IRIXOW': 2, 'UKEBAN': 2, 'IRIXUC': 2, 'IRIYA..."
1,1,ligand1-0,[Cl],Cl,-1,1,"{'WAWJEM': 1, 'KUVSUP': 1, 'WURFEW': 2, 'FUTWE..."
2,2,ligand2-0,[C]1(C([H])(C(C([H])([H])[C](C(C([C]1[H])([H])...,C8H13,-1,21,{'WAGBIQ': 1}
3,3,ligand3-0,[Br],Br,-1,1,"{'UFUMUG': 2, 'MARNEY': 1, 'IGUGOH': 2, 'WEYVI..."
4,4,ligand4-0,O=[C],CO,0,2,"{'YAXBUU': 3, 'WURFEW': 2, 'LEYSAK': 4, 'ZEQJA..."
...,...,...,...,...,...,...,...
29726,29726,ligand31886-0,C1(=C(C(=C([C]1[H])[H])[H])c1nc([H])c(c(c1[H])...,C10H8N,-1,19,{'LECZUQ': 3}
29727,29727,ligand31887-0,s1c(c(c(c1[H])[H])[H])C(=O)[N]/N=C(/[H])\c1nc2...,C13H8N3OS2,-1,27,{'GEDMAG': 2}
29728,29728,ligand31888-0,n1n(c(c(c1[H])[H])[H])C(N(c1c([H])c([H])c(c(c1...,C14H14FN5,0,34,{'KAZKUT': 2}
29729,29729,ligand31890-0,Brc1n(nc(c1Br)Br)[B]([H])([H])n1nc(c(c1Br)Br)Br,C6H2BBr6N4,-1,19,{'EFUFIU': 2}


Save our progress.

In [79]:
pd.DataFrame.to_csv(df_ligands, 'Separate ligand info.csv')

Now, we want to add to our complex database which ligands appear in it and their number of occurance. The code is made to be as efficient as possible considering the amount of data we are going through.

In [109]:
for i, complex in enumerate(df_complex['id']):
    for j, ligand_dico in enumerate(df_ligands['Complex in which they appear and number of appearances']):
        if complex in ligand_dico.keys():
            df_complex["Ligands and number of occurance"][i][f'{df_ligands.iloc[j,1]}']=ligand_dico[complex]

In [110]:
df_complex

Unnamed: 0.1,Unnamed: 0,id,charge,molecular_mass,n_atoms,n_electrons,Stoichiometry,Metals,Ligands and number of occurance
0,0,TUTCAM,0,728.12688,81,374,,[Cd],"{'ligand34-0': 2, 'ligand27213-0': 2}"
1,1,WAWJEM,0,692.22093,75,346,,[Au],"{'ligand1-0': 1, 'ligand25035-0': 1}"
2,2,KUVSUP,0,492.00654,47,248,C20H19ClN2O4Pd,[Pd],"{'ligand1-0': 1, 'ligand115-0': 1, 'ligand1313..."
3,3,WURFEW,0,516.06263,56,264,,[Ru],"{'ligand1-0': 2, 'ligand4-0': 2, 'ligand12916-..."
4,4,YAXBUU,1,496.14599,49,242,,[Re],"{'ligand4-0': 3, 'ligand4191-0': 3}"
...,...,...,...,...,...,...,...,...,...
60794,60794,WIBQUU,0,478.92184,25,224,,[Pt],{'ligand19880-0': 2}
60795,60795,CERJOY,0,410.07147,47,212,,[Mn],"{'ligand4-0': 2, 'ligand4353-0': 2}"
60796,60796,VUPXIO,0,552.89084,31,262,,[Pt],"{'ligand1-0': 2, 'ligand21337-0': 2}"
60797,60797,KONBEW,0,598.99920,39,284,C20H10Cl2N6Pt,[Pt],"{'ligand1-0': 2, 'ligand4916-0': 2}"


We check that the amount of complexes with missing data is the same as previously, otherwise it would mean something went wrong.

In [112]:
j=0
for i in range(0, len(df_complex)):
    if df_complex["Ligands and number of occurance"][i]=={}:
        j+=1
print(j)

3754


Save our progress.

In [113]:
pd.DataFrame.to_csv(df_complex, 'Separate complex info.csv')