# Embeddings for drugs from DrugBANK using ChatGPT

This note creates and save embeddings for drugs from [DrugBANK](https://go.drugbank.com/releases/latest#biotech-sequences)  It contains 11582 drugs SMILES and 8723 drug description.
Embeedings are created from
1) SMILES representations 2) drug names and drug descriptions.


 * [OpenAI GPT-3 Text Embeddings - Really a new state-of-the-art in dense text embeddings?](https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9)
 * [Comparison of embeddings in LLM](https://betterprogramming.pub/openais-embedding-model-with-vector-database-b69014f04433)
 * [Introducing text and code embeddings](https://openai.com/blog/introducing-text-and-code-embeddings) OpenAI Jan 2022 (note that the models used in this tutorial is not the most updated one. In Dec 2022 the models are updated as text-embedding-ada-002 et al. After Jan 2024 there will be new models.)

In [None]:
!pip install openai
!pip install deepchem
!pip install rdkit




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import openai
import numpy as np
from google.colab import drive
import deepchem as dc
from rdkit import Chem

# Add Your OpenAI Key
openai.api_key  = ('....')

## SMILES
Rextracted from All drugs.xml file from [DrugBank](https://go.drugbank.com/releases/latest)

In [None]:
url = 'https://raw.githubusercontent.com/sshaghayeghs/DDI-LLM/main/Dataset/Drug%20Information/structure%20links.csv'

In [None]:
df = pd.read_csv(url)

# extract the drug ids and smile strings
df_smiles = df[["DrugBank ID", "SMILES"]]

# drop any rows with missing smile strings
df_smiles.dropna(inplace=True)

# reset the index
df_smiles.reset_index(drop=True, inplace=True)

# print the resulting dataframe
df_smiles

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_smiles.dropna(inplace=True)


Unnamed: 0,DrugBank ID,SMILES
0,DB00006,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...
1,DB00007,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...
2,DB00014,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...
3,DB00027,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...
4,DB00035,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...
...,...,...
11578,DB17379,CC(C)C1=C(O)C(O)=C(C=O)C2=C(O)C(=C(C)C=C12)C1=...
11579,DB17383,CN1CCN(CC2=CC=C(NC(=O)C3=NNC=C3NC3=C4C=CNC4=NC...
11580,DB17384,CC1=C2N=C(C3=CC=CC=C3Cl)C3=C(NC2=NN1)C=C(N=C3)...
11581,DB17385,CC[C@@]1(OC(=O)C(C)ON=C2C3=C(C4=C2C=C(C=C4[N+]...


In [None]:
df

Unnamed: 0,DrugBank ID,Name,CAS Number,Drug Groups,InChIKey,InChI,SMILES,Formula,KEGG Compound ID,KEGG Drug ID,PubChem Compound ID,PubChem Substance ID,ChEBI ID,ChEMBL ID,HET ID,ChemSpider ID,BindingDB ID
0,DB00006,Bivalirudin,128270-60-0,approved; investigational,OIRCOABEOLEUMC-GEJPAHFPSA-N,InChI=1S/C98H138N24O33/c1-5-52(4)82(96(153)122...,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,C98H138N24O33,,D03136,16129704.0,46507415.0,59173.0,CHEMBL2103749,,10482069.0,50248103.0
1,DB00007,Leuprolide,53714-56-0,approved; investigational,GFIJNRVAKGFPGQ-LIJARHBVSA-N,InChI=1S/C59H84N16O12/c1-6-63-57(86)48-14-10-2...,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,C59H84N16O12,C07612,D08113,,46507635.0,6427.0,CHEMBL1201199,,571356.0,50369395.0
2,DB00014,Goserelin,65807-02-5,approved,BLCLNMBMMGCOAS-URPVMXJPSA-N,InChI=1S/C59H84N18O14/c1-31(2)22-40(49(82)68-3...,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,C59H84N18O14,,D00573,5311128.0,46507336.0,5523.0,CHEMBL1201247,,4470656.0,
3,DB00027,Gramicidin D,1405-97-6,approved,NDAYQJDHGXTBJL-MWWSRJDJSA-N,InChI=1S/C96H135N19O16/c1-50(2)36-71(105-79(11...,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,C96H135N19O16,,D04369,45267103.0,46507412.0,,CHEMBL557217,,24623445.0,
4,DB00035,Desmopressin,16679-58-6,approved,NFLWUMRGJYTJIN-PNIOQBSNSA-N,InChI=1S/C46H64N14O12S2/c47-35(62)15-14-29-40(...,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,C46H64N14O12S2,C06944,D00291,,,4450.0,CHEMBL1429,,4470602.0,50205308.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12222,DB17382,AUM-601,,investigational,,,,,,,,,,,,,
12223,DB17383,FN-1501,1429515-59-2,investigational,VXLAKHWYGRKCGI-UHFFFAOYSA-N,InChI=1S/C22H25N9O/c1-30-8-10-31(11-9-30)13-15...,CN1CCN(CC2=CC=C(NC(=O)C3=NNC=C3NC3=C4C=CNC4=NC...,C22H25N9O,,,,,,CHEMBL4077071,,67886359.0,50270304.0
12224,DB17384,Tinengotinib,2230490-29-4,investigational,DQFCVOOFMXEPOC-UHFFFAOYSA-N,InChI=1S/C20H19ClN6O/c1-12-18-20(26-25-12)23-1...,CC1=C2N=C(C3=CC=CC=C3Cl)C3=C(NC2=NN1)C=C(N=C3)...,C20H19ClN6O,,,,,,,,115008005.0,
12225,DB17385,Lipotecan,1432468-79-5,investigational,JCCCLGDYMMTBPM-HXDHBHDHSA-N,InChI=1S/C39H30N8O15/c1-5-39(26-13-30-34-18(14...,CC[C@@]1(OC(=O)C(C)ON=C2C3=C(C4=C2C=C(C=C4[N+]...,C39H30N8O15,,,,,,CHEMBL4594352,,64854346.0,


In [None]:
EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = []
k=len(df_smiles)
for s in df_smiles['SMILES'][:k]:
    response=openai.Embedding.create(
      model="text-embedding-ada-002",
      input=s)
    SMILES_Embedding = [e["embedding"] for e in response["data"]]
    embeddings.extend(SMILES_Embedding)

matrix = np.array(embeddings)
df_smiles=pd.concat([df_smiles, pd.DataFrame(matrix)], axis=1)


In [None]:
df_smiles.head()

Unnamed: 0,DrugBank ID,SMILES,0,1,2,3,4,5,6,7,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
0,DB00006,CC[C@H](C)[C@H](NC(=O)[C@H](CCC(O)=O)NC(=O)[C@...,-0.004058,-0.012137,-0.015942,-0.033304,0.006861,0.001978,0.011687,-0.020355,...,0.005235,-0.009962,0.035181,-0.01447,-0.019873,0.014509,0.00113,-0.001672,0.004309,-0.014585
1,DB00007,CCNC(=O)[C@@H]1CCCN1C(=O)[C@H](CCCNC(N)=N)NC(=...,0.000901,-0.006647,-0.016257,-0.016899,-0.002548,0.008476,0.000789,-0.021973,...,-0.001559,-0.012075,0.050344,-0.025788,-0.008194,0.013622,0.005254,-0.006264,0.005844,-0.020649
2,DB00014,CC(C)C[C@H](NC(=O)[C@@H](COC(C)(C)C)NC(=O)[C@H...,-0.012944,-0.010268,-0.021658,-0.025679,0.001797,0.003518,0.00303,-0.005819,...,0.009178,-0.012109,0.041594,-0.02846,-0.020196,0.020836,0.002066,-0.004938,-0.000288,-0.018212
3,DB00027,CC(C)C[C@@H](NC(=O)CNC(=O)[C@@H](NC=O)C(C)C)C(...,-0.010805,-0.017782,-0.008534,-0.025193,0.007859,0.006013,0.001504,-0.011974,...,0.003835,-0.003586,0.040185,-0.025674,-0.024752,0.018989,0.002612,-0.002705,-0.001424,-0.008385
4,DB00035,NC(=O)CC[C@@H]1NC(=O)[C@H](CC2=CC=CC=C2)NC(=O)...,-0.003569,-0.007164,-0.015846,-0.019412,-0.003754,0.002498,0.005443,-0.026427,...,0.004552,-0.012663,0.046916,-0.023211,-0.013473,0.011068,0.001153,-0.002503,-0.002738,-0.02456


In [None]:
df_smiles.to_csv('/content/drive/MyDrive/Shaghayegh Sadeghi/Drug embedding/SMILES_GPT.csv', sep='\t')

## Drug Descriotion
Rextracted from All drugs.xml file from [DrugBank](https://go.drugbank.com/releases/latest)

In [None]:
urlDes='https://raw.githubusercontent.com/sshaghayeghs/DDI-LLM/main/Dataset/Drug%20Information/Drug_description.csv'

In [None]:
Drug=pd.read_csv(urlDes)
Drug = Drug[Drug.Discription != ";;;;"]
Drug=Drug.reset_index(drop=True)
Drug

Unnamed: 0,Drug ID,Drug Name,Discription
0,DB00001,Lepirudin,Lepirudin is a recombinant hirudin formed by 6...
1,DB00002,Cetuximab,Cetuximab is a recombinant chimeric human/mous...
2,DB00003,Dornase alfa,Dornase alfa is a biosynthetic form of human d...
3,DB00004,Denileukin diftitox,A recombinant DNA-derived cytotoxic protein co...
4,DB00005,Etanercept,Dimeric fusion protein consisting of the extra...
...,...,...,...
8718,DB17378,MVR-T3011,MVR-T3011 is a genetically modified oncolytic ...
8719,DB17379,(-)-Gossypol,"(-)-Gossypol or (R)-Gossypol, is the R-isomer ..."
8720,DB17380,Saposin C,"Saposin C is a lysosomal protein. Recombinant,..."
8721,DB17381,Nadofaragene firadenovec,Nadofaragene firadenovec (nadofaragene firaden...


In [None]:
EMBEDDING_MODEL = "text-embedding-ada-002"

embeddings = []

for discript in Drug['Description']:
    response=openai.Embedding.create(
      model="text-embedding-ada-002",
      input=discript)
    Drug_Embedding = [e["embedding"] for e in response["data"]]
    embeddings.extend(Drug_Embedding)

matrix = np.array(embeddings)
Drug=pd.concat([Drug, pd.DataFrame(matrix)], axis=1)

In [None]:
Drug.head()

Unnamed: 0,Drug ID,Drug Name,Discription,0,1,2,3,4,5,6,...,1526,1527,1528,1529,1530,1531,1532,1533,1534,1535
0,DB00001,Lepirudin,Lepirudin is a recombinant hirudin formed by 6...,-0.023243,-0.003074,-0.012402,-0.001186,-0.000536,0.003723,-0.008107,...,0.013473,-0.018314,0.053811,-0.038109,-0.01997,0.010234,0.000348,-0.016429,-0.003619,0.006386
1,DB00002,Cetuximab,Cetuximab is a recombinant chimeric human/mous...,-0.039815,-0.00662,-0.013445,-0.024023,0.011249,0.006637,-0.027192,...,0.015463,0.010131,0.033266,-0.010604,-0.031268,0.001842,-0.041787,-0.012721,-0.009013,-0.001828
2,DB00003,Dornase alfa,Dornase alfa is a biosynthetic form of human d...,-0.032719,0.000878,-0.014393,-0.015269,-0.009142,0.033176,-0.027683,...,0.036435,-0.007951,0.005945,-0.004393,-0.010314,-0.005931,-0.007217,-0.008685,0.015861,0.005022
3,DB00004,Denileukin diftitox,A recombinant DNA-derived cytotoxic protein co...,-0.042006,-0.000229,-0.053231,-0.014633,0.003604,0.037218,-0.013145,...,0.021611,-0.00635,0.020989,-0.02575,-0.033107,0.009663,-0.009947,-0.026453,-0.004561,-0.006079
4,DB00005,Etanercept,Dimeric fusion protein consisting of the extra...,-0.021316,0.011136,-0.027183,-0.007539,-0.003882,0.007035,-0.031855,...,0.011249,-0.010711,0.043535,-0.009829,-0.010452,0.015994,0.003451,-0.03066,0.0176,-0.029333


In [None]:
Drug.to_csv('/content/drive/MyDrive/Shaghayegh Sadeghi/Drug embedding/Dr_Desc_GPT.csv', sep='\t')