<a href="https://colab.research.google.com/github/seshiu/pubchem_rdkit/blob/main/PubChem_PUG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
import requests
import bs4
from bs4 import BeautifulSoup
import re
from statistics import mode
import pandas as pd
import numpy as np

# Working with PUG-REST

Let's try to get all data on aspirin



In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin")
data

<Response [400]>

Hmmm... A 400 response code means error. It looks like we have to specify the output file type. Otherwise, requests won't work.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/xml")
data

<Response [200]>

Ok. Let's parse this information using Beautiful Soup.

In [None]:
html = BeautifulSoup(data.content, "xml")

Let's find out what kind of information we can get.

In [None]:
html.find_all("PC-Urn_label")

[<PC-Urn_label>Compound</PC-Urn_label>,
 <PC-Urn_label>Compound Complexity</PC-Urn_label>,
 <PC-Urn_label>Count</PC-Urn_label>,
 <PC-Urn_label>Count</PC-Urn_label>,
 <PC-Urn_label>Count</PC-Urn_label>,
 <PC-Urn_label>Fingerprint</PC-Urn_label>,
 <PC-Urn_label>IUPAC Name</PC-Urn_label>,
 <PC-Urn_label>IUPAC Name</PC-Urn_label>,
 <PC-Urn_label>IUPAC Name</PC-Urn_label>,
 <PC-Urn_label>IUPAC Name</PC-Urn_label>,
 <PC-Urn_label>IUPAC Name</PC-Urn_label>,
 <PC-Urn_label>IUPAC Name</PC-Urn_label>,
 <PC-Urn_label>InChI</PC-Urn_label>,
 <PC-Urn_label>InChIKey</PC-Urn_label>,
 <PC-Urn_label>Log P</PC-Urn_label>,
 <PC-Urn_label>Mass</PC-Urn_label>,
 <PC-Urn_label>Molecular Formula</PC-Urn_label>,
 <PC-Urn_label>Molecular Weight</PC-Urn_label>,
 <PC-Urn_label>SMILES</PC-Urn_label>,
 <PC-Urn_label>SMILES</PC-Urn_label>,
 <PC-Urn_label>Topological</PC-Urn_label>,
 <PC-Urn_label>Weight</PC-Urn_label>]

Let's get the Molecular Weight. First, we locate the tag.

In [None]:
mw_tag = html.find(name="PC-Urn_label", string="Molecular Weight")
mw_tag

<PC-Urn_label>Molecular Weight</PC-Urn_label>

Then, let's take a look at the parent of this tag.

In [None]:
mw_parents = mw_tag.find_parent("PC-InfoData")
mw_parents

<PC-InfoData>
<PC-InfoData_urn>
<PC-Urn>
<PC-Urn_label>Molecular Weight</PC-Urn_label>
<PC-Urn_datatype>
<PC-UrnDataType value="string">1</PC-UrnDataType>
</PC-Urn_datatype>
<PC-Urn_version>2.1</PC-Urn_version>
<PC-Urn_software>PubChem</PC-Urn_software>
<PC-Urn_source>ncbi.nlm.nih.gov</PC-Urn_source>
<PC-Urn_release>2021.05.07</PC-Urn_release>
</PC-Urn>
</PC-InfoData_urn>
<PC-InfoData_value>
<PC-InfoData_value_sval>180.16</PC-InfoData_value_sval>
</PC-InfoData_value>
</PC-InfoData>

The info we want is contained between 'PC-InfoData_value_sval' tag. So let's get to it.

In [None]:
mw = mw_parents.find('PC-InfoData_value_sval').string
mw

'180.16'

There you go. It takes some playing around with. You can always go to the URL, which will render the data on your browser. That's what I did to help navigate this tree.

# Working with PUG-View

Let's do the same for PUG-View and see what we can get from it for aspirin.

In [None]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2244/xml")
data

<Response [200]>

In [None]:
html = BeautifulSoup(data.content, "xml")

To get an idea of what information is available, we can look up TOCHeadings with html.find_all('TOCHeading'). I'm not gonna run the code here, because it's a really long list. Let's say we want to look up melting points.

In [None]:
mp_tag = html.find(name='TOCHeading', string='Melting Point')
mp_tag

<TOCHeading>Melting Point</TOCHeading>

If you look at the xml structure, you'll see that this TOCHeading has siblings with tags 'Information'. The information we want is the childrean of these 'Information' tag under the tag 'String.' So, to find the first value, we can do this:

In [None]:
mp_tag.find_next_sibling('Information').find(name='String').string

'275 °F (NTP, 1992)'

As you can see, these numbers come with not just the units, but annotations on where the data comes from. Not only that, there are multiple values for melting point, since it's experimentally measured. This was quite a long way to find the melting point. An eaiser way is to include what we want in the URL.

In [2]:
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2244/xml?heading=Melting+Point")
data

<Response [200]>

In [3]:
html = BeautifulSoup(data.content, "xml")

In [15]:
html.find_all('String')

[<String>275 °F (NTP, 1992)</String>,
 <String>138-140</String>,
 <String>135 °C (rapid heating)</String>,
 <String>135 °C</String>,
 <String>135 °C</String>,
 <String>275 °F</String>,
 <String>275 °F</String>]

In [23]:
mode(html.find_all('String')).text

'135 °C'

In [32]:
CID_list = [2345,
 23235,
 8698,
 21265,
 13654,
 8437,
 7194,
 7193,
 7165,
 5705112,
 32611,
 25054,
 33023,
 91234,
 134692469,
 111835,
 87390959,
 76959962,
 44150341,
 21989361]

df_properties = pd.DataFrame(index = CID_list, columns = ['mp'])

for CID in CID_list:
  data = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{CID}/xml?heading=Melting+Point")
  html = BeautifulSoup(data.content, "xml")
  if len(html.find_all('String')) > 0:
    print(mode(html.find_all('String')).text)
    df_properties.loc[CID,'mp'] = mode(html.find_all('String')).text
  else:
    df_properties.loc[CID,'mp'] = np.nan

df_properties

21 °C
-22 °C
-28 °C
-34 °C
38.5 °C
-49 °F (USCG, 1999)
-77.8 °C


Unnamed: 0,mp
2345,21 °C
23235,
8698,-22 °C
21265,
13654,
8437,-28 °C
7194,
7193,
7165,-34 °C
5705112,38.5 °C


In [37]:
def cid_to_property(CID_list, property_name, property_heading):

  df_properties = pd.DataFrame(index = CID_list, columns = [property_name])

  for CID in CID_list:
    data = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{CID}/xml?heading={property_heading}")
    html = BeautifulSoup(data.content, "xml")
    if len(html.find_all('String')) > 0:
      print(mode(html.find_all('String')).text)
      df_properties.loc[CID,property_name] = mode(html.find_all('String')).text
    else:
      df_properties.loc[CID,property_name] = np.nan

  return df_properties

df_properties = cid_to_property(CID_list, 'RI', 'Refractive+Index')







Index of refraction: 1.5680 at 20 °C/D
1.490-1.500
INDEX OF REFRACTION: 1.4940 @ 25 °C/D
1.514-1.521
1.492-1.497
Refractive index = 1.5424
1.491-1.497
1.502-1.508
INDEX OF REFRACTION: 1.4449 @ 20 °C/D


In [38]:
df_properties

Unnamed: 0,RI
2345,Index of refraction: 1.5680 at 20 °C/D
23235,1.490-1.500
8698,INDEX OF REFRACTION: 1.4940 @ 25 °C/D
21265,1.514-1.521
13654,1.492-1.497
8437,Refractive index = 1.5424
7194,
7193,1.491-1.497
7165,1.502-1.508
5705112,


In [44]:
def cid_to_exp_properties(CID_list, property_name_list, property_heading_list):

  df_properties = pd.DataFrame(index = CID_list, columns = property_name_list)

  for i in range(len(property_name_list)):

    property_name = property_name_list[i]
    property_heading = property_heading_list[i]

    print(property_heading)

    for CID in CID_list:
      data = requests.get(f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{CID}/xml?heading={property_heading}")
      html = BeautifulSoup(data.content, "xml")
      if len(html.find_all('String')) > 0:
        print(mode(html.find_all('String')).text)
        df_properties.loc[CID,property_name] = mode(html.find_all('String')).text
      else:
        df_properties.loc[CID,property_name] = np.nan

  return df_properties

df_properties = cid_to_exp_properties(CID_list, ['MP','RI'], ['Melting+Point','Refractive+Index'])

Melting+Point
21 °C
-22 °C
-28 °C
-34 °C
38.5 °C
-49 °F (USCG, 1999)
-77.8 °C
Refractive+Index
Index of refraction: 1.5680 at 20 °C/D
1.490-1.500
INDEX OF REFRACTION: 1.4940 @ 25 °C/D
1.514-1.521
1.492-1.497
1.491-1.497
1.502-1.508
INDEX OF REFRACTION: 1.4449 @ 20 °C/D


In [45]:
df_properties

Unnamed: 0,MP,RI
2345,21 °C,Index of refraction: 1.5680 at 20 °C/D
23235,,1.490-1.500
8698,-22 °C,INDEX OF REFRACTION: 1.4940 @ 25 °C/D
21265,,1.514-1.521
13654,,1.492-1.497
8437,-28 °C,
7194,,
7193,,1.491-1.497
7165,-34 °C,1.502-1.508
5705112,38.5 °C,


In [24]:
print(bs4.__version__)

4.11.2


In [10]:
print(xml.__version__)

NameError: ignored

In [12]:
import lxml
print(lxml.__version__)

4.9.2
