# Getting Molecular Properties through PUG-REST

adapted from https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics

## Objectives

- Learn the basic approach to getting data from PubChem through PUG-REST
- Retrieve a single property of a single compound.
- Retrieve a single property of multiple compounds
- Retrieve multiple properties of multiple compounds.
- Write a `for` loop to make the same kind of requests. 
- Process a large amount of data by splitting them into smaller chunks

## 1. The Shortest Code to Get PubChem Data

Let's suppose that we want to get the molecular formula of water from PubChem through PUG-REST.  You can get this data from your web browsers (Chrome, Safari, Internet Explorer, etc) via the following URL:<br>
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt<br>
Getting the same data using a computer program is not very difficult.  This task can be with a three-line code.

**Line 1:** First, the "requests" python library is imported.  The "requests" library contains a set of pre-written functions that allows you to access information on the web.

In [24]:
import requests

**Line 2:** Get the desired information using the function `get()` in the requests library.  The PUG-REST request URL (enclosed within a pair of quotes('') is provided within the parenteses.  The result will be stored in a variable called `res`.

In [25]:
res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt')

**Line 3:** The `res` variable (which means "result" or "response") contains not only the requested data but also some information about the request.  To view the returned data, you need to get the data from `res` and print it out.

In [26]:
print(res.text)

H2O



As another example, the following code retrieves the number of heavy (non-hydrogen) atoms of butadiene.

In [27]:
res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/butadiene/property/HeavyAtomCount/txt')
print(res.text)

4



Note that in this example, we did not import the `requests` library because it has already been imported (in the very fist example for getting the molecular formula of water).

**Exercise 1a:**  Retrieve the molecular weight of ethanol in a "text" format.

In [28]:
# Write your code in this cell:
res = requests.get('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/ethanol/property/HBondAcceptorCount/txt')

print(res.text)


1



**Exercise 1b:** Retrieve the number of hydrogen-bond acceptors of aspirin in a "text" format.

In [29]:
# Write your code in this cell:




## 2. Formulating PUG-REST request URLs using variables

In the previous examples, the PUG-REST request URLs were directly provided to the `requests.get()`, by explicitly typing the URL within the parentheses.  However, it is also possible to provide the URL using a variable.  The following example shows how to formulate the PUG-REST request URL using variables and pass it to `requests.get()`.

In [30]:
pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
pugin   = "compound/name/water"
pugoper = "property/MolecularFormula"
pugout  = "txt"

url     = pugrest + '/' + pugin + '/' + pugoper + '/' + pugout
print(url)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt


A PUG-REST request URL encodes three pieces of information (input, operation, output), preceded by the prologue common to all requests.  In the above code cell, these pieces of information are stored in four different variables (`pugrest`, `pugin`, `pugoper`, `pugout`) and combined into a new variable `url`.

One can also generate the same URL using the `join()` function, available for a string.

In [31]:
url = "/".join( [pugrest, pugin, pugoper, pugout] )
print(url)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/water/property/MolecularFormula/txt


Here, the strings stored in the four variables are joined by the "/" character as a separator.  Note that the four variables are enclosed within the square bracket ([]), meaning that a list containing them as elements is provided to `join()`.

Then, the url can be passed to `request.get()`.

In [32]:
res = requests.get(url)
print(res.text)

H2O



## 3. Making multiple requests using a for loop

The approach in the previous section (that use variables to construct a request URL) looks very inconvenient, compared to the three-line code shown at the beginning, where the request URL is directly provided to `requests.get()`.  If you are making only one request, it would be simpler to provide the URL directly to `requests.get()`, rather than assiging the piecse to variables, constructing the URL from them, and passing it to the function.<br>
However, if you are making a large number of requests, it would be very time consuming to type the respective request URLs for all requests.  In that case, you want to store common parts as variables and use them in a loop.  For example, suppose that you want to retrieve the SMILES strings of 5 chemicals.

In [33]:
names = ['cytosine', 'benzene', 'motrin', 'aspirin', 'zolpidem']

Now the chemical names are stored in a list called `names`.  Using a `for` loop, you can loop over each chemical name, formulating the request URL and retrieving the desired data, as shown below.

In [34]:
pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
pugoper = "property/CanonicalSMILES"
pugout  = "txt"

for myname in names:    # loop over each element in the "names" list
    
    pugin = "compound/name/" + myname
    
    url = "/".join( [pugrest, pugin, pugoper, pugout] )
    res = requests.get(url)
    print(myname, ":", res.text)


cytosine : C1=C(NC(=O)N=C1)N

benzene : C1=CC=CC=C1

motrin : CC(C)CC1=CC=C(C=C1)C(C)C(=O)O

aspirin : CC(=O)OC1=CC=CC=C1C(=O)O

zolpidem : CC1=CC=C(C=C1)C2=C(N3C=C(C=CC3=N2)C)CC(=O)N(C)C



**Warning:** When you make a lot of programmatic access requests using a loop, you should limit your request rate to or below **five requests per second**.  Please read the following document to learn more about PubChem's usage policies:
https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access$_RequestVolumeLimitations<br>
**Violation of usage policies** may result in the user being **temporarily blocked** from accessing PubChem (or NCBI) resources**

In the for-loop example above, we have only five input chemical names to process, so it is not likely to violate the five-requests-per-second limit.  However, if you have thousands of names to process, the above code will exceed the limit (considering that this kind of requests usually finish very quickly).  Therefore, the request rate should be adjusted by using the **`sleep()`** function in the **`time`** module.  For simplicity, let's suppose that you have 12 chemical names to process (in reality, you could have much more to process).

In [35]:
names = [ 'water', 'benzene', 'methanol', 'ethene', 'ethanol', \
          'propene','1-propanol', '2-propanol', 'butadiene', '1-butanol', \
          '2-butanol', 'tert-butanol']

In [36]:
import time

# pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
# pugoper = "property/CanonicalSMILES"
# pugout  = "txt"

for i in range(len(names)):    # loop over each index (position) in the "names" list
    
    pugin = "compound/name/" + names[i]    # names[i] = the ith element in the names list.
    
    url = "/".join( [pugrest, pugin, pugoper, pugout] )
    res = requests.get(url)
    print(names[i], ":", res.text)
    # print(i % 5)
    if  i % 5 == 4:  # the % is the modulo operator and returns the remainder of a calculation (if i = 4, 9, ...)
        print('hello')
        time.sleep(1)

water : O

benzene : C1=CC=CC=C1

methanol : CO

ethene : C=C

ethanol : CCO

hello
propene : CC=C

1-propanol : CCCO

2-propanol : CC(C)O

butadiene : C=CC=C

1-butanol : CCCCO

hello
2-butanol : CCC(C)O

tert-butanol : CC(C)(C)O



There are three things noteworthy in the above example (compared to the previous examples with the five chemical name queries).
- First, the for loop interates from 0 to [`len(names)` − 1], that is, [0, 1, 2, 3, ...,11].
- The variable `i` is used (in `names[i]`) to generate the input part (`pugin`) of the PUG-REST request URL.
- The variable `i` is used (in the `if` sentence) to stop the program for one second for every five requests.

It should be noted that the request volume limit can be lowered through the dynamic traffic control at times of excessive load (https://pubchemdocs.ncbi.nlm.nih.gov/dynamic-request-throttling).  Throttling information is provided in the HTTP header response, indicating the system-load state and the per-user limits.  Based on this throttling information, the user should moderate the speed at which requests are sent to PubChem.  We will cover this topic later in this course.

**Exercise 3a:**  Retrieve the XlogP values of linear alkanes with 1 ~ 12 carbons.<br>
- Use the chemical names as inputs
- Use a for loop to retrieve the XlogP value for each alkane.
- Use the sleep() function to stop the program for one second for every **five** requests.

In [37]:
# Write your code in this cell: (The solution code below will be removed later)




**Exercise 3b** Retrieve the **isomeric** SMILES of the 20 common amino acids.
- Use the chemical names as inputs. Because the 20 common amino acids in living organisms predominantly exist as one chrial form (the L-form), the names should be prefixed with **"L-"** (e.g., "L-alanine", rather than "alanine"), except for "glycine" (which does not have a chiral center).
- Use a for loop to retrieve the isomeric SMILES for each alkane.
- Use the sleep() function to stop the program for one second for every **five** requests.

In [38]:
# Write your code in this cell (The solution code below will be removed later)

 


## 4. Getting multiple molecular properties

All the examples we have seen in this notebook retrieved a single molecular property for a single compound (although we were able to get a desired property for a group of compounds using a for loop).  However, it is possible to get multiple properties for multiple compounds with a single request.

The following example retrieves the hydrogen-bond donor count, hydrogen-bond acceptor count, XLogP, TPSA for 5 compounds (represented by PubChem Compound IDs (CIDs) in a comma-separated values (CSV) format.

In [39]:
pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
pugin   = "compound/cid/4485,4499,5026,5734,8082"
pugoper = "property/HBondDonorCount,HBondDonorCount,XLogP,TPSA"
pugout  = "csv"

url = "/".join([pugrest, pugin, pugoper, pugout])   # Construct the URL
print(url)
print("-" * 30)   # Print "-" 30 times (to print a line for readability)

res = requests.get(url)
print(res.text)

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/4485,4499,5026,5734,8082/property/HBondDonorCount,HBondDonorCount,XLogP,TPSA/csv
------------------------------
"CID","HBondDonorCount","HBondDonorCount","XLogP","TPSA"
4485,1,1,2.200,110.0
4499,1,1,3.300,110.0
5026,1,1,4.300,123.0
5734,1,1,0.2,94.6
8082,1,1,0.800,12.0



In [40]:
res.text.rstrip()

'"CID","HBondDonorCount","HBondDonorCount","XLogP","TPSA"\n4485,1,1,2.200,110.0\n4499,1,1,3.300,110.0\n5026,1,1,4.300,123.0\n5734,1,1,0.2,94.6\n8082,1,1,0.800,12.0'

PubChem has a standard time limit of **30 seconds per request**.  When you try to retrieve too many properties for too many compounds with a single request, it can take longer than the 30-second limit and a time-out error will be returned.  Therefore, you may need to split the compound list into smaller chunks and process one chunk at a time.

In [41]:
cids = [ 443422,  72301,   8082,    4485,    5353740, 5282230, 5282138, 1547484, 941361, 5734,  \
         5494,    5422,    5417,    5290,    5245,    5026,    4746,    4507,    4499,   4497,  \
         4494,    4474,    4418,    4386,    4009,    4008,    3949,    3926,    3878,   3784,  \
         3698,    3547,    3546,    3336,    3333,    3236,    3076,    2585,    2520,   2351,  \
         2312,    2162,    1236,    1234,    292331,  275182,  235244,  108144,  104972, 77157, \
         5942250, 5311217, 4564402, 4715169, 5311501]

In [42]:
chunk_size = 10

if ( len(cids) % chunk_size == 0 ) : # check if total number of cids is divisible by 10 with no remainder
    num_chunks = len(cids) // chunk_size # sets number of chunks
else : # if divide by 10 results in remainder
    num_chunks = len(cids) // chunk_size + 1 # add one more chunk

print("# Number of CIDs:", len(cids) )
print("# Number of chunks:", num_chunks )

# Number of CIDs: 55
# Number of chunks: 6


In [None]:
pugrest = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
pugoper = "property/HBondDonorCount,HBondAcceptorCount,XLogP,TPSA"
pugout  = "csv"

csv = ""   #sets a variable called csv to save the comma separated output

for i in range(num_chunks) : # sets number of requests to number of data chunks as determined above
    
    idx1 = chunk_size * i        # sets a variable for a moving window of cids to start in a data chunk
    idx2 = chunk_size * (i + 1)  # sets a variable for a moving window of cids to end ina data chunk

    pugin = "compound/cid/" + ",".join([ str(x) for x in cids[idx1:idx2] ]) # build pug input for chunks of data
    url = "/".join( [pugrest, pugin, pugoper, pugout] )   # Construct the URL
    
    res = requests.get(url)

    if i == 0: # if this is the first request, store result in empty csv variable
        csv = res.text 
    else :          # if this is a subsequent request, add the request to the csv variable adding a new line between chunks
        csv = csv + "\n".join(res.text.split()[1:]) + "\n" 
    
    if i % 5 == 4:  
        time.sleep(1)

print(csv)

In [21]:
print(type(csv))

<class 'str'>


**Exercise 4a:** Below is the list of CIDs of known antiinflmatory agents (obtained from PubChem via the URL: https://www.ncbi.nlm.nih.gov/pccompound?LinkName=mesh_pccompound&from_uid=68000893).  Use GPT to write a script that downloads the following properties of those compounds in a comma-separated format: Heavy atom count, rotatable bond count, molecular weight, XLogP, hydrogen bond donor count, hydrogen bond acceptor count, TPSA, and isomeric SMILES.

- Split the input CID list into small chunks (with a chunk size of 100 CIDs).
- Process one chunk at a time using a for loop.
- Do not forget to add sleep() to comply the usage policy.

What problems do you encounter in using GPT and how can you solve these?

In [22]:
# Write your code in this cell.


