# Accessing PubChem from Python

This code illustrates a basic request of PubChem with Python and how to fetch its data. It departs from a request to PubChem of all compounds whose structure is related to Acetylsalicylic Acid (2244) in the following way:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/cid/2244/cids/XML

It then iterates in the compounds and shows the id of the 20 first:

In [14]:
import io
import requests
import xml.etree.ElementTree as et
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/cid/2244/cids/XML"
data = requests.get(url).content
tree = et.parse(io.StringIO(data.decode("utf-8")))
pc = tree.getroot()
cont = 0
for child in pc:
    cont = cont + 1
    if cont <= 20:
        print(child.text)
print(cont, ' records')

2244
137329
91626
9905405
9871508
24666
24847961
16099592
9938610
71586929
56841578
44153517
24936226
24847798
11980079
145904
83966
56841602
54681542
24847819
612  records


# Enriching XML with Python

This notebook departs from the list of all PubChem elements that have cross-reference with ChEBI, described here:

https://github.com/santanche/lab2learn/blob/master/xml/lab04-xquery-drom-pubchem.md

This code produces a REST request for each id to retrieve synonym names from PubChem. It illustrates how to explore Python to enrich XML resources.

In [None]:
import io
import requests
import xml.etree.ElementTree as et
import re
# url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sourceall/ChEBI/xrefs/RegistryID/XML"
url = "https://raw.githubusercontent.com/santanche/lab2learn/master/api/pubchem/pubchem-dron-join.xml"
data = requests.get(url).content
tree = et.parse(io.StringIO(data.decode("utf-8")))
pc = tree.getroot()
cont = 0
f = open("pubchem-chebi-synonyms.xml", "w")
f.write('<PC-DataSet>')
for regid in pc.iter('SID'):
    cont = cont + 1
    print(regid.text)
    subst = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/' + regid.text + '/synonyms/XML'
    datas = requests.get(subst).content
    datastr = datas.decode("utf-8")
    datastr = datastr.replace('<?xml version="1.0"?>', '')
    datastr = re.sub(r'<InformationList[^>]*>', '<InformationList>', datastr, re.M)
    f.write(datastr)
f.write('</PC-DataSet>')
f.close()
print(cont, ' records')

10318855
10318863
10318864
10318874
10318895
111978170
11533154
11533208
11533222
11533325
11533347
11533358
11533496
11533499
11533784
11533933
11534102
11534105
124403616
124403618
124403681
124403703
14717642
14717661
14717665
14717772
14717784
14718342
14718462
160644656
160962750
163425655
171571835
17425133
17425146
17425376
17425442
17425478
17425507
223438296
223438340
223438417
223438428
223438430
223438431
223438432
223438434
223438436
223438453
223438458
223438482
223438483
223438485
223438492
223439750
24398251
24434790
24434920
24712284
255509821
26697085
26697092
26697116
26697206
26697284
26697306
26697359
26697417
26697544
26744180
26744226
29214789
29214791
29214813
29214861
329554132
329554134
340096730
340096731
340096735
341102794
355203982
374393741
374393743
375561408
375561414
405081458
405081460
405081461
46530514
46530623
49658626
49658718
49658851
49658919
49693580
49742702
49836633
49836727
50139237
50139240
50139262
50139266
50139270
53801116
53801152
538011