# Reading HTML Documents

This section contains a modified example shown based on the [reading documents page](http://chemdataextractor.org/docs/reading) of the Chem Data Extractor (CDE) documentation. The HTML document  used in this example can be accessed at :
[Silveri, F., Quesne, M. G., Roldan, A., De Leeuw, N. H., & Catlow, C. R. A. (2019). Hydrogen adsorption on transition metal carbides: a DFT study. Physical Chemistry Chemical Physics, 21(10), 5335-5343](https://pubs.rsc.org/ko/content/articlehtml/2019/cp/c8cp05975f). 


In [1]:
# The line of code (LOC) below imports the document object from the CDE library 
from chemdataextractor import Document

# library for handling html requests (reading online pages)
import requests

# The HTML document used in this example is the online version of:
# Silveri, F., Quesne, M. G., Roldan, A., De Leeuw, N. H., & Catlow, C. R. A. (2019). 
# Hydrogen adsorption on transition metal carbides: a DFT study. Physical Chemistry 
# Chemical Physics, 21(10), 5335-5343.

# the path for the html file is used directly
article_url = "https://pubs.rsc.org/ko/content/articlehtml/2019/cp/c8cp05975f"

# set request header
req_head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
# get the page content 
html_response = requests.get(article_url, headers = req_head)
#save the content as a temporary file in the local disk
f= open("temp.html","w+")
f.write(str(html_response.content))
f.close()

# open de temporary file and read in binary mode
f = open("temp.html", 'rb')

# create a document object from the file
doc = Document.from_file(f)

The document is made up of the paragraphs, each paragraph is an element (some are short because they include the title only).

The elements of a document are part of a list. So list operations can be used to explore the list. 

In [2]:
#To see the length of the elements list
len(doc.elements)
doc.elements

[Paragraph(id='wrapper', references=[], text=' Open Access Article'),
 Paragraph(id='wrapper', references=[], text='\\nThis Open Access Article is licensed under a '),
 Paragraph(id='wrapper', references=[], text='Creative Commons Attribution 3.0 Unported Licence'),
 Paragraph(id='wrapper', references=[], text=' DOI:\\xc2\\xa010.1039/C8CP05975F\\n(Paper)\\nPhys. Chem. Chem. Phys., 2019, 21, 5335-5343'),
 Title(id='sect187', references=['fn1'], text='Hydrogen adsorption on transition metal carbides: a DFT study'),
 Paragraph(id='wrapper', references=[], text='\\n \\n \\n \\n Fabrizio \\n Silveri\\n \\n \\n *, \\n \\n \\n \\n Matthew G. \\n Quesne\\n \\n \\n , \\n \\n \\n \\n Alberto \\n Roldan\\n \\n \\n , \\n \\n \\n \\n Nora H. \\n de Leeuw\\n \\n \\n  and \\n \\n \\n \\n C. Richard A. \\n Catlow\\n \\n \\n \\n '),
 Paragraph(id='wrapper', references=[], text='School of Chemistry, Cardiff University, Main Building, Park Place, Cardiff CF10 3AT, UK. E-mail: SilveriF@Cardiff.ac.uk; Catl

A paragraph in the document can be accessed using an index, as follows.

In [3]:
# access an element on the list 
# the fifth element is the article title
para = doc.elements[4]
para

Each element can be further explored as it is divided in sentences and tokens (not words because numbers, symbols, and punctuation characters are also considered tokens)

In [4]:
print(para)
print("Sentences:", len(para.sentences))
print("Tokens:", para.tokens)
print("Tokens:", len(para.tokens))
print("Tokens:", len(para.tokens[0]))

Hydrogen adsorption on transition metal carbides: a DFT study
Sentences: 1
Tokens: [[Token('Hydrogen', 0, 8), Token('adsorption', 9, 19), Token('on', 20, 22), Token('transition', 23, 33), Token('metal', 34, 39), Token('carbides', 40, 48), Token(':', 48, 49), Token('a', 50, 51), Token('DFT', 52, 55), Token('study', 56, 61)]]
Tokens: 1
Tokens: 10


CDE also creates a special list of chemical entities (elements, compounds, types, etc.). These can be accessed as the cems list. The numbers next to each entity name indicate the position of the element in the text (start and end character)

In [5]:
doc.cems

[Span('hydrogens', 726, 735),
 Span('hydrogens', 374, 383),
 Span('carbon', 463, 469),
 Span('carbon', 111, 117),
 Span('C8CP05975F\\n(Paper)\\nPhys', 21, 46),
 Span('transition metal carbides', 23, 48),
 Span('hydro-carbon', 249, 261),
 Span('carbide', 347, 354),
 Span('TiC', 480, 483),
 Span('ZrC', 461, 464),
 Span('TiC', 284, 287),
 Span('hydrogen', 247, 255),
 Span('hydrogen', 691, 699),
 Span('hydrogen', 375, 383),
 Span('TiC', 265, 268),
 Span('hydrogen', 103, 111),
 Span('hydrogens', 160, 169),
 Span('Pt', 486, 488),
 Span('TiC', 638, 641),
 Span('VC(001)', 1043, 1050),
 Span('hydrogens', 929, 938),
 Span('transition metal carbides', 74, 99),
 Span('carbide', 143, 150),
 Span('ZrC', 893, 896),
 Span('hydrogen', 14, 22),
 Span('hydrogen', 846, 854),
 Span('VC', 0, 2),
 Span('hydrogens', 1389, 1398),
 Span('hydrogen', 526, 534),
 Span('VC', 108, 110),
 Span('H', 309, 310),
 Span('C\\xe2\\x80\\x93H', 682, 696),
 Span('TiC', 103, 106),
 Span('carbide', 394, 401),
 Span('H', 603, 604

The cems list contains all the occurrences of chemical entities, so there are repetitions. CDE provides the serialize  operation for summarising all entities, joining together abreviations, and capitalised or uncapitalised occurrences.

In [6]:
serialised=doc.records.serialize()
len(serialised)

47

In [7]:
serialised

[{'names': ['C8CP05975F\\n(Paper)\\nPhys']},
 {'names': ['SilveriF']},
 {'names': ['H2O']},
 {'names': ['H2.\\n'], 'roles': ['product']},
 {'names': ['Monocarbides']},
 {'names': ['hydrocarbons']},
 {'names': ['methane']},
 {'names': ['platinum']},
 {'names': ['TiC, ZrC']},
 {'names': ['carbon-']},
 {'names': ['Perdew\\xe2\\x80\\x93Burke\\xe2\\x80\\x93Ernzerhof', 'PBE']},
 {'names': ['Eslab+nH']},
 {'names': ['9 \\xc3\\x97 10 \\xc3\\x97 11 \\xc3\\x85']},
 {'names': ['Hydrogen adsorption']},
 {'names': ['Titanium']},
 {'names': ['hydro-carbon']},
 {'names': ['TiC(011)']},
 {'names': ['ZrC(001)']},
 {'names': ['TiC(001)']},
 {'names': ['VC(001)']},
 {'names': ['(111)-M']},
 {'names': ['Pt']},
 {'names': ['transition metals']},
 {'names': ['Ti']},
 {'names': ['C\\xe2\\x80\\x93H']},
 {'names': ['(111)-C']},
 {'names': ['N / D']},
 {'names': ['ZrC(111)-M']},
 {'names': ['\\n nH']},
 {'names': ['\\n \\xce\\xb8\\n']},
 {'names': ['TiC, VC']},
 {'names': ['CO2']},
 {'names': ['\\xce\\x94Eads']