# Reading PDF Documents

This section contains a modified example shown based on the [reading documents page](http://chemdataextractor.org/docs/reading) of the Chem Data Extractor (CDE) documentation. The file used in this example was obtained from:
[Silveri, F., Quesne, M. G., Roldan, A., De Leeuw, N. H., & Catlow, C. R. A. (2019). Hydrogen adsorption on transition metal carbides: a DFT study. Physical Chemistry Chemical Physics, 21(10), 5335-5343](https://pubs.rsc.org/ko/content/articlehtml/2019/cp/c8cp05975f). 


In [1]:
# The line of code (LOC) below imports the document object from the CDE library 
from chemdataextractor import Document

# define the input file (put it in the same directory as the source files or add 
# the complete path to the file, e.g.
# 'c:/users/user1/documents/file.pdf')

# The file used in this example was obtained from:
# Silveri, F., Quesne, M. G., Roldan, A., De Leeuw, N. H., & Catlow, C. R. A. (2019). 
# Hydrogen adsorption on transition metal carbides: a DFT study. Physical Chemistry 
# Chemical Physics, 21(10), 5335-5343.
filepath = "pdfs/c8cp05975f.pdf"

# open de file in read binary mode
f = open(filepath, 'rb')

# create a document object from the file
doc = Document.from_file(f)

The document is made up of the paragraphs, each paragraph is an element (some are short because they include the title only).

The elements of a document are part of a list. So list operations can be used to explore the list. 

In [2]:
#To see the length of the elements list
len(doc.elements)

312

A paragraph in the document can be accessed using an index, as follows.

In [3]:
# access an element on the list
para = doc.elements[14]
para

Each element can be further explored as it is divided in sentences and tokens (not words because numbers, symbols, and punctuation characters are also considered tokens)

In [4]:
print(para)
print("Sentences:", len(para.sentences))
print("Tokens:", para.tokens)
print("Tokens:", len(para.tokens))
print("Tokens:", len(para.tokens[0]))

Transition metal carbides are a class of materials widely known for both their interesting physical
properties and catalytic activity. In this work, we have used plane-wave DFT methods to study the
Sentences: 2
Tokens: [[Token('Transition', 0, 10), Token('metal', 11, 16), Token('carbides', 17, 25), Token('are', 26, 29), Token('a', 30, 31), Token('class', 32, 37), Token('of', 38, 40), Token('materials', 41, 50), Token('widely', 51, 57), Token('known', 58, 63), Token('for', 64, 67), Token('both', 68, 72), Token('their', 73, 78), Token('interesting', 79, 90), Token('physical', 91, 99), Token('properties', 100, 110), Token('and', 111, 114), Token('catalytic', 115, 124), Token('activity', 125, 133), Token('.', 133, 134)], [Token('In', 135, 137), Token('this', 138, 142), Token('work', 143, 147), Token(',', 147, 148), Token('we', 149, 151), Token('have', 152, 156), Token('used', 157, 161), Token('plane', 162, 167), Token('-', 167, 168), Token('wave', 168, 172), Token('DFT', 173, 176), Token('

CDE also creates a special list of chemical entities (elements, compounds, types, etc.). These can be accessed as the cems list. The numbers next to each entity name indicate the position of the element in the text (start and end character)

In [5]:
doc.cems

[Span('(111)-C', 92, 99),
 Span('H', 303, 304),
 Span('hydrogen', 113, 121),
 Span('hydrogen', 328, 336),
 Span('TMCs', 66, 70),
 Span('hydrogen', 1620, 1628),
 Span('hydrogen', 68, 76),
 Span('hydrogen', 84, 92),
 Span('transition metals', 237, 254),
 Span('ZrC', 0, 3),
 Span('TiC(011)', 407, 415),
 Span('hydrogen', 458, 466),
 Span('hydrogen', 682, 690),
 Span('hydrogen', 1258, 1266),
 Span('carbon', 10, 16),
 Span('transition metal carbides', 178, 203),
 Span('O', 32, 33),
 Span('Ti', 631, 633),
 Span('hydrogen', 815, 823),
 Span('hydrogens', 414, 423),
 Span('hydrogen', 111, 119),
 Span('H', 14, 15),
 Span('hydrogen', 47, 55),
 Span('carbides', 129, 137),
 Span('carbon', 219, 225),
 Span('ZrC', 174, 177),
 Span('hydrogen', 149, 157),
 Span('TiC', 30, 33),
 Span('transition metals', 481, 498),
 Span('p\np0', 0, 4),
 Span('TiC', 1585, 1588),
 Span('carbon', 777, 783),
 Span('Hydrogen', 36, 44),
 Span('nH', 12, 14),
 Span('H2', 111, 113),
 Span('hydrogen', 901, 909),
 Span('carbon', 2

The cems list contains all the occurrences of chemical entities, so there are repetitions. CDE provides the serialize  operation for summarising all entities, joining together abreviations, and capitalised or uncapitalised occurrences.

In [6]:
serialised=doc.records.serialize()
len(serialised)

54

In [7]:
serialised

[{'names': ['ISSN 1463-9076']},
 {'names': ['H2O']},
 {'names': ['Pt']},
 {'labels': ['H2'], 'roles': ['product']},
 {'names': ['transition metals']},
 {'names': ['SilveriF']},
 {'names': ['hydrocarbons']},
 {'names': ['methane']},
 {'names': ['platinum']},
 {'names': ['TiC, ZrC']},
 {'names': ['carbon-']},
 {'names': ['Eads ¼']},
 {'names': ['EslabþnH']},
 {'names': ['Eslab+nH']},
 {'names': ['Fslab e']},
 {'names': ['n A']},
 {'names': ['p p0']},
 {'names': ['Hydrogen adsorption']},
 {'names': ['Titanium']},
 {'names': ['hydro-carbon']},
 {'names': ['4 1 2 1 2']},
 {'names': ['ZrC(001)']},
 {'names': ['TiC(001)']},
 {'names': ['VC(001)']},
 {'names': ['TiC(011)']},
 {'names': ['(111)-M']},
 {'names': ['J. Ha¨glund']},
 {'names': ['O']},
 {'names': ['Nitrides']},
 {'names': ['n', 'N']},
 {'names': ['Ti']},
 {'names': ['C–H']},
 {'names': ['(111)-C']},
 {'names': ['ZrC(111)-M']},
 {'names': ['nH']},
 {'names': ['8 1 4 1 2 1']},
 {'names': ['TiC, VC']},
 {'names': ['S']},
 {'names': ['C