# Reading IR DATA from PDF Documents

This section contains a modified example shown based on the [reading documents page](http://chemdataextractor.org/docs/reading) of the Chem Data Extractor (CDE) documentation. 
The example uses the IR Parser to try to extract data from documents (if it is present).

The first part just tries to replicate the results of the test cases from the [test pages](https://github.com/CambridgeMolecularEngineering/chemdataextractor/blob/master/tests/test_parse_ir.py).

In [1]:
# libraries for accessing sentences and for parsing IR data
from chemdataextractor.doc.text import Sentence
from chemdataextractor.parse.ir import ir, IrParser

def ir_parse_print(input):
    print("Input text:\n", input)
    s = Sentence(input)
    for c in IrParser().parse(s.tagged_tokens):
        print("Parsed IR Data:\n " ,c.serialize())

#test 1
text = 'IR (ATR): ṽ [cm−1] 3024 (w), 2980 (w), 2918 (w), 1601 (w), 1485 (m), 1460 (m), 1438 (w), 1358 (w), ' \
    '1290 (w), 1188 (w), 1115 (w), 1002 (m), 954 (m), 912 (w), 853 (m), 814 (s), 793 (s), 762 (s), 739 (m), ' \
    '687 (s), 671 (m).'
ir_parse_print(text)
    
#test 2    
text = 'IR (KBr/cm–1): 4321, 2222, 1734, 1300, 1049, 777, 620.'
ir_parse_print(text)

#test 3
text = 'FTIR (KBr): ν/cm‒1 3315, 3002, 1630, 1593 (νCH=N), 1251;'
ir_parse_print(text)

#test 4
text = 'IR-ATR:  3380, 3190,  2973, 2873, 1669, 1646, 1602, 1495, 1178, 828 cm-1.'
ir_parse_print(text)

# the type of sentences in the article with band data
from_article = "Nicotine: FT-IR (KBr, cm-1): ν = 3025, 2970, 2870, 1691, 1677, 904, 717."
ir_parse_print(from_article)
from_article = "[MBNT]Cl: FT-IR (KBr, cm-1): ν =3412, 3043, 2958, 1632, 1610, 1451, 1197, 916, 703, 472"
ir_parse_print(from_article)

Input text:
 IR (ATR): ṽ [cm−1] 3024 (w), 2980 (w), 2918 (w), 1601 (w), 1485 (m), 1460 (m), 1438 (w), 1358 (w), 1290 (w), 1188 (w), 1115 (w), 1002 (m), 954 (m), 912 (w), 853 (m), 814 (s), 793 (s), 762 (s), 739 (m), 687 (s), 671 (m).
Parsed IR Data:
  {'ir_spectra': [{'solvent': 'ATR', 'peaks': [{'value': '3024', 'units': '[cm−1]', 'strength': 'w'}, {'value': '2980', 'units': '[cm−1]', 'strength': 'w'}, {'value': '2918', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1601', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1485', 'units': '[cm−1]', 'strength': 'm'}, {'value': '1460', 'units': '[cm−1]', 'strength': 'm'}, {'value': '1438', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1358', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1290', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1188', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1115', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1002', 'units': '[cm−1]', 'strength': 'm'}, {'value': '954', 'units': '[cm−1]', 'streng

The next part is to try to read similar data strings from documents and parse them. For this we need an example which has the same tipe of sentences as the ones in the examples above. Most articles do not present information in these formats, however an example with supplementary information could be parsed to obtain the same data.

In [2]:
# The line of code (LOC) below imports the document object from the CDE library 
from chemdataextractor import Document

# define the input file (put it in the same directory as the source files or add 
# the complete path to the file, e.g.
# 'c:/users/user1/documents/file.pdf')

# The file used in this example was obtained from:
# http://www.rsc.org/suppdata/c5/ra/c5ra19028b/c5ra19028b1.pdf
# This is the supplementary material for the article 
# https://pubs.rsc.org/en/content/articlelanding/2015/ra/c5ra19028b
# Hajipour, A. R., Boostani, E., & Mohammadsaleh, F. (2015). 
# Copper (I) catalyzed Sonogashira reactions promoted by monobenzyl nicotinium chloride, 
# a N-donor quaternary ammonium salt. RSC Advances, 5(114), 94369-94374.
filepath = "IRSpectroscopy/c5ra19028b1.pdf"

# open de file in read binary mode
f = open(filepath, 'rb')

# create a document object from the file
doc = Document.from_file(f)

The document is made up of the paragraphs, each paragraph is an element (some are short because they include the title only).

The elements of a document are part of a list. So list operations can be used to explore the list. 

In [3]:
#To see the length of the elements list
len(doc.elements)

70

A paragraph in the document can be accessed using an index, as follows.

In [4]:
# access an element on the list
para = doc.elements[14]
para

Each element can be further explored as it is divided in sentences and tokens (not words because numbers, symbols, and punctuation characters are also considered tokens)

In [5]:
print(para)
print("Sentences:", len(para.sentences))
print("Tokens:", para.tokens)
print("Tokens:", len(para.tokens))
print("Tokens:", len(para.tokens[0]))

Figure 1 FT-IR spectra, a)Nicotine, b)[MBNT]Cl, c)[MBNT][Cu4Cl5]
Sentences: 1
Tokens: [[Token('Figure', 0, 6), Token('1', 7, 8), Token('FT-IR', 9, 14), Token('spectra', 15, 22), Token(',', 22, 23), Token('a)Nicotine', 24, 34), Token(',', 34, 35), Token('b)[MBNT]Cl', 36, 46), Token(',', 46, 47), Token('c)[MBNT][Cu4Cl5]', 48, 64)]]
Tokens: 1
Tokens: 10


CDE also creates a special list of chemical entities (elements, compounds, types, etc.). These can be accessed as the cems list. The numbers next to each entity name indicate the position of the element in the text (start and end character)

In [6]:
doc.cems

[Span('monobenzyl \nnicotinium chloride a N- donor quaternary ammonium', 54, 116),
 Span('monobenzyl nicotinium chloride', 47, 77),
 Span('DBNT][Cu4Cl5', 1, 13),
 Span('1H', 10, 12),
 Span('dibenzyl nicotinium chloride', 46, 74),
 Span('N', 0, 1),
 Span('OMe', 0, 3),
 Span('CDCl3', 18, 23),
 Span('Cl', 0, 2),
 Span('1H', 0, 2),
 Span('Copper(I)', 0, 9),
 Span('DMSO', 8, 12),
 Span('CDCl3', 8, 13),
 Span('13C', 10, 13),
 Span('dibenzyl nicotinium chlorid', 45, 72),
 Span('CDCl3', 19, 24),
 Span('monobenzyl nicotinium chlorid', 46, 75),
 Span('CDCl3', 17, 22),
 Span('CDCl3', 9, 14),
 Span('Nicotine', 0, 8),
 Span('13C', 0, 3),
 Span('1H', 9, 11)]

The cems list contains all the occurrences of chemical entities, so there are repetitions. CDE provides the serialize  operation for summarising all entities, joining together abreviations, and capitalised or uncapitalised occurrences.
As the following example shows it also extracts data when it is found in the appropriate format for parsing.

In [7]:
serialised=doc.records.serialize()
len(serialised)

18

In [8]:
serialised

[{'names': ['Copper(I)']},
 {'names': ['monobenzyl nicotinium chloride a N- donor quaternary ammonium']},
 {'names': ['Nicotine']},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks': [{'value': '3025', 'units': 'cm-1'},
     {'value': '2970', 'units': 'cm-1'},
     {'value': '2870', 'units': 'cm-1'},
     {'value': '1691', 'units': 'cm-1'},
     {'value': '1677', 'units': 'cm-1'},
     {'value': '904', 'units': 'cm-1'},
     {'value': '717', 'units': 'cm-1'}]}]},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks': [{'value': '3412', 'units': 'cm-1'},
     {'value': '3043', 'units': 'cm-1'},
     {'value': '2958', 'units': 'cm-1'},
     {'value': '1632', 'units': 'cm-1'},
     {'value': '1610', 'units': 'cm-1'},
     {'value': '1451', 'units': 'cm-1'},
     {'value': '1197', 'units': 'cm-1'},
     {'value': '916', 'units': 'cm-1'},
     {'value': '703', 'units': 'cm-1'},
     {'value': '472', 'units': 'cm-1'}]}]},
 {'names': ['[DBNT][Cu4Cl5]']},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks