# Reading IR DATA from PDF Documents

This section contains a modified example shown based on the [reading documents page](http://chemdataextractor.org/docs/reading) of the Chem Data Extractor (CDE) documentation. 
The example uses the IR Parser to try to extract data from documents (if it is present).

The first part just tries to replicate the results of the test cases from the [test pages](https://github.com/CambridgeMolecularEngineering/chemdataextractor/blob/master/tests/test_parse_ir.py).

In [1]:
# libraries for accessing sentences and for parsing IR data
from chemdataextractor.doc.text import Sentence
from chemdataextractor.parse.ir import ir, IrParser

def ir_parse_print(input):
    print("Input text:\n", input)
    s = Sentence(input)
    for c in IrParser().parse(s.tagged_tokens):
        print("Parsed IR Data:\n " ,c.serialize())

#test 1
text = 'IR (ATR): ṽ [cm−1] 3024 (w), 2980 (w), 2918 (w), 1601 (w), 1485 (m), 1460 (m), 1438 (w), 1358 (w), ' \
    '1290 (w), 1188 (w), 1115 (w), 1002 (m), 954 (m), 912 (w), 853 (m), 814 (s), 793 (s), 762 (s), 739 (m), ' \
    '687 (s), 671 (m).'
ir_parse_print(text)
    
#test 2    
text = 'IR (KBr/cm–1): 4321, 2222, 1734, 1300, 1049, 777, 620.'
ir_parse_print(text)

#test 3
text = 'FTIR (KBr): ν/cm‒1 3315, 3002, 1630, 1593 (νCH=N), 1251;'
ir_parse_print(text)

#test 4
text = 'IR-ATR:  3380, 3190,  2973, 2873, 1669, 1646, 1602, 1495, 1178, 828 cm-1.'
ir_parse_print(text)

# the type of sentences in the article with band data
from_article = "Nicotine: FT-IR (KBr, cm-1): ν = 3025, 2970, 2870, 1691, 1677, 904, 717."
ir_parse_print(from_article)
from_article = "[MBNT]Cl: FT-IR (KBr, cm-1): ν =3412, 3043, 2958, 1632, 1610, 1451, 1197, 916, 703, 472"
ir_parse_print(from_article)

Input text:
 IR (ATR): ṽ [cm−1] 3024 (w), 2980 (w), 2918 (w), 1601 (w), 1485 (m), 1460 (m), 1438 (w), 1358 (w), 1290 (w), 1188 (w), 1115 (w), 1002 (m), 954 (m), 912 (w), 853 (m), 814 (s), 793 (s), 762 (s), 739 (m), 687 (s), 671 (m).
Parsed IR Data:
  {'ir_spectra': [{'solvent': 'ATR', 'peaks': [{'value': '3024', 'units': '[cm−1]', 'strength': 'w'}, {'value': '2980', 'units': '[cm−1]', 'strength': 'w'}, {'value': '2918', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1601', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1485', 'units': '[cm−1]', 'strength': 'm'}, {'value': '1460', 'units': '[cm−1]', 'strength': 'm'}, {'value': '1438', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1358', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1290', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1188', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1115', 'units': '[cm−1]', 'strength': 'w'}, {'value': '1002', 'units': '[cm−1]', 'strength': 'm'}, {'value': '954', 'units': '[cm−1]', 'streng

The next part is to try to read similar data strings from documents and parse them. For this we need an example which has the same tipe of sentences as the ones in the examples above. Most articles do not present information in these formats, however an example with supplementary information could be parsed to obtain the same data.

In [2]:
# The line of code (LOC) below imports the document object from the CDE library 
from chemdataextractor import Document

# define the input file (put it in the same directory as the source files or add 
# the complete path to the file, e.g.
# 'c:/users/user1/documents/file.pdf')

# The file used in this example was obtained from:
# http://www.rsc.org/suppdata/c5/ra/c5ra19028b/c5ra19028b1.pdf
# This is the supplementary material for the article 
# https://pubs.rsc.org/en/content/articlelanding/2015/ra/c5ra19028b
# Hajipour, A. R., Boostani, E., & Mohammadsaleh, F. (2015). 
# Copper (I) catalyzed Sonogashira reactions promoted by monobenzyl nicotinium chloride, 
# a N-donor quaternary ammonium salt. RSC Advances, 5(114), 94369-94374.
filepath = "IRSpectroscopy/c5ra19028b1.pdf"

# open de file in read binary mode
f = open(filepath, 'rb')

# create a document object from the file
doc = Document.from_file(f)

CDE provides the serialize  operation for summarising all entities, joining together abreviations, and capitalised or uncapitalised occurrences.
As the following example shows it also extracts data when it is found in the appropriate format for parsing.

In [3]:
serialised=doc.records.serialize()
print(len(serialised))
serialised

18


[{'names': ['Copper(I)']},
 {'names': ['monobenzyl nicotinium chloride a N- donor quaternary ammonium']},
 {'names': ['Nicotine']},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks': [{'value': '3025', 'units': 'cm-1'},
     {'value': '2970', 'units': 'cm-1'},
     {'value': '2870', 'units': 'cm-1'},
     {'value': '1691', 'units': 'cm-1'},
     {'value': '1677', 'units': 'cm-1'},
     {'value': '904', 'units': 'cm-1'},
     {'value': '717', 'units': 'cm-1'}]}]},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks': [{'value': '3412', 'units': 'cm-1'},
     {'value': '3043', 'units': 'cm-1'},
     {'value': '2958', 'units': 'cm-1'},
     {'value': '1632', 'units': 'cm-1'},
     {'value': '1610', 'units': 'cm-1'},
     {'value': '1451', 'units': 'cm-1'},
     {'value': '1197', 'units': 'cm-1'},
     {'value': '916', 'units': 'cm-1'},
     {'value': '703', 'units': 'cm-1'},
     {'value': '472', 'units': 'cm-1'}]}]},
 {'names': ['[DBNT][Cu4Cl5]']},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks

In [4]:
serialised

[{'names': ['Copper(I)']},
 {'names': ['monobenzyl nicotinium chloride a N- donor quaternary ammonium']},
 {'names': ['Nicotine']},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks': [{'value': '3025', 'units': 'cm-1'},
     {'value': '2970', 'units': 'cm-1'},
     {'value': '2870', 'units': 'cm-1'},
     {'value': '1691', 'units': 'cm-1'},
     {'value': '1677', 'units': 'cm-1'},
     {'value': '904', 'units': 'cm-1'},
     {'value': '717', 'units': 'cm-1'}]}]},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks': [{'value': '3412', 'units': 'cm-1'},
     {'value': '3043', 'units': 'cm-1'},
     {'value': '2958', 'units': 'cm-1'},
     {'value': '1632', 'units': 'cm-1'},
     {'value': '1610', 'units': 'cm-1'},
     {'value': '1451', 'units': 'cm-1'},
     {'value': '1197', 'units': 'cm-1'},
     {'value': '916', 'units': 'cm-1'},
     {'value': '703', 'units': 'cm-1'},
     {'value': '472', 'units': 'cm-1'}]}]},
 {'names': ['[DBNT][Cu4Cl5]']},
 {'ir_spectra': [{'solvent': 'KBr',
    'peaks

The same processing on the full article yields additional data such as NMR spectra

In [5]:
filepath = "IRSpectroscopy/CooperICatalyzedSogashira.html"

# open de file in read binary mode
f = open(filepath, 'rb')

# create a document object from the file
doc = Document.from_file(f)

serialised=doc.records.serialize()
print(len(serialised))
serialised

131


[{'names': ['N-donor quaternary ammonium']},
 {'names': ['alkynes']},
 {'names': ['acetylene']},
 {'names': ['aryl alkynes']},
 {'names': ['aryl']},
 {'names': ['C–N']},
 {'names': ['C–O']},
 {'names': ['triphenylphosphine']},
 {'names': ['1,10-phenanthroline']},
 {'names': ['ethylenediamine']},
 {'names': ['N,N-dimethylglycine']},
 {'names': ['choline chloride']},
 {'names': ['C–C and C–S']},
 {'names': ['carbon dioxide']},
 {'names': ['cyclic carbonates']},
 {'names': ['monobenzylnicotinium bromide']},
 {'names': ['dibenzylnicotinium bromide']},
 {'names': ['1-benzyl-4-aza-1-azoniabicyclo[2.2.2]octane chloride']},
 {'names': ['DABCO chloride']},
 {'names': ['1-benzyl-1-methyl-2-(pyridin-3-yl)pyrrolidin-1-ium chloride']},
 {'names': ['1-benzyl-3-(1-benzyl-1-methylpyrrolidin-1-ium-2-yl)-pyridin-1-ium dichloride']},
 {'names': ['Monobenzylnicotininium chloride [MBNT]Cl']},
 {'names': ['dibenzyl nicotinium dichloride']},
 {'names': ['quaternary ammonium cation N,N′-diallylmorpholinium']}