# Reading List of Documents from a CSV File

This section contains a modified example based on the [reading documents page](http://chemdataextractor.org/docs/reading) of the Chem Data Extractor (CDE) documentation. 

A simple csv file containing the details of a series of documents is included (articles_list.csv).

Several of the functions used in the cde_read_files.py example are re-used here, the main change is the way the list of files to process is acquired (two files for reading and writting csv files are included).

In [1]:
# The line of code (LOC) below imports the document object from the CDE library 
from chemdataextractor import Document

# import library for managing files
from pathlib import Path
import sys

# A function for getting a list of files from the directory
# This will be modified to get the list from a csv file
def get_files_list (source_dir):
    i_counter = 0
    files_list = []
    for filepath in sorted(source_dir.glob('*.pdf')):
        i_counter += 1
        files_list.append(filepath)
    return files_list

# A function for getting a list of unique occurrecnces 
# returns an array of element names and their occurrence count
def get_uniques(cde_doc):
    uniques={}
    for chement in cde_doc.cems:
        if not chement.text in uniques:
            uniques[chement.text] = 1
        else:
            uniques[chement.text] += 1
    return uniques

# A function for getting the entity with most occurrecnces  
# retuns two values: the entity name and the count
def get_max(uniques):
    max_val = 0
    max_lbl = ""
    for chement in uniques:
        if uniques[chement] > max_val:
            max_val = uniques[chement]
            max_lbl = chement.replace('\n',' ')
    return max_lbl, max_val

# A function which read a list of files from directory
# and performs a basic analysis of the documents looking
# for the most mentioned entity
def cde_read_pdfs(pdf_path = "./pdfs"):
    pdf_dir= Path(pdf_path)
    files_list = get_files_list(pdf_dir)
    print(files_list)
    for a_file in files_list:
        file_name = a_file.name
        pdf_f = open(a_file, 'rb')
        doc = Document.from_file(pdf_f)
        uniques = get_uniques(doc)
        max_lbl, max_val = get_max(uniques)       
        print(file_name, "Unique entities:", len(uniques), "Most common entity:", max_lbl, max_val)

After the functions are devlared, we can directly call the cde_read_pdfs function and see its results.

In [2]:
#To see the length of the elements list
cde_read_pdfs("./pdfs")

[WindowsPath('pdfs/1-s2.0-S0926860X18305003-main.pdf'), WindowsPath('pdfs/acscatal.9b01820.pdf'), WindowsPath('pdfs/acscatal.9b04186.pdf'), WindowsPath('pdfs/c8cp05975f.pdf'), WindowsPath('pdfs/c8ob00066b.pdf'), WindowsPath('pdfs/cs5b01936_si_Proof.pdf'), WindowsPath('pdfs/cs5b01936MainProof.pdf'), WindowsPath('pdfs/Decarolis2018_Article_EffectOfParticleSizeAndSupport.pdf'), WindowsPath('pdfs/fchem-07-00182.pdf'), WindowsPath('pdfs/PhysRevB.66.224405.pdf')]
1-s2.0-S0926860X18305003-main.pdf Unique entities: 108 Most common entity: DME 29
acscatal.9b01820.pdf Unique entities: 148 Most common entity: methanol 40
acscatal.9b04186.pdf Unique entities: 279 Most common entity: H 25
c8cp05975f.pdf Unique entities: 56 Most common entity: hydrogen 54
c8ob00066b.pdf Unique entities: 123 Most common entity: Arg 37
cs5b01936_si_Proof.pdf Unique entities: 79 Most common entity: TiO2 27
cs5b01936MainProof.pdf Unique entities: 88 Most common entity: ReOx 30
Decarolis2018_Article_EffectOfParticleSizeA

## Modifications to read from csv

First we create two functions to read and write from csv files

In [3]:
# import library for managing csv files
import csv

# get the data from the csv_file, assuming first column is integer id
def get_csv_data(input_file, id_field):
    csv_data = {}
    fieldnames=[]
    with open(input_file, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            if fieldnames==[]:
                fieldnames=list(row.keys())
            csv_data[int(row[id_field])]=row
    return csv_data, fieldnames

# writes data to the given file name
def write_csv_data(values, filename):
    fieldnames = []
    for item in values.keys():
        for key in values[item].keys():
            if not key in fieldnames:
                fieldnames.append(key)
    #write back to a new csv file
    with open(filename, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for key in values.keys():
            writer.writerow(values[key])

The use of the function for reading articles from the csv file is shown below. The get_csv_data returns two values, a structure with the contents of the file and a simple list of the column headers.

In [5]:
articles_list, column_names = get_csv_data("./articles_list.csv", "id")
print("The first article in the list:an\n\t", articles_list[1])
print("The names of the columns in the file:\n\t", column_names)

The first article in the list:an
	 OrderedDict([('id', '1'), ('filename', 'pdfs/1-s2.0-S0926860X18305003-main.pdf'), ('title', 'Investigation of ZSM-5 catalysts for dimethylether conversion using inelastic neutron scattering'), ('doi', '10.1016/j.apcata.2018.10.010'), ('url', '')])
The names of the columns in the file:
	 ['id', 'filename', 'title', 'doi', 'url']


Modified version reading from the csv file

In [6]:
# A function which read a list of files from a csv file
# and performs a basic analysis of the documents looking
# for the most mentioned entity
# modified version of the one which reads from directory
def cde_read_pdfs_csv(csv_name = "./articles_list.csv"):
    articles_list, column_names = get_csv_data(csv_name, "id")
    for a_file in articles_list:
        file_name = articles_list[a_file]['filename']
        file_title = articles_list[a_file]['title']
        pdf_f = open(file_name, 'rb')
        doc = Document.from_file(pdf_f)
        uniques = get_uniques(doc)
        max_lbl, max_val = get_max(uniques)       
        print(file_title, "Unique entities:", len(uniques), "Most common entity:", max_lbl, max_val)

In [7]:
cde_read_pdfs_csv("./articles_list.csv")

Investigation of ZSM-5 catalysts for dimethylether conversion using inelastic neutron scattering Unique entities: 108 Most common entity: DME 29
Elementary Steps in the Formation of Hydrocarbons from Surface Methoxy Groups in HZSMâ€‘5 Seen by Synchrotron Infrared Microspectroscopy Unique entities: 148 Most common entity: methanol 40
Machine Learning for Catalysis Informatics: Recent Applications and Prospects Unique entities: 279 Most common entity: H 25
Hydrogen adsorption on transition metal carbides: a DFT study Unique entities: 56 Most common entity: hydrogen 54
QM/MM simulations identify the determinants of catalytic activity differences between type II dehydroquinase enzymes Unique entities: 123 Most common entity: Arg 37
Supporting Information ReOx/TiO2 â€“ a recyclable solid catalyst for deoxydehydration Unique entities: 79 Most common entity: TiO2 27
ReOx/TiO2: A Recyclable Solid Catalyst for Deoxydehydration Unique entities: 88 Most common entity: ReOx 30
Effect of Particle S