# Preprocessing: PDF to Text Conversion

This notebook is intended to efficiently way to convert articles in `*.pdf` format into plain text as part of a `*.json` file format that contains meta information of the article. 

Running this notebook in full will convert everything in the `../articles/` folder and place the output text files in `../data/` folder. 

In [9]:
import os
import json
import slate3k as sl
import subprocess
import json
import re

**Connection to paths**

In [2]:
article_dir = '../articles/'
data_dir_ = '../data/'

In [3]:
os.listdir(article_dir)[1]

**Function for processing individual file in directory**

Script adopted from `slate` package documentation:

In [4]:
def convert_pdf(path):
    with open(path,'rb') as f:
        extracted_text = sl.PDF(f)

    # Remove all of the line breaks & tabs  
    clean_text = " ".join([pg.replace("\n", "").replace("\t", "").replace(u'\xa0', u'').replace(u'\x0c', u'') 
                  for pg in extracted_text])

    return clean_text

In [20]:
def capture_doi(text):
    return re.findall('DOI:(\d+\.\d+\/\w.\d+)', text)[0]

In [21]:
temp_path = article_dir+os.listdir(article_dir)[0]

In [22]:
text = convert_pdf(temp_path)



**Pulling metadata from DOI**

In [16]:
def doi_pull(doi):
    """Returns the metadata form a cURL of the input doi. 
    """
    
    if type(doi) != str:
        raise ValueError('doi needs to be a string.')
    
    # cURL the doi metadata in json format
    proc = subprocess.Popen(["curl", "-LH", "Accept: application/json", 
                         "http://dx.doi.org/"+doi], stdout=subprocess.PIPE)
    (out, err) = proc.communicate()
    
    # clean up for json processing
    out.decode("utf-8")
    cleaned_meta = json.loads(out.decode("utf-8"))
    
    return cleaned_meta 

In [23]:
doi_pull(capture_doi(text))

{'indexed': {'date-parts': [[2019, 12, 2]],
  'date-time': '2019-12-02T19:23:54Z',
  'timestamp': 1575314634111},
 'reference-count': 0,
 'publisher': 'American Psychological Association (APA)',
 'issue': '5',
 'content-domain': {'domain': [], 'crossmark-restriction': False},
 'DOI': '10.1037/a0034040',
 'type': 'article-journal',
 'created': {'date-parts': [[2013, 8, 19]],
  'date-time': '2013-08-19T15:52:32Z',
  'timestamp': 1376927552000},
 'page': '799-809',
 'source': 'Crossref',
 'is-referenced-by-count': 33,
 'title': 'A field experiment: Reducing interpersonal discrimination toward pregnant job applicants.',
 'prefix': '10.1037',
 'volume': '98',
 'author': [{'given': 'Whitney Botsford',
   'family': 'Morgan',
   'sequence': 'first',
   'affiliation': []},
  {'given': 'Sarah Singletary',
   'family': 'Walker',
   'sequence': 'additional',
   'affiliation': []},
  {'given': 'Michelle (Mikki) R.',
   'family': 'Hebl',
   'sequence': 'additional',
   'affiliation': []},
  {'given'