In [1]:
import os
import re
import sys
import glob
import math
import logging
from pathlib import Path
from pprint import pprint

import numpy as np
import scipy as sp
import sklearn

import spacy
import tika
from tika import parser

%load_ext autoreload
%autoreload 2

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import seaborn as sns
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("whitegrid")

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

logging.basicConfig(level=logging.INFO, stream=sys.stdout)

In [2]:
from wb_nlp.processing import document

In [3]:
## Hints

# nlp = spacy.load('en_core_web_sm')

This notebook contains examples of how the `PDFDoc2Txt` class can be used to convert pdf documents into formatted text. Additional methods implemented in this class can also be applied to raw texts extracted from PDFs.

We start by creating an instance of the `PDFDoc2Txt`—`pdf_parser`.

In [4]:
pdf_parser = document.PDFDoc2Txt()

# Parsing a pdf file

Parsing a pdf file starts with the `parse` method. This method accepts a buffer of byte object or a string to a url or file path. The source type must be specified for the parser to correctly execute the processing.

Below is the implementation of the `parse` method. Tika is the main driver of the parser. We use the `xmlContent` flag to specify that we want to get an xml formatted output. The xml output contains relevant structure that we can leverage to generate an informed reconstruction of the document.

In [5]:
??pdf_parser.parse

[0;31mSignature:[0m [0mpdf_parser[0m[0;34m.[0m[0mparse[0m[0;34m([0m[0msource[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbytes[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0msource_type[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'buffer'[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0mparse[0m[0;34m([0m[0mself[0m[0;34m,[0m [0msource[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbytes[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0msource_type[0m[0;34m:[0m [0mstr[0m[0;34m=[0m[0;34m'buffer'[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Parse a PDF document to text from different source types.[0m
[0;34m[0m
[0;34m        Args:[0m
[0;34m            source:[0m
[0;34m                Source of the PDF that needs to be converted.[0m
[0;34m                The source could be a url, a path, or a buffer/file-like object[0m
[0;

## Processing a single page

The xml returned by Tika contains page information captured by div tags. We used this to process documents by page.

The `process_page` method takes a tag element corresponding to the extracted page. Page level processing is then applied such as consolidation of paragraphs in the page and fixing footnote citations. We also perform concatenation of likely fragmented paragraphs.

In [6]:
??pdf_parser.process_page

[0;31mSignature:[0m [0mpdf_parser[0m[0;34m.[0m[0mprocess_page[0m[0;34m([0m[0mpage[0m[0;34m:[0m [0mbs4[0m[0;34m.[0m[0melement[0m[0;34m.[0m[0mTag[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;34m@[0m[0mstaticmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mprocess_page[0m[0;34m([0m[0mpage[0m[0;34m:[0m [0mbs4[0m[0;34m.[0m[0melement[0m[0;34m.[0m[0mTag[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mparagraphs[0m [0;34m=[0m [0;34m[[0m[0;34m][0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m        [0;32mfor[0m [0mp[0m [0;32min[0m [0mpage[0m[0;34m.[0m[0mfind_all[0m[0;34m([0m[0;34m'p'[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mparagraph[0m [0;34m=[0m [0mPDFDoc2Txt[0m[0;34m.[0m[0mconsolidate_paragraph[0m[0;34m([0m[0mp[0m[0;34m.[0m[0mtext[0m[0;34m)[0m[0;34m

# Paragraph consolidation algorithm

The following method `consolidate_paragraph` contains the different heuristics for identifying fragmentation of paragraphs/sentences extracted from the pdf file.

This method is a static method allowing us to use this on arbitrary text document that may contain sentence level fragmentation due to OCR or other X-to-text conversion.

In [7]:
??pdf_parser.consolidate_paragraph

[0;31mSignature:[0m
[0mpdf_parser[0m[0;34m.[0m[0mconsolidate_paragraph[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtext_paragraph[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_fragment_len[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;34m@[0m[0mstaticmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mconsolidate_paragraph[0m[0;34m([0m[0mtext_paragraph[0m[0;34m:[0m [0mstr[0m[0;34m,[0m [0mmin_fragment_len[0m[0;34m:[0m [0mint[0m[0;34m=[0m[0;36m3[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Consolidate a `text_paragraph` with possible multiple newlines into one logical paragraph.[0m
[0;34m[0m
[0;34m        Tika provides access to extracted text by paragraph. These paragraphs, however, may contain[0m
[0;34m        multiple newlines that br

# Example

### Run tika docker image first.

https://hub.docker.com/r/apache/tika

```
sudo docker pull apache/tika
sudo docker run -d -p 9998:9998 apache/tika
```

The WB Docs repository contains pdf and txt versions of documents. However, some text versions are not formatted properly.

In [20]:
import requests

url = 'http://documents1.worldbank.org/curated/en/735931527600661308/text/126663-WP-PUBLIC-P164538-Malawi-Economic-Monitor-7-Realizing-Safety-Nets-Potential.txt'
txt_original = requests.get(url).content.decode('utf-8')

pdf_url = 'http://documents1.worldbank.org/curated/en/735931527600661308/pdf/126663-WP-PUBLIC-P164538-Malawi-Economic-Monitor-7-Realizing-Safety-Nets-Potential.pdf'
txt_parsed = pdf_parser.parse(source=pdf_url, source_type='url')

# Result from direct txt version

In [36]:
print(txt_original[12850:20000])

MALAWI ECONOMIC MONITOR MAY 2018
OVERVIEW
                                                          challenges related to erratic energy and water
The Malawi Economic Monitor (MEM) provides an
                                                          supply, which had a particularly negative impact
analysis of economic and structural development
                                                          on      manufacturing.    Within  services,    the
issues in Malawi. This edition was published in May
                                                          performance of the wholesale and retail trade and
2018. It follows on from the six previous editions of
                                                          distribution sub-sectors declined as a result of
the MEM and is part of an ongoing series, with
                                                          subdued domestic demand.
future editions to follow twice per year.
                                                 

# Result from PDFDoc2Txt

In [39]:
print(txt[4])

1 « MALAWI ECONOMIC MONITOR MAY 2018

OVERVIEW The Malawi Economic Monitor (MEM) provides an analysis of economic and structural development issues in Malawi. This edition was published in May 2018. It follows on from the six previous editions of the MEM and is part of an ongoing series, with future editions to follow twice per year.

The aim of the publication is to foster better- informed policy analysis and debate regarding the key challenges that Malawi faces in its endeavor to achieve high rates of stable, inclusive and sustainable economic growth.

The MEM consists of two parts: Part 1 presents a review of recent economic developments and a macroeconomic outlook. Part 2 focuses on a special selected topic relevant to Malawi's development prospects.

In this edition, the special topic focuses on Social Safety Nets. This is a defining moment for Malawi to transform its safety net. The recently approved second Malawi National Social Support Program (MNSSP II) works towards the creat

## Observations

Using the text version from the WB Docs repository shows a fragmented structure. Columnar flows in a pdf page are literally placed side-by-side. This proves to be a challenge since inferring of sentences is not straighforward as simply replacing line breaks with spaces.

On the other hand, we can see that the `PDFDoc2Txt` has managed to recover the logical sentences in the text. It was able to properly concatenate fragmented sentences and also identified which correspond to a single column flow.