In [1]:
import os
import re
import sys
import glob
import math
import logging
from pathlib import Path
from pprint import pprint

import numpy as np
import scipy as sp
import sklearn

import spacy
import tika
from tika import parser

%load_ext autoreload
%autoreload 2

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import seaborn as sns
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("whitegrid")

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

logging.basicConfig(level=logging.INFO, stream=sys.stdout)

In [2]:
from wb_nlp.processing import document

In [3]:
## Hints

# nlp = spacy.load('en_core_web_sm')

This notebook contains examples of how the `PDFDoc2Txt` class can be used to convert pdf documents into formatted text. Additional methods implemented in this class can also be applied to raw texts extracted from PDFs.

We start by creating an instance of the `PDFDoc2Txt`—`pdf_parser`.

In [4]:
pdf_parser = document.PDFDoc2Txt()

# Parsing a pdf file

Parsing a pdf file starts with the `parse` method. This method accepts a buffer of byte object or a string to a url or file path. The source type must be specified for the parser to correctly execute the processing.

Below is the implementation of the `parse` method. Tika is the main driver of the parser. We use the `xmlContent` flag to specify that we want to get an xml formatted output. The xml output contains relevant structure that we can leverage to generate an informed reconstruction of the document.

In [5]:
??pdf_parser.parse

[0;31mSignature:[0m [0mpdf_parser[0m[0;34m.[0m[0mparse[0m[0;34m([0m[0msource[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbytes[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0msource_type[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'buffer'[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;32mdef[0m [0mparse[0m[0;34m([0m[0mself[0m[0;34m,[0m [0msource[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbytes[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m [0msource_type[0m[0;34m:[0m [0mstr[0m[0;34m=[0m[0;34m'buffer'[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Parse a PDF document to text from different source types.[0m
[0;34m[0m
[0;34m        Args:[0m
[0;34m            source:[0m
[0;34m                Source of the PDF that needs to be converted.[0m
[0;34m                The source could be a url, a path, or a buffer/file-like object[0m
[0;

## Processing a single page

The xml returned by Tika contains page information captured by div tags. We used this to process documents by page.

The `process_page` method takes a tag element corresponding to the extracted page. Page level processing is then applied such as consolidation of paragraphs in the page and fixing footnote citations. We also perform concatenation of likely fragmented paragraphs.

In [6]:
??pdf_parser.process_page

[0;31mSignature:[0m [0mpdf_parser[0m[0;34m.[0m[0mprocess_page[0m[0;34m([0m[0mpage[0m[0;34m:[0m [0mbs4[0m[0;34m.[0m[0melement[0m[0;34m.[0m[0mTag[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
    [0;34m@[0m[0mstaticmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mprocess_page[0m[0;34m([0m[0mpage[0m[0;34m:[0m [0mbs4[0m[0;34m.[0m[0melement[0m[0;34m.[0m[0mTag[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0mparagraphs[0m [0;34m=[0m [0;34m[[0m[0;34m][0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m        [0;32mfor[0m [0mp[0m [0;32min[0m [0mpage[0m[0;34m.[0m[0mfind_all[0m[0;34m([0m[0;34m'p'[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mparagraph[0m [0;34m=[0m [0mPDFDoc2Txt[0m[0;34m.[0m[0mconsolidate_paragraph[0m[0;34m([0m[0mp[0m[0;34m.[0m[0mtext[0m[0;34m)[0m[0;34m

# Paragraph consolidation algorithm

The following method `consolidate_paragraph` contains the different heuristics for identifying fragmentation of paragraphs/sentences extracted from the pdf file.

This method is a static method allowing us to use this on arbitrary text document that may contain sentence level fragmentation due to OCR or other X-to-text conversion.

In [7]:
??pdf_parser.consolidate_paragraph

[0;31mSignature:[0m
[0mpdf_parser[0m[0;34m.[0m[0mconsolidate_paragraph[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtext_paragraph[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_fragment_len[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
    [0;34m@[0m[0mstaticmethod[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0mconsolidate_paragraph[0m[0;34m([0m[0mtext_paragraph[0m[0;34m:[0m [0mstr[0m[0;34m,[0m [0mmin_fragment_len[0m[0;34m:[0m [0mint[0m[0;34m=[0m[0;36m3[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;34m"""Consolidate a `text_paragraph` with possible multiple newlines into one logical paragraph.[0m
[0;34m[0m
[0;34m        Tika provides access to extracted text by paragraph. These paragraphs, however, may contain[0m
[0;34m        multiple newlines that br

# Example

### Run tika docker image first.

https://hub.docker.com/r/apache/tika

```
sudo docker pull apache/tika
sudo docker run -d -p 9998:9998 apache/tika
```

The WB Docs repository contains pdf and txt versions of documents. However, some text versions are not formatted properly.

In [129]:
import requests

url = 'http://documents1.worldbank.org/curated/en/735931527600661308/text/126663-WP-PUBLIC-P164538-Malawi-Economic-Monitor-7-Realizing-Safety-Nets-Potential.txt'
txt_original = requests.get(url).content.decode('utf-8')

# pdf_url = 'http://documents1.worldbank.org/curated/en/735931527600661308/pdf/126663-WP-PUBLIC-P164538-Malawi-Economic-Monitor-7-Realizing-Safety-Nets-Potential.pdf'

pdf_url = 'https://openknowledge.worldbank.org/bitstream/handle/10986/34013/Designing-a-Credit-Facility-for-Women-Entrepreneurs-Lessons-from-the-Ethiopia-Women-Entrepreneurship-Development-Project.pdf?sequence=4&isAllowed=y'
txt_parsed = pdf_parser.parse(source=pdf_url, source_type='url')

In [130]:
buffer = requests.get(pdf_url).content
xml = document.parser.from_buffer(buffer, xmlContent=True)['content']

In [131]:
xmlB = document.BeautifulSoup(xml, features='html.parser')

In [286]:
nlp = spacy.load('en_core_web_sm')

Cleaning process:

- If coming from pdf, parse with tika to convert into xml using xmlContent=True.
- Process text per page. If page contains very few or no sentences, we could drop it since it may be a page full of tables or other details.
- Expand acronyms.
- Check length of text.
- Lemmatize.
- Remove noise.
- Detect language.
- Spell check.
- Respeller.
- Plural-singular map?
- Create recognizers of entities, e.g., countries, names, places, etc.





In [302]:
# https://spacy.io/api/annotation
POS_TAGS = ['POS', 'ADJ', 'ADP', 'ADV', 'AUX', 'CONJ', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X', 'SPACE']

LDA_POS_TAGS = [
    'ADJ', 'NOUN', 'PROPN', 'VERB'
]

EMBEDDING_POS_TAGS = [
    'ADJ', 'NOUN'
]


INVALID_ENT_TYPE = [
    'DATE', 'MONEY', 'CARDINAL', 'PERCENT',
]


def clean_text(text):
    # text = xmlB.find_all('div', attrs={'class': 'page'})[page_num].text
    text = (
        text
        .replace('\n', ' ')
        .replace('’', "'")
        .replace('“', '"')
        .replace('”', '"')
    )

    text = re.sub('\s+', ' ', text).strip()

    doc = nlp(text)

    for t in doc:
        if t.ent_type_:
            print((t.text, t.lemma_, t.pos_, t.ent_type_, t.ent_iob_))

In [303]:
clean_text(text)

('Le', 'Le', 'PROPN', 'FAC', 'B')
('s', 's', 'PROPN', 'FAC', 'I')
('s', 's', 'NOUN', 'FAC', 'I')
('WEDP', 'WEDP', 'PROPN', 'FAC', 'B')
('January', 'January', 'PROPN', 'DATE', 'B')
('2014', '2014', 'NUM', 'DATE', 'I')
('12', '12', 'NUM', 'CARDINAL', 'B')
('6', '6', 'NUM', 'CARDINAL', 'B')
('the', 'the', 'DET', 'DATE', 'B')
('end', 'end', 'NOUN', 'DATE', 'I')
('of', 'of', 'ADP', 'DATE', 'I')
('the', 'the', 'DET', 'DATE', 'I')
('calendar', 'calendar', 'NOUN', 'DATE', 'I')
('year', 'year', 'NOUN', 'DATE', 'I')
('ETB', 'ETB', 'PROPN', 'MONEY', 'B')
('456.6', '456.6', 'NUM', 'MONEY', 'I')
('million', 'million', 'NUM', 'MONEY', 'I')
('USD', 'USD', 'PROPN', 'MONEY', 'B')
('23.3', '23.3', 'NUM', 'MONEY', 'I')
('million', 'million', 'NUM', 'MONEY', 'I')
('1,863', '1,863', 'NUM', 'CARDINAL', 'B')
('over', 'over', 'ADP', 'CARDINAL', 'B')
('half', 'half', 'DET', 'CARDINAL', 'I')
('WEDP', 'WEDP', 'PROPN', 'ORG', 'B')
('Ethiopia', 'Ethiopia', 'PROPN', 'GPE', 'B')
('66', '66', 'NUM', 'PERCENT', 'B')
(

In [304]:
spacy.__version__

'2.3.2'

In [299]:
page_num = 3
text = re.sub('\s+', ' ', xmlB.find_all('div', attrs={'class': 'page'})[page_num].text.replace('\n', ' ').replace('’', "'").replace('“', '"').replace('”', '"')).strip()
doc1 = nlp(text)

docs1 = [s for s in doc1.sents]

# for s in docs1:
#     print(s)
#     print('------')

doc2 = nlp(re.sub('\s+', ' ', txt_parsed[page_num].replace('\n', ' ')))

docs2 = [s for s in doc2.sents]

# for s in docs2:
#     print(s)
#     print('------')

len(docs1) == len(docs2)

np.where(np.array([d1.text.strip() == d2.text.strip() for d1, d2 in zip(docs1, docs2)]) != True)

(array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36,
        37, 38, 39, 40, 41, 42, 43, 44, 45]),)

In [288]:
ix = 29
docs1[ix - 1], docs2[ix]

(At the end of 2017, after securing external financing from Italy (USD 15.8 million) and Japan (USD 50 million), WEDP entered its second phase, expanding into 4 additional cities,4 adopting new technologies and innovations, and consolidating its interventions.,
 _3 Shortly thereafter, a WEDP revolving fund was approved to replenish the line of credit from repaid principals.)

In [289]:
for e, s in enumerate(doc2.sents):
    # print(s)
    # print('------')
    if e == ix:
        break

In [290]:
t = s[0]
s

_3 Shortly thereafter, a WEDP revolving fund was approved to replenish the line of credit from repaid principals.

In [292]:
%%time

import pkg_resources
from symspellpy.symspellpy import SymSpell

# Set max_dictionary_edit_distance to avoid spelling correction
sym_spell = SymSpell(max_dictionary_edit_distance=0, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
# term_index is the column of the term and count_index is the
# column of the term frequency
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# a sentence without any spaces
# input_term = "thequickbrownfoxjumpsoverthelazydog"
input_term = re.sub('\s+', '', text)
result = sym_spell.word_segmentation(input_term)
# print("{}, {}, {}".format(result.corrected_string, result.distance_sum,
#                           result.log_prob_sum))

CPU times: user 1.37 s, sys: 53.8 ms, total: 1.43 s
Wall time: 1.57 s


In [296]:
print(result.corrected_string)

De signing aCr edit Facility for Wo men Ent rep rene ursLes sons from the EthiopiaWo men Ent rep rene ur ship D eve lop men tProject(WEDP)4WEDP loans began disbursing in J anu ary2014 through 12 partner micro finance institutions (MFIs) across 6 targeted cities in Ethi-opia.1By the end of the calendar year , the project had issued ETB456.6 million (USD23.3 million )to1,863 women entrepreneurs – over half the dedicated line of credit .2WEDP clients represented abroad spectrum of Ethiopia's business community , ranging from retail stores to restaurants to beauty salons .A mong those who borrowed ,66 percent were first - time borrowers , and yet repayment rates were healthy , standing at 99.1 percent .In tandem ,3,083 women had participated in entrepreneurship trainings .By mid -2015, the high demand for credit had led to a rapid depletion of WEDP funds , prompting the project leadership to explore additional sources of financing .In light of the declining balance , partner MFIs began dis

In [280]:
s = nlp('repaid principal is beng given to the authority')
print(s._.performed_spellCheck) 
[(t.text, t.lemma_, t.pos_, t.ent_type_, t.ent_iob_) for t in s]

True


[('repaid', 'repay', 'VERB', '', 'O'),
 ('principal', 'principal', 'NOUN', '', 'O'),
 ('is', 'be', 'AUX', '', 'O'),
 ('beng', 'beng', 'NOUN', 'PERSON', 'B'),
 ('given', 'give', 'VERB', '', 'O'),
 ('to', 'to', 'ADP', '', 'O'),
 ('the', 'the', 'DET', '', 'O'),
 ('authority', 'authority', 'NOUN', '', 'O')]

# Result from direct txt version

In [36]:
print(txt_original[12850:20000])

MALAWI ECONOMIC MONITOR MAY 2018
OVERVIEW
                                                          challenges related to erratic energy and water
The Malawi Economic Monitor (MEM) provides an
                                                          supply, which had a particularly negative impact
analysis of economic and structural development
                                                          on      manufacturing.    Within  services,    the
issues in Malawi. This edition was published in May
                                                          performance of the wholesale and retail trade and
2018. It follows on from the six previous editions of
                                                          distribution sub-sectors declined as a result of
the MEM and is part of an ongoing series, with
                                                          subdued domestic demand.
future editions to follow twice per year.
                                                 

# Result from PDFDoc2Txt

In [39]:
print(txt[4])

1 « MALAWI ECONOMIC MONITOR MAY 2018

OVERVIEW The Malawi Economic Monitor (MEM) provides an analysis of economic and structural development issues in Malawi. This edition was published in May 2018. It follows on from the six previous editions of the MEM and is part of an ongoing series, with future editions to follow twice per year.

The aim of the publication is to foster better- informed policy analysis and debate regarding the key challenges that Malawi faces in its endeavor to achieve high rates of stable, inclusive and sustainable economic growth.

The MEM consists of two parts: Part 1 presents a review of recent economic developments and a macroeconomic outlook. Part 2 focuses on a special selected topic relevant to Malawi's development prospects.

In this edition, the special topic focuses on Social Safety Nets. This is a defining moment for Malawi to transform its safety net. The recently approved second Malawi National Social Support Program (MNSSP II) works towards the creat

## Observations

Using the text version from the WB Docs repository shows a fragmented structure. Columnar flows in a pdf page are literally placed side-by-side. This proves to be a challenge since inferring of sentences is not straighforward as simply replacing line breaks with spaces.

On the other hand, we can see that the `PDFDoc2Txt` has managed to recover the logical sentences in the text. It was able to properly concatenate fragmented sentences and also identified which correspond to a single column flow.