# Text cleaning

Aim is to try different scripts and libraries to clean text of various formats. 

**Don't forget to install the modules in requirements.txt**

## PDF

Code_40.pdf is the French environmental code. Good example of how bills, articles, amendments, or treaties look like as a PDF (versus letter or manifesto, a lot of different heading, long documents etc.).

* PDF format
* Just text but very structured, no tables or weird formatting
* Will try to extract the text and categories based on the titles and headings
* Would be interesting to merge the pdfs?

### With PyPDF2

This works fine but only extracts text, no structure. Ideally want to get headings etc. 

In [None]:
import PyPDF2
#text = textract.process("code/Code_40.pdf")

pdfFileObj = open('code/Code_40.pdf', 'rb')
# read object

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# print number of pages
print(pdfReader.numPages)

pageObj = pdfReader.getPage(1)
print(pageObj.extractText())


**Structure of the document:**

Just from the raw text we can see the structure: 
* Each BOOK starts like this: BOOK I\nCommon provisions\n\nArticles L121-1 to\nL110-2\n\n
* ENVIRONMENTAL CODE on top left of each page
* \n (line breaks) indicate paragraphs


### With PDFminer

**Useful scripts:**
* convert pdf: https://gist.github.com/terencezl/61fe3f28c44a763dd1e9f060b8ff6f2e
* get tags: https://gist.github.com/joelhsmith/5e6ec7ee70ab4b89d7bc5700e9e07fde
* converting to html: https://stackoverflow.com/questions/3637781/converting-a-pdf-to-text-html-in-python-so-i-can-parse-it

**Problem**: pdfminer is not maintained, all those scripts use deprecated functions/modules. Tried pdfminer.six and pdfminer.3k but same story.

In [20]:
# INTERNAL ERRORS using pdfminer - doesn't look well maintained
# a lot of version problems between scripts and package installed via pip

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
import ply 

def convert_pdf(path, format='text', password=''):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    laparams = LAParams()
    if format == 'text':
        device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    elif format == 'html':
        device = HTMLConverter(rsrcmgr, retstr, laparams=laparams)
    elif format == 'xml':
        device = XMLConverter(rsrcmgr, retstr, laparams=laparams)
    else:
        raise ValueError('provide format, either text, html or xml!')
    
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True
    pagenos=set()
    
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue().decode()
    fp.close()
    device.close()
    retstr.close()
    
    return text

ModuleNotFoundError: No module named 'pdfminer.converter'

In [None]:
# convert to text, print first 100 chars
convert_pdf('Code_40.pdf')[0:100]

### With Tika

This works really well in terms of keeping the structure (still need to figure out how to extract the different parts e.g. separate BOOK I and BOOK II). 

Very slow.

In [28]:
from tika import parser

raw = parser.from_file('Code_40.pdf')
raw.keys()
print(raw['content'][50:500])


ENVIRONMENTAL CODE

With the cooperation of Michael Faure
Professor of Comparative and International Environmental Law and Academic Director of METRO, the Institute for
Transnational Legal Research of the Universiteit Maastricht.

BOOK I
Common provisions Articles L121-1 to

L110-2
Article L110-1
(Act no. 2002-276 of 27 February 2002 Article 132 Official Journal of 28 February 2002)
       I. - Natural areas, resources and habitats, sites and la


In [41]:
# setting xmlContent=True adds the html markup which can be useful to detect titles, paragraphs etc.
# can then separate the parts using custom script e.g. https://cbrownley.wordpress.com/2016/06/26/parsing-pdfs-in-python-with-tika/

raw_xml = parser.from_file('Code_40.pdf', xmlContent=True)
print(raw_xml['content'][6000:7000])


  In addition, the National Public Debate Commission ensures the upkeep of good conditions for informing the public
</p>
<p>Updated 04/10/2006 - Page 1/201</p>
<p />
</div>
<div class="page"><p />
<p>ENVIRONMENTAL CODE
throughout the implementation phase of the projects referred to it, up to the receipt of equipment and works.
       This Commission advises the competent authorities and any developer, at their request, on any question relating to
dialogue with the public throughout the development of the project.
       The National Public Debate Commission is also entrusted with the role of issuing all and any opinions and
recommendations of a general or methodological nature likely to encourage and develop dialogue with the public.
       The National Public Debate Commission and individual commissions do not comment on the substance of the
projects submitted to them.
</p>
<p>Article L121-2
(Act no. 2002-276 of 27 February 2002 Article 134 Official Journal of 28 February 2002)
      

### PyMuPDF

Good explanation here: https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

Most advanced library to extract headings / structure so far, but not great with messy PDFs - not as straightforward as it seems. 

In [44]:
import fitz

In [45]:
doc = fitz.open("Code_40.pdf")     

In [60]:
page = doc[3]
text = page.getText("blocks") # can also use html, dict, xml, xhtml, raw text, blocks works pretty well with list of articles / bills
text[2]

(31.190000534057617,
 346.58001708984375,
 534.1900634765625,
 387.25,
 '                CHAPTER II\n                Environmental evaluation Articles L122-1 to\nL122-11',
 4,
 0)

## JSON

GET request to UK Parliament API to see what it returns (how clean, how straightforward it is etc.).

Very easy to use, text is clean, metadata is easy to store in panda df. 

In [None]:
import requests

response = requests.get("http://lda.data.parliament.uk/lordswrittenquestions.json?_view=Written+Questions&_pageSize=500&_page=0")

In [None]:
import pandas as pd

def get_text(response):
    
    response_json = json.loads(response.text)['result']['items']
    df = pd.DataFrame({'AnswerDate': [response_json[i]['AnswerDate']['_value'] for i in range(len(response_json))],
                       'AnsweringBody': [response_json[i]['AnsweringBody'][0]['_value'] for i in range(len(response_json))],
                       'questionText': [response_json[i]['questionText'] for i in range(len(response_json))],
                       'tablingMember': [response_json[i]['tablingMemberPrinted'][0]['_value'] for i in range(len(response_json))]})
    
    return df


In [None]:
get_text(response)

## HTML

## XML