# Text cleaning

Aim is to try different scripts and libraries to clean text of various formats. 

**Don't forget to install the modules in requirements.txt**

## Some challenges 

Challenges observed so far when it comes to preparing text for analysis:
* Getting the structure from PDFs/XML in particular i.e. title, headings etc.
* Removing/extracting noise e.g. page number integrated to text, footer will appear as many times as page number, reference numbers etc.
* Extracting data based on tags is easy but each document has its custom structure, no general script to do this (although potentially transform into dictionary then keep keys that on average have the most text? or smth like this, but again might run into exceptions)

## PDF

sample_pdf_french_law.pdf is the French environmental code. Good example of how bills, articles, amendments, or treaties look like as a PDF (versus letter or manifesto, a lot of different heading, long documents etc.). However, they can be found in easier format than PDF and have often been extracted already.

* PDF format
* Just text but very structured, no tables or weird formatting
* Will try to extract the text and categories based on the titles and headings

### With PyPDF2 and regex

This works fine but only extracts text, no structure. Ideally want to get headings etc. Can use regex for this, but very manual - there might be patterns accross most commonly used documents, for more custom structure, it can be described by the user?

In [2]:
import PyPDF2
#text = textract.process("sample_pdf_french_law.pdf")

pdfFileObj = open('sample_pdf_french_law.pdf', 'rb')
# read object

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# print number of pages
print(pdfReader.numPages)

pageObj = pdfReader.getPage(1)
print(pageObj.extractText()[0:100])

201
ENVIRONMENTAL CODEthroughout the implementation phase of the projects referred to it, up to the rece


In [4]:
print(pageObj.extractText()[0:600])

# ENVIRONMENTAL CODE = Top of page
# €€€€€€€ = indent 
# Article XXXX = new article but also sometimes in text, so not enough to extract articles
# Within articles: I. -, 1°, a)
# Updated 04/10/2006 - Page XXX/XXX = end of page

ENVIRONMENTAL CODEthroughout the implementation phase of the projects referred to it, up to the receipt of equipment and works.€€€€€€€This Commission advises the competent authorities and any developer, at their request, on any question relating todialogue with the public throughout the development of the project.€€€€€€€The National Public Debate Commission is also entrusted with the role of issuing all and any opinions andrecommendations of a general or methodological nature likely to encourage and develop dialogue with the public.€€€€€€€The National Public Debate Commission and individual co


### With PDFminer

**Useful scripts:**
* convert pdf: https://gist.github.com/terencezl/61fe3f28c44a763dd1e9f060b8ff6f2e
* get tags: https://gist.github.com/joelhsmith/5e6ec7ee70ab4b89d7bc5700e9e07fde
* converting to html: https://stackoverflow.com/questions/3637781/converting-a-pdf-to-text-html-in-python-so-i-can-parse-it

**Problem**: a lot of version of pdfminer, all those scripts use deprecated functions/modules. Tried pdfminer.six and pdfminer.3k but same story.

In [7]:
# INTERNAL ERRORS using pdfminer - doesn't look well maintained
# a lot of version problems between scripts and package installed via pip

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, XMLConverter, HTMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
import ply 

def convert_pdf(path, format='text', password=''):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    laparams = LAParams()
    if format == 'text':
        device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    elif format == 'html':
        device = HTMLConverter(rsrcmgr, retstr, laparams=laparams)
    elif format == 'xml':
        device = XMLConverter(rsrcmgr, retstr, laparams=laparams)
    else:
        raise ValueError('provide format, either text, html or xml!')
    
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True
    pagenos=set()
    
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue().decode()
    fp.close()
    device.close()
    retstr.close()
    
    return text

# convert to text, print first 100 chars
convert_pdf('sample_pdf_french_law.pdf')[0:100]

ModuleNotFoundError: No module named 'pdfminer'

### With Tika

This works really well in terms of keeping the structure (still need to figure out how to extract the different parts e.g. separate BOOK I and BOOK II). 

Very slow.

In [8]:
from tika import parser

raw = parser.from_file('sample_pdf_french_law.pdf')
raw.keys()
print(raw['content'][50:500])

2020-06-25 11:35:00,083 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /var/folders/8g/jhnx9x0d2vxg4p121bvdz1gr0000gn/T/tika-server.jar.
2020-06-25 11:35:19,650 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /var/folders/8g/jhnx9x0d2vxg4p121bvdz1gr0000gn/T/tika-server.jar.md5.
2020-06-25 11:35:21,280 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...



ENVIRONMENTAL CODE

With the cooperation of Michael Faure
Professor of Comparative and International Environmental Law and Academic Director of METRO, the Institute for
Transnational Legal Research of the Universiteit Maastricht.

BOOK I
Common provisions Articles L121-1 to

L110-2
Article L110-1
(Act no. 2002-276 of 27 February 2002 Article 132 Official Journal of 28 February 2002)
       I. - Natural areas, resources and habitats, sites and la


In [9]:
# setting xmlContent=True adds the html markup which can be useful to detect titles, paragraphs etc.
# can then separate the parts using custom script e.g. https://cbrownley.wordpress.com/2016/06/26/parsing-pdfs-in-python-with-tika/

raw_xml = parser.from_file('sample_pdf_french_law.pdf', xmlContent=True)
print(raw_xml['content'][6000:7000])


ublique.
       In addition, the National Public Debate Commission ensures the upkeep of good conditions for informing the public
</p>
<p>Updated 04/10/2006 - Page 1/201</p>
<p />
</div>
<div class="page"><p />
<p>ENVIRONMENTAL CODE
throughout the implementation phase of the projects referred to it, up to the receipt of equipment and works.
       This Commission advises the competent authorities and any developer, at their request, on any question relating to
dialogue with the public throughout the development of the project.
       The National Public Debate Commission is also entrusted with the role of issuing all and any opinions and
recommendations of a general or methodological nature likely to encourage and develop dialogue with the public.
       The National Public Debate Commission and individual commissions do not comment on the substance of the
projects submitted to them.
</p>
<p>Article L121-2
(Act no. 2002-276 of 27 February 2002 Article 134 Official Journal of 28 Februar

Trying it again with a more complex PDF, **manifesto of the AFD (Austria)** which has images and a lot of formating. 
* Surprisingly fast 
* But order of elements is messy (e.g. name of the author of the foreword is located far from the foreword)
* Recognizes paragraphs but sometimes weirdly because of formatting (e.g. if there is a picture)

In [39]:
from tika import parser

raw = parser.from_file('sample_pdf_afd_manifesto.pdf')
raw.keys()
print(raw['content'][50:500])

out 1


Freiheitliches Wahlprogramm 
zur Nationalratswahl 2017



100 FPÖ-Forderungen zur 
Beseitigung der Fairness-Krise
Österreicher verdienen Fairness. Denn Österreich durchleidet spürbar eine massive
Fairnesskrise. Wir haben die höchste Steuerbelastung bei einem aufgeblähten
Staatsapparat, eine Einschränkung aller Freiheitsräume durch Überregulierung (Ge-
werbeordnung, überbordende Gesetzesflut) und eine doppelte Umverteilung: einer-

seits v


### With PyMuPDF

Good explanation here: https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

Most advanced library to extract headings / structure so far, but not great with messy PDFs - not as straightforward as it seems. 

In [10]:
import fitz

In [11]:
doc = fitz.open("sample_pdf_french_law.pdf")     

In [12]:
page = doc[3]
text = page.getText("blocks") # can also use html, dict, xml, xhtml, raw text, blocks works pretty well with list of articles / bills
text[2]

(31.190000534057617,
 244.5900421142578,
 564.0989379882812,
 303.6300048828125,
 'Article L121-14\n(Inserted by Act no. 2002-276 of 27 February 2002 Article 134 Official Journal of 28 February 2002)\n       No irregularity with regard to the provisions of the present Chapter may be invoked when the notice by which the\nNational Public Debate Commission has opted not to organise a public debate or the notice mentioned in Article L.\n121-13 has become final.',
 2,
 0)

Trying with the other samle pdf. Same as before, useful if formating is on point, requires quite a bit of human effort to check/indicate what headings might be. Ok if formatting is the same accross all pages, a pain otherwise. 

In [83]:
doc = fitz.open("sample_pdf_afd_manifesto.pdf")     
page = doc[3]
text = page.getText("blocks") # can also use html, dict, xml, xhtml, raw text, blocks works pretty well with list of articles / bills
text[0:2]

[(56.692901611328125,
  110.83820343017578,
  412.9449157714844,
  171.78221130371094,
  'Unsere Souveränität \nund Selbstbestimmung schützen',
  0,
  0),
 (56.69260025024414,
  226.3517608642578,
  291.9969787597656,
  452.3187255859375,
  'E\ns entspricht freiheitlicher Geisteshaltung, dem\neinzelnen Menschen die Freiheit als höchstes\nGut einzuräumen und darin gleichzeitig einen un-\nverzichtbaren Wert zu sehen.  Der einzelne Mensch\nist jedoch stets in eine Gemeinschaft gestellt, die\nebenfalls selbständig Träger von Freiheitsrechten\nist – von der Familie bis zum Volk. Wir Freiheitliche\nsind daher bestrebt, eine Gesellschaftsordnung zu\nverwirklichen, die dem Einzelnen einen durch\nGrund- und Freiheitsrechte garantierten, staats-\nfreien Raum gewährleistet. Auf der anderen Seite\nwollen wir unsere Heimat als möglichst autonomen\nund autarken Staat in der internationalen Staaten-\ngemeinschaft etabliert wissen.',
  1,
  0)]

## JSON

GET request to UK Parliament API to see what it returns (how clean, how straightforward it is etc.).

Very easy to use, text is clean, metadata is easy to store in panda df. 

In [13]:
import requests

response = requests.get("http://lda.data.parliament.uk/lordswrittenquestions.json?_view=Written+Questions&_pageSize=500&_page=0")

In [16]:
import pandas as pd
import json

def get_text(response):
    
    response_json = json.loads(response.text)['result']['items']
    df = pd.DataFrame({'AnswerDate': [response_json[i]['AnswerDate']['_value'] for i in range(len(response_json))],
                       'AnsweringBody': [response_json[i]['AnsweringBody'][0]['_value'] for i in range(len(response_json))],
                       'questionText': [response_json[i]['questionText'] for i in range(len(response_json))],
                       'tablingMember': [response_json[i]['tablingMemberPrinted'][0]['_value'] for i in range(len(response_json))]})
    
    return df


In [17]:
get_text(response)

Unnamed: 0,AnswerDate,AnsweringBody,questionText,tablingMember
0,2020-07-08,Foreign and Commonwealth Office,To ask Her Majesty's Government what assessmen...,Lord Alton of Liverpool
1,2020-07-08,Foreign and Commonwealth Office,"To ask Her Majesty's Government, further to re...",Lord Alton of Liverpool
2,2020-07-08,Home Office,To ask Her Majesty's Government what measures ...,Lord Alton of Liverpool
3,2020-07-08,Foreign and Commonwealth Office,To ask Her Majesty's Government what assessmen...,Lord Alton of Liverpool
4,2020-07-08,Foreign and Commonwealth Office,To ask Her Majesty's Government what plans the...,Lord Alton of Liverpool
...,...,...,...,...
495,2020-06-29,Department of Health and Social Care,Her Majesty's Government how they will ensure ...,Lord Roberts of Llandudno
496,2020-06-29,Department for International Trade,To ask Her Majesty's Government what discussio...,Lord Roberts of Llandudno
497,2020-06-29,Department of Health and Social Care,Her Majesty's Government what assessment they ...,Lord Roberts of Llandudno
498,2020-06-29,Department for Work and Pensions,To ask Her Majesty's Government what plans the...,Lord Roberts of Llandudno


## XML

Sample file is the proceedings from the Welsh Record of Proceedings from the Culture, Welsh Language and Communications Committee (11/06/2020 13:29).

In [18]:
import xml.etree.ElementTree as ET
tree = ET.parse('sample_xml_welsh_parl.xml')
root = tree.getroot()

In [19]:
# tree for first element
[elem.tag for elem in root[0].iter()]

['XML_CultureWelshLanguageAndCommunicationsCommittee_English',
 'Meeting_ID',
 'Assembly',
 'MeetingDate',
 'Contribution_ID',
 'Contribution_Order_ID',
 'contribution_language',
 'ContributionTime',
 'contribution_spoken_seneddTv',
 'contribution_translated_seneddTv',
 'Agenda_Item_ID',
 'Agenda_item_welsh',
 'Agenda_item_english',
 'contribution_type',
 'Attendee_Id',
 'Member_Id',
 'Member_name_English',
 'Member_biog_English',
 'Member_biog_Welsh',
 'Member_job_title_English',
 'Member_job_title_Welsh',
 'Contribution_English',
 'Contribution_Welsh',
 'contribution_verbatim',
 'contribution_translated']

In [25]:
(ET.tostring(root, encoding='utf8').decode('utf8'))[0:1000]

'<?xml version=\'1.0\' encoding=\'utf8\'?>\n<dataroot generated="2020-06-12T15:33:07">\n  <XML_CultureWelshLanguageAndCommunicationsCommittee_English>\n    <Meeting_ID>6356</Meeting_ID>\n    <Assembly>5</Assembly>\n    <MeetingDate>2020-06-11T13:29:46</MeetingDate>\n    <Contribution_ID>293839</Contribution_ID>\n    <Contribution_Order_ID>0</Contribution_Order_ID>\n    <contribution_language>Cy</contribution_language>\n    <ContributionTime />\n    <contribution_spoken_seneddTv />\n    <contribution_translated_seneddTv />\n    <Agenda_Item_ID>200611-0</Agenda_Item_ID>\n    <Agenda_item_welsh />\n    <Agenda_item_english />\n    <contribution_type>I</contribution_type>\n    <Attendee_Id />\n    <Member_Id />\n    <Member_name_English />\n    <Member_biog_English />\n    <Member_biog_Welsh />\n    <Member_job_title_English />\n    <Member_job_title_Welsh />\n    <Contribution_English>&lt;p&gt;The proceedings are reported in the language in which they were spoken in the committee. In addi

In [26]:
# easier to work with as a dictionary -> convert and export as json

import xmltodict
import json
with open('sample_xml_welsh_parl.xml') as in_file:
    xml = in_file.read()
    with open('transformed_welsh_parl.json', 'w') as out_file:
        json_file = json.dump(xmltodict.parse(xml), out_file)

In [27]:
with open('transformed_welsh_parl.json') as in_file:
    json_wales = json.loads(in_file.read())

In [29]:
english_text = [json_wales["dataroot"]['XML_CultureWelshLanguageAndCommunicationsCommittee_English'][i]["Contribution_English"] for i in range(len(json_wales["dataroot"]['XML_CultureWelshLanguageAndCommunicationsCommittee_English']))]
english_text[0:3]

['<p>The proceedings are reported in the language in which they were spoken in the committee. In addition, a transcription of the simultaneous interpretation is included. This is a draft version of the record. The final version will be published within five working days.</p>',
 '<p>The committee met by video-conference.</p>\n<p>The meeting began at 13:29.</p>',
 "<p>Good afternoon, everyone, and a warm welcome to this meeting of the Culture, Welsh Language and Communications Committee at our Senedd. In accordance with Standing Order 34.19, I have determined that the public are excluded from attending this committee meeting in order to protect public health. This meeting is, however, being broadcast live on Senedd.tv, with all participants joining via video-conference. A transcript of the meeting will be published as usual. Aside from the procedure adaptations relating to conducting proceedings remotely, all other Standing Order requirements remain in place. This meeting is bilingual, w

New sample, more complex XML **Written Questions to the Ministers (France)**. There is more nesting and each XML contains one question. 
* Straightforward for a human, harder for a machine
* If they keys have a clear name (e.g. text) then it's very easy (even for a machine? at first just grep anything that sounds like text OR based on length of values for each key -> in the future ML? although very idiosyncratic)
* Potentially print all keys and build structure and ask the user to select where the text is, but not very scalable
* Tree doesn't work in this case so convert to dictionary is useful

In [40]:
tree = ET.parse('sample_xml_france_parl.xml')
root = tree.getroot()

In [77]:
# tree for first element
[elem.tag for elem in root[0].iter()]

['{http://schemas.assemblee-nationale.fr/referentiel}uid']

In [47]:
import xmltodict
import json
with open('sample_xml_france_parl.xml') as in_file:
    xml = in_file.read()
    with open('sample_xml_france_parl.json', 'w') as out_file:
        json_file = json.dump(xmltodict.parse(xml), out_file)

In [49]:
with open('sample_xml_france_parl.json') as in_file:
    json_wales = json.loads(in_file.read())

In [57]:
json_wales.keys() # dict_keys(['question'])
json_wales['question'].keys()

dict_keys(['@xmlns', '@xmlns:xsi', '@xsi:type', 'uid', 'identifiant', 'type', 'indexationAN', 'auteur', 'minInt', 'minAttribs', 'textesQuestion', 'textesReponse', 'cloture', 'signalement', 'renouvellements'])

In [68]:
# question
json_wales['question']['textesQuestion']['texteQuestion']['texte'][0:100]

"M. Antoine Herth attire l'attention de M. le ministre d'État, ministre de l'intérieur, sur les diffi"

In [75]:
# answer
json_wales['question']['textesReponse']['texteReponse']['texte'][0:100]

'Le déploiement des télé-procédures dans le cadre du plan préfecture nouvelle génération (PPNG) a int'