# Module 8. Parsing text
# Reading PDFs

## Lecture objectives
1. Demonstrate how to load in text data from PDFs
2. Show how to clean and simplify text data using `regex` 

## Getting text into Python
Before we process any text, we need to take a step back and figure out how to get that text into Python. Typically, plans and other policy documents come as PDFs, which are a pain to read. There are dozens of PDF readers for Python, all of which are flawed in different ways. (See some discussions [here](https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7), [here](https://johannesfilter.com/python-and-pdf-a-review-of-existing-tools/) and [here](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file).) We'll use `pdfminer.six`, which is fairly robust and is easier to install than some alternatives. YMMV.

Let's start with an adaptation of the [LA Times analysis of California High-Speed Rail](https://github.com/datadesk/hsr-document-analysis). If you look at their code, they use the `urllib` library to download files. You can do the same but with a couple of extra steps using `requests`. 

But for now, let's work with just one of their files: [the EIR section on air quality and climate change, for the Bakersfield to Palmdale segment](https://hsr.ca.gov/wp-content/uploads/docs/programs/bakersfield-palmdale/BP_Draft_EIRS_Vol_1_CH_3.3_Air_Quality_and_Global_Climate_Change.pdf). It's in your git repository.

We'll read the text using `pdfminer.six`. Its simplest function is `extract_text`. Full documentation is [here]

Note that you will often have to experiment with other PDF parsers if you get unintelligible results. `PyPDF2` is another commonly used package.

In [None]:
from pdfminer.high_level import extract_text

fn = '../data/BP_Draft_EIRS_Vol_1_CH_3.3_Air_Quality_and_Global_Climate_Change.pdf'
eirtext = extract_text(fn)
print('Text is {} characters long'.format(len(eirtext)))

Let's look at a few random extracts. We read the file into a string, so we can use our standard string slicing syntax.

For example, let's look at 1,000 characters starting at the 200,000th character.

In [None]:
print(eirtext[199999:200999])

And another slice.

In [None]:
print(eirtext[400000:401000])

## Cleaning up the text
So we've got a bunch of text in, but clearly the formatting leaves something to be desired. In particularly, there are a lot of random line breaks. Let's use `regex` to convert all whitespace (spaces, tabs (`\t`), and newlines (`\n` or `\r\n`) to a single space. 

`regex` is short for "regular expression," and is essentially a pattern matching tool for text. Think of it as a souped-up version of `replace`. 

`regex` is extremely powerful and has an extremely unfriendly syntax. But there are thousands of examples online. [Here's a good place to start](https://regexone.com/) if you want to explore more. And [this website](https://regex101.com) helps you test and debug your expressions.

Let's look at an example – `r"\s+"`:
- The `r` tells Python that what follows is a "raw string," and thus the `\` character should be interpreted literally
- `\s` matches whitespace
- `+` matches multiple occurences

So basically, we are matching all whitespace, however long.

Let's then use `re.sub` to replace that whitespace. The second argument is what we replace our matched substrings with. The third argument is the string to apply the substitution to. Note that we have some spaces, some tabs (`\t`), and some newlines (`\n`).

In [None]:
import re
print(re.sub(r"\s+", " ", "HSR\tis     an\nexpensive    boondoogle"))

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> What happens if you omit the <strong>+</strong> sign from <strong>r"\s+"</strong>? How can you explain your results?
</div>

If we omit the `+` and just specify `r"\s"`, we don't match multiple occurences. So 4 spaces are replaced with 4 spaces, rather than a single space. But the tabs and newlines are still converted to spaces.

In [None]:
print(re.sub(r"\s", " ", "HSR\twill     \ntransform     California"))

I won't pass judgment on the content of either of these claims.

Let's apply the `regex` to our text that we pulled out of the EIR

In [None]:
eirtext = re.sub(r"\s+", " ", eirtext)
print(eirtext[200001:201001])

In [None]:
print(eirtext[400001:401001])

We can also use `regex` to get rid of punctuation, digits, etc. 

Here:
* `[]` means match anything within the brackets
* `^` means not
* `A-z` is any letter in any case
* `\s` is any whitespace (which is just spaces, since we converted other whitespace like tabs to spaces

So `[^A-z\s]` captures anything that is not a letter or whitespace. 

Since we might want the punctuation at a later date, let's assign our cleaned text to a new variable, `eirtext_wordsonly`.

In [None]:
eirtext_wordsonly = re.sub(r"[^A-z\s]", "", eirtext)
eirtext_wordsonly[400001:401001]

Notice that removing some digits, etc. means that we now have extra spaces. For example, `Table 3.3-46 provides a summary` becomes `Table  provides a summary.`

So let's use our same process from before to remove duplicate spaces.

In [None]:
eirtext_wordsonly = re.sub(r"\s+", " ", eirtext_wordsonly)
eirtext_wordsonly[394890:395200]

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> The regex above also removes numbers. Adapt <strong>r"[^A-z\s]"</strong> so that it retains the digits 0 through 9. <em>Hint:</em> Look at the websites linked above for tips on the syntax.
</div>

Note that `0-9` represents any number. So we can add this to our list of characters that will be matched.

In [None]:
eirtext_wordsonly = re.sub(r"[^A-z0-9\s]", "", eirtext)
# and remove duplicate spaces
eirtext_wordsonly = re.sub(r"\s+", " ", eirtext_wordsonly)
eirtext_wordsonly[400001:401001]

This looks much better! We now have some clean text to analyze.

Let's pause here. We'll save the text to a file, so that we can load it in at the start of the next lecture.

Note here that `open` opens the file object `f`. We then write `eirtext` to the file. The `with` syntax helps because it automatically closes the file afterwards.

In [None]:
# the encoding keyword is needed on Windows, as it does not write as unicode by default
with open('../scratch/eirtext.txt', 'w', encoding="utf-8") as f:
    f.write(eirtext)

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>PDFs are difficult to work with. pdfminer is a good starting point, but make sure to inspect your output.</li>
  <li>regex is a powerful tool to clean up text, e.g. removing whitespace and punctuation.</li>
</ul>
</div>