## Introduction to the task

The task consists in extracting usable information from documents containing text written in natural language.

### Quick setup guide

Follow these steps to set up and run the project:
- ensure `Python` (version `3.7` or newer) is installed and added to `PATH`.
- clone this repository using Git:
    ```shell
    git clone <repository-url>
    ```
- navigate to the project folder and run the setup script
    ```shell
    python setup.py
    ```
- run tests to verify setup (make sure you are inside the data-processing folder)
    ```shell
    python -m unittest discover
    ```

### A closer look at the domain of the problem

The candidate is required to figure what kind of information would be useful to extract, as no explanation is given in the `README`.
Looking through the sample documents provided in the `pdfs` folder, you can see that the entirety of their content is a placeholder.
Therefore, I'll assume that no information needs to be extracted from the paragraphs, as they would contain actual content otherwise.
This means that the scope narrows down to the sole 'captions' and 'labels' containing information that is already blatantly meaningful.

### Getting started

A first approach involves natural language processing.
There are pre-trained models that can do named entity recognition.
In this demonstration, I will use the `spaCy` module to get up-and-running quickly.

First off, I want to extract text from a pdf document.
I will use `PyPDF2` for this.


In [None]:
import warnings
import docproc as dp
from spacy import Language
import spacy

warnings.filterwarnings("ignore", category=FutureWarning)

sample3 = dp.load_file('./pdfs/sample-3.pdf')
sample3_text = dp.get_document_text(sample3)

print(sample3_text)

Note that, in some cases, information can be useless when it's extracted away from its context.
In order to avoid a complete (or partial) loss of semantics in the data that we extract, we can split it into chunks.

Assuming that documents contain a single macro-topic an approach like this is probably exaggerated, since most sentences will likely share similar context regardless.
With `spaCy`, we have an easy way to split our documents into sentences, which should be enough for our purpose.

In [None]:
nlp: Language = spacy.load('en_core_web_trf')

sentences = dp.get_sentences(nlp, sample3_text)

for sentence in sentences:
    print(sentence, end='\n\n')

Once we have the document split in sentences, we can proceed extracting individual features.
The following example extracts all the dates in the document, sentence by sentence.

In [None]:
for n, sentence in enumerate(dp.get_sentences(nlp, sample3_text)):
    print(f'Sentence {n}:')
    print('\tDates:')
    for date in dp.extract_dates(nlp, sentence.text):
        print('\t\t', date, end='\n\n')

### Out-of-the-box experience limitations with spaCy and their pre-trained models

There are two fundamental limitations when it comes to using the default configurations that spaCy provides:
- the sentence recognition is done by a pre-trained model that takes grammar into account. This is more accurate than the rule-based alternative `Sentencizer`, but both fall short when dealing with languages the model is unfamiliar with (e.g. the pseudo-latin of Lorem Ipsum).
- the named entity recognition only supports 18 types of entities by default. Although it is possible to train a custom NER pipeline, the process requires a considerable amount of good quality data.

### Enter regular expressions

Regular expressions are handy when handling data that matches a certain pattern.
For example, `spaCy` cannot recognize email addresses by default.

With regex, it is just a matter of finding the tokens that have the structure of an email address.
Here's an updated version of the code above that can also extract email addresses.

In [None]:
for n, sentence in enumerate(dp.get_sentences(nlp, sample3_text)):
    print(f'Sentence {n}:')
    print('\tDates:')
    for date in dp.extract_dates(nlp, sentence.text):
        print('\t\t', date)
    print()
    print('\tEmail Addresses:')
    for addr in dp.extract_emails(nlp, sentence.text):
        print('\t\t', addr, end='\n\n')