# Data Extraction and Management

*Author*: Zachary del Rosario

### Learning outcomes

By working through this exercise, you will be able to:

- Prevent future data headaches by collecting data *defensively*
- Extract data from tables in published documents with Tabula
- Liberate data from graphs with WebPlotDigitizer


In [None]:
import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()


## Designing Data Collection

---

### Choosing columns

(TODO Agree on columns and formats before starting data collection)


### Example 1: Poor data planning

| Run | Observation |
|-----|-------------|
|  1  | It worked   |
|  2  | It failed   |
|  3  | It worked   |

### Q1: Suppose you were reviewing data in this form; would you be able to answer the following questions? Why or why not?

- How many observations were collected?
- What were the experimental settings for each observation?
- Were all of the planned experiments run?
- What would you do to resolve the issue with Observation 2?


### Example 2: More details

| Run |  Knob A | Knob B | Recorded Value | Notes |
|-----|---------|--------|----------------|-------|
|  1  | 1.0 m/s | 2.3 kg | 1.6 eV         |       |
|  2  | 1.0 mph | 10.0 g | NA   | Sample did not survive characterization |
|  3  | 0.5 mph | 50.0 g | 1e-17 J        |       |

### Q2: Suppose you were reviewing data in this form; would you be able to answer the following questions? Why or why not?

- What were the experimental settings for each observation?
- How do the experimental settings compare for each observation?
- Were all of the planned experiments run?
- What would you do to resolve the issue with Observation 2?


### Example 3: Even Better

| Run |  Knob A (m/s) | Knob B (g) | Recorded Value (eV) | Date | Notes |
|-----|---------------|------------|---------------------|------|-------|
| 1 / 2 |    1.00 |  2.3e3 | 1.60          | 2021-08-11 | - |
| 2 / 2 |    0.45 |  2.0e1 | NA   | 2021-08-11 | Sample did not survive characterization |
| 1 / 1 |    0.22 |  5.0e1 | 62.42          | 2021-08-12 | - |

### Q3: Suppose you were reviewing data in this form; would you be able to answer the following questions? Why or why not?

- How do the experimental settings compare for each observation?
- Were all of the planned experiments run?


## Tabula: Extracting tables from documents

---

*Background*: [Tabula](https://tabula.technology/) is a piece of software developed for journalists carrying out investigative reporting. It was developed with support from organizations like [ProPublica](http://propublica.org/) and [The New York Times](http://www.nytimes.com/). This tool is meant to help investigators parse unwieldy PDFs and liberate useful information.

### Q4 Download and install [Tabula](https://tabula.technology/); the webpage has installation instructions.

*Note*: Tabula's interface is through a locally-hosted server; it should automatically open a browser window for Tabula. If it does not, then open [http://localhost:8080/](http://localhost:8080/) after you've launched Tabula.


### Q5 Download [this example PDF](https://github.com/zdelrosario/mi101/blob/main/mi101/data/weibull1939-table4.pdf) and import it into Tabula for data extraction.

![Tabula's interface: Click `Browse` to find the example PDF](../images/tabula-front.png)

Click `Browse` to find the example PDF, click `Import` to load the file into Tabula, then click `Extract Data` to enter the data extraction interface.


### Q6 Enter the Extraction menu, and drag-select a box to target the table of data.

![Tabula's interface: Click and drag to draw a box around the data you want to extract](../images/tabula-select.png)

Click and drag to draw a box around the data you want to extract. Make sure to exclude the Table title, and any other non-data text.


### Q7 Once selected, click `Preview & Export Extracted Data`.

![Tabula's interface: View of selected region for data extraction](../images/tabula-selected.png)




### Q8 Choose between `Stream` and `Lattice` options to help Tabula extract the data correctly.

![Tabula's interface: View of selected region for data extraction](../images/tabula-preview-stream.png)

The `Stream` option looks for whitespace, while the `Lattice` option looks for vertical and horizontal bars that denote data entries. For this case, both options work fine.


### Q9 Once satisfied, click `Export` to download the data.

Once you finish this, you should have access to a CSV that looks like the following:

In [None]:
df_extracted = pd.read_csv("../data/tabula-weibull1939-table4.csv")
df_extracted

### Important caveat! Image-based PDF's.

Tabula tends to work best with more modern, fully-digital documents. For PDF's of older documents, you might get the following:

![Tabula's warning of an image-based PDF](../images/tabula-warning.png)

This means the PDF doesn't have any digitized text in it; it's just a scan of an old document. The `Help` tab in Tabula gives some suggestions on things you can try; for instance, the help page links to *optical character recognition* (OCR) machine learning tools you can use to pre-process an image into text data, which you can then treat with Tabula.


## WebPlotDigitizer: Liberating data from images

---

Sometimes data are messy---we'll learn how to deal with that later in the workshop. Other times data are "locked up" in a format we can't easily analyze, such as in an image. In this exercise you'll learn how to *liberate* data from a plot using WebPlotDigitizer.
