# Assignment 1

<br> 
    <center>
        <img src="src/A1_ViT_illustration.png" width="300"/>  <img src="src/A1_Earshells_illustration.png" width="300"/>
    </center>
</br>

In the first assignment, you can revise your **general Python skills**. The overall task is to extract information from PDFs in the form of text strings and Pandas DataFrame tables. In particular, you can practice

* installing new libraries and setting up a new environment for your project,
* creating and using **generators** and **decorators**,
* working with **Pandas DataFrames**.

Please provide solutions to all exercises below and send me all notebooks by **31st of May 2024**.

***

## Part I: Setting up a new project environment

In order to reuse tools or libraries shared through, for example, GitHub repositories, you need to be 
at ease with installing new libraries and environments in your own Python setup. 

To this end, the first part of the assignment is to 
* set up Python on your own machine (if you haven't done it yet)
* and install libraries provided in ```src/environment_A1.yml```

You can use the information I shared in my first e-mail on the course details. In particular and **more beginner-friendly**, you can use

1. my YouTube video "[Python Course: Set up Python](https://youtu.be/-RJnYbxVZTg)",
2. my brief instruction on how to set up the [Python environment in Anaconda Navigator](src/A1_Install_environment_in_Anaconda_Navigator.pdf),
3. and switch to the new environment ```A1apml``` and start Jupyterlab in Anaconda Navigator.

If you are **more experienced** and Python as well as Anaconda are already set up, you can

1. install the environment from the console / terminal (Linux / macOS) or Anaconda Prompt (Windows) with the command ```conda env create -f environment_A1.yml``` (you need to do this in the directory ```.../Assignments/src/``` where ```environment_A1.yml``` is stored),
2. activate the environment with ```conda activate A1apml```,
3. and start jupyter lab with ```jupyter lab```.


In other words, you might need to close this Jupyter lab window and restart it from the new environment.

In this notebook, we will **work with this new environment** ```A1apml``` that provides you with
Jupyter lab, NumPy, Pandas, and **PyMuPDF**. 

***

## Part II: Read text from a PDF

In the second part, we want to retrieve some information from the paper
on the influential *vision transformer* (ViT) architecture which is provided in
```data/A1_ViT_paper.pdf```.

For this, we use PyMuPDF which is a high-performance Python library for data extraction, 
analysis, conversion, and manipulation of information stored in PDFs. You will import it through

```Python
import fitz
```

Find more information on the functionalities in this 
[overview](https://pymupdf.readthedocs.io/en/latest/the-basics.html). 

Execute the following cells load data and get a feeling for the library.

In [None]:
import fitz 
import pandas as pd
import numpy as np

paper_path = 'data/A1_ViT_paper.pdf'

doc = fitz.open(paper_path)

In [None]:
print(f"The paper '{paper_path}' has {len(doc)} pages.")

In [None]:
for page_nr, page in enumerate(doc): 
    text = page.get_text()
    print(f"----- Page {page_nr+1}\n{text}")

### Exercise II.1
Write a **generator** ```page_provider``` which provides the pages of the PDF as strings.
The generator is supposed to expect the path to the PDF ```paper_path```
as an argument and, when, for example, used in a loop, provide the next page of
the document.

Fill in the cell below:

### Exercise II.2
Extract the abstract and conclusion sections from the PDF into a string variable.

For this, assign the generator ```page_provider``` to a variable e.g. ```get_page```. Loop over the pages and extract the sections on 'ABSTRACT'
and 'CONCLUSION' and store the text in one string variable ```result_text```.

Ideas on strings provided in [Notebook 1](../Notebooks/1-Python_Concepts.ipynb) might help you with that.

Fill in the cell below:

In [None]:
print(result_text)

***

## Part III: Read a table from a PDF

In the third part, we tackle the slightly more challenging task of extracting 
tables from PDFs. We use the exemplary table in ```data/A1_table_example.pdf``` for this.

The following cell provide a wrapper function ```get_pandas_df``` to load the first table 
occurring on a specified page into a Pandas DataFrame. You can ignore the arguments
```search_strategy```and ```search_region```for now. Read up on the used ```PyMuPDF``` function 
in the [find_tables documentation](https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables). 

In [None]:
import fitz
import pandas as pd
import numpy as np

def get_pandas_df(page, search_strategy='text', search_region=None):
    tabs = page.find_tables(strategy=search_strategy, clip=search_region) 
    
    tab = tabs[0]
    df = tab.to_pandas()

    df = df.convert_dtypes()

    return df

In [None]:
paper_path = "data/A1_table_example.pdf"

doc = fitz.open(paper_path)

print(f"The paper '{paper_path}' has {len(doc)} pages.")

table_df = get_pandas_df(doc[0])

In [None]:
table_df

### Exercise III.1

As you probably have noticed ```table_df``` requires some addtional processing
before being properly used as DataFrame.

Thus, write a **decorator** ```nicer_df``` which can be used to conveniently extend
the functionality of our ```get_pandas_df```. In particular, please do not change the ```get_pandas_df``` function, but define a decorator as discussed in [Notebook 1](../Notebooks/1-Python_Concepts.ipynb).

Complete the cell below and perform the following steps:

* Drop all rows where entries with only empty strings ```''``` make up at least half of the entries in that row 
* Convert columns with numeric values to floats or integers, respectively
* for each numeric row, provide as the last two rows the mean and standard deviation

Hints: Your solution could include the following parts

*  ```== ''```
*  ```.sum(axis=1)```
*  ```~``` operator for negation, i.e. ```~np.array([True, True, False])``` becomes ```array([False, False,  True])```
*  ```pd.to_numeric(..., errors='ignore')```
*  ```df.loc[row_index, column_name]```

Note that your own solution might not require all of these.

In [None]:
def nicer_df(func): 
    # Here comes your implementation

If you were successful with performing all the outlined steps,
you can *decorate* the ```get_pandas_df``` like this:

In [None]:
@nicer_df
def get_pandas_df(page, search_strategy='text', search_region=None):
    tabs = page.find_tables(strategy=search_strategy, clip=search_region) 
    
    tab = tabs[0]
    df = tab.to_pandas()
    
    df = df.convert_dtypes()

    return df

Re-run the table extraction with the extended functionality:

In [None]:
import fitz
import pandas as pd
import numpy as np

paper_path = "data/A1_table_example.pdf"

doc = fitz.open(paper_path)

print(f"The paper '{paper_path}' has {len(doc)} pages.")

nicer_table_df = get_pandas_df(doc[0])

Here is a comparison between the "old" ```table_df```
and "new" ```nicer_table_df```

In [None]:
table_df

In [None]:
nicer_table_df

***

## Bonus Part IV: Read a table from a PDF by specifying the section

#### This last part is not required to complete assignment 1!

You have probably realised
the table in ```data/A1_table_example.pdf``` is a simplified example. If you consider
the vision transformer paper ```data/A1_ViT_paper.pdf```, you notice that it is more 
challenging to read tables from "real" documents correctly.

In this last section, you can explore how to extract tables in more realistic
scenarios if you are interested.

We load again the "undecorated" function ```get_pandas_df```for this.

In [None]:
def get_pandas_df(page, search_strategy='text', search_region=None):
    tabs = page.find_tables(strategy=search_strategy, clip=search_region) 
    
    tab = tabs[0]
    df = tab.to_pandas()
    
    df = df.convert_dtypes()

    return df

The additional arguments ```search_strategy``` / ```strategy```  and 
```search_region``` / ```clip``` provide additional control on how and 
where to find tables in the PDF. We focus on ```search_region``` / ```clip``` in the following.
As before, you can find more in the [find_tables documentation](https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables).

With ```search_region``` / ```clip```, you can specify a particular region of the 
page that should be parsed.

#### Bonus Exercise IV.1
Explore different settings for ```custom_rectangle``` by changing the arguments of
```fitz.Rect(...)``` and see how well you can parse the tables!

In [None]:
import fitz
import pandas as pd
import numpy as np

doc = fitz.open("data/A1_ViT_paper.pdf")

# Simpler table on page 5
page = doc[4]

# Slighlty more complicated table on page 6
# page = doc[5]

pdf_rectangle = page.rect
custom_rectangle = fitz.Rect(pdf_rectangle.x0, 
                             pdf_rectangle.y0, 
                             pdf_rectangle.x1, 
                             pdf_rectangle.y1
                            )

print(f"pdf_rectangle:\t\t{pdf_rectangle}")
print(f"custom_rectangle:\t{custom_rectangle}\n")

table_vit_df = get_pandas_df(page, search_strategy='text', search_region=custom_rectangle)

doc.close()

table_vit_df