# 2. Reading data from an HTML web page

The goal of this notebook session is to read a data file that is displayed as an "HTML table" on a website. HTML consists of text that is enclosed by descriptive tags, which dictate how it is displayed in a web browser. The set of tags, bounded by angle brackets `<` and `>`, are unique and provide the web browser a clue as to how they should be displayed. For example, the `<table>` tag indicates that a 2-dimensional table structure is to follow. Rows in the table are bounded by `<tr>` tags, and delineate either a header row (with `<th>` tags) or data row (with `<td>` tags).

<img src='./table_html.png/' width='55%'/>

The data we will use in this session is from a non-active research site of the *Long Term Ecological Research Network*, called *North Inlet LTER*. The data consist of daily water samples from from 1978 to 1992. This data is available from the *Environmental Data Initiative* (EDI) [data repository](https://portal.edirepository.org/nis) under the repository identifier [knb-lter-nin.1.1](https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-nin&identifier=1).

Although we do not have a website from which to scrape this data table, the HTML data table can be loaded from the desktop and operated on as if you did download it directly from a website. We will use a Python package called `BeautifulSoup` to parse and scrape the HTML. BeautifulSoup reads the full HTML page and creates a memory-based document object model (tree-like hierarchy) of the page structure and content. To access the data table, we will focus on finding the `<table>` element, which builds a data table object of header and data rows. From the data table object, we will skip the header row and parse only rows that contain the `<td>` data element tag into a Python `list` data structure.

Sidenote: If you were to install the Python `requests` package, you could use the following code to download an HTML web page for pasring and scraping by BeautifulSoup:

```Python
import requests

url = 'https://datawebsite.org/data.html
r = requests.get(url)
html_data_table = r.text
```

#### References

1. [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)
1. [lxml](http://lxml.de/)

In [None]:
!pip install lxml
!pip install beautifulsoup4

### Read the data table HTML file `data.html` into a `BeautifulSoup` webpage DOM object. 

The "data.html" is the same data as found in the `LTER.NIN.DWS.csv` data file.

In [None]:
from bs4 import BeautifulSoup

with open('./data.html', 'r') as f: # Open html file for reading
    soup = BeautifulSoup(f, 'lxml') # Create a beautiful soup DOM object

In [None]:
html_table = soup.find('table') # Find the single, known data table tag object in the HTML web page
table_rows = html_table('tr') # Returns a list of "<tr>" tag objects

table = []
for table_row in table_rows[1:]:
    row = []
    row_data = table_row('td') # Returns a list of "<td>" tag objects
    for data_token in row_data:
        row.append(data_token.text) # Extract the text contents of the "<td>" element
    table.append(row)

In [None]:
for head in table[:9]:
    print(head)

In [None]:
len(table)

In [None]:
# Populate data frame with coerced (converted) values from data table in column-major order

from datetime import datetime

df = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
for row in table:
    date = datetime.strptime(row[0], '%m/%d/%Y')
    df[0].append(date)              # Date as datetime
    df[1].append(row[1])           # transect as unicode string
    df[2].append(float(row[2]))    # water_temp as float
    df[3].append(float(row[3]))    # SAL as float
    df[4].append(float(row[4]))    # TNW as float
    df[5].append(float(row[5]))    # TNF as float
    df[6].append(float(row[6]))    # TPW as float
    df[7].append(float(row[7]))    # TPF as float
    df[8].append(float(row[8]))    # POP as float
    df[9].append(float(row[9]))    # NHN as float
    df[10].append(float(row[10]))  # NNN as float
    df[11].append(int(row[11]))    # CHEM as integer
    df[12].append(float(row[12]))  # TOC as float
    df[13].append(float(row[13]))  # DOC as float
    df[14].append(float(row[14]))  # POC as float

In [None]:
# Access the "Date" and "water_temp" columns and plot the data

import matplotlib
import matplotlib.pyplot as plt

date = df[0]
water_temp = df[2]
    
fig, ax = plt.subplots()
ax.plot(date, water_temp, label='Water Temp')
ax.grid(True)
fig.autofmt_xdate()
fig.set_size_inches(10, 8)
plt.legend()
plt.show()