### [DOM model of HTML page](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model) and Table Scraping

[Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) is a Python library for pulling data out of HTML and XML files using a DOM parser such as html5lin, lxml, etc...

Note:  
DOM stands for Document Object Model. It [represents a document with a logical tree](https://en.wikipedia.org/wiki/Document_Object_Model#/media/File:DOM-model.svg). Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree.>

In [None]:
import requests
#!pip install beautifulsoup4
#pip install html5lib
#or !pip install lxml
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
url = "https://raw.githubusercontent.com/mdn/css-examples/main/learn/tasks/tables/table-download.html"
r = requests.get(url)

In [None]:
r.text

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')
#soup = BeautifulSoup(r.text, 'lxml')
soup

In [None]:
soup.head

In [None]:
soup.body

In [None]:
soup.body.table

In [None]:
soup.body.table.thead

In [None]:
ths=soup.body.table.thead.find_all('th')
ths

In [None]:
columns = [row.get_text() for row in ths]
columns

In [None]:
soup.body.table.tbody

In [None]:
trs = soup.body.table.tbody.find_all('tr')
trs

In [None]:
for row in trs:
    ths = row.find_all('th')
    tds = row.find_all('td')
    
    rowData = [h_or_d.get_text() for h_or_d in ths+tds]
    print (rowData)

In [None]:
soup.body.table.tfoot

### [pd.read_html](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)
A simple table scraping wrapper built on the top of beautifulsoup

In [None]:
url = "https://ourworldindata.org/the-worlds-deadliest-earthquakes"
tables = pd.read_html(url, encoding='utf-8')
len(tables)

In [None]:
tables[0]

### [tabula-py](https://nbviewer.org/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb): A PDF table scraper
tabula-py is a tool to convert PDF tables to pandas DataFrame. tabula-py is a wrapper of tabula-java, which requires java on your machine.

In [None]:
#!pip install tabula-py
from tabula import read_pdf
url = "https://github.com/chezou/tabula-py/raw/master/tests/resources/data.pdf"
tables = read_pdf(url, pages='all')
len(tables)

In [None]:
tables[0]

In [None]:
tables[1]