# <center>Extracting Tabular Data from HTML</center>

<center>Dr. W.J.B. Mattingly</center>

<center>Smithsonian Data Science Lab and United States Holocaust Memorial Museum</center>

<center>March 2022</center>

## Covered in this Chapter

1) Table Tags<br>
2) TR Tags<br>
3) TH and TD Tags<br>
4) How to parse an HTML Table<b>

## Introduction

In this chapter, we will put our skills to work! We will learn how to extract tabular data via requests and BeautifulSoup. We will work with a lot of the commands and methods we saw in the last chapter, but we will not be trying to extract p tags, rather tabular data from the same Wikipedia page. All of this will allow you to apply your skills to the final challenge of this textbook (introduced in the next chapter).

First, let's import the same libraries as in the last chapter, requests and BeautifulSoup.

In [1]:
import requests
from bs4 import BeautifulSoup

Let's also go ahead and make the same string object, the url of the page we want to scrape.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_French_monarchs"

Now, let's dive in!

## Finding the Tables on the Page

In the cell below, we will make the call to the Wikipedia server and convert the HTML content into a soup object as we saw in the last chapter.

In [3]:
s =  requests.get(url)
soup = BeautifulSoup(s.content)

Now that we have our soup object, we can start to parse it. Tables are often structured in HTML the same way across all sites. The main tag used is the table tag. On Wikipedia, there are multiple kinds of tables. The class we want is a table class called "wikitable". Let's go ahead and grab all these tables and print off how many we have on the page.

In [7]:
tables = soup.find_all("table", {"class": "wikitable"})
print (len(tables))

17


## Grabbing Rows of a Table

Excellent! Now that hwe have the tables, let's take a look at the first one's HTML.

In [10]:
first_table = tables[0]
print (first_table)

<table class="wikitable" width="95%">
<tbody><tr>
<th width="8%">Portrait
</th>
<th width="20%">Name
</th>
<th width="7%">King from
</th>
<th width="7%">King until
</th>
<th width="20%">Relationship with predecessor(s)
</th>
<th width="13%">Title
</th></tr>
<tr>
<td align="center"><a class="image" href="/wiki/File:Biblioth%C3%A8que_nationale_de_France_-_Bible_de_Vivien_Ms._Latin_1_folio_423r_d%C3%A9tail_Le_comte_Vivien_offre_le_manuscrit_de_la_Bible_faite_%C3%A0_l%27abbaye_de_Saint-Martin_de_Tours_%C3%A0_Charles_le_Chauve.jpg"><img alt="Bibliothèque nationale de France - Bible de Vivien Ms. Latin 1 folio 423r détail Le comte Vivien offre le manuscrit de la Bible faite à l'abbaye de Saint-Martin de Tours à Charles le Chauve.jpg" data-file-height="2462" data-file-width="2068" decoding="async" height="119" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/40/Biblioth%C3%A8que_nationale_de_France_-_Bible_de_Vivien_Ms._Latin_1_folio_423r_d%C3%A9tail_Le_comte_Vivien_offre_le_manuscrit_de

A table is a combination of 2 things: rows and cells. Rows will almost always be "tr" tags. This stands for table row. Cells with either be th tags or td tags. The th tag stands for table header. This will usually be used in the first row that indicates the name of the column. The td tag stands for Table Data cell. These are the cells that start on the first row that contains data and continue on down until the table ends. Because tables are precise structurally, there will always be the same name of headers as there are columns of data. We can use this structure to our advantage.

Let's find all the rows in the first table

In [12]:
rows = first_table.find_all("tr")

Excellent! Now, let's iterate over all the rows.

In [14]:
for row in rows:
    print (row)

<tr>
<th width="8%">Portrait
</th>
<th width="20%">Name
</th>
<th width="7%">King from
</th>
<th width="7%">King until
</th>
<th width="20%">Relationship with predecessor(s)
</th>
<th width="13%">Title
</th></tr>
<tr>
<td align="center"><a class="image" href="/wiki/File:Biblioth%C3%A8que_nationale_de_France_-_Bible_de_Vivien_Ms._Latin_1_folio_423r_d%C3%A9tail_Le_comte_Vivien_offre_le_manuscrit_de_la_Bible_faite_%C3%A0_l%27abbaye_de_Saint-Martin_de_Tours_%C3%A0_Charles_le_Chauve.jpg"><img alt="Bibliothèque nationale de France - Bible de Vivien Ms. Latin 1 folio 423r détail Le comte Vivien offre le manuscrit de la Bible faite à l'abbaye de Saint-Martin de Tours à Charles le Chauve.jpg" data-file-height="2462" data-file-width="2068" decoding="async" height="119" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/40/Biblioth%C3%A8que_nationale_de_France_-_Bible_de_Vivien_Ms._Latin_1_folio_423r_d%C3%A9tail_Le_comte_Vivien_offre_le_manuscrit_de_la_Bible_faite_%C3%A0_l%27abbaye_de_Saint-Ma

We can see the precise same HTML that we saw above. Now that we can access the rows, let's try and access the first row of data.

## Find Cells

To do this, we will need to find each cell. Remember, on row 1, we are working with th tags because these are headers.

In [15]:
first_row = rows[0]
print (first_row)

<tr>
<th width="8%">Portrait
</th>
<th width="20%">Name
</th>
<th width="7%">King from
</th>
<th width="7%">King until
</th>
<th width="20%">Relationship with predecessor(s)
</th>
<th width="13%">Title
</th></tr>


Let's now try to find all the cells in the first row.

In [25]:
cells = first_row.find_all("th")
print (cells)

[<th width="8%">Portrait
</th>, <th width="20%">Name
</th>, <th width="7%">King from
</th>, <th width="7%">King until
</th>, <th width="20%">Relationship with predecessor(s)
</th>, <th width="13%">Title
</th>]


We can iterate over these cells and print off their text. I am adding .strip() here to remove the leading whitespaces and line breaks.

In [18]:
for cell in cells:
    print (cell.text.strip())

Portrait
Name
King from
King until
Relationship with predecessor(s)
Title


## Iterating Across the Entire Table

Now that we know how to grab all tables, all rows within a table, and all cells within a row, let's try and iterate over the entire first table. First, let's grab the headers.

In [23]:
rows = first_table.find_all("tr")
for row in rows:
    cells = row.find_all("th")
    for cell in cells:
        print (cell.text.strip())

Portrait
Name
King from
King until
Relationship with predecessor(s)
Title


Now that we know all the headers, explore all the table data cells which start on the next row.

In [24]:
rows = first_table.find_all("tr")
for row in rows:
    cells = row.find_all("td")
    for cell in cells:
        print (cell.text.strip())


Charles II the Bald
August 843(King of the Franks from 20 June 840)
6 October 877
• Son of Louis I the Pious
King of the FranksEmperor of the Romans (875–77)

Louis II the Stammerer
6 October 877
10 April 879
• Son of Charles II the Bald
King of the Franks

Louis III
10 April 879
5 August 882
• Son of Louis II the Stammerer
King of the Franks

Carloman II
5 August 882
6 December 884
• Son of Louis II the Stammerer
 • Younger brother of Louis III
King of the Franks

Charles the Fat
20 May 885
13 January 888
• Son of Louis II the German • Cousin once removed of Carloman II and Louis III • Grandson of Louis I the Pious
King of the FranksEmperor of the Romans (881–88)


Now that you know how to grab all of this data, it will be time to bring all of this together for your final task. In the next chapter, we will try and extract all this data into properly structured data that will be stored within Python, then parsed as a Pandas DataFrame, and finally saved as a .csv file.