# Pandas read_html()

This is a notebook for the medium article [Pandas read_html](https://bindichen.medium.com/all-pandas-read-html-you-should-know-for-scraping-data-from-html-tables-a3cbb5ce8274)

Please check out article for instructions

**License**: [BSD 2-Clause](https://opensource.org/licenses/BSD-2-Clause)

In [1]:
import pandas as pd
from IPython.display import display_html

## 1. Read an HTML table from a string

In [2]:
html_string = """
<table>
  <thead>
    <tr>
      <th>date</th>
      <th>name</th>
      <th>year</th>
      <th>cost</th>
      <th>region</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2020-01-01</td>
      <td>Jenny</td>
      <td>1998</td>
      <td>0.2</td>
      <td>South</td>
    </tr>
    <tr>
      <td>2020-01-02</td>
      <td>Alice</td>
      <td>1992</td>
      <td>-1.34</td>
      <td>East</td>
    </tr>
    <tr>
      <td>2020-01-03</td>
      <td>Tomas</td>
      <td>1982</td>
      <td>1.00023</td>
      <td>South</td>
    </tr>
  </tbody>
</table>
"""

# Display the HTML representation
display_html(html_string, raw=True)

date,name,year,cost,region
2020-01-01,Jenny,1998,0.2,South
2020-01-02,Alice,1992,-1.34,East
2020-01-03,Tomas,1982,1.00023,South


### Read html string into a list of DataFrames

In [3]:
dfs = pd.read_html(html_string)
# Result is not a Pandas DataFrame
dfs

[         date   name  year     cost region
 0  2020-01-01  Jenny  1998  0.20000  South
 1  2020-01-02  Alice  1992 -1.34000   East
 2  2020-01-03  Tomas  1982  1.00023  South]

In [4]:
# Result is a Python list
type(dfs)

list

In [5]:
# Use index to get the table
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,0.2,South
1,2020-01-02,Alice,1992,-1.34,East
2,2020-01-03,Tomas,1982,1.00023,South


In [6]:
dfs[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    3 non-null      object 
 1   name    3 non-null      object 
 2   year    3 non-null      int64  
 3   cost    3 non-null      float64
 4   region  3 non-null      object 
dtypes: float64(1), int64(1), object(3)
memory usage: 248.0+ bytes


## 2. Reading tables from a URL

In [7]:
URL = 'https://en.wikipedia.org/wiki/London'

dfs = pd.read_html(URL)

print(f'Total tables: {len(dfs)}')

Total tables: 31


In [8]:
dfs[6]

Unnamed: 0_level_0,2011 United Kingdom Census[225],2011 United Kingdom Census[225]
Unnamed: 0_level_1,Country of birth,Population
0,United Kingdom,5175677
1,India,262247
2,Poland,158300
3,Ireland,129807
4,Nigeria,114718
5,Pakistan,112457
6,Bangladesh,109948
7,Jamaica,87467
8,Sri Lanka,84542
9,France,66654


## 3. Reading tables from a file

In [9]:
file_path = 'html_string.txt'
with open(file_path, 'r') as f:
    dfs = pd.read_html(f.read())

dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,0.2,South
1,2020-01-02,Alice,1992,-1.34,East
2,2020-01-03,Tomas,1982,1.00023,South


## 4. Parsing date columns using `parse_dates`

In [10]:
dfs = pd.read_html(html_string, parse_dates=['date'])

In [11]:
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,0.2,South
1,2020-01-02,Alice,1992,-1.34,East
2,2020-01-03,Tomas,1982,1.00023,South


In [12]:
dfs[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    3 non-null      datetime64[ns]
 1   name    3 non-null      object        
 2   year    3 non-null      int64         
 3   cost    3 non-null      float64       
 4   region  3 non-null      object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 248.0+ bytes


## 5. Converting values with `converters`

In [13]:
html_string = """
<table>
  <tr>
    <th>ID</th>
    <th>date</th>
    <th>name</th>
    <th>year</th>
    <th>cost</th>
    <th>region</th>
  </tr>
  <tr>
    <td>0001</td>
    <td>2020-01-01</td>
    <td>Jenny</td>
    <td>1998</td>
    <td>0.2</td>
    <td>South</td>
  </tr>
  <tr>
    <td>0002</td>
    <td>2020-01-02</td>
    <td>Alice</td>
    <td>1992</td>
    <td>-1.34</td>
    <td>East</td>
  </tr>
  <tr>
    <td>0003</td>
    <td>2020-01-03</td>
    <td>Tomas</td>
    <td>1982</td>
    <td>1.00023</td>
    <td>South</td>
  </tr>
</table>
"""

# Display the HTML representation
display_html(html_string, raw=True)

ID,date,name,year,cost,region
1,2020-01-01,Jenny,1998,0.2,South
2,2020-01-02,Alice,1992,-1.34,East
3,2020-01-03,Tomas,1982,1.00023,South


In [14]:
dfs = pd.read_html(html_string, converters={
    'ID': str,
    'year': int,
    'cost': float,
})

In [15]:
dfs[0]

Unnamed: 0,ID,date,name,year,cost,region
0,1,2020-01-01,Jenny,1998,0.2,South
1,2,2020-01-02,Alice,1992,-1.34,East
2,3,2020-01-03,Tomas,1982,1.00023,South


In [16]:
dfs[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      3 non-null      object 
 1   date    3 non-null      object 
 2   name    3 non-null      object 
 3   year    3 non-null      int64  
 4   cost    3 non-null      float64
 5   region  3 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 272.0+ bytes


## 6. MultiIndex, header, and index column

In [17]:
html_string = """
<table>
  <thead>
    <tr>
      <th colspan="5">Year 2020</th>
    </tr>
    <tr>
      <th>date</th>
      <th>name</th>
      <th>year</th>
      <th>cost</th>
      <th>region</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2020-01-01</td>
      <td>Jenny</td>
      <td>1998</td>
      <td>1.2</td>
      <td>South</td>
    </tr>
    <tr>
      <td>2020-01-02</td>
      <td>Alice</td>
      <td>1992</td>
      <td>-1.34</td>
      <td>East</td>
    </tr>
  </tbody>
</table>
"""

# Display the HTML representation
display_html(html_string, raw=True)

Year 2020,Year 2020,Year 2020,Year 2020,Year 2020
date,name,year,cost,region
2020-01-01,Jenny,1998,1.2,South
2020-01-02,Alice,1992,-1.34,East


In [18]:
# It creates MultiIndex because multiple rows in <thead>
dfs = pd.read_html(html_string)
dfs[0]

Unnamed: 0_level_0,Year 2020,Year 2020,Year 2020,Year 2020,Year 2020
Unnamed: 0_level_1,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,-1.34,East


In [19]:
# Specify a header row:
dfs = pd.read_html(html_string, header=1)
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,-1.34,East


In [20]:
# Specify a index column
dfs = pd.read_html(html_string, header=1, index_col=0)
dfs[0]

Unnamed: 0_level_0,name,year,cost,region
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,Jenny,1998,1.2,South
2020-01-02,Alice,1992,-1.34,East


## 7. Matching a table with `match` argument

In [21]:
# table with caption
html_string = """
<table id="report">
  <caption>2020 report</caption>
  <thead>
    <tr>
      <th>date</th>
      <th>name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2020-01-01</td>
      <td>Jenny</td>
    </tr>
    <tr>
      <td>2020-01-02</td>
      <td>Alice</td>
    </tr>
  </tbody>
</table>

<table>
  <caption>Average income</caption>
  <thead>
    <tr>
      <th>name</th>
      <th>income</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Tom</td>
      <td>200</td>
    </tr>
    <tr>
      <td>James</td>
      <td>300</td>
    </tr>
  </tbody>
</table>
"""
# Display the HTML representation
display_html(html_string, raw=True)

date,name
2020-01-01,Jenny
2020-01-02,Alice

name,income
Tom,200
James,300


In [22]:
# matching caption
dfs = pd.read_html(html_string, match='2020 report')
print(len(dfs))

1


In [23]:
dfs[0]

Unnamed: 0,date,name
0,2020-01-01,Jenny
1,2020-01-02,Alice


In [24]:
# matching value
dfs = pd.read_html(html_string, match='James')
print(len(dfs))

1


In [57]:
dfs[0]

Unnamed: 0,name,income
0,Tom,200
1,James,300


## 8. Filter talbe with `attrs` 

In [25]:
dfs = pd.read_html(html_string, attrs={'id': 'report'})

In [26]:
dfs[0]

Unnamed: 0,date,name
0,2020-01-01,Jenny
1,2020-01-02,Alice


## 9. Working with missing values

In [27]:
html_string = """
<table>
  <tr>
    <th>date</th>
    <th>name</th>
    <th>year</th>
    <th>cost</th>
    <th>region</th>
  </tr>
  <tr>
    <td>2020-01-01</td>
    <td>Jenny</td>
    <td>1998</td>
    <td>1.2</td>
    <td>South</td>
  </tr>
  <tr>
    <td>2020-01-02</td>
    <td>Alice</td>
    <td>1992</td>
    <td></td>
    <td>East</td>
  </tr>
  <tr>
    <td>2020-01-03</td>
    <td>Tomas</td>
    <td>1982</td>
    <td></td>
    <td>South</td>
  </tr>
</table>
"""

# Display the HTML representation
display_html(html_string, raw=True)

date,name,year,cost,region
2020-01-01,Jenny,1998,1.2,South
2020-01-02,Alice,1992,,East
2020-01-03,Tomas,1982,,South


In [28]:
# By default, all empty strings are treated as missing values and read as NaN
dfs = pd.read_html(html_string)
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South


In [29]:
# To keep empty string
dfs = pd.read_html(html_string, keep_default_na=False)
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South


In [30]:
html_string = """
<table>
  <tr>
    <th>date</th>
    <th>name</th>
    <th>year</th>
    <th>cost</th>
    <th>region</th>
  </tr>
  <tr>
    <td>2020-01-01</td>
    <td>Jenny</td>
    <td>1998</td>
    <td>1.2</td>
    <td>South</td>
  </tr>
  <tr>
    <td>2020-01-02</td>
    <td>Alice</td>
    <td>1992</td>
    <td>?</td>
    <td>East</td>
  </tr>
  <tr>
    <td>2020-01-03</td>
    <td>Tomas</td>
    <td>1982</td>
    <td>&</td>
    <td>South</td>
  </tr>
</table>
"""

# Display the HTML representation
display_html(html_string, raw=True)

date,name,year,cost,region
2020-01-01,Jenny,1998,1.2,South
2020-01-02,Alice,1992,?,East
2020-01-03,Tomas,1982,&,South


In [31]:
# other character representations for missing values
dfs = pd.read_html(html_string, na_values=['?', '&'])
dfs[0]

Unnamed: 0,date,name,year,cost,region
0,2020-01-01,Jenny,1998,1.2,South
1,2020-01-02,Alice,1992,,East
2,2020-01-03,Tomas,1982,,South
