<a href="https://colab.research.google.com/github/sakunisgithub/machine_learning/blob/master/web_scraping/0003_Wikipedia_Covid19_Human_Development_Index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [70]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [115]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index'

In [116]:
response = requests.get(url)

In [117]:
soup = BeautifulSoup(response.text, 'html')

In [118]:
print(f"There are {len(soup.find_all('table'))} 'table' tags.")

There are 14 'table' tags.


On Inspecting the page, I find that my table of interest has class = 'sortable' and it is the first table having such a class.

In [119]:
print(f"There are {len(soup.find('table', {'class' : 'sortable'}))} 'table' tags having class = 'sortable'.")

There are 4 'table' tags having class = 'sortable'.


I access the first one only.

In [120]:
tables_with_sortable = soup.find_all('table', {'class' : 'sortable'})

my_table = tables_with_sortable[0]

But, in my required table, all rows don't have equal number of tags. Let us check what the different number of tags are.

In [121]:
all_rows = my_table.find_all('tr'); len(all_rows)

194

In [122]:
all_rows[0] # the header row

<tr>
<th scope="col">Rank
</th>
<th data-sort-type="number" scope="col"><abbr title="Change since 2015">Δ</abbr>
</th>
<th scope="col" style="width:17em;">Country or territory
</th>
<th scope="col">HDI value
</th>
<th data-sort-type="number" scope="col">%<br/>annual growth<br/>(2010–2023)
</th></tr>

In [123]:
data_row_lengths = [len(all_rows[i].find_all(['th', 'td'])) for i in range(1, 194)]

print(f"Different number of 'th' and 'td' tags in data rows are {np.unique(np.array(data_row_lengths))}.")

Different number of 'th' and 'td' tags in data rows are [3 5].


The 2nd column in the table is not of my interest.

In any given row, if there are total 5 'th' and 'td' tags then their correspondings are as follows :

$\bullet$ 1st 'td' : rank

$\bullet$ 2nd 'td' : not interested

$\bullet$ 1st 'th' : country_name

$\bullet$ 3rd 'td' : HDI

$\bullet$ 4th 'td' : annual_growth

If there are total 3 'th' and 'td' tags then their correspondings are as follows :

$\bullet$ 1st 'td' : not interested

$\bullet$ 1st 'th' : country_name

$\bullet$ 2nd 'td' : annual_growth

In [124]:
header_row = [entry.text.strip() for entry in all_rows[0].find_all('th')]
header_row

['Rank', 'Δ', 'Country or territory', 'HDI value', '%annual growth(2010–2023)']

I remove the 2nd element as I am not interested in the 2nd column of the table.

In [125]:
header_row = header_row[:1] + header_row[2:]
header_row

['Rank', 'Country or territory', 'HDI value', '%annual growth(2010–2023)']

Now I work on the data-rows.

In [126]:
rows = [header_row]

for tr in my_table.find_all('tr')[1:] :
  cells = tr.find_all(['th', 'td'])

  a_full_row = [entry.text.strip() for entry in cells]

  if len(cells) == 5 :
    my_row = a_full_row[:1] + a_full_row[2:]
  elif len(cells) == 3 :
    my_row = [None] + a_full_row[1:2] + [None] + a_full_row[2:]

  rows.append(my_row)

In [127]:
df = pd.DataFrame(rows)

In [128]:
df.shape

(194, 4)

In [129]:
df.head(10)

Unnamed: 0,0,1,2,3
0,Rank,Country or territory,HDI value,%annual growth(2010–2023)
1,1,Iceland,0.972,0.28%
2,2,Norway,0.970,0.25%
3,,Switzerland,,0.24%
4,4,Denmark,0.962,0.35%
5,5,Germany,0.959,0.19%
6,,Sweden,,0.38%
7,7,Australia,0.958,0.20%
8,8,Netherlands,0.955,0.26%
9,,Hong Kong,,0.38%
