# Extracting GDP by Capita
I am practicing scraping data, cleaning it, and presenting it in a dataframe.

# Dependencies
I'll be using CuPy from [RAPIDS](https://rapids.ai/start.html#get-rapids) (similar to pandas) to store the data as a dataframe.

I'll use [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) to extract the data

A conda environment will be provided, which I reccomend you install using [mamba](https://github.com/mamba-org/mamba)


In [3]:
# create a python environment with rapids
#! mamba create -n data_scraping -c rapidsai -c nvidia -c conda-forge rapids=22.08 python=3.9 cudatoolkit=11.5 beautifulsoup4 -y


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.25.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['rapids=22.08', 'python=3.9', 'cudatoolkit=11.5', 'beautifulsoup4']

conda-forge/linux-64         

In [1]:
# write environment to file (from https://www.machinelearningplus.com/deployment/conda-create-environment-and-everything-you-need-to-know-to-manage-conda-virtual-environment/)
# !mamba activate data_scraping
# !mamba env export --from-history > environment.yml

Run 'mamba init' to be able to run mamba activate/deactivate
and start a new shell session. Or use conda to activate/deactivate.



In [None]:
# install the environment like this
!mamba env create -f environment.yml

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [3]:
DATA_URL = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
data = urlopen(DATA_URL)
soup = BeautifulSoup(data, 'html.parser')

In [10]:
data_table = soup.find("table", class_="wikitable")
data_table

<table border="1" class="wikitable sortable static-row-numbers plainrowheaders srn-white-background" style="text-align:right;">
<caption>GDP per capita (US$ PPP) by country or <link href="mw-data:TemplateStyles:r981673959" rel="mw-deduplicated-inline-style"/><span class="legend-color" style="background-color:#F0E891; color:black;"> territory or non IMF members </span>
</caption>
<tbody><tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
<th rowspan="2">Country/Territory
</th>
<th rowspan="2"><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN Region</a>
</th>
<th colspan="2"><a href="/wiki/International_Monetary_Fund" title="International Monetary Fund">IMF</a><sup class="reference" id="cite_ref-IMF_6-0"><a href="#cite_note-IMF-6">[5]</a></sup><sup class="reference" id="cite_ref-7"><a href="#cite_note-7">[6]</a></sup>
</th>
<th colspan="2"><a href="/wiki/World_Bank" title="World Bank">World Bank</a><sup class="reference" id="cite_ref

In [160]:
import re


header_row: list[str] = [elm.get_text(strip="True") for elm in data_table.tr.find_all('th')]
# Remove citation information (assuming all citations have square brackets)
clean_header_row : list[str] = [ header for header in map(lambda header: re.sub('\[[^\]]*\]', '', header), header_row)]


In [161]:
estimate_headers = map(lambda header: f"{header} estimate", clean_header_row[2:])
year_headers = map(lambda header: f"{header} year", clean_header_row[2:])
zipped_headers = list(sum(list(zip(estimate_headers, year_headers)), ()))
# clean_header_row=[clean_header_row[0]]
clean_header_row = [*clean_header_row[0:2], *zipped_headers]
clean_header_row

['Country/Territory',
 'UN Region',
 'IMF estimate',
 'IMF year',
 'World Bank estimate',
 'World Bank year',
 'CIA estimate',
 'CIA year']

In [54]:
import cudf as pd
import bs4

In [169]:
data_rows: list[bs4.Tag] = data_table.find_all('tr')
list_of_data = []
for row in data_rows[2:]:
    data_row = []
    for datum in row.find_all('td'):
        cleaned_data = re.sub('\u202f\*','', datum.get_text(strip=True))
        if re.match('$\[\d+\]^', cleaned_data):
            continue
        if re.match('—', cleaned_data):
            data_row.append(cleaned_data)
        data_row.append(cleaned_data)
    list_of_data.append(data_row)

In [170]:
list_of_data

[['Monaco', 'Europe', '—', '—', '190,513', '2019', '115,700', '2015'],
 ['Liechtenstein', 'Europe', '—', '—', '180,367', '2018', '139,100', '2009'],
 ['Luxembourg',
  'Europe',
  '140,694',
  '2022',
  '118,360',
  '2020',
  '110,300',
  '2020'],
 ['Singapore', 'Asia', '131,580', '2022', '98,526', '2020', '93,400', '2020'],
 ['Ireland', 'Europe', '124,596', '2022', '93,612', '2020', '89,700', '2020'],
 ['Qatar', 'Asia', '112,789', '2022', '89,949', '2020', '85,300', '2020'],
 ['Macau', 'Asia', '85,612', '2022', '57,807', '2020', '54,800', '2020'],
 ['Switzerland',
  'Europe',
  '84,658',
  '2022',
  '71,352',
  '2020',
  '68,400',
  '2020'],
 ['Isle of Man', 'Europe', '—', '—', '—', '—', '84,600', '2014'],
 ['Bermuda', 'Americas', '—', '—', '80,830', '2020', '81,800', '2019'],
 ['United Arab Emirates',
  'Asia',
  '78,255',
  '2022',
  '69,958',
  '2019',
  '67,100',
  '2019'],
 ['Norway', 'Europe', '77,808', '2022', '63,198', '2020', '63,600', '2020'],
 ['United States',
  'Americas',

In [171]:
clean_header_row

['Country/Territory',
 'UN Region',
 'IMF estimate',
 'IMF year',
 'World Bank estimate',
 'World Bank year',
 'CIA estimate',
 'CIA year']

In [172]:
gdp_df = pd.DataFrame(list_of_data, columns=clean_header_row)


In [173]:
gdp_df

Unnamed: 0,Country/Territory,UN Region,IMF estimate,IMF year,World Bank estimate,World Bank year,CIA estimate,CIA year
0,Monaco,Europe,—,—,190513,2019,115700,2015
1,Liechtenstein,Europe,—,—,180367,2018,139100,2009
2,Luxembourg,Europe,140694,2022,118360,2020,110300,2020
3,Singapore,Asia,131580,2022,98526,2020,93400,2020
4,Ireland,Europe,124596,2022,93612,2020,89700,2020
...,...,...,...,...,...,...,...,...
225,Somalia,Africa,1322,2022,875.2,2020,800,2020
226,DR Congo,Africa,1316,2022,1131,2020,1098,2019
227,Central African Republic,Africa,1102,2022,979.6,2020,945,2019
228,South Sudan,Africa,928,2022,1235,2015,1600,2017
