# Scraping Tables

[pandas](https://pandas.pydata.org) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

Scraping tables can be done easy as follows.

Consider this page [GDP wiki](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita)

```Python
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'

tables = pd.read_html(url)
```

In [1]:
import pandas as pd

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita'

tables = pd.read_html(url)

In [3]:
type(tables)

list

In [4]:
len(tables)

6

In [5]:
type(tables[0])

pandas.core.frame.DataFrame

In [7]:
tables[1].head()

Unnamed: 0_level_0,Country/Territory,UN Region,IMF[4][5],IMF[4][5],World Bank[6],World Bank[6],United Nations[7],United Nations[7]
Unnamed: 0_level_1,Country/Territory,UN Region,Estimate,Year,Estimate,Year,Estimate,Year
0,,,,,,,,
1,Monaco *,Europe,—,—,234316.0,2021.0,234317.0,2021.0
2,Liechtenstein *,Europe,—,—,157755.0,2020.0,169260.0,2021.0
3,Luxembourg *,Europe,127673,2022,133590.0,2021.0,133745.0,2021.0
4,Bermuda *,Americas,—,—,114090.0,2021.0,112653.0,2021.0


In [8]:
data = tables[1]

In [9]:
type(data)

pandas.core.frame.DataFrame

In [10]:
data.dtypes

Country/Territory  Country/Territory    object
UN Region          UN Region            object
IMF[4][5]          Estimate             object
                   Year                 object
World Bank[6]      Estimate             object
                   Year                 object
United Nations[7]  Estimate             object
                   Year                 object
dtype: object

In [13]:
data_list = data.values.tolist()[1:]

In [15]:
data_list[0]

['Monaco\u202f*', 'Europe', '—', '—', '234316', '2021', '234317', '2021']

In [16]:
data_items = []

for item in data_list:
    country = item[0]
    gdp_pc_str = item[-2]
    
    if country.endswith('\u202f*'):
        country = country.replace('\u202f*', '')
        
    if gdp_pc_str == '—':
        gdp_pc = None
    else:
        gdp_pc = int(gdp_pc_str)
        
    data_items.append(
        {
            'country': country,
            'gdp pc': gdp_pc
        }
    )

In [17]:
data_items[0]

{'country': 'Monaco', 'gdp pc': 234317}

In [18]:
for item in data_items:
    print(item['country'], item['gdp pc'])

Monaco 234317
Liechtenstein 169260
Luxembourg 133745
Bermuda 112653
Ireland 101109
Norway 89242
Switzerland 93525
Isle of Man None
Cayman Islands 85250
Qatar 66799
Singapore 66822
United States 69185
Channel Islands None
Iceland 69133
Faroe Islands None
Australia 66916
Denmark 68037
Greenland 58185
Canada 52112
Sweden 60730
Netherlands 57871
Israel 54111
Austria 53840
Finland 53703
Belgium 51166
Hong Kong 49259
British Virgin Islands 49444
Germany 51073
United Arab Emirates 43295
San Marino 50425
United Kingdom 46542
New Zealand 48824
Brunei 31449
France 44229
Andorra 42066
U.S. Virgin Islands None
Puerto Rico 32716
Kuwait 32150
European Union *[n 1] 31875
New Caledonia 34994
Guam None
Taiwan 33011
Japan 39650
Italy 35579
Macau 43555
South Korea 34940
Malta 33642
Bahamas 27478
Aruba 29342
Cyprus 32281
Slovenia 29135
Estonia 27991
Spain 30058
Bahrain 26563
Czech Republic 26809
Saudi Arabia 23186
Sint Maarten (Dutch part) 26199
Portugal 24651
Lithuania 23844
Northern Mariana Islands None