# Practice Projet: GDP extraction and processing
course: Python for Data Science, AI and Development (IBM)
## Project Scenario:
An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating a script that can extract the list of the top 10 largest economies of the world in descending order of their GDPs in Billion USD (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF).

The required data seems to be available on the URL mentioned below:

URL: https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
url =' https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29'

In [24]:
# extract tables from webpage
tables = pd.read_html(url)
df = tables[3] # getting the third table

In [25]:
# replacing the column headers with column numbers
df.columns = range(df.shape[1])

In [26]:
# retain the columns index: 0 and 2 (country name and GDP)
df = df[[0, 2]]

In [27]:
# retain rows with index 1 to 10 indicating the top 10 economies
df = df.iloc[1:11,:]

In [28]:
# assign column names as "Country" and "GDP (Million USD)"
df.columns = ['Country', 'GDP (Million USD)']

In [29]:
print(df)

           Country GDP (Million USD)
1    United States          26854599
2            China          19373586
3            Japan           4409738
4          Germany           4308854
5            India           3736882
6   United Kingdom           3158938
7           France           2923489
8            Italy           2169745
9           Canada           2089672
10          Brazil           2081235


In [30]:
# changing the datatype of GDP column to int
df['GDP (Million USD)'] = df['GDP (Million USD)'].astype(int)

In [31]:
# converting million to billion
df['GDP (Million USD)'] = df['GDP (Million USD)']/1000

In [32]:
# rounding off to 2 decimal places
df['GDP (Million USD)'] = np.round(df[['GDP (Million USD)']], 2)

In [33]:
# renaming column header
df.rename(columns = {'GDP (Million USD)': 'GDP (Billion USD)'})

Unnamed: 0,Country,GDP (Billion USD)
1,United States,26854.6
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
5,India,3736.88
6,United Kingdom,3158.94
7,France,2923.49
8,Italy,2169.74
9,Canada,2089.67
10,Brazil,2081.24


In [34]:
df.to_csv("10LargestEconimies.csv")