In this practice project, you will use the skills acquired through the course and create a complete ETL pipeline for accessing data from a website and processing it to meet the requirements.

## Project Scenario

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

You can find the required data on this [webpage](https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29).

The required information needs to be made accessible as a JSON file 'Countries_by_GDP.json' as well as a table 'Countries_by_GDP' in a database file 'World_Economies.db' with attributes 'Country' and 'GDP_USD_billion.'

Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, log the entire process of execution in a file named 'etl_project_log.txt'.

You must create a Python code 'etl_project_gdp.py' that performs all the required tasks.

## Objectives

You have to complete the following tasks for this project

1. Write a data extraction function to retrieve the relevant information from the required URL.

2. Transform the available GDP information into 'Billion USD' from 'Million USD'.

3. Load the transformed information to the required CSV file and as a database file.

4. Run the required query on the database.

5. Log the progress of the code with appropriate timestamps.

In [1]:
import myfunc as fun

1. Write a data extraction function to retrieve the relevant information from the required URL.

In [2]:
url = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
df = fun.extract_scrap_data(url=url)
df

Unnamed: 0,Country/Territory,UN_Region,IMF_Estimate,IMF_Year
0,World,—,105568776,2023
1,United States,Americas,26854599,2023
2,China,Asia,19373586,[n 1]2023
3,Japan,Asia,4409738,2023
4,Germany,Europe,4308854,2023
...,...,...,...,...
209,Anguilla,Americas,—,—
210,Kiribati,Oceania,248,2023
211,Nauru,Oceania,151,2023
212,Montserrat,Americas,—,—


2. Transform the available GDP information into 'Billion USD' from 'Million USD'.

In [3]:
df = fun.transform_data(df=df)
df

Unnamed: 0,Country/Territory,UN_Region,IMF_Estimate_USD_Mio,IMF_Year,IMF_Estimate_USD_Bio
0,World,—,105568776,2023,105568.78
1,United States,Americas,26854599,2023,26854.60
2,China,Asia,19373586,2023,19373.59
3,Japan,Asia,4409738,2023,4409.74
4,Germany,Europe,4308854,2023,4308.85
...,...,...,...,...,...
209,Anguilla,Americas,0,0,0.00
210,Kiribati,Oceania,248,2023,0.25
211,Nauru,Oceania,151,2023,0.15
212,Montserrat,Americas,0,0,0.00


3. Load the transformed information to the required CSV file and as a database file.

In [4]:
filename = 'Countries_by_GDP.csv'
table_name = 'Countries_by_GDP'
db_name = 'World_Economies.db'
attributes = ['Country', 'GDP_USD_billion']

In [5]:
fun.load_data(df=df, filename=filename, table_name=table_name, db_name=db_name, attributes=attributes)

4. Run the required query on the database.

In [6]:
fun.run_query(query=f"SELECT * FROM {table_name} WHERE GDP_USD_billion > 100", db_name=db_name)

Unnamed: 0,Country,GDP_USD_billion
0,World,105568.78
1,United States,26854.60
2,China,19373.59
3,Japan,4409.74
4,Germany,4308.85
...,...,...
65,Kenya,118.13
66,Angola,117.88
67,Oman,104.90
68,Guatemala,102.31


5. Log the progress of the code with appropriate timestamps.

In [7]:
message = "Query has been run succesfully"
logfile = 'etl_project_log.txt'

fun.execution_log(message=message, logfile=logfile)