# ETL Process for Global Inflation Dashboard Project 

## Inflation data

#### What is Inflation?

Inflation is the rate at which the general level of prices for goods and services rises, leading to a decrease in the purchasing power of a currency. Essentially, it means that over time, you will be able to buy less with the same amount of money.

#### Key Concepts of Inflation
1. Purchasing Power:

- As prices increase, the purchasing power of money decreases. For example, if the inflation rate is 2% per year, a $1 item today will cost $1.02 in one year.

2. Measurement:

- Inflation is typically measured by the Consumer Price Index (CPI) or the Producer Price Index (PPI).
- CPI tracks the cost of a basket of consumer goods and services over time.
- PPI measures the average changes in selling prices received by domestic producers for their output.

3. Types of Inflation:

- Demand-Pull Inflation: Occurs when the demand for goods and services exceeds their supply, driving up prices.
- Cost-Push Inflation: Happens when the costs of production increase (e.g., higher wages, raw materials), leading producers to raise prices to maintain profit margins.
- Built-In Inflation: Also known as wage-price inflation, where businesses increase prices to compensate for rising wages, and workers demand higher wages to keep up with increased living costs, creating a feedback loop.

4. Hyperinflation:

- Extremely high and typically accelerating inflation. This can lead to the collapse of a currency and a significant loss of confidence in the economy.

5. Deflation:

- The opposite of inflation, where the general price level of goods and services decreases. While it might seem beneficial, deflation can lead to decreased economic activity as consumers delay purchases in anticipation of lower prices.

#### Causes of Inflation

1. Monetary Factors:

- Increase in the money supply without a corresponding increase in economic output can lead to inflation. This is often due to central banks printing more money or reducing interest rates.

2. Demand-Side Factors:

- Increased consumer spending, higher government expenditure, and increased investment can drive up demand for goods and services, leading to inflation.

3. Supply-Side Factors:

- Increased production costs, such as rising wages or raw material prices, can cause businesses to raise prices, resulting in cost-push inflation.

4. Expectations:

- If people expect inflation to rise, they may demand higher wages and spend more quickly, which can contribute to higher inflation.

#### Effects of Inflation

1. Decreased Purchasing Power:

- Inflation erodes the value of money, reducing the purchasing power of consumers.

2. Impact on Savings and Investments:

- Inflation can erode the value of savings if the return on investments is lower than the inflation rate. Conversely, it can benefit borrowers as the real value of debt decreases over time.

3. Menu Costs:

- Businesses incur costs when they have to change prices frequently due to inflation (e.g., reprinting menus, updating systems).

4. Uncertainty:

- High and unpredictable inflation can create uncertainty in the economy, leading to reduced investment and economic growth.

#### Controlling Inflation

1. Monetary Policy:

- Central banks use tools like interest rates and control of the money supply to manage inflation. Raising interest rates can help reduce inflation by decreasing spending and borrowing.

2. Fiscal Policy:

- Governments can use taxation and spending policies to influence the economy. Reducing government spending or increasing taxes can help curb inflation by reducing overall demand.

3. Supply-Side Policies:

- Measures to improve productivity and efficiency in the economy can help control inflation by reducing production costs.

#### Examples

1. Hyperinflation in Zimbabwe:

- In the late 2000s, Zimbabwe experienced hyperinflation, with prices doubling every day at its peak. This was due to excessive printing of money and loss of confidence in the currency.

2. Moderate Inflation in Developed Economies:

- Most developed economies aim for a moderate inflation rate of around 2% per year, which is considered healthy for economic growth. For instance, the Federal Reserve in the United States targets a 2% inflation rate as part of its monetary policy.

#### Conclusion

Inflation is a critical economic concept that affects everyone's daily life. Understanding its causes, effects, and how it is measured can help individuals and policymakers make informed decisions. Proper management of inflation through monetary and fiscal policies is essential for maintaining economic stability and growth.

---
What follows here is an ETL process for the Global Inflation Dashboard Project.

In [2]:
# Import dependencies
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)

In [3]:
# Read inflation data into Pandas dataframe
inflation_df = pd.read_csv('Dataset/global_inflation_data_kaggle_raw.csv')
inflation_df.head()

Unnamed: 0,country_name,indicator_name,1980,1981,1982,1983,1984,1985,1986,1987,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,Afghanistan,Annual average inflation (consumer prices) rate,13.4,22.2,18.2,15.9,20.4,8.7,-2.1,18.4,...,-0.66,4.38,4.98,0.63,2.3,5.44,5.06,13.71,9.1,
1,Albania,Annual average inflation (consumer prices) rate,,,,,,,,,...,1.9,1.3,2.0,2.0,1.4,1.6,2.0,6.7,4.8,4.0
2,Algeria,Annual average inflation (consumer prices) rate,9.7,14.6,6.6,7.8,6.3,10.4,14.0,5.9,...,4.8,6.4,5.6,4.3,2.0,2.4,7.2,9.3,9.0,6.8
3,Andorra,Annual average inflation (consumer prices) rate,,,,,,,,,...,-1.1,-0.4,2.6,1.0,0.5,0.1,1.7,6.2,5.2,3.5
4,Angola,Annual average inflation (consumer prices) rate,46.7,1.4,1.8,1.8,1.8,1.8,1.8,1.8,...,9.2,30.7,29.8,19.6,17.1,22.3,25.8,21.4,13.1,22.3


In [4]:
# Get a brief summary of the inflation data
inflation_df.describe()

Unnamed: 0,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
count,140.0,144.0,145.0,145.0,145.0,145.0,145.0,147.0,147.0,147.0,...,194.0,194.0,195.0,195.0,195.0,194.0,194.0,194.0,192.0,191.0
mean,21.757143,17.796528,17.029655,19.177241,26.97931,103.215172,25.262069,111.294558,58.635374,101.246259,...,4.116186,6.594742,7.656821,339.688359,107.294872,19.83268,16.577629,13.616031,13.736458,9.309424
std,33.656118,18.992691,22.797064,34.806824,111.889811,975.748316,86.93121,1081.094434,400.370989,679.792142,...,10.763149,31.096216,34.954954,4681.227548,1425.256254,173.722612,117.154632,25.282229,39.667874,25.195589
min,-7.3,0.0,-0.9,-8.5,-7.4,-16.0,-17.6,-31.2,-13.0,-9.6,...,-3.8,-5.6,-13.3,-44.4,-3.2,-2.6,-3.0,-3.2,-0.8,1.2
25%,9.55,8.6,6.1,5.0,3.8,2.8,1.8,2.15,2.55,3.35,...,0.1,0.1,1.15,1.3,0.8,0.4,1.925,5.5,4.0,2.8
50%,13.85,12.5,10.3,8.7,8.0,7.1,5.8,5.9,6.8,6.9,...,1.5,1.5,2.4,2.5,2.2,1.9,3.5,8.1,5.8,4.0
75%,20.525,19.8,16.7,16.0,17.1,16.8,18.2,16.65,17.8,16.7,...,4.8,5.125,5.2,4.3,4.0,4.575,5.975,11.975,9.925,5.8
max,316.6,116.8,123.6,275.6,1281.3,11749.6,885.2,13109.5,4775.2,7428.7,...,121.7,346.1,438.1,65374.1,19906.0,2355.1,1588.5,193.4,360.0,222.4


In [5]:
inflation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196 entries, 0 to 195
Data columns (total 47 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country_name    196 non-null    object 
 1   indicator_name  196 non-null    object 
 2   1980            140 non-null    float64
 3   1981            144 non-null    float64
 4   1982            145 non-null    float64
 5   1983            145 non-null    float64
 6   1984            145 non-null    float64
 7   1985            145 non-null    float64
 8   1986            145 non-null    float64
 9   1987            147 non-null    float64
 10  1988            147 non-null    float64
 11  1989            147 non-null    float64
 12  1990            150 non-null    float64
 13  1991            155 non-null    float64
 14  1992            158 non-null    float64
 15  1993            169 non-null    float64
 16  1994            171 non-null    float64
 17  1995            172 non-null    flo

In [6]:
# Get the inflation_df columns
inflation_df_columns_list = inflation_df.columns
inflation_df_columns_list

Index(['country_name', 'indicator_name', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', '2021', '2022', '2023', '2024'],
      dtype='object')

In [7]:
# Melt the DataFrame to long format
time_series_df = pd.melt(inflation_df, 
                         id_vars=['country_name', 'indicator_name'], 
                         var_name='year', 
                         value_name='inflation_rate')

# Convert year to integer
time_series_df['year'] = time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
time_series_df['id'] = time_series_df['country_name'] + '_' + time_series_df['year'].astype(str)

time_series_df.rename(columns={
 'country_name': 'Country',
 'indicator_name': 'Indicator',
 'year': 'Year',
 'inflation_rate': 'Inflation_Rate',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
time_series_df.head()


Unnamed: 0,Country,Indicator,Year,Inflation_Rate,ID
0,Afghanistan,Annual average inflation (consumer prices) rate,1980,13.4,Afghanistan_1980
1,Albania,Annual average inflation (consumer prices) rate,1980,,Albania_1980
2,Algeria,Annual average inflation (consumer prices) rate,1980,9.7,Algeria_1980
3,Andorra,Annual average inflation (consumer prices) rate,1980,,Andorra_1980
4,Angola,Annual average inflation (consumer prices) rate,1980,46.7,Angola_1980


In [8]:
# Export time_series_df as CSV files.
time_series_df.to_csv("cleaned_data/inflation.csv", index=False)


## GDP data

#### GDP
=
𝐶
+
𝐼
+
𝐺
+
(
𝑋
−
𝑀
)
GDP=C+I+G+(X−M)

- C = Consumption

- I = Investment

- G = Government spending

- X = Exports

- M = Imports

In [9]:
# Read inflation data into Pandas dataframe
gdp_df = pd.read_csv('Dataset/gdp_kaggle_raw.csv')
gdp_df.head()

Unnamed: 0,Country Name,Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,,,,,,,,,...,2534637000.0,2727850000.0,2790849000.0,2962905000.0,2983637000.0,3092430000.0,3202189000.0,,,
1,Africa Eastern and Southern,AFE,19313110000.0,19723490000.0,21493920000.0,25733210000.0,23527440000.0,26810570000.0,29152160000.0,30173170000.0,...,950521400000.0,964242400000.0,984807100000.0,919930000000.0,873354900000.0,985355700000.0,1012853000000.0,1009910000000.0,920792300000.0,
2,Afghanistan,AFG,537777800.0,548888900.0,546666700.0,751111200.0,800000000.0,1006667000.0,1400000000.0,1673333000.0,...,19907320000.0,20146400000.0,20497130000.0,19134210000.0,18116560000.0,18753470000.0,18053230000.0,18799450000.0,20116140000.0,
3,Africa Western and Central,AFW,10404280000.0,11128050000.0,11943350000.0,12676520000.0,13838580000.0,14862470000.0,15832850000.0,14426430000.0,...,727571400000.0,820787600000.0,864966600000.0,760729700000.0,690543000000.0,683741600000.0,741691600000.0,794572500000.0,784587600000.0,
4,Angola,AGO,,,,,,,,,...,128052900000.0,136709900000.0,145712200000.0,116193600000.0,101123900000.0,122123800000.0,101353200000.0,89417190000.0,58375980000.0,


In [10]:
# Get a brief summary of the inflation data
gdp_df.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
count,128.0,134.0,137.0,137.0,137.0,148.0,151.0,154.0,159.0,159.0,...,257.0,258.0,258.0,257.0,256.0,256.0,256.0,253.0,242.0,0.0
mean,72089900000.0,72495790000.0,75601040000.0,81532730000.0,89551090000.0,90929830000.0,101185300000.0,104972200000.0,110041100000.0,121683300000.0,...,2405010000000.0,2482543000000.0,2553374000000.0,2407493000000.0,2447400000000.0,2620694000000.0,2786905000000.0,2864964000000.0,2892666000000.0,
std,217556700000.0,221811200000.0,235645100000.0,253509700000.0,277236200000.0,291155100000.0,318782800000.0,337165900000.0,359081800000.0,395216700000.0,...,8162961000000.0,8388441000000.0,8616240000000.0,8171350000000.0,8321991000000.0,8852537000000.0,9425221000000.0,9620595000000.0,9530218000000.0,
min,12012010.0,11592010.0,9122751.0,10840100.0,12712470.0,13593930.0,14469080.0,15835180.0,14600000.0,15850000.0,...,37671770.0,37509080.0,37290610.0,35492070.0,36547800.0,40619250.0,42588160.0,47271460.0,48855550.0,
25%,493017100.0,500733800.0,531736500.0,516147800.0,542578400.0,586371600.0,638099500.0,623858400.0,644007100.0,683482000.0,...,8709165000.0,8747774000.0,9297231000.0,8738203000.0,8666853000.0,9565595000.0,10462330000.0,11314950000.0,12049960000.0,
50%,2661047000.0,2966849000.0,2814319000.0,3540403000.0,3405333000.0,3038595000.0,3170500000.0,3377453000.0,3941700000.0,4485778000.0,...,46580460000.0,49816760000.0,51143880000.0,50065950000.0,48869130000.0,53322710000.0,56144040000.0,61136870000.0,62128300000.0,
75%,22184500000.0,29567130000.0,29292290000.0,33956040000.0,31226320000.0,27194100000.0,28936610000.0,30376760000.0,33403760000.0,37457990000.0,...,552483700000.0,544709200000.0,545626600000.0,505103800000.0,526123800000.0,566671200000.0,563444500000.0,597280600000.0,744174700000.0,
max,1387318000000.0,1443856000000.0,1545481000000.0,1666138000000.0,1824277000000.0,1987340000000.0,2156806000000.0,2294908000000.0,2476959000000.0,2732048000000.0,...,75312280000000.0,77439510000000.0,79557660000000.0,75112440000000.0,76305060000000.0,81193290000000.0,86267600000000.0,87568050000000.0,84746980000000.0,


In [11]:
gdp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 64 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  266 non-null    object 
 1   Code          266 non-null    object 
 2   1960          128 non-null    float64
 3   1961          134 non-null    float64
 4   1962          137 non-null    float64
 5   1963          137 non-null    float64
 6   1964          137 non-null    float64
 7   1965          148 non-null    float64
 8   1966          151 non-null    float64
 9   1967          154 non-null    float64
 10  1968          159 non-null    float64
 11  1969          159 non-null    float64
 12  1970          168 non-null    float64
 13  1971          171 non-null    float64
 14  1972          171 non-null    float64
 15  1973          171 non-null    float64
 16  1974          172 non-null    float64
 17  1975          174 non-null    float64
 18  1976          175 non-null    

In [12]:
# Get the gdp_df columns
gdp_df_columns_list = gdp_df.columns
gdp_df_columns_list

Index(['Country Name', 'Code', '1960', '1961', '1962', '1963', '1964', '1965',
       '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974',
       '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', 'Unnamed: 65'],
      dtype='object')

In [14]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_df.columns if col.isdigit()]
id_vars = ['Country Name', 'Code']

# Melt the DataFrame to long format
gdp_time_series_df = pd.melt(gdp_df, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='gdp')

# Convert year to integer
gdp_time_series_df['year'] = gdp_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_time_series_df['id'] = gdp_time_series_df['Country Name'] + '_' + gdp_time_series_df['year'].astype(str)

gdp_time_series_df.rename(columns={
 'Country Name': 'Country',
 'year': 'Year',
 'gdp': 'GDP',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
print(gdp_time_series_df.head())

                       Country Code  Year           GDP  \
0                        Aruba  ABW  1960           NaN   
1  Africa Eastern and Southern  AFE  1960  1.931311e+10   
2                  Afghanistan  AFG  1960  5.377778e+08   
3   Africa Western and Central  AFW  1960  1.040428e+10   
4                       Angola  AGO  1960           NaN   

                                 ID  
0                        Aruba_1960  
1  Africa Eastern and Southern_1960  
2                  Afghanistan_1960  
3   Africa Western and Central_1960  
4                       Angola_1960  


In [15]:
# Export gdp_time_series_df as CSV files.
gdp_time_series_df.to_csv("cleaned_data/gdp.csv", index=False)

## GDP growth per capita data

#### GDP per capita
=
GDP per Capita= 
GDP / Population


In [16]:
# Read gdp growth per capita data into Pandas dataframe
gdp_growth_df = pd.read_csv('Dataset/gdp_per_capita_growth_kaggle_raw.csv')
gdp_growth_df.head()

Unnamed: 0,Country Name,Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,,,,,,,,,...,-1.865105,3.592223,-0.290534,5.129657,1.587869,1.519821,,,,
1,Africa Eastern and Southern,AFE,,,,,,,,,...,-0.769509,1.505305,1.20326,0.18786,-0.674533,-0.14471,-0.185406,-0.544414,-5.40382,
2,Afghanistan,AFG,,,,,,,,,...,8.974865,1.974166,-0.665291,-1.622857,-0.541416,0.064764,-1.1949,1.535637,-4.575032,
3,Africa Western and Central,AFW,,-0.232405,1.602299,4.990675,3.12468,1.783947,-3.946431,-11.557321,...,2.315175,3.260886,3.096784,0.007402,-2.533562,-0.390665,0.241531,0.492953,-3.453976,
4,Angola,AGO,,,,,,,,,...,4.706519,1.291994,1.219881,-2.468737,-5.816188,-3.409983,-5.162112,-3.795608,-8.396241,


In [17]:
gdp_growth_df.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
count,0.0,119.0,124.0,124.0,124.0,124.0,132.0,136.0,138.0,142.0,...,253.0,253.0,255.0,254.0,254.0,254.0,253.0,251.0,244.0,0.0
mean,,1.265832,2.717553,2.611405,4.001137,2.965614,2.346891,1.879621,3.68699,4.584436,...,1.823515,1.832804,1.951132,1.386062,1.794568,1.916788,1.828986,1.534935,-5.797238,
std,,5.676197,4.469364,4.998234,4.349242,4.336827,4.314371,6.96594,7.254843,4.643365,...,9.052478,4.626217,3.442476,4.465341,3.663534,3.538378,2.980404,2.92505,7.167678,
min,,-26.527644,-21.644507,-14.574477,-14.092751,-15.204696,-10.653985,-17.553376,-6.758929,-9.249391,...,-47.590601,-36.55692,-24.498011,-29.827145,-13.02043,-9.167725,-19.821379,-11.645791,-54.641447,
25%,,-0.410944,1.017317,0.552989,1.831906,0.382805,-0.537042,-0.671153,0.584387,2.040628,...,-0.195108,0.163871,0.556265,0.106375,0.027065,0.492579,0.48518,0.053584,-8.057641,
50%,,1.977003,2.322061,3.059142,4.347052,3.03523,2.40898,1.836612,3.560811,4.278922,...,1.424974,1.888643,1.841026,1.651083,1.772708,1.9415,1.899148,1.59897,-4.752616,
75%,,4.152542,4.772294,4.80438,5.968369,4.959529,4.934897,4.223501,4.979455,6.735885,...,3.745866,3.650819,3.555055,3.418883,3.3856,3.709334,3.650746,3.150985,-2.495986,
max,,13.66159,20.599898,31.010546,23.94715,14.197889,15.95135,61.658941,76.675456,21.876232,...,121.779472,29.690999,25.566139,23.999093,28.273852,24.976038,13.446087,17.21141,42.7893,


In [18]:
gdp_growth_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 64 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  266 non-null    object 
 1   Code          266 non-null    object 
 2   1960          0 non-null      float64
 3   1961          119 non-null    float64
 4   1962          124 non-null    float64
 5   1963          124 non-null    float64
 6   1964          124 non-null    float64
 7   1965          124 non-null    float64
 8   1966          132 non-null    float64
 9   1967          136 non-null    float64
 10  1968          138 non-null    float64
 11  1969          142 non-null    float64
 12  1970          139 non-null    float64
 13  1971          153 non-null    float64
 14  1972          153 non-null    float64
 15  1973          153 non-null    float64
 16  1974          153 non-null    float64
 17  1975          155 non-null    float64
 18  1976          159 non-null    

In [19]:
# Get the gdp_growth_df columns
gdp_growth_df_columns_list = gdp_growth_df.columns
gdp_growth_df_columns_list

Index(['Country Name', 'Code', '1960', '1961', '1962', '1963', '1964', '1965',
       '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974',
       '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', 'Unnamed: 65'],
      dtype='object')

In [20]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_growth_df.columns if col.isdigit()]
id_vars = ['Country Name', 'Code']

# Melt the DataFrame to long format
gdp_growth_time_series_df = pd.melt(gdp_growth_df, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='gdp_growth')

# Convert year to integer
gdp_growth_time_series_df['year'] = gdp_growth_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_growth_time_series_df['id'] = gdp_growth_time_series_df['Country Name'] + '_' + gdp_growth_time_series_df['year'].astype(str)

gdp_growth_time_series_df.rename(columns={
 'Country Name': 'Country',
 'year': 'Year',
 'gdp_growth': 'GDP_Growth',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
print(gdp_growth_time_series_df.head())

                       Country Code  Year  GDP_Growth  \
0                        Aruba  ABW  1960         NaN   
1  Africa Eastern and Southern  AFE  1960         NaN   
2                  Afghanistan  AFG  1960         NaN   
3   Africa Western and Central  AFW  1960         NaN   
4                       Angola  AGO  1960         NaN   

                                 ID  
0                        Aruba_1960  
1  Africa Eastern and Southern_1960  
2                  Afghanistan_1960  
3   Africa Western and Central_1960  
4                       Angola_1960  


In [21]:
# Export gdp_time_series_df as CSV files.
gdp_growth_time_series_df.to_csv("cleaned_data/gdp_growth.csv", index=False)

## GDP per capita data

`GDP per Capita` = 
(`GDP per Capita in Current Period` − `GDP per Capita in Previous Period`) / (`GDP per Capita in Previous Period`) * 100
​


In [22]:
# Read gdp per capita data into Pandas dataframe
gdp_percapita_df = pd.read_csv('Dataset/gdp_per_capita_kaggle_raw.csv')
gdp_percapita_df.head()

Unnamed: 0,Country Name,Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,,,,,,,,,...,24712.493263,26441.619936,26893.011506,28396.908423,28452.170615,29350.805019,30253.279358,,,
1,Africa Eastern and Southern,AFE,147.612227,147.014904,156.189192,182.243917,162.347592,180.214908,190.845484,192.337167,...,1736.16656,1713.899299,1703.596298,1549.03794,1431.778723,1573.063386,1574.978648,1530.059177,1359.618224,
2,Afghanistan,AFG,59.773234,59.8609,58.458009,78.706429,82.095307,101.108325,137.594298,160.898434,...,638.845852,624.315455,614.223342,556.007221,512.012778,516.679862,485.668419,494.17935,516.747871,
3,Africa Western and Central,AFW,107.932233,113.081647,118.831107,123.442888,131.854402,138.526332,144.326212,128.58247,...,1965.118485,2157.481149,2212.853135,1894.310195,1673.835527,1613.473553,1704.139603,1777.918672,1710.073363,
4,Angola,AGO,,,,,,,,,...,5100.097027,5254.881126,5408.4117,4166.979833,3506.073128,4095.810057,3289.643995,2809.626088,1776.166868,


In [23]:
gdp_percapita_df.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
count,128.0,134.0,137.0,137.0,137.0,148.0,151.0,154.0,159.0,159.0,...,257.0,258.0,258.0,257.0,256.0,256.0,256.0,253.0,242.0,0.0
mean,482.725314,491.115624,513.103203,544.566319,590.805091,654.822693,710.326019,725.993794,742.735017,804.348363,...,16120.560273,16659.720923,16964.258521,15359.98064,15524.456868,16308.827347,17288.675629,17144.609643,14354.539354,
std,626.040168,639.846845,669.060194,708.534576,774.925619,864.084238,935.096499,966.070357,994.686146,1071.190027,...,23550.237516,25086.503778,25768.767745,23277.449054,23522.761455,24336.089896,25989.582741,25903.117527,21898.003148,
min,40.537211,26.308357,26.98592,28.44943,20.035487,16.596459,12.802812,12.915456,20.418277,20.700642,...,252.358871,256.975653,274.857836,305.549653,260.565221,253.826354,238.783467,228.213589,238.990726,
25%,104.407303,109.078291,114.581807,121.76213,123.645925,138.512407,145.574804,155.377495,154.043379,160.648668,...,1965.118485,2092.400443,2156.75641,2049.851666,2093.428898,2101.842601,2202.590063,2246.625578,1936.184305,
50%,197.897848,197.158225,203.43737,213.896759,234.633358,249.112369,262.312268,252.705895,281.925786,293.387951,...,6528.971775,6797.232338,6655.280874,6175.87603,5921.203011,6336.133639,6962.584227,6853.693411,5606.538092,
75%,478.826988,475.192902,520.206131,583.411562,629.591526,683.03031,780.032061,750.374486,766.275654,826.348388,...,19870.801212,20140.95134,20148.31804,18028.973431,18356.000551,19689.335613,19952.381716,19575.768481,15650.348914,
max,3007.123445,3066.562869,3243.843078,3374.515171,3573.941185,4443.452338,4571.155742,4336.426587,4695.92339,5032.144743,...,157520.219427,177673.745368,189432.370013,167313.26628,170028.655718,171253.964254,185978.609251,189487.147128,173688.18936,


In [24]:
gdp_percapita_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 64 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  266 non-null    object 
 1   Code          266 non-null    object 
 2   1960          128 non-null    float64
 3   1961          134 non-null    float64
 4   1962          137 non-null    float64
 5   1963          137 non-null    float64
 6   1964          137 non-null    float64
 7   1965          148 non-null    float64
 8   1966          151 non-null    float64
 9   1967          154 non-null    float64
 10  1968          159 non-null    float64
 11  1969          159 non-null    float64
 12  1970          168 non-null    float64
 13  1971          171 non-null    float64
 14  1972          171 non-null    float64
 15  1973          171 non-null    float64
 16  1974          172 non-null    float64
 17  1975          174 non-null    float64
 18  1976          175 non-null    

In [25]:
# Get the gdp_percapita_df columns
gdp_percapita_df_columns_list = gdp_percapita_df.columns
gdp_percapita_df_columns_list

Index(['Country Name', 'Code', '1960', '1961', '1962', '1963', '1964', '1965',
       '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974',
       '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', 'Unnamed: 65'],
      dtype='object')

In [26]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_percapita_df.columns if col.isdigit()]
id_vars = ['Country Name', 'Code']

# Melt the DataFrame to long format
gdp_percapita_time_series_df = pd.melt(gdp_percapita_df, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='gdp_per_capita')

# Convert year to integer
gdp_percapita_time_series_df['year'] = gdp_percapita_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_percapita_time_series_df['id'] = gdp_percapita_time_series_df['Country Name'] + '_' + gdp_percapita_time_series_df['year'].astype(str)

gdp_percapita_time_series_df.rename(columns={
 'Country Name': 'Country',
 'year': 'Year',
 'gdp_per_capita': 'GDP_per_Capita',
 'id': 'ID'
}, inplace=True)


# Check the transformed DataFrame
print(gdp_percapita_time_series_df.head())

# Export gdp_time_series_df as CSV files.
gdp_percapita_time_series_df.to_csv("cleaned_data/gdp_percapita.csv", index=False)

                       Country Code  Year  GDP_per_Capita  \
0                        Aruba  ABW  1960             NaN   
1  Africa Eastern and Southern  AFE  1960      147.612227   
2                  Afghanistan  AFG  1960       59.773234   
3   Africa Western and Central  AFW  1960      107.932233   
4                       Angola  AGO  1960             NaN   

                                 ID  
0                        Aruba_1960  
1  Africa Eastern and Southern_1960  
2                  Afghanistan_1960  
3   Africa Western and Central_1960  
4                       Angola_1960  


## GDP ppp data

#### Calculation of GDP PPP - purchase power parity

The calculation of `GDP PPP` involves the following steps:

1. Identify a Basket of Goods:

- Select a representative basket of goods and services for comparison.

2. Calculate the Price of the Basket:

- Determine the cost of the basket in each country’s local currency.

3. Determine the PPP Exchange Rate:

- Calculate the PPP exchange rate by comparing the price of the basket in each country.

4. Adjust GDP:

- Adjust the nominal GDP using the PPP exchange rate to obtain GDP PPP.

In [27]:
# Read gdp ppp data into Pandas dataframe
gdp_ppp_df = pd.read_csv('Dataset/gdp_ppp_kaggle_raw.csv')
gdp_ppp_df.head()

Unnamed: 0,Country Name,Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,,,,,,,,,...,3442856000.0,3799467000.0,3816822000.0,3893071000.0,3941354000.0,4098240000.0,,,,
1,Africa Eastern and Southern,AFE,,,,,,,,,...,1772679000000.0,1893539000000.0,2025402000000.0,2098286000000.0,2212573000000.0,2319151000000.0,2438518000000.0,2536280000000.0,2495345000000.0,
2,Afghanistan,AFG,,,,,,,,,...,59667000000.0,65039840000.0,69058340000.0,71831700000.0,70097960000.0,74711920000.0,77415570000.0,81879800000.0,80918340000.0,
3,Africa Western and Central,AFW,,,,,,,,,...,1396677000000.0,1526772000000.0,1645122000000.0,1662297000000.0,1678674000000.0,1744087000000.0,1841811000000.0,1937451000000.0,1946297000000.0,
4,Angola,AGO,,,,,,,,,...,186124200000.0,199865600000.0,220364800000.0,204603600000.0,204874700000.0,217987300000.0,218748600000.0,221262800000.0,211837300000.0,


In [28]:
gdp_ppp_df.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,242.0,244.0,243.0,243.0,242.0,242.0,241.0,240.0,238.0,0.0
mean,,,,,,,,,,,...,3704909000000.0,3865470000000.0,4032503000000.0,4116129000000.0,4307111000000.0,4547134000000.0,4857924000000.0,5100159000000.0,5082028000000.0,
std,,,,,,,,,,,...,11255580000000.0,11792730000000.0,12272920000000.0,12554410000000.0,13105830000000.0,13827320000000.0,14744780000000.0,15449700000000.0,15346520000000.0,
min,,,,,,,,,,,...,33140650.0,35265650.0,36402180.0,40108840.0,41759650.0,44286550.0,46483600.0,51929350.0,54867630.0,
25%,,,,,,,,,,,...,21775890000.0,22964670000.0,24182400000.0,24861170000.0,24901490000.0,27321490000.0,28774700000.0,30139070000.0,30631610000.0,
50%,,,,,,,,,,,...,114468100000.0,118577100000.0,119064600000.0,121023900000.0,128907300000.0,134078800000.0,144631300000.0,153513100000.0,160044200000.0,
75%,,,,,,,,,,,...,1155848000000.0,1088341000000.0,1140403000000.0,1094342000000.0,1145318000000.0,1202053000000.0,1286524000000.0,1423722000000.0,1563385000000.0,
max,,,,,,,,,,,...,100045000000000.0,104963500000000.0,108910800000000.0,111242600000000.0,115694600000000.0,121890100000000.0,129316300000000.0,134828900000000.0,132999100000000.0,


In [29]:
gdp_ppp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 64 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  266 non-null    object 
 1   Code          266 non-null    object 
 2   1960          0 non-null      float64
 3   1961          0 non-null      float64
 4   1962          0 non-null      float64
 5   1963          0 non-null      float64
 6   1964          0 non-null      float64
 7   1965          0 non-null      float64
 8   1966          0 non-null      float64
 9   1967          0 non-null      float64
 10  1968          0 non-null      float64
 11  1969          0 non-null      float64
 12  1970          0 non-null      float64
 13  1971          0 non-null      float64
 14  1972          0 non-null      float64
 15  1973          0 non-null      float64
 16  1974          0 non-null      float64
 17  1975          0 non-null      float64
 18  1976          0 non-null      

In [30]:
# Get the gdp_ppp_df columns
gdp_ppp_df_columns_list = gdp_ppp_df.columns
gdp_ppp_df_columns_list

Index(['Country Name', 'Code', '1960', '1961', '1962', '1963', '1964', '1965',
       '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974',
       '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', 'Unnamed: 65'],
      dtype='object')

In [31]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_ppp_df.columns if col.isdigit()]
id_vars = ['Country Name', 'Code']

# Melt the DataFrame to long format
gdp_ppp_time_series_df = pd.melt(gdp_ppp_df, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='gdp_ppp')

# Convert year to integer
gdp_ppp_time_series_df['year'] = gdp_ppp_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_ppp_time_series_df['id'] = gdp_ppp_time_series_df['Country Name'] + '_' + gdp_ppp_time_series_df['year'].astype(str)

gdp_ppp_time_series_df.rename(columns={
 'Country Name': 'Country',
 'year': 'Year',
 'gdp_ppp': 'GDP_PPP',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
print(gdp_ppp_time_series_df.head())

# Export gdp_time_series_df as CSV files.
gdp_ppp_time_series_df.to_csv("cleaned_data/gdp_ppp.csv", index=False)

                       Country Code  Year  GDP_PPP  \
0                        Aruba  ABW  1960      NaN   
1  Africa Eastern and Southern  AFE  1960      NaN   
2                  Afghanistan  AFG  1960      NaN   
3   Africa Western and Central  AFW  1960      NaN   
4                       Angola  AGO  1960      NaN   

                                 ID  
0                        Aruba_1960  
1  Africa Eastern and Southern_1960  
2                  Afghanistan_1960  
3   Africa Western and Central_1960  
4                       Angola_1960  


## GDP ppp per capita data

`GDP PPP per capita` is the gross domestic product (GDP) of a country adjusted for purchasing power parity (PPP) divided by its population. It provides a per-person measure of economic output that accounts for differences in price levels across countries, offering a more accurate comparison of living standards and economic well-being between nations.

Formula:
GDP PPP per Capita = (GDP PPP) / Population

Use:

- Comparing Living Standards: Offers a fairer comparison of average income and standard of living across countries.
- Economic Analysis: Helps assess economic productivity and well-being on a per-person basis, adjusted for cost of living differences.

In [32]:
# Read gdp ppp per capita data into Pandas dataframe
gdp_ppp_percapita_df = pd.read_csv('Dataset/gdp_ppp_per_capita_kaggle_raw.csv')
gdp_ppp_percapita_df.head()

Unnamed: 0,Country Name,Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
0,Aruba,ABW,,,,,,,,,...,33567.550017,36829.032774,36779.429429,37311.75032,37585.025079,38897.122666,,,,
1,Africa Eastern and Southern,AFE,,,,,,,,,...,3237.870658,3365.682977,3503.699383,3533.230819,3627.294348,3702.390685,3791.875407,3842.578511,3684.562623,
2,Afghanistan,AFG,,,,,,,,,...,1914.774228,2015.514775,2069.424022,2087.305323,1981.118069,2058.400221,2082.635648,2152.366489,2078.648615,
3,Africa Western and Central,AFW,,,,,,,,,...,3772.323802,4013.196523,4208.73178,4139.323364,4069.005667,4115.64548,4231.815774,4335.199547,4242.114489,
4,Angola,AGO,,,,,,,,,...,7412.967035,7682.475386,8179.297828,7337.569822,7103.226431,7310.896589,7099.971958,6952.419362,6445.432873,


In [33]:
gdp_ppp_percapita_df.describe()

Unnamed: 0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,Unnamed: 65
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,242.0,244.0,243.0,243.0,242.0,242.0,241.0,240.0,238.0,0.0
mean,,,,,,,,,,,...,18328.390712,18751.443091,19168.560583,18814.45804,19403.458591,20370.527037,21163.156817,21637.501483,20327.58553,
std,,,,,,,,,,,...,20835.907707,21334.683819,21276.21636,19543.702698,19843.679781,20984.733337,21876.165789,22189.879498,20575.707251,
min,,,,,,,,,,,...,669.567443,738.474892,720.32411,787.012906,796.944083,773.572859,779.808176,783.451983,771.163242,
25%,,,,,,,,,,,...,3859.104915,4088.445436,4307.315072,4403.744735,4694.439582,4834.519273,5045.510271,5089.415179,5046.672965,
50%,,,,,,,,,,,...,11166.256302,11262.851557,11713.175908,12015.640528,12543.977976,13142.54232,13964.98408,14247.089866,13325.063671,
75%,,,,,,,,,,,...,24298.929126,24824.099061,25443.367959,25536.303599,26669.298888,28432.100384,29309.206273,30510.233694,27920.15188,
max,,,,,,,,,,,...,141634.703825,153563.91096,152856.341085,116298.729552,114893.015331,126144.104058,135551.807873,132654.8967,117500.207222,


In [34]:
gdp_ppp_percapita_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 64 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  266 non-null    object 
 1   Code          266 non-null    object 
 2   1960          0 non-null      float64
 3   1961          0 non-null      float64
 4   1962          0 non-null      float64
 5   1963          0 non-null      float64
 6   1964          0 non-null      float64
 7   1965          0 non-null      float64
 8   1966          0 non-null      float64
 9   1967          0 non-null      float64
 10  1968          0 non-null      float64
 11  1969          0 non-null      float64
 12  1970          0 non-null      float64
 13  1971          0 non-null      float64
 14  1972          0 non-null      float64
 15  1973          0 non-null      float64
 16  1974          0 non-null      float64
 17  1975          0 non-null      float64
 18  1976          0 non-null      

In [35]:
# Get the gdp_ppp_percapita_df columns
gdp_ppp_percapita_df_columns_list = gdp_ppp_percapita_df.columns
gdp_ppp_percapita_df_columns_list

Index(['Country Name', 'Code', '1960', '1961', '1962', '1963', '1964', '1965',
       '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974',
       '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019',
       '2020', 'Unnamed: 65'],
      dtype='object')

In [36]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_ppp_percapita_df.columns if col.isdigit()]
id_vars = ['Country Name', 'Code']

# Melt the DataFrame to long format
gdp_ppp_percapita_time_series_df = pd.melt(gdp_ppp_percapita_df, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='gdp_ppp_percapita')

# Convert year to integer
gdp_ppp_percapita_time_series_df['year'] = gdp_ppp_percapita_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_ppp_percapita_time_series_df['id'] = gdp_ppp_percapita_time_series_df['Country Name'] + '_' + gdp_ppp_percapita_time_series_df['year'].astype(str)

gdp_ppp_percapita_time_series_df.rename(columns={
 'Country Name': 'Country',
 'year': 'Year',
 'gdp_ppp_percapita': 'GDP_PPP_per_Capita',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
print(gdp_ppp_percapita_time_series_df.head())

# Export gdp_time_series_df as CSV files.
gdp_ppp_percapita_time_series_df.to_csv("cleaned_data/gdp_ppp_percapita.csv", index=False)

                       Country Code  Year  GDP_PPP_per_Capita  \
0                        Aruba  ABW  1960                 NaN   
1  Africa Eastern and Southern  AFE  1960                 NaN   
2                  Afghanistan  AFG  1960                 NaN   
3   Africa Western and Central  AFW  1960                 NaN   
4                       Angola  AGO  1960                 NaN   

                                 ID  
0                        Aruba_1960  
1  Africa Eastern and Southern_1960  
2                  Afghanistan_1960  
3   Africa Western and Central_1960  
4                       Angola_1960  


## Global unemployment data - first data set

### Global Unemployment Data

**Global unemployment data** refers to the statistical measurement of the proportion of the labor force that is jobless and actively seeking employment across various countries. This data helps understand the economic health and labor market conditions of different regions.

### Key Points

1. **Unemployment Rate**:
   - The unemployment rate is the percentage of the labor force that is unemployed and actively seeking work.
   - Formula: 
   \[
   \text{Unemployment Rate} = \left( \frac{\text{Number of Unemployed}}{\text{Labor Force}} \right) \times 100
   \]

2. **Data Sources**:
   - **International Labour Organization (ILO)**: Provides comprehensive global labor statistics.
   - **World Bank**: Offers data on unemployment rates worldwide.
   - **OECD**: Provides detailed labor market data for member countries.

3. **Usage**:
   - **Economic Analysis**: Helps assess the health of economies and the effectiveness of labor policies.
   - **Policy Making**: Guides governments in formulating policies to reduce unemployment.
   - **Comparative Studies**: Enables comparison of unemployment rates between countries and regions.

4. **Factors Influencing Unemployment**:
   - **Economic Conditions**: Recessions and economic downturns typically increase unemployment rates.
   - **Technological Changes**: Automation and technological advancements can affect job availability.
   - **Education and Skills**: Higher levels of education and skill development can impact employability.
   - **Government Policies**: Labor laws, minimum wage regulations, and job creation programs play a role.

5. **Types of Unemployment**:
   - **Frictional Unemployment**: Short-term unemployment during the transition between jobs.
   - **Structural Unemployment**: Long-term unemployment due to changes in the economy that make certain skills obsolete.
   - **Cyclical Unemployment**: Unemployment correlated with the business cycle, rising during recessions and falling during economic expansions.

### Summary

Global unemployment data is crucial for understanding the labor market and economic health of countries. It is used for economic analysis, policy making, and comparative studies to address unemployment issues effectively.

In [37]:
# Read unemployment data into Pandas dataframe (dataset 1)
gdp_unemployment_df1 = pd.read_csv('Dataset/global_unemployment_data_kaggle_raw.csv')
gdp_unemployment_df1.head()

Unnamed: 0,country_name,indicator_name,sex,age_group,age_categories,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,Afghanistan,Unemployment rate by sex and age,Female,15-24,Youth,13.34,15.974,18.57,21.137,20.649,20.154,21.228,21.64,30.561,32.2,33.332
1,Afghanistan,Unemployment rate by sex and age,Female,25+,Adults,8.576,9.014,9.463,9.92,11.223,12.587,14.079,14.415,23.818,26.192,28.298
2,Afghanistan,Unemployment rate by sex and age,Female,Under 15,Children,10.306,11.552,12.789,14.017,14.706,15.418,16.783,17.134,26.746,29.193,30.956
3,Afghanistan,Unemployment rate by sex and age,Male,15-24,Youth,9.206,11.502,13.772,16.027,15.199,14.361,14.452,15.099,16.655,18.512,19.77
4,Afghanistan,Unemployment rate by sex and age,Male,25+,Adults,6.463,6.879,7.301,7.728,7.833,7.961,8.732,9.199,11.357,12.327,13.087


In [38]:
gdp_unemployment_df1.describe()

Unnamed: 0,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
count,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1134.0,1128.0,1122.0,1122.0
mean,11.3878,11.272444,11.122963,10.863516,10.516499,10.311452,11.851285,11.422645,10.340361,9.985181,9.940089
std,11.119002,10.915942,10.742947,10.64098,10.527773,10.297952,11.23158,10.873412,10.26481,9.987778,9.977512
min,0.027,0.034,0.038,0.035,0.044,0.036,0.056,0.064,0.067,0.063,0.06
25%,3.9335,3.9935,3.94525,3.7475,3.67275,3.5385,4.3345,4.1535,3.55525,3.4775,3.45975
50%,7.6975,7.5475,7.5045,7.1405,6.706,6.6275,8.0675,7.5425,6.5715,6.466,6.364
75%,15.05075,14.76625,14.4675,14.142,13.343,13.2855,15.31625,14.8815,13.41,12.9145,12.68775
max,74.485,74.655,74.72,75.416,76.395,77.173,83.99,82.135,78.776,78.541,78.644


In [39]:
gdp_unemployment_df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1134 entries, 0 to 1133
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   country_name    1134 non-null   object 
 1   indicator_name  1134 non-null   object 
 2   sex             1134 non-null   object 
 3   age_group       1134 non-null   object 
 4   age_categories  1134 non-null   object 
 5   2014            1134 non-null   float64
 6   2015            1134 non-null   float64
 7   2016            1134 non-null   float64
 8   2017            1134 non-null   float64
 9   2018            1134 non-null   float64
 10  2019            1134 non-null   float64
 11  2020            1134 non-null   float64
 12  2021            1134 non-null   float64
 13  2022            1128 non-null   float64
 14  2023            1122 non-null   float64
 15  2024            1122 non-null   float64
dtypes: float64(11), object(5)
memory usage: 141.9+ KB


In [40]:
# Get the gdp_ppp_percapita_df columns
gdp_unemployment_df1_columns_list = gdp_unemployment_df1.columns
gdp_unemployment_df1_columns_list

Index(['country_name', 'indicator_name', 'sex', 'age_group', 'age_categories',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       '2023', '2024'],
      dtype='object')

In [42]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_unemployment_df1.columns if col.isdigit()]
id_vars = ['country_name', 'indicator_name', 'sex', 'age_group', 'age_categories']

# Melt the DataFrame to long format
gdp_unemployment_time_series_df = pd.melt(gdp_unemployment_df1, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='unemployment')

# Convert year to integer
gdp_unemployment_time_series_df['year'] = gdp_unemployment_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_unemployment_time_series_df['id'] = gdp_unemployment_time_series_df['country_name'] + '_' + gdp_unemployment_time_series_df['sex'] + '_' + gdp_unemployment_time_series_df['age_group'] + '_' + gdp_unemployment_time_series_df['age_categories'] + '_' + gdp_unemployment_time_series_df['year'].astype(str)

gdp_unemployment_time_series_df.rename(columns={
 'country_name': 'Country',
 'indicator_name': 'Indicator',
 'sex': 'Gender',
 'age_group': 'Age_Group',
 'age_category': 'Age_Category',
 'year': 'Year',
 'unemployment': 'Unemployment_Rate',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
print(gdp_unemployment_time_series_df.head())

# Export gdp_time_series_df as CSV files.
gdp_unemployment_time_series_df.to_csv("cleaned_data/gdp_unemployment.csv", index=False)

       Country                         Indicator  Gender Age_Group  \
0  Afghanistan  Unemployment rate by sex and age  Female     15-24   
1  Afghanistan  Unemployment rate by sex and age  Female       25+   
2  Afghanistan  Unemployment rate by sex and age  Female  Under 15   
3  Afghanistan  Unemployment rate by sex and age    Male     15-24   
4  Afghanistan  Unemployment rate by sex and age    Male       25+   

  age_categories  Year  Unemployment_Rate  \
0          Youth  2014             13.340   
1         Adults  2014              8.576   
2       Children  2014             10.306   
3          Youth  2014              9.206   
4         Adults  2014              6.463   

                                          ID  
0        Afghanistan_Female_15-24_Youth_2014  
1         Afghanistan_Female_25+_Adults_2014  
2  Afghanistan_Female_Under 15_Children_2014  
3          Afghanistan_Male_15-24_Youth_2014  
4           Afghanistan_Male_25+_Adults_2014  


## Global unemployment data - second data set

In [46]:
# Read unemployment data into Pandas dataframe (dataset 2)
gdp_unemployment_df2 = pd.read_csv('Dataset/unemployment analysis_kaggle_raw.csv')
gdp_unemployment_df2.head()

Unnamed: 0,Country Name,Country Code,1991,1992,1993,1994,1995,1996,1997,1998,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Africa Eastern and Southern,AFE,7.8,7.84,7.85,7.84,7.83,7.84,7.86,7.81,...,6.56,6.45,6.41,6.49,6.61,6.71,6.73,6.91,7.56,8.11
1,Afghanistan,AFG,10.65,10.82,10.72,10.73,11.18,10.96,10.78,10.8,...,11.34,11.19,11.14,11.13,11.16,11.18,11.15,11.22,11.71,13.28
2,Africa Western and Central,AFW,4.42,4.53,4.55,4.54,4.53,4.57,4.6,4.66,...,4.64,4.41,4.69,4.63,5.57,6.02,6.04,6.06,6.77,6.84
3,Angola,AGO,4.21,4.21,4.23,4.16,4.11,4.1,4.09,4.07,...,7.35,7.37,7.37,7.39,7.41,7.41,7.42,7.42,8.33,8.53
4,Albania,ALB,10.31,30.01,25.26,20.84,14.61,13.93,16.88,20.05,...,13.38,15.87,18.05,17.19,15.42,13.62,12.3,11.47,13.33,11.82


In [47]:
gdp_unemployment_df2.describe()

Unnamed: 0,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
count,235.0,235.0,235.0,235.0,235.0,235.0,235.0,235.0,235.0,235.0,...,235.0,235.0,235.0,235.0,235.0,235.0,235.0,235.0,235.0,235.0
mean,7.278,7.62634,8.070766,8.246043,8.333915,8.494894,8.394043,8.441064,8.568043,8.438979,...,8.062553,8.086468,7.92434,7.818426,7.720979,7.485404,7.247404,7.087362,8.278809,8.21966
std,6.013749,6.296617,6.335855,6.243778,6.330822,6.358431,6.206845,6.133045,6.088361,6.126318,...,5.780173,5.832019,5.699899,5.574759,5.456333,5.318381,5.240429,5.129146,5.470319,5.506914
min,0.3,0.34,0.41,0.47,0.5,0.56,0.54,0.56,0.57,0.58,...,0.48,0.25,0.2,0.17,0.15,0.14,0.11,0.1,0.21,0.26
25%,2.945,3.14,3.7,3.89,3.945,3.995,4.02,4.085,4.275,4.07,...,4.09,4.245,4.2,4.315,4.31,4.075,3.875,3.805,4.62,4.75
50%,5.41,5.71,6.03,6.55,6.7,7.05,6.93,6.89,6.69,6.53,...,6.45,6.29,6.15,6.08,6.01,5.8,5.62,5.53,6.8,6.58
75%,9.815,10.17,10.895,11.11,11.05,11.405,11.09,11.5,11.845,11.565,...,10.655,10.465,10.29,10.08,9.895,9.445,9.06,8.605,10.23,10.245
max,36.12,36.39,36.74,36.98,37.34,38.8,37.94,37.16,36.35,35.46,...,31.02,29.0,28.03,27.69,26.54,27.04,26.91,28.47,29.22,33.56


In [48]:
gdp_unemployment_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 33 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Country Name  235 non-null    object 
 1   Country Code  235 non-null    object 
 2   1991          235 non-null    float64
 3   1992          235 non-null    float64
 4   1993          235 non-null    float64
 5   1994          235 non-null    float64
 6   1995          235 non-null    float64
 7   1996          235 non-null    float64
 8   1997          235 non-null    float64
 9   1998          235 non-null    float64
 10  1999          235 non-null    float64
 11  2000          235 non-null    float64
 12  2001          235 non-null    float64
 13  2002          235 non-null    float64
 14  2003          235 non-null    float64
 15  2004          235 non-null    float64
 16  2005          235 non-null    float64
 17  2006          235 non-null    float64
 18  2007          235 non-null    

In [49]:
# Get the gdp_ppp_percapita_df columns
gdp_unemployment_df2_columns_list = gdp_unemployment_df2.columns
gdp_unemployment_df2_columns_list

Index(['Country Name', 'Country Code', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021'],
      dtype='object')

In [52]:
# Select only the intended columns for melting
year_columns = [col for col in gdp_unemployment_df2.columns if col.isdigit()]
id_vars = ['Country Name', 'Country Code']

# Melt the DataFrame to long format
gdp_unemployment2_time_series_df = pd.melt(gdp_unemployment_df2, 
                             id_vars=id_vars, 
                             value_vars=year_columns, 
                             var_name='year', 
                             value_name='unemployment')

# Convert year to integer
gdp_unemployment2_time_series_df['year'] = gdp_unemployment2_time_series_df['year'].astype(int)

# Create a unique identifier for each country-year combination
gdp_unemployment2_time_series_df['id'] = gdp_unemployment2_time_series_df['Country Name'] + '_' + gdp_unemployment2_time_series_df['year'].astype(str)

gdp_unemployment2_time_series_df.rename(columns={
 'Country Name': 'Country',
 'Country Code': 'Code',
 'year': 'Year',
 'unemployment': 'Unemployment_Rate',
 'id': 'ID'
}, inplace=True)

# Check the transformed DataFrame
print(gdp_unemployment2_time_series_df.head())

# Export gdp_time_series_df as CSV files.
gdp_unemployment2_time_series_df.to_csv("cleaned_data/gdp_unemployment_for_use.csv", index=False)

                       Country Code  Year  Unemployment_Rate  \
0  Africa Eastern and Southern  AFE  1991               7.80   
1                  Afghanistan  AFG  1991              10.65   
2   Africa Western and Central  AFW  1991               4.42   
3                       Angola  AGO  1991               4.21   
4                      Albania  ALB  1991              10.31   

                                 ID  
0  Africa Eastern and Southern_1991  
1                  Afghanistan_1991  
2   Africa Western and Central_1991  
3                       Angola_1991  
4                      Albania_1991  


# Database Schema

## ERD Diagram

[Used this Site for ERD Diagram](https://www.quickdatabasediagrams.com/)

[global_inflation_ERD](https://github.com/umasel/Global_Inflation_Trends_Dashboard/blob/main/postgreSQL_db/sql_code.sql)

![inflation_ERD.png](attachment:inflation_ERD.png)

In [53]:
# Load inflation data
inflation_df = pd.read_csv('C:\\Users\\wware\\Desktop\\UWA Bootcamp\\Challenges\\Global_Inflation_Trends_Dashboard\\cleaned_data\\inflation.csv')

# Load gdp data
gdp_df = pd.read_csv('C:\\Users\\wware\\Desktop\\UWA Bootcamp\\Challenges\\Global_Inflation_Trends_Dashboard\\cleaned_data\\gdp.csv')

In [55]:
# Get a list of IDs in the inflation data
inflation_ids = set(inflation_df['ID'])

# Get a list of IDs in the gdp data
gdp_ids = set(gdp_df['ID'])

# Identify missing IDs
missing_ids_in_gdp = gdp_ids - inflation_ids
missing_ids_in_inflation = inflation_ids - gdp_ids

print(f"Missing IDs in GDP: {missing_ids_in_gdp}")
print(f"Missing IDs in Inflation: {missing_ids_in_inflation}")



Missing IDs in GDP: {'Peru_1962', 'Middle East & North Africa_1973', 'Middle East & North Africa (IDA & IBRD countries)_1971', 'Austria_1968', 'High income_2000', 'Liberia_1960', 'Uruguay_1973', 'Africa Western and Central_2007', 'Chile_1973', 'Azerbaijan_1971', 'Latin America & Caribbean_1993', 'Turks and Caicos Islands_1970', 'French Polynesia_2004', 'Israel_1964', 'French Polynesia_1996', 'Caribbean small states_1966', 'Greenland_1998', 'Guam_2018', 'Panama_1971', 'Burundi_1976', 'Least developed countries: UN classification_2016', 'Heavily indebted poor countries (HIPC)_2005', 'Malaysia_1978', 'Arab World_2005', 'Yemen, Rep._1983', 'South Sudan_1975', 'Not classified_1985', 'Isle of Man_1962', 'Caribbean small states_2007', 'Sierra Leone_1961', 'Austria_1978', 'Middle income_2013', 'Latin America & the Caribbean (IDA & IBRD countries)_1987', 'Other small states_1996', 'Virgin Islands (U.S.)_2020', 'Afghanistan_1961', 'Fragile and conflict affected situations_2003', 'South Asia_2015

In [56]:
# Export missing IDs to CSV files for review
missing_gdp_df = gdp_df[gdp_df['ID'].isin(missing_ids_in_gdp)]
missing_gdp_df.to_csv('C:\\Users\\wware\\Desktop\\UWA Bootcamp\\Challenges\\Global_Inflation_Trends_Dashboard\\cleaned_data\\missing_gdp.csv', index=False)

missing_inflation_df = inflation_df[inflation_df['ID'].isin(missing_ids_in_inflation)]
missing_inflation_df.to_csv('C:\\Users\\wware\\Desktop\\UWA Bootcamp\\Challenges\\Global_Inflation_Trends_Dashboard\\cleaned_data\\missing_inflation.csv', index=False)
