The first step is to import all the World Bank datasets and combine them together to make a dataset that we can then merge with the carbon emissions dataset. The following 4 screenshots show the steps to consolidate all of the data of interest from the World Bank into a single dataframe:

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

main_df = pd.read_csv("world_bank_data/API_11_DS2_en_csv_v2_2163688.csv", skiprows=4).drop('Unnamed: 65', axis=1)
main_df.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,ABW,"Survey mean consumption or income per capita, bottom 40% of population (2011 PPP $ per day)",SI.SPR.PC40,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Aruba,ABW,Poverty gap at $5.50 a day (2011 PPP) (%),SI.POV.UMIC.GP,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Aruba,ABW,Poverty headcount ratio at $5.50 a day (2011 PPP) (% of population),SI.POV.UMIC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,Aruba,ABW,Poverty headcount ratio at national poverty lines (% of population),SI.POV.NAHC,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,Aruba,ABW,Multidimensional poverty index (scale 0-1),SI.POV.MDIM.XQ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Read in another World Bank dataset, this one gives us information of the Region and Income level

In [2]:
counties_df = pd.read_csv("world_bank_data/Metadata_Country_API_11_DS2_en_csv_v2_2163688.csv").drop('Unnamed: 5', axis=1)
# If IncomeGroup is Nan, we want to drop the row because we want to look at income for countries
# these other rows are for continents and other groups that are not countries
counties_df = counties_df[pd.notna(counties_df['IncomeGroup'])]
# only columns we care about
counties_df = counties_df[['Country Code', 'Region', 'IncomeGroup']]
counties_df = counties_df.rename(columns={'IncomeGroup':'Income'})
counties_df.head()

Unnamed: 0,Country Code,Region,Income
0,ABW,Latin America & Caribbean,High income
1,AFG,South Asia,Low income
2,AGO,Sub-Saharan Africa,Lower middle income
3,ALB,Europe & Central Asia,Upper middle income
4,AND,Europe & Central Asia,High income


Perform an inner merge with the first World Bank dataset to get the country name as well as region and income information

In [3]:
# merge with other df on "Country Code"
merged_counties = pd.merge(main_df, counties_df, how='inner', on='Country Code')
# filter to only the columns we care about
merged_counties = merged_counties[['Country Name', 'Country Code', 'Region','Income']]
# drop duplicate rows, because each country had 29 duplicates
merged_counties = merged_counties.drop_duplicates().reset_index(drop=True)
merged_counties.head(10)

Unnamed: 0,Country Name,Country Code,Region,Income
0,Aruba,ABW,Latin America & Caribbean,High income
1,Afghanistan,AFG,South Asia,Low income
2,Angola,AGO,Sub-Saharan Africa,Lower middle income
3,Albania,ALB,Europe & Central Asia,Upper middle income
4,Andorra,AND,Europe & Central Asia,High income
5,United Arab Emirates,ARE,Middle East & North Africa,High income
6,Argentina,ARG,Latin America & Caribbean,Upper middle income
7,Armenia,ARM,Europe & Central Asia,Upper middle income
8,American Samoa,ASM,East Asia & Pacific,Upper middle income
9,Antigua and Barbuda,ATG,Latin America & Caribbean,High income


Now with the income data in the desired format, we need to add GDP information. We can start by importing the data, and then using the percent_missing_cols function to see the percentage of missing values that for each column in the gdp_df dataframe:

In [4]:
gdp_df = pd.read_csv("world_bank_data/API_NY.GDP.MKTP.CD_DS2_en_csv_v2_2163564.csv", skiprows=4).drop('Unnamed: 65', axis=1)
gdp_df

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,ABW,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,,,,,,,,,,,,,,,,,,,,4.054634e+08,4.876025e+08,5.964236e+08,6.953044e+08,7.648871e+08,8.721387e+08,9.584632e+08,1.082980e+09,1.245688e+09,1.320475e+09,1.379961e+09,1.531944e+09,1.665101e+09,1.722799e+09,1.873453e+09,1.920112e+09,1.941341e+09,2.021229e+09,2.228492e+09,2.330726e+09,2.424581e+09,2.615084e+09,2.745251e+09,2.498883e+09,2.390503e+09,2.549721e+09,2.534637e+09,2.701676e+09,2.765363e+09,2.919553e+09,2.965922e+09,3.056425e+09,,,
1,Afghanistan,AFG,GDP (current US$),NY.GDP.MKTP.CD,5.377778e+08,5.488889e+08,5.466667e+08,7.511112e+08,8.000000e+08,1.006667e+09,1.400000e+09,1.673333e+09,1.373333e+09,1.408889e+09,1.748887e+09,1.831109e+09,1.595555e+09,1.733333e+09,2.155555e+09,2.366667e+09,2.555556e+09,2.953333e+09,3.300000e+09,3.697940e+09,3.641723e+09,3.478788e+09,,,,,,,,,,,,,,,,,,,,,4.055180e+09,4.515559e+09,5.226779e+09,6.209138e+09,6.971286e+09,9.747880e+09,1.010923e+10,1.243909e+10,1.585657e+10,1.780429e+10,2.000160e+10,2.056107e+10,2.048489e+10,1.990711e+10,1.801775e+10,1.886995e+10,1.835388e+10,1.929110e+10,
2,Angola,AGO,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,,,,,,,,,,,,,,5.930503e+09,5.550483e+09,5.550483e+09,5.784342e+09,6.131475e+09,7.553560e+09,7.072063e+09,8.083872e+09,8.769251e+09,1.020110e+10,1.122876e+10,1.060378e+10,8.307811e+09,5.768720e+09,4.438321e+09,5.538749e+09,7.526447e+09,7.648377e+09,6.506230e+09,6.152923e+09,9.129595e+09,8.936064e+09,1.528559e+10,1.781271e+10,2.355205e+10,3.697092e+10,5.238101e+10,6.526645e+10,8.853861e+10,7.030716e+10,8.379950e+10,1.117897e+11,1.280529e+11,1.367099e+11,1.457122e+11,1.161936e+11,1.011239e+11,1.221238e+11,1.013532e+11,8.881570e+10,
3,Albania,ALB,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,,,,,,,,,,,,,,,,,,1.857338e+09,1.897050e+09,2.097326e+09,2.080796e+09,2.051236e+09,2.253090e+09,2.028554e+09,1.099559e+09,6.521750e+08,1.185315e+09,1.880952e+09,2.392765e+09,3.199643e+09,2.258516e+09,2.545967e+09,3.212119e+09,3.480355e+09,3.922099e+09,4.348070e+09,5.611492e+09,7.184681e+09,8.052076e+09,8.896074e+09,1.067732e+10,1.288135e+10,1.204422e+10,1.192693e+10,1.289077e+10,1.231983e+10,1.277622e+10,1.322814e+10,1.138685e+10,1.186120e+10,1.301969e+10,1.514702e+10,1.527918e+10,
4,Andorra,AND,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,,,,7.861921e+07,8.940982e+07,1.134082e+08,1.508201e+08,1.865587e+08,2.201272e+08,2.272810e+08,2.540202e+08,3.080089e+08,4.115783e+08,4.464161e+08,3.889587e+08,3.758960e+08,3.278618e+08,3.300707e+08,3.467380e+08,4.820006e+08,6.113164e+08,7.214259e+08,7.954493e+08,1.029048e+09,1.106929e+09,1.210014e+09,1.007026e+09,1.017549e+09,1.178739e+09,1.223945e+09,1.180597e+09,1.211932e+09,1.239876e+09,1.429049e+09,1.546926e+09,1.755910e+09,2.361727e+09,2.894922e+09,3.159905e+09,3.456442e+09,3.952601e+09,4.085631e+09,3.674410e+09,3.449967e+09,3.629204e+09,3.188809e+09,3.193704e+09,3.271808e+09,2.789870e+09,2.896679e+09,3.000181e+09,3.218316e+09,3.154058e+09,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259,Kosovo,XKX,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.849196e+09,2.535334e+09,2.406271e+09,2.790456e+09,3.556757e+09,3.663102e+09,3.846820e+09,4.655899e+09,5.687418e+09,5.653793e+09,5.835874e+09,6.701698e+09,6.499807e+09,7.074778e+09,7.396705e+09,6.442916e+09,6.719172e+09,7.245707e+09,7.942962e+09,7.926134e+09,
260,"Yemen, Rep.",YEM,GDP (current US$),NY.GDP.MKTP.CD,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.647119e+09,5.930370e+09,6.463650e+09,5.368271e+09,4.167356e+09,4.258789e+09,5.785685e+09,6.838557e+09,6.325142e+09,7.641103e+09,9.652436e+09,9.861560e+09,1.069463e+10,1.177797e+10,1.387279e+10,1.674634e+10,1.906198e+10,2.165053e+10,2.691085e+10,2.513027e+10,3.090675e+10,3.272642e+10,3.540134e+10,4.041524e+10,4.320647e+10,4.245062e+10,3.093598e+10,2.673614e+10,2.348627e+10,2.258108e+10,
261,South Africa,ZAF,GDP (current US$),NY.GDP.MKTP.CD,7.575397e+09,7.972997e+09,8.497997e+09,9.423396e+09,1.037400e+10,1.133440e+10,1.235500e+10,1.377739e+10,1.489459e+10,1.678039e+10,1.841839e+10,2.033369e+10,2.135744e+10,2.929567e+10,3.680772e+10,3.811454e+10,3.660335e+10,4.065135e+10,4.673945e+10,5.764572e+10,8.298048e+10,8.545442e+10,7.842306e+10,8.741585e+10,7.734409e+10,5.908264e+10,6.752160e+10,8.857370e+10,9.517664e+10,9.903086e+10,1.155523e+11,1.239428e+11,1.345446e+11,1.343081e+11,1.397525e+11,1.554609e+11,1.476063e+11,1.525874e+11,1.377748e+11,1.366323e+11,1.363613e+11,1.215147e+11,1.154824e+11,1.752569e+11,2.285900e+11,2.577727e+11,2.716385e+11,2.994155e+11,2.867698e+11,2.959365e+11,3.753494e+11,4.164189e+11,3.963327e+11,3.668294e+11,3.509046e+11,3.176205e+11,2.963573e+11,3.495541e+11,3.682889e+11,3.514316e+11,
262,Zambia,ZMB,GDP (current US$),NY.GDP.MKTP.CD,7.130000e+08,6.962857e+08,6.931429e+08,7.187143e+08,8.394286e+08,1.082857e+09,1.264286e+09,1.368000e+09,1.605857e+09,1.965714e+09,1.825286e+09,1.687000e+09,1.910714e+09,2.268714e+09,3.121833e+09,2.618667e+09,2.746714e+09,2.483000e+09,2.813375e+09,3.325500e+09,3.829500e+09,3.872667e+09,3.994778e+09,3.216308e+09,2.739444e+09,2.281258e+09,1.661949e+09,2.269895e+09,3.713614e+09,3.998638e+09,3.285217e+09,3.378882e+09,3.181922e+09,3.273238e+09,3.656648e+09,3.807067e+09,3.597221e+09,4.303282e+09,3.537683e+09,3.404312e+09,3.600683e+09,4.094481e+09,4.193846e+09,4.901840e+09,6.221078e+09,8.331870e+09,1.275686e+10,1.405696e+10,1.791086e+10,1.532834e+10,2.026556e+10,2.345952e+10,2.550306e+10,2.804555e+10,2.715073e+10,2.124334e+10,2.095475e+10,2.586817e+10,2.631214e+10,2.330977e+10,


This function calculates the percentage of missing values in each column. https://stackoverflow.com/questions/51070985/find-out-the-percentage-of-missing-values-in-each-column-in-the-given-dataset/51071037

In [5]:
def percent_missing_cols(df):
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'percent_missing': percent_missing})
    return missing_value_df

In [6]:
percent_missing_cols(gdp_df).tail(5)

Unnamed: 0,percent_missing
2016,4.924242
2017,4.924242
2018,5.681818
2019,12.878788
2020,100.0


Since 2018 had about 5.7% of values missing, but 2019 had almost 13% values missing, we can isolate the 2018 GDP as the column to perform GDP analysis on. In this analysis, we're only using GDP to measure a country's current economic status, so historical GDP information has no role in this project.

In [7]:
merged_counties.columns

Index(['Country Name', 'Country Code', 'Region', 'Income'], dtype='object')

The following code merges the GDP information with the income information of each country and filters on the 2018 GDP information to create a merged dataframe with all the economic information we need for this analysis:

In [8]:
country_wealth_data = pd.merge(merged_counties, gdp_df, how='inner', on='Country Code').drop("Country Name_y", axis=1).rename(columns={'2018':'2018 GDP', 'Country Name_x':'Country Name'})
country_wealth_data = country_wealth_data[["Country Name", "Country Code", "Region", "Income", "2018 GDP"]]
country_wealth_data.head()

Unnamed: 0,Country Name,Country Code,Region,Income,2018 GDP
0,Aruba,ABW,Latin America & Caribbean,High income,
1,Afghanistan,AFG,South Asia,Low income,18353880000.0
2,Angola,AGO,Sub-Saharan Africa,Lower middle income,101353200000.0
3,Albania,ALB,Europe & Central Asia,Upper middle income,15147020000.0
4,Andorra,AND,Europe & Central Asia,High income,3218316000.0


The dataset below is Greenhouse Gas emissions data by country by year

In [9]:
co2_data = pd.read_csv("co2-data/owid-co2-data.csv")

In [10]:
co2_data.head()

Unnamed: 0,iso_code,country,year,co2,co2_growth_prct,co2_growth_abs,consumption_co2,trade_co2,trade_co2_share,co2_per_capita,consumption_co2_per_capita,share_global_co2,cumulative_co2,share_global_cumulative_co2,co2_per_gdp,consumption_co2_per_gdp,co2_per_unit_energy,cement_co2,coal_co2,flaring_co2,gas_co2,oil_co2,other_industry_co2,cement_co2_per_capita,coal_co2_per_capita,flaring_co2_per_capita,gas_co2_per_capita,oil_co2_per_capita,other_co2_per_capita,share_global_coal_co2,share_global_oil_co2,share_global_gas_co2,share_global_flaring_co2,share_global_cement_co2,cumulative_coal_co2,cumulative_oil_co2,cumulative_gas_co2,cumulative_flaring_co2,cumulative_cement_co2,share_global_cumulative_coal_co2,share_global_cumulative_oil_co2,share_global_cumulative_gas_co2,share_global_cumulative_flaring_co2,share_global_cumulative_cement_co2,total_ghg,ghg_per_capita,methane,methane_per_capita,nitrous_oxide,nitrous_oxide_per_capita,primary_energy_consumption,energy_per_capita,energy_per_gdp,population,gdp
0,AFG,Afghanistan,1949,0.015,,,,,,0.002,,0.0,0.015,0.0,,,,,0.015,,,,,,0.002,,,,,0.0,,,,,0.015,,,,,0.0,,,,,,,,,,,,,,7663783.0,
1,AFG,Afghanistan,1950,0.084,475.0,0.07,,,,0.011,,0.001,0.099,0.0,0.004,,,,0.021,,,0.063,,,0.003,,,0.008,,0.001,0.004,,,,0.036,0.063,,,,0.0,0.0,,,,,,,,,,,,,7752000.0,19494800000.0
2,AFG,Afghanistan,1951,0.092,8.696,0.007,,,,0.012,,0.001,0.191,0.0,0.005,,,,0.026,,,0.066,,,0.003,,,0.008,,0.001,0.004,,,,0.061,0.129,,,,0.0,0.0,,,,,,,,,,,,,7840000.0,20063850000.0
3,AFG,Afghanistan,1952,0.092,,,,,,0.012,,0.001,0.282,0.0,0.004,,,,0.032,,,0.06,,,0.004,,,0.008,,0.001,0.003,,,,0.093,0.189,,,,0.0,0.001,,,,,,,,,,,,,7936000.0,20742350000.0
4,AFG,Afghanistan,1953,0.106,16.0,0.015,,,,0.013,,0.002,0.388,0.0,0.005,,,,0.038,,,0.068,,,0.005,,,0.008,,0.001,0.003,,,,0.131,0.257,,,,0.0,0.001,,,,,,,,,,,,,8040000.0,22015460000.0


Now we can finally create a merged dataframe combining the cleaned GDP/income data with the co2 emissions data. The result is a dataframe that contains information for 202 countries from the year data was first available for each country (for some countries this goes back to 1750) until 2019 (the 2020 data was not available at the time of this writing). In total there are 20,098 entries in "merged_df", with the dataframe spanning 19 columns.

In [11]:
merged_df = pd.merge(country_wealth_data, co2_data, how='inner', left_on='Country Code', right_on='iso_code')
merged_df.head()

Unnamed: 0,Country Name,Country Code,Region,Income,2018 GDP,iso_code,country,year,co2,co2_growth_prct,co2_growth_abs,consumption_co2,trade_co2,trade_co2_share,co2_per_capita,consumption_co2_per_capita,share_global_co2,cumulative_co2,share_global_cumulative_co2,co2_per_gdp,consumption_co2_per_gdp,co2_per_unit_energy,cement_co2,coal_co2,flaring_co2,gas_co2,oil_co2,other_industry_co2,cement_co2_per_capita,coal_co2_per_capita,flaring_co2_per_capita,gas_co2_per_capita,oil_co2_per_capita,other_co2_per_capita,share_global_coal_co2,share_global_oil_co2,share_global_gas_co2,share_global_flaring_co2,share_global_cement_co2,cumulative_coal_co2,cumulative_oil_co2,cumulative_gas_co2,cumulative_flaring_co2,cumulative_cement_co2,share_global_cumulative_coal_co2,share_global_cumulative_oil_co2,share_global_cumulative_gas_co2,share_global_cumulative_flaring_co2,share_global_cumulative_cement_co2,total_ghg,ghg_per_capita,methane,methane_per_capita,nitrous_oxide,nitrous_oxide_per_capita,primary_energy_consumption,energy_per_capita,energy_per_gdp,population,gdp
0,Aruba,ABW,Latin America & Caribbean,High income,,ABW,Aruba,1926,0.033,,0.033,,,,,,0.001,0.033,0.0,,,,,,,,0.033,,,,,,,,,0.007,,,,,0.033,,,,,0.0,,,,,,,,,,,,,,
1,Aruba,ABW,Latin America & Caribbean,High income,,ABW,Aruba,1927,0.036,7.082,0.002,,,,,,0.001,0.069,0.0,,,,,,,,0.036,,,,,,,,,0.007,,,,,0.069,,,,,0.001,,,,,,,,,,,,,,
2,Aruba,ABW,Latin America & Caribbean,High income,,ABW,Aruba,1928,0.083,131.402,0.047,,,,,,0.002,0.152,0.0,,,,,,,,0.083,,,,,,,,,0.015,,,,,0.152,,,,,0.002,,,,,,,,,,,,,,
3,Aruba,ABW,Latin America & Caribbean,High income,,ABW,Aruba,1929,0.103,24.971,0.021,,,,,,0.002,0.255,0.0,,,,,,,,0.103,,,,,,,,,0.017,,,,,0.255,,,,,0.003,,,,,,,,,,,,,,
4,Aruba,ABW,Latin America & Caribbean,High income,,ABW,Aruba,1930,0.135,30.135,0.031,,,,,,0.003,0.39,0.0,,,,,,,,0.135,,,,,,,,,0.023,,,,,0.39,,,,,0.004,,,,,,,,,,,,,,


In [12]:
# how many countries are in the dataset
display(len(merged_df['Country Code'].unique()))

202

In [13]:
percent_missing_cols(merged_df)

Unnamed: 0,percent_missing
Country Name,0.0
Country Code,0.0
Region,0.0
Income,0.0
2018 GDP,3.861081
iso_code,0.0
country,0.0
year,0.0
co2,2.438054
co2_growth_prct,6.831526


In [14]:
merged_df = merged_df.reset_index().drop(['index', 'country'], axis=1)

In [15]:
merged_df.head(2)

Unnamed: 0,Country Name,Country Code,Region,Income,2018 GDP,iso_code,year,co2,co2_growth_prct,co2_growth_abs,consumption_co2,trade_co2,trade_co2_share,co2_per_capita,consumption_co2_per_capita,share_global_co2,cumulative_co2,share_global_cumulative_co2,co2_per_gdp,consumption_co2_per_gdp,co2_per_unit_energy,cement_co2,coal_co2,flaring_co2,gas_co2,oil_co2,other_industry_co2,cement_co2_per_capita,coal_co2_per_capita,flaring_co2_per_capita,gas_co2_per_capita,oil_co2_per_capita,other_co2_per_capita,share_global_coal_co2,share_global_oil_co2,share_global_gas_co2,share_global_flaring_co2,share_global_cement_co2,cumulative_coal_co2,cumulative_oil_co2,cumulative_gas_co2,cumulative_flaring_co2,cumulative_cement_co2,share_global_cumulative_coal_co2,share_global_cumulative_oil_co2,share_global_cumulative_gas_co2,share_global_cumulative_flaring_co2,share_global_cumulative_cement_co2,total_ghg,ghg_per_capita,methane,methane_per_capita,nitrous_oxide,nitrous_oxide_per_capita,primary_energy_consumption,energy_per_capita,energy_per_gdp,population,gdp
0,Aruba,ABW,Latin America & Caribbean,High income,,ABW,1926,0.033,,0.033,,,,,,0.001,0.033,0.0,,,,,,,,0.033,,,,,,,,,0.007,,,,,0.033,,,,,0.0,,,,,,,,,,,,,,
1,Aruba,ABW,Latin America & Caribbean,High income,,ABW,1927,0.036,7.082,0.002,,,,,,0.001,0.069,0.0,,,,,,,,0.036,,,,,,,,,0.007,,,,,0.069,,,,,0.001,,,,,,,,,,,,,,


The following code filters the columns so we only keep the columns of interest, then rename some of them to include units

In [16]:
merged_df = merged_df[['Country Name', 'Country Code', "Region", "Income", '2018 GDP', 'year',
       'co2', 'co2_growth_prct', 'co2_growth_abs',
       'consumption_co2', 'trade_co2', 'trade_co2_share',
       'co2_per_capita', 'consumption_co2_per_capita',
       'share_global_co2', 'cumulative_co2',
       'share_global_cumulative_co2', 'population']]

In [17]:
merged_df = merged_df.rename(columns=
                               {'Country Name':'country',
                                'Country Code': 'country code',
                                'co2':'co2 (M Tonnes)', 
                                'co2_growth_abs':'co2_growth_abs (M Tonnes)',
                                'consumption_co2':'consumption_co2 (M Tonnes)',
                                'trade_co2': 'trade_co2 (M Tonnes)',
                                'co2_per_capita':'co2_per_capita (Tonnes)',
                                'consumption_co2_per_capita':'consumption_co2_per_capita (Tonnes)',
                                'cumulative_co2':'cumulative_co2 (M Tonnes)'
                                })

In [18]:
percent_missing_cols(merged_df)

Unnamed: 0,percent_missing
country,0.0
country code,0.0
Region,0.0
Income,0.0
2018 GDP,3.861081
year,0.0
co2 (M Tonnes),2.438054
co2_growth_prct,6.831526
co2_growth_abs (M Tonnes),6.38372
consumption_co2 (M Tonnes),83.475968


In [19]:
merged_df['co2 per 2018 GDP (M Tonnes/USD)'] = merged_df['co2 (M Tonnes)']/merged_df['2018 GDP']

In [20]:
merged_df.head(10)

Unnamed: 0,country,country code,Region,Income,2018 GDP,year,co2 (M Tonnes),co2_growth_prct,co2_growth_abs (M Tonnes),consumption_co2 (M Tonnes),trade_co2 (M Tonnes),trade_co2_share,co2_per_capita (Tonnes),consumption_co2_per_capita (Tonnes),share_global_co2,cumulative_co2 (M Tonnes),share_global_cumulative_co2,population,co2 per 2018 GDP (M Tonnes/USD)
0,Aruba,ABW,Latin America & Caribbean,High income,,1926,0.033,,0.033,,,,,,0.001,0.033,0.0,,
1,Aruba,ABW,Latin America & Caribbean,High income,,1927,0.036,7.082,0.002,,,,,,0.001,0.069,0.0,,
2,Aruba,ABW,Latin America & Caribbean,High income,,1928,0.083,131.402,0.047,,,,,,0.002,0.152,0.0,,
3,Aruba,ABW,Latin America & Caribbean,High income,,1929,0.103,24.971,0.021,,,,,,0.002,0.255,0.0,,
4,Aruba,ABW,Latin America & Caribbean,High income,,1930,0.135,30.135,0.031,,,,,,0.003,0.39,0.0,,
5,Aruba,ABW,Latin America & Caribbean,High income,,1931,0.112,-16.937,-0.023,,,,,,0.003,0.502,0.0,,
6,Aruba,ABW,Latin America & Caribbean,High income,,1932,0.109,-2.169,-0.002,,,,,,0.003,0.611,0.0,,
7,Aruba,ABW,Latin America & Caribbean,High income,,1933,0.117,7.009,0.008,,,,,,0.004,0.728,0.0,,
8,Aruba,ABW,Latin America & Caribbean,High income,,1934,0.129,10.585,0.012,,,,,,0.004,0.857,0.001,,
9,Aruba,ABW,Latin America & Caribbean,High income,,1935,0.145,12.04,0.016,,,,,,0.004,1.002,0.001,,


The "Income" column above includes the word "income" after every income category (for example, "High income", "Low income", Upper middle income"), so the following code eliminates the "income" so the column is less repetitive

In [21]:
merged_df['Income'] = merged_df.apply(lambda x: x['Income'].rsplit(' ', 1)[0], axis=1)
merged_df.head()

Unnamed: 0,country,country code,Region,Income,2018 GDP,year,co2 (M Tonnes),co2_growth_prct,co2_growth_abs (M Tonnes),consumption_co2 (M Tonnes),trade_co2 (M Tonnes),trade_co2_share,co2_per_capita (Tonnes),consumption_co2_per_capita (Tonnes),share_global_co2,cumulative_co2 (M Tonnes),share_global_cumulative_co2,population,co2 per 2018 GDP (M Tonnes/USD)
0,Aruba,ABW,Latin America & Caribbean,High,,1926,0.033,,0.033,,,,,,0.001,0.033,0.0,,
1,Aruba,ABW,Latin America & Caribbean,High,,1927,0.036,7.082,0.002,,,,,,0.001,0.069,0.0,,
2,Aruba,ABW,Latin America & Caribbean,High,,1928,0.083,131.402,0.047,,,,,,0.002,0.152,0.0,,
3,Aruba,ABW,Latin America & Caribbean,High,,1929,0.103,24.971,0.021,,,,,,0.002,0.255,0.0,,
4,Aruba,ABW,Latin America & Caribbean,High,,1930,0.135,30.135,0.031,,,,,,0.003,0.39,0.0,,


We now need to check the datatypes of the columns and see if the types are what we want

In [22]:
display(merged_df.dtypes)

country                                 object
country code                            object
Region                                  object
Income                                  object
2018 GDP                               float64
year                                     int64
co2 (M Tonnes)                         float64
co2_growth_prct                        float64
co2_growth_abs (M Tonnes)              float64
consumption_co2 (M Tonnes)             float64
trade_co2 (M Tonnes)                   float64
trade_co2_share                        float64
co2_per_capita (Tonnes)                float64
consumption_co2_per_capita (Tonnes)    float64
share_global_co2                       float64
cumulative_co2 (M Tonnes)              float64
share_global_cumulative_co2            float64
population                             float64
co2 per 2018 GDP (M Tonnes/USD)        float64
dtype: object

While the majority of this is fine, we can change the data type for 2 columns. One is the "Income" column. This is because "Income" is an ordinal variable (a ranked categorical variable), and therefore it should be converted to type "category". This can be done with the following code.

In [23]:
display(merged_df['Income'].unique())

array(['High', 'Low', 'Lower middle', 'Upper middle'], dtype=object)

In [24]:
# Create assign the category ranking
my_categories = pd.CategoricalDtype(categories = ['Low', 'Lower middle', 'Upper middle', 'High'],
                                    ordered=True)

# Convert income column from type object to type category
# Now we have ordering to the Income variable, where 'High' is at the top and 'Low' is at the bottom
merged_df['Income'] = merged_df['Income'].astype(my_categories)
display(merged_df.head())
display(merged_df.dtypes)

Unnamed: 0,country,country code,Region,Income,2018 GDP,year,co2 (M Tonnes),co2_growth_prct,co2_growth_abs (M Tonnes),consumption_co2 (M Tonnes),trade_co2 (M Tonnes),trade_co2_share,co2_per_capita (Tonnes),consumption_co2_per_capita (Tonnes),share_global_co2,cumulative_co2 (M Tonnes),share_global_cumulative_co2,population,co2 per 2018 GDP (M Tonnes/USD)
0,Aruba,ABW,Latin America & Caribbean,High,,1926,0.033,,0.033,,,,,,0.001,0.033,0.0,,
1,Aruba,ABW,Latin America & Caribbean,High,,1927,0.036,7.082,0.002,,,,,,0.001,0.069,0.0,,
2,Aruba,ABW,Latin America & Caribbean,High,,1928,0.083,131.402,0.047,,,,,,0.002,0.152,0.0,,
3,Aruba,ABW,Latin America & Caribbean,High,,1929,0.103,24.971,0.021,,,,,,0.002,0.255,0.0,,
4,Aruba,ABW,Latin America & Caribbean,High,,1930,0.135,30.135,0.031,,,,,,0.003,0.39,0.0,,


country                                  object
country code                             object
Region                                   object
Income                                 category
2018 GDP                                float64
year                                      int64
co2 (M Tonnes)                          float64
co2_growth_prct                         float64
co2_growth_abs (M Tonnes)               float64
consumption_co2 (M Tonnes)              float64
trade_co2 (M Tonnes)                    float64
trade_co2_share                         float64
co2_per_capita (Tonnes)                 float64
consumption_co2_per_capita (Tonnes)     float64
share_global_co2                        float64
cumulative_co2 (M Tonnes)               float64
share_global_cumulative_co2             float64
population                              float64
co2 per 2018 GDP (M Tonnes/USD)         float64
dtype: object

Now that this is a category, we can make queries such as the following, which returns only countries with incomes of "Upper middle" or higher:

In [25]:
display(merged_df[merged_df['Income'] >= 'Upper middle'][['country', 'Income']].drop_duplicates().reset_index().drop('index', axis=1))

Unnamed: 0,country,Income
0,Aruba,High
1,Albania,Upper middle
2,Andorra,High
3,United Arab Emirates,High
4,Argentina,Upper middle
...,...,...
119,St. Vincent and the Grenadines,Upper middle
120,"Venezuela, RB",Upper middle
121,British Virgin Islands,High
122,Samoa,Upper middle


The other column we can change the data type of is the "year" column. Although this is not completely necessary, this conversion allows to practice working with time series data. With the following code, we can convert the "year" column from type int to type datetime64.

In [26]:
merged_df['year'] = pd.to_datetime(merged_df['year'], format='%Y')
merged_df.dtypes

country                                        object
country code                                   object
Region                                         object
Income                                       category
2018 GDP                                      float64
year                                   datetime64[ns]
co2 (M Tonnes)                                float64
co2_growth_prct                               float64
co2_growth_abs (M Tonnes)                     float64
consumption_co2 (M Tonnes)                    float64
trade_co2 (M Tonnes)                          float64
trade_co2_share                               float64
co2_per_capita (Tonnes)                       float64
consumption_co2_per_capita (Tonnes)           float64
share_global_co2                              float64
cumulative_co2 (M Tonnes)                     float64
share_global_cumulative_co2                   float64
population                                    float64
co2 per 2018 GDP (M Tonnes/U

In [27]:
merged_df.to_csv (r'cleaned_data.csv')