# Import/Export-Adjusted Green House Gas Emissions
## Final Assignment for EPA1333 Computer Engineering for Scientific Computing

Authors:

Patrick Steinmann #4623991

Stefan Wigman     #4016246







## Abstract

High-pollutant industrial processes often take place in developing countries, the resulting products often being exported to developed countries. We analyze this "offshoring" of green house gas (GHG) emissions by considering country-to-country import/export balances and national GHG emissions. We attempt to assign each country each it's "true" GHG emissions by determining which emissions that country causes in other countries, and then attributing these "offshored" emissions accordingly. We find that #TODO

## Introduction

In partial fulfillment of the course requirements of EPA1333, we were tasked to conduct an original and non-trivial data analysis related to climate change.

We chose to investigate the phenomenon of "outsourcing" green house gas (GHG) emissions. Many emission-intensive activities take place in countries with poor emissions records - however, these countries often export the products of these activities to countries with much better emissions records. In essence, the emissions are being outsourced. A simple example is the import of electrical energy - highly polluting coal is burned in a power plant in poorly developed country A, and the generated energy is exported to highly developed country B. Country B can claim low GHG emissions - after all, the coal is being burned in A, which, as a poorly developed country, has much more leeway regarding pollution. However, the resulting emissions should really be attributed to country B, since that is where the energy ends up.

Our research question therefore is as follows:

*When considering a country's import/export-adjusted emissions, does this differ significantly from their claimed emissions records, and how has this developed over time?*

## Methodology

### Approach

We tackled our research question by first finding, importing and cleaning import/export data between countries. Specifically, we were interested in total import/export (that is, goods and services) from and to each country, for our time range of interest. We defined this time range as 1995 to 2015, giving us a long enough 20-year interval while staying largely inside the time range where useful data was available.

We then obtained data on every country's GDP and GHG emissions. These emissions are reported as total GHG emitted over a year in a country, irrespective of use/destination.

$ emissions_{nominal} = emissions_{self} + emissions_{export} $

By comparing export volume and GDP, we were able to determine which percentage of a country's GHG emissions are "self-caused", and which are "offshored" - that is, emissions created by products destined to by exported. In essence, these emissions are the fault of the country importing those products, not the emittant's.

We could then assign each country, based on its imports, a percentage of their import partners' GHG emissions, thus arriving at each country's import/export adjusted (or "true") emissions.

$ emissions_{true} = emissions_{self} + emissions_{import} $

### Assumptions & Simplifications

* countries export a broadly similar product palette to every export partner
* no re-export or re-import

## Results/Work

### Setup

In a first step, we import all packages used throughout this notebook. These packages add functionality and features. Most of the packages are Anaconda-default. wbdata is the exception - this package is essentially an API for accessing World Bank Development Indicators data in an efficient, pandas-integrated fashion.

We also import a custom .py file called ProjectFunctions. It holds all the functions created for and applied in this analysis. Maintaining an external functions package keeps this notebook cleaner and easier to understand.

In [11]:
import requests
import pandas as pd
from pathlib import Path
import numpy as np
import os
from ProjectFunctions import *
import datetime
import wbdata

We override a default pandas option to make chained assignments not throw warnings.

In [4]:
pd.options.mode.chained_assignment = None  # default='warn'

As we intend to use a pandas multi-index dataframe, we create an IndexSlice object to make multi index slicing syntax more natural. This is optional.

In [5]:
idx = pd.IndexSlice

### Data Import & Cleaning

#### Country to Country Trade Data

We first import the raw country-to-country trade data from a CSV file, using suitable encoding.

In [6]:
trade_data=pd.read_csv("raw_data/DataJobID-1257172_1257172_TestQuery.csv" , encoding = "ISO-8859-1")

A thorny aspect of dealing with country-level data is the wildly differing standards for labelling the data. Various databases use full country names in various spellings, two-character ISO codes, three-character ISO codes, three-character IOC (International Olympic Committee) codes, or other identifiers. Thus, data alignment can be an issue. We decide to use ISO3 as our common identifier, and therefore create a dictionary to manage the conversions.

In [8]:
dic_cols=['ReporterISO3', 'ReporterName']
dic_df=trade_data[dic_cols].drop_duplicates()
country_dic=dic_df.set_index('ReporterName')['ReporterISO3'].to_dict()
inv_country_dic = {v: k for k, v in country_dic.items()}

We intend to build a multi-index dataframe to hold trade data between countries over a range of years. Multi-index dataframes are n-dimensional dataframes. In our case, we will use three dimensions - for each year (time being the third dimension), a two-dimensional dataframe holds the country-to-country trade data.

To build the multi-index, we need to define the indices first.

In [9]:
years = list(range(1995,2016))
countries=list(trade_data['ReporterName'].unique())

We can then build the structure of the multi-index dataframe.

In [12]:
data = build_multi_index_df(years,countries)

We can then fill the structure with values from the trade data. This iterative approach is quite slow. We use iPython magic to measure execution time. Anecdotally, execution time seems to be around 6-8 minutes.

In [14]:
%%timeit -n1 -r1

#Caution, takes roughly 6-8 minutes!
for index, row in trade_data.iterrows():
    for year in years:
        year_key=str(year)+" in 1000 USD "
        data.loc[year][row['ReporterName']][row['PartnerName']]=row[year_key]

1 loop, best of 1: 7min 13s per loop


This data contains many NaN (Not a Number) values, which we fill with 0.

In [15]:
data_filled=data.fillna(0)

To make data handling easier, we write the created multi-index dataframe to a TSV (tab-separated values) file.

In [16]:
data_filled.to_csv('trade_data.tsv', sep='\t')

We then re-import that TSV file. This makes working with the data much easier, as we don't have to recreate it every time we run the notebook, we can just load it from the TSV file.

In [17]:
imported_data = pd.read_table('trade_data.tsv', index_col=[0,1])

To ensure the data has not been re-shaped during the write/read, we compare it to the original.

In [18]:
all(imported_data == data_filled)

True

#### World Bank: World Development Indicators Data

In an external Excel sheet, we first define which WDI indicators we would like to import through the wbdata API.

In [19]:
indicator_dataframe, indicators, tabnames=GetIndicatorsWB(file='Selected_Indicators.xlsx', sheet='Indicators')

We first import income and region data for every country.

In [20]:
countries1=GetRegionIncomeDataWB()

We then import WDI data for the selected indicators based on 2015 numbers. Our custom function for this attempts to fill in missing values using older data where possible, going back to 2010 at the earliest.

In [21]:
wbdata = GetDataWB(indicators,2010, 2015)

We add the indicators data to the countries' income and region data.

In [22]:
wb_data_countries = countries1.join(wbdata, how='inner')

To account for missing income data, we use two functions. The first function identifies which countries are missing data, and then attempts to find other countries in that country's region with comparable income levels to fill the data. We do this because we assume that similarly developed countries in the same region will have comparable WDI indicators statistics.

As this does not cover all countries, we then run a simplified version of this method, matching only on region. This guarantees that there will be data for every country, but the data is less accurate.

In [23]:
region_income_data=FillByRegionAndIncomeWB(wb_data_countries)
region_income_data=FillByRegionWB(region_income_data)

We verify that we have a complete data set using another custom function.

In [24]:
DataCompleteness(region_income_data)

Country Data                                                    100.0
Region                                                          100.0
IncomeGroup                                                     100.0
Exports of goods and services (% of GDP)                        100.0
GDP (current US$)                                               100.0
Total greenhouse gas emissions (kt of CO2 equivalent)           100.0
Exports of goods and services (% of GDP) source                 100.0
GDP (current US$) source                                        100.0
Total greenhouse gas emissions (kt of CO2 equivalent) source    100.0
dtype: float64


We create a dictionary to match country names to codes, and vice versa. We will be able to use this to match country names spelled differently in various datasets.

In [25]:
dic_cols_wb=countries1['Country Data']
country_dic_wb=dic_cols_wb.to_dict()
inv_country_dic_wb = {v: k for k, v in country_dic_wb.items()}

We compare the dictionaries for World Bank data and trade data to find discrepancies in country names.

In [26]:
for item in inv_country_dic_wb:
    if item in inv_country_dic:
        continue
    else:
        print(item, inv_country_dic_wb[item])
        
print('---------------------------')
for item in inv_country_dic:
    if item in inv_country_dic_wb:
        continue
    else:
        print(item, inv_country_dic[item])

ASM American Samoa
VGB British Virgin Islands
CYM Cayman Islands
TCD Chad
CHI Channel Islands
COD Congo, Dem. Rep.
CUW Curacao
GNQ Equatorial Guinea
GIB Gibraltar
GUM Guam
IMN Isle of Man
PRK Korea, Dem. People���s Rep.
XKX Kosovo
LBR Liberia
LIE Liechtenstein
MHL Marshall Islands
FSM Micronesia, Fed. Sts.
MCO Monaco
MNE Montenegro
NRU Nauru
MNP Northern Mariana Islands
PRI Puerto Rico
ROU Romania
SMR San Marino
SRB Serbia
SXM Sint Maarten (Dutch part)
SOM Somalia
SSD South Sudan
MAF St. Martin (French part)
TLS Timor-Leste
UZB Uzbekistan
VIR Virgin Islands (U.S.)
---------------------------
AIA Anguila
ANT Netherlands Antilles
BLX Belgium-Luxembourg
COK Cook Islands
EUN European Union
GLP Guadeloupe
GUF French Guiana
MNT Montenegro
MSR Montserrat
MTQ Martinique
MYT Mayotte
OAS Other Asia, nes
REU Reunion
ROM Romania
SER Serbia, FR(Serbia/Montenegro)
SUD Sudan
TMP East Timor


Differences in spelling are reasily recognized, the appropriate conversions are written to a conversion dictionary.

In [33]:
conversion_dic={'SER':'SRB',
               'SUD':'SSD',
               'ROM':'ROU'}

### Shaping Data

#### Trade Percentages

Our data for trade between countries is currently in thousand USD. We align the data by instead expressing it in percentages - that is, X percent of country A's total exports go to country B, Y percent go to country C, etc. This will make the emissions calculations easier to execute later on.

We first make a copy of the multi-index dataframe imported from the trade data TSV file. This ensures we don't accidentally manipulate our base data.

In [35]:
percentages_multi = imported_data.copy()

We then calculate the percentage-wise exports for every exporter for every year ("layer") in the multi-index dataframe.

In [36]:
for year in years:
    this = percentages_multi.loc[year].div(percentages_multi.loc[year].sum(axis=1), axis=0)
    this_filled = this.fillna(0)
    percentages_multi.loc[year].update(this_filled)

To see how much data is available, we build a custom dataframe showing the year-wise export destinations for each exporter, and show it.

As the first year is used to build the dataframe, and then additional years are attached as columns, the first year must be skipped in the iterator. pd.assign() interprets the given column ("temp") literally, therefore, the column names must be re-written in each iteration loop.

In [32]:
data_points = (percentages_multi.loc[1995] != 0).sum(axis=1).to_frame()
data_points.columns = ['1995']

i=1
for year in years[1:]:
    i=i+1
    this = (percentages_multi.loc[year] != 0).sum(axis=1)
    data_points = data_points.assign(temp = this)
    data_points.columns = [years[:i]]
    
data_points

Unnamed: 0_level_0,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
exporter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,38,48,52,49,59,63,68,66,70,72,...,76,76,72,80,78,78,80,77,82,76
Afghanistan,57,63,64,71,68,74,72,85,89,93,...,94,96,97,105,105,97,98,96,91,94
Angola,65,76,75,81,88,100,104,106,108,114,...,111,117,114,118,120,125,119,117,116,111
Anguila,33,35,47,45,47,55,48,50,57,56,...,60,56,61,61,65,60,59,56,53,54
Albania,60,65,67,71,74,77,81,87,86,86,...,96,98,102,103,97,106,98,98,96,97
Andorra,41,43,45,49,47,54,62,66,70,70,...,81,85,76,84,81,82,78,74,81,72
Netherlands Antilles,70,73,77,73,84,92,96,93,94,94,...,105,99,104,105,97,0,0,0,0,0
United Arab Emirates,94,102,110,111,119,134,135,135,143,144,...,147,152,151,148,157,155,151,150,144,140
Argentina,84,95,99,103,105,116,117,113,116,119,...,122,117,123,127,124,125,123,115,116,108
Armenia,49,53,61,63,66,70,74,74,79,83,...,85,86,93,89,96,97,95,94,93,90


### Connecting Data

Nota bene: Due to time concerns, we were not able to conduct our analysis for multiple years. Instead, we decided to focus on a single year which, in our opinion, showed high data quality and availability. With more time, the analysis could easily have been conducted for all the years using the shown procedure.

We isolate a single yearly slice of the multi-index dataframe for analysis.

In [42]:
percentages = percentages_multi.loc[2014]

# percentages=pd.DataFrame()
# percentage_to_country=imported_data.loc[2014]/imported_data.loc[2014].sum(axis=0)
# percentages=percentages.append(percentage_to_country)
# percentages=percentages.fillna(0)
#TODO decide.

We merge the percentages dataframe and the WB data dataframe into one using a custom function, which accounts for different country name spellings by comparing them to the dictionaries created earlier. This will make dataframe operations easier.

In [43]:
filled_dataframe=MergeDataFrames(region_income_data, percentages, country_dic_wb, country_dic, conversion_dic)        

KeyError: 'Anguila'

### Visualizations

## Analysis

In [None]:
discuss why data cleaning/interpolation/completion only for WB data, not trade data

## Conclusion

## Reflection

importance of checking data while working (custom function)

## References & Sources