# Data Cleaning & Abatement Calculations (Target Development)
This notebook outlines the process of developing the target variable: CO2 units abated annually over time. The calculation uses datasets from the [World Bank](https://data.worldbank.org/) The intention of developing this target, instead of only using the metrics given, is to focus the analysis towards the long-term goal (improving environmental quality by reducing overall emissions).  

## Notebook Contents
- [Loading in and merging data sources](#loading_and_merging)  
    [Countries in the dataset](#full_country_list)
- [Developing abatement equation](#equation)
- [Historical Abatement Calculations](#abatement_calculations)
- [Completed CSV File](#csv)

In [74]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### _Loading in and merging data sources_
<a id='loading_and_merging'></a>

Although most of our data is from the same source, each attribute needed to be pulled and saved as a separate table, then read in individually. There were initially a few issues with the way the datasets were reading in because of how the Yearbook is formatted so the `kwargs` and the `.replace` cell were included as a result of trial-and-error read-ins.  

Otherwise, there is no true cleaning needed, except to [fill nulls for the patent dataset](#fill_nulls)

In [75]:
#Loading in excel files
total_emissions = pd.read_excel(
    './data_worldbank/co2_emissions.xls', index_col='Country Name')

electric_consumption_pc = pd.read_excel(
    './data_worldbank/pct_co2_emissions_electric.xls', index_col='Country Name')

electric_emissions = pd.read_excel(
    './data_worldbank/pct_co2_emissions_electric.xls', index_col='Country Name')

electric_production_reneg = pd.read_excel(
    './data_worldbank/pct_share_electricity_hydro.xls', index_col='Country Name')

electric_production_hydro = pd.read_excel(
    './data_worldbank/pct_share_electric_reneg.xls', index_col='Country Name')

### There's a shortcut to cleaning multiple, complementary datasets at once.
### Save the names of datasets in a list of strings when you first import everything, later we'll use the 'eval' function to clean with loops

In [76]:
#aggregate objects that need to be removed
non_countries = ['East Asia & Pacific (excluding high income)','Early-demographic dividend',
'East Asia & Pacific','Europe & Central Asia (excluding high income)','Europe & Central Asia',
'Euro area','European Union','Fragile and conflict affected situations','High income',
'IBRD only','IDA & IBRD total','IDA total','IDA blend','IDA only','Not classified',
'Latin America & Caribbean (excluding high income)','Latin America & Caribbean',
'Least developed countries: UN classification','Low income','Lower middle income','Low & middle income',
'Late-demographic dividend','Middle East & North Africa (excluding high income)','North America',
'OECD members','Other small states','Pre-demographic dividend','Post-demographic dividend',
'Sub-Saharan Africa (excluding high income)','Small states','East Asia & Pacific (IDA & IBRD countries)',
'Europe & Central Asia (IDA & IBRD countries)','Latin America & the Caribbean (IDA & IBRD countries)',
'Middle East & North Africa (IDA & IBRD countries)','South Asia (IDA & IBRD)',
'Sub-Saharan Africa (IDA & IBRD countries)','Upper middle income']

In [77]:
datasets = ['total_emissions','electric_consumption_pc','electric_emissions','electric_production_hydro','electric_production_reneg']

In [78]:
#run through datasets loop to drop list
for i in datasets:
    eval(i).drop(labels=non_countries, axis=0, inplace=True)

In [79]:
#building list of countries that are completely null
#saving them as a list will also allow you to track changes and follow up

completely_null = []
for i in total_emissions.T:
    if total_emissions.T[i].isnull().sum()==55:
        completely_null.append(str(i))

In [80]:
completely_null

['American Samoa',
 'Channel Islands',
 'Guam',
 'Isle of Man',
 'St. Martin (French part)',
 'Monaco',
 'Northern Mariana Islands',
 'Puerto Rico',
 'San Marino',
 'Virgin Islands (U.S.)',
 'Kosovo']

In [82]:
for i in datasets:
    eval(i).drop(labels=completely_null, axis=0, inplace=True)

In [88]:
for i in datasets:
    eval(i).fillna(value=0, inplace=True)

In [90]:
#Loading in CSVs
for i in datasets:
    eval(i).to_csv(f'./cleaned_data/{i}.csv')

<a id='full_country_list'></a>

### _Developing the Abatement Equation_
<a id='equation'></a>

In order to develop this equation the datasets needed to be merged individually to avoid errors and column name confusion. The datasets included in this process house emissions data, percentage of emissions from electricity and heat production, total electricity production, and percentage of electricity production from renewable resources (solar, wind, and geothermal).

You can also [skip to the equation broken down step-by-step](#equation_breakdown)


In [16]:
#In order to build out the equation and check the output I'm creating subsets for only 2014 data
#The process of merging follows the steps of the calculation so that it can be calculated across
#the dataframe

total_emissions_14 = total_emissions[['Country', '2014']]
electric_emissions_14 = electric_emissions[['Country Name', '2014']]
electric_production_14 = electric_production[['Country', '2014']]
reneg_production_14 = reneg_production[['Country','2014']]

In [17]:
#Creating the dataset, designated here as abatement_dataset
#Beginning with total emissions and adding in electric emissions 
#These will create our first step in the equation, CO2 emissions from electricity & heat

abatement_dataset = pd.merge(total_emissions_14, electric_emissions_14, how='left',
                       left_on='Country', right_on='Country Name', left_index=True,
                            suffixes=[' co2_emissions','electric_emissions_pct'])

#Dropping double country columns and renaming to keep track
abatement_dataset.drop(labels='Country Name', axis=1, inplace=True)
abatement_dataset.columns = ['country','co2_emissions','electric_emissions_pct']

In [18]:
#Adding on electric production to the dataset
#This will be the first part of our CO2 units per unit of energy calculation

abatement_dataset = pd.merge(abatement_dataset, electric_production_14, how='left',
                            left_on='country', right_on='Country', left_index=True)

#Dropping double country column and renaming others 
abatement_dataset.columns=['country', 'co2_emissions', 'electric_emissions_pct',
                           'country1', 'electric_production']
abatement_dataset.drop(labels='country1', axis=1, inplace=True)

In [19]:

#Adding the percent share of renewable energy for the second component of the CO2 units per energy unit

abatement_dataset = pd.merge(abatement_dataset, reneg_production_14, how='left',
                            left_on='country', right_on='Country', left_index=True)

#Dropping double country column and renaming others
abatement_dataset.columns=['country', 'co2_emissions', 'electric_emissions_pct',
                           'electric_production', 'country1','reneg_production']
abatement_dataset.drop(labels='country1', axis=1, inplace=True)

In [20]:
#This column was throwing an error, converting to int so it will calculate without issues
abatement_dataset.co2_emissions = abatement_dataset.co2_emissions.astype(int)

In [21]:
#Preview of the dataset
abatement_dataset.head()

Unnamed: 0,country,co2_emissions,electric_emissions_pct,electric_production,reneg_production
0,Algeria,133,38.83,71,0.28
1,Argentina,198,38.04,142,31.69
2,Australia,378,58.36,248,14.92
3,Belgium,89,25.85,73,18.44
4,Brazil,475,26.31,591,73.08


## _Equation Breakdown_
<a id='equation_breakdown'></a>

This equation was developed with a few focuses. In order to truly measure abatement, it was first important to determine how much of a country's emissions were caused by generation of electricity and heat ($φυ$) because these are the two consumption routes that renewable energy sources are a part of. You can then divide this value by the share of conventional production ($λ(1-γ)$)to get the units of CO2 emitted per unit of energy produced with conventional methods. Finally, we can multiple that number by the share of energy produced with renewable methods to assess how many units of CO2 were _saved_ per unit of renewable energy produced.  


# $$\frac{φυ}{λ(1-γ)}γλ$$
**co2_abated (Final Output)**: _metric tons of CO2 abated for every unit of energy produced using renewable sources_  
Where:
-  $φ$ **co2_emissions**: _Total CO2 Units emitted annually, measured in metric tons of CO2_  
- $υ$ **electric_emissions_pct**: _Percentage of CO2 emissions from producing heat & electricity_  
- $λ$ **electric_production**: _Amount of electricity produced annually, measured in metric tons of energy_  
- $γ$ **reneg_production**: _Percentage of electricity production from renewable sources_  



In [22]:
#1. determine actual emissions from electricity and heat production alone
#Using % of emissions that comes from generating heat & electricity
abatement_dataset['co2_electricity'] = abatement_dataset.co2_emissions * (abatement_dataset.electric_emissions_pct/100)

#2. understand how much C02 is produced per unit of energy
# Of electricity production how 
abatement_dataset['reneg_production_share'] = (abatement_dataset.reneg_production/100)*abatement_dataset.electric_production #share of regular energy
abatement_dataset['conventional_production_share'] = abatement_dataset.electric_production*(1-(abatement_dataset.reneg_production/100))
# #once i know how much is non-renewable I can use C02 emissions/energyproduction and find out C02 per unit of production

# # Of the total energy output from electric plants what percentage is renewable
abatement_dataset['co2_units'] = abatement_dataset.co2_electricity/abatement_dataset.conventional_production_share

# # How much C02 is not being produced for the share of green energy usage
#C02 units * renewable energy share = C02 saved
abatement_dataset['co2_abated'] = abatement_dataset.co2_units*abatement_dataset.reneg_production_share

In [23]:
abatement_dataset[['country','co2_abated']].head()

Unnamed: 0,country,co2_abated
0,Algeria,0.145009
1,Argentina,34.94167
2,Australia,38.685519
3,Belgium,5.201568
4,Brazil,339.264127


### _Historical Abatement Calculations_

<a id='abatement_calculations'></a>

In [24]:
datasets = ['total_emissions',
'electric_consumption',
'electric_emissions',
'electric_production',
'reneg_production']

In [25]:
years = ['1990','1991','1992','1993','1994','1995','1996','1997', '1998','1999','2000','2001',
         '2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014']

In [26]:
def abatement_calculator(datasets, no_years):
    abatement = pd.DataFrame()
    for year in no_years:
        for dataset in datasets:
            #print(dataset)
            if dataset == 'total_emissions':
                    country = eval(dataset)['Country']
                    co2_emissions = eval(dataset)[year]
                    co2_emissions = co2_emissions.astype(int).values
            if dataset == 'electric_emissions':
                    electric_emissions_pct = eval(dataset)[year].values/100
            if dataset == 'electric_production':
                    electric_prod = eval(dataset)[year].values
            if dataset == 'reneg_production':
                    reneg_prod_pct = eval(dataset)[year].values/100                    
                    
        abatement['country'] = country   
        abatement[year] = ((co2_emissions*electric_emissions_pct)/(electric_prod-(reneg_prod_pct*electric_prod)))*(reneg_prod_pct*electric_prod)
    return abatement

In [27]:
abate = abatement_calculator(datasets, years)

<a id='csv'></a>

In [28]:
abate.to_csv('./Data/abatement_calculations.csv')