# Data Cleaning & Abatement Calculations (Target Development)
This notebook outlines the process of developing the target variable: CO2 units abated annually over time. The calculation uses datasets from the [World Bank](https://data.worldbank.org/) The intention of developing this target, instead of only using the metrics given, is to focus the analysis towards the long-term goal (improving environmental quality by reducing overall emissions).  

## Notebook Contents
- [Loading in and merging data sources](#loading_and_merging)  
    [Countries in the dataset](#full_country_list)
- [Developing abatement equation](#equation)
- [Historical Abatement Calculations](#abatement_calculations)
- [Completed CSV File](#csv)

In [127]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### _Loading in and merging data sources_
<a id='loading_and_merging'></a>

Although most of our data is from the same source, each attribute needed to be pulled and saved as a separate table, then read in individually. There were initially a few issues with the way the datasets were reading in because of how the Yearbook is formatted so the `kwargs` and the `.replace` cell were included as a result of trial-and-error read-ins.  

Otherwise, there is no true cleaning needed, except to [fill nulls for the patent dataset](#fill_nulls)

In [128]:
#Loading in excel files
total_emissions = pd.read_excel(
    './data_worldbank/co2_emissions.xls', index_col='Country Name')

electric_consumption_pc = pd.read_excel(
    './data_worldbank/pct_co2_emissions_electric.xls', index_col='Country Name')

electric_emissions = pd.read_excel(
    './data_worldbank/pct_co2_emissions_electric.xls', index_col='Country Name')

electric_production_reneg = pd.read_excel(
    './data_worldbank/pct_share_electricity_hydro.xls', index_col='Country Name')

electric_production_hydro = pd.read_excel(
    './data_worldbank/pct_share_electric_reneg.xls', index_col='Country Name')

### There's a shortcut to cleaning multiple, complementary datasets at once.
### Save the names of datasets in a list of strings when you first import everything, later we'll use the 'eval' function to clean with loops

In [129]:
#aggregate objects that need to be removed
non_countries = ['East Asia & Pacific (excluding high income)','Early-demographic dividend',
'East Asia & Pacific','Europe & Central Asia (excluding high income)','Europe & Central Asia',
'Euro area','European Union','Fragile and conflict affected situations','High income',
'IBRD only','IDA & IBRD total','IDA total','IDA blend','IDA only','Not classified',
'Latin America & Caribbean (excluding high income)','Latin America & Caribbean',
'Least developed countries: UN classification','Low income','Lower middle income','Low & middle income',
'Late-demographic dividend','Middle East & North Africa (excluding high income)','North America',
'OECD members','Other small states','Pre-demographic dividend','Post-demographic dividend',
'Sub-Saharan Africa (excluding high income)','Small states','East Asia & Pacific (IDA & IBRD countries)',
'Europe & Central Asia (IDA & IBRD countries)','Latin America & the Caribbean (IDA & IBRD countries)',
'Middle East & North Africa (IDA & IBRD countries)','South Asia (IDA & IBRD)',
'Sub-Saharan Africa (IDA & IBRD countries)','Upper middle income','Central Europe and the Baltics',
'Heavily indebted poor countries (HIPC)','Middle East & North Africa','Middle income' ]

In [130]:
datasets = ['total_emissions','electric_consumption_pc','electric_emissions','electric_production_hydro','electric_production_reneg']

In [131]:
#run through datasets loop to drop list
for i in datasets:
    eval(i).drop(labels=non_countries, axis=0, inplace=True)

In [132]:
#building list of countries that are completely null
#saving them as a list will also allow you to track changes and follow up

has_nulls = []
for i in total_emissions.T:
    if total_emissions.T[i].isnull().sum()>0:
        has_nulls.append(str(i))

In [133]:
for i in datasets:
    eval(i).dropna(axis=0, inplace=True)

In [134]:
#Loading in CSVs
for i in datasets:
    eval(i).to_csv(f'./cleaned_data/{i}.csv')

In [135]:
print(f'There are {total_emissions.shape[0]} countries in our dataset without nulls.\
\nWe eliminated {len(has_nulls)} countries that contained no data across the length of our dataset:\
\n{has_nulls}')

There are 156 countries in our dataset without nulls.
We eliminated 67 countries that contained no data across the length of our dataset:
['Aruba', 'Andorra', 'Armenia', 'American Samoa', 'Azerbaijan', 'Burundi', 'Bangladesh', 'Bosnia and Herzegovina', 'Belarus', 'Bhutan', 'Botswana', 'Channel Islands', 'Curacao', 'Czech Republic', 'Germany', 'Eritrea', 'Estonia', 'Micronesia, Fed. Sts.', 'Georgia', 'Guam', 'Croatia', 'Isle of Man', 'Kazakhstan', 'Kyrgyz Republic', 'Kiribati', 'Liechtenstein', 'Lesotho', 'Lithuania', 'Latvia', 'St. Martin (French part)', 'Monaco', 'Moldova', 'Maldives', 'Marshall Islands', 'Macedonia, FYR', 'Montenegro', 'Northern Mariana Islands', 'Malawi', 'Malaysia', 'Namibia', 'Nauru', 'Oman', 'Puerto Rico', 'Korea, Dem. People’s Rep.', 'West Bank and Gaza', 'Russian Federation', 'San Marino', 'Serbia', 'South Sudan', 'Slovak Republic', 'Slovenia', 'Swaziland', 'Sint Maarten (Dutch part)', 'Seychelles', 'Turks and Caicos Islands', 'Tajikistan', 'Turkmenistan', 'Tim

<a id='full_country_list'></a>

### _Developing the Abatement Equation_
<a id='equation'></a>

In order to develop this equation the datasets needed to be merged individually to avoid errors and column name confusion. The datasets included in this process house emissions data, percentage of emissions from electricity and heat production, total electricity production, and percentage of electricity production from renewable resources (solar, wind, and geothermal).

You can also [skip to the equation broken down step-by-step](#equation_breakdown)


In [136]:
electric_production_reneg_total = electric_production_hydro+electric_production_reneg
datasets.append('electric_production_reneg_total')

In [137]:
for i in datasets:
    print(i)


total_emissions
electric_consumption_pc
electric_emissions
electric_production_hydro
electric_production_reneg
electric_production_reneg_total


In [138]:
total_emissions = pd.read_csv('./cleaned_data/total_emissions.csv', index_col='Country Name')
electric_consumption_pc = pd.read_csv('./cleaned_data/electric_consumption_pc.csv', index_col='Country Name')
electric_production_reneg = pd.read_csv('./cleaned_data/electric_production_reneg.csv', index_col='Country Name')
electric_production_hydro = pd.read_csv('./cleaned_data/electric_production_hydro.csv', index_col='Country Name')
electric_emissions = pd.read_csv('./cleaned_data/electric_emissions.csv')

In [139]:
years=[i for i in total_emissions.columns]

## _Equation Breakdown_
<a id='equation_breakdown'></a>

This equation was developed with a few focuses. In order to truly measure abatement, it was first important to determine how much of a country's emissions were caused by generation of electricity and heat ($φυ$) because these are the two consumption routes that renewable energy sources are a part of. You can then divide this value by the share of conventional production ($λ(1-γ)$)to get the units of CO2 emitted per unit of energy produced with conventional methods. Finally, we can multiple that number by the share of energy produced with renewable methods to assess how many units of CO2 were _saved_ per unit of renewable energy produced.  


# $$\frac{φυ}{λ(1-γ)}γλ$$
**co2_abated (Final Output)**: _metric tons of CO2 abated for every unit of energy produced using renewable sources_  
Where:
-  $φ$ **co2_emissions**: _Total CO2 Units emitted annually, measured in metric tons of CO2_  
- $υ$ **electric_emissions_pct**: _Percentage of CO2 emissions from producing heat & electricity_  
- $λ$ **electric_production**: _Amount of electricity produced annually, measured in metric tons of energy_  
- $γ$ **reneg_production**: _Percentage of electricity production from renewable sources_  



In [145]:
def abatement_calculator(dataset_names, no_years):
    calculation = pd.DataFrame()
    for year in no_years:
        for dataset in dataset_names:
            if dataset == 'total_emissions':
                    calculation['countries'] = [i for i in eval(dataset)[year].index]
                    calculation['total_emissions'] = eval(dataset)[year].astype(int).values
            if dataset == 'electric_emissions':
                    calculation['electric_emissions'] = eval(dataset)[year].values/100
            if dataset == 'electric_consumption_pc':
                    calculation['electric_consumption_pc'] = eval(dataset)[year].values
            if dataset == 'electric_production_reneg_total':
                    calculation['electric_production_reneg_total'] = eval(dataset)[year].values/100
        calculation.set_index('countries', inplace=True)
        abated = []
        for n in calculation.index:
            abated.append(
                (calculation.loc[n, 'total_emissions']*calculation.loc[n, 'electric_emissions'])/
                (calculation.loc[n, 'electric_consumption_pc']*(1-calculation.loc[n, 'electric_production_reneg_total']))*
                (calculation.loc[n, 'electric_production_reneg_total']*calculation.loc[n, 'electric_consumption_pc'])
            )
    return abated

In [146]:
abatement_calculator(datasets, years)

ValueError: Length of values does not match length of index

In [129]:
abate.to_csv('./Data/abatement_calculations.csv')

AttributeError: 'NoneType' object has no attribute 'to_csv'