<h2>EIA-923 validation, NOx emissions reference and NOx Emissions Factor reference</h2>
This notebook intends to:

- validate EIA-923 as a source for Power Plant fuel consumption and energy generation information (at the month level), extracted from the EIA-923 Annual Report for 2018 (EIA is U.S. Energy Information Administration) 
- provide a reference for NOx emissions in the larger Power Plants in Puerto Rico, in order to be compared later (hopefully, before the end of the competition) with the emissions inferred from data provided by the satellites. For this purpose, EIA-923 fuel consumption information is complemented with standard Emission Factors obtained from an eGRID document (eGRID belongs to EPA, U.S. Environmental Protection Agency)
- incorporate this information into the original dataset gppd_120_pr.csv

After applying this information the conclusion is: <h3>The estimated reference NOx Emissions Factor for the whole grid during the year 2018 is 2.114 ton/GWh, or 4.228 lb/MWh</h3>

Note: Due to unavailability of information for the year 2019, the numbers are provided just for 2018, and consequently just the second half is inside the period under study.

<h2>Validation of EIA-923 information</h2>
In a previous Notebook (https://www.kaggle.com/ajulian/gppd-120-pr-csv-and-the-capacity-factor/) I showed the provided Power Plant dataset (gppd_120_pr.csv) is reliable except for energy generation, thus I decided to look for another data source.

EIA-923 is a reliable information source for energy generation and fuel consumption in the US, and has a good level of granularity: the information is provided for every month and every production area in each Power Plant.

However, when applied to our case, Puerto Rico in the period July 2018 - June 2019, some issues should be mentioned:
- information from Puerto Rico began being included in EIA-923 in 2017, and with a red note: "*Generation and fuel consumption data collected from Puerto Rico power plants may be incomplete or may contain errors (...)*" 
- the information for the rest of the US is aggregated in EIA-923 on a monthly basis, with a delay of two months (Dec 2019 info has been included on Feb 2020); however, the information from Puerto Rico is included later and for the whole year, so currently (March, 2020) there is no information yet about Puerto Rico for 2019.

<h3>EIA923_PR_2018.csv</h3>
EIA923_PR_2018.csv is an extract of the sheet "Page 1 Puerto Rico 2018" in EIA923_Schedules_2_3_4_5_M_12_2018_Final_Revision.xlsx, which is one of the files of the 2018 EIA-923 Annual Report (f923_2018.zip) in https://www.eia.gov/electricity/data/eia923/ (credit: this file was linked by @pcjimmmy in https://www.kaggle.com/c/ds4g-environmental-insights-explorer/discussion/131395)

In [None]:
import numpy as np
import pandas as pd
months = ["January", "February", "March", "April", "May", "June", 
          "July", "August", "September", "October", "November", "December"]
eia923_df = pd.read_csv("../input/eia923-pr-2018-v0/EIA923_PR_2018_v0.csv", sep=";")
eia923_df = eia923_df.drop(columns="YEAR")
eia923_df.head()

In the original Excel file there is also a sheet "Page 7 File Layout" detailing all the possible values in the columns. Some values that occur in Puerto Rico are:
- **Reported Prime Mover**: CA (Steam part of a Combined-Cycle plant), CT (Combustion turbine of a Combined-Cycle plant), IC (Internal Combustion Engine), ST (Steam Turbine); there are also values for renewable energy, such as PV (PhotoVoltaic) for Solar plants, WT (Wind Turbine), HY (Hydraulic Turbine)
- **Reported Fuel Type Code**: BIT (Bituminous coal), DFO (Distillate Fuel Oil), RFO (Residual Fuel Oil), NG (Natural Gas); regarding renewable energy, we have WAT (Water at a conventional Hydroelectric plant), WND (Wind)
- **Physical Unit Label (of the fuel)**: typical values are "mcf" (thousands of cubic feet) for gas, "short tons" for coal and "barrels" for oil

There are also twelve "Quantity _" columns for the fuel consumption monthly values and twelve "Netgen _" columns for the energy generation monthly values.
<h3>Capacity Factor consistency</h3>
First, we will do the same as in the previous Notebook https://www.kaggle.com/ajulian/gppd-120-pr-csv-and-the-capacity-factor/: check the Capacity Factors for all the sub-plants are < 1. Thus, we must get the capacity values from gppd_120_pr.csv; they are at the Power Plant level, while generation data in EIA-923 is at the sub-plant level, so first me must group values.

In [None]:
gppd = pd.read_csv('../input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')
# EIA-923 has rows at the sub-plant month level, thus first must be grouped.
# However, since the plant names must be fixed because they do not match among both files, I will only 
# check Capacity Factors for the six largest Power Plants

# We keep the Generation annual columns and group it by Power Plant 
tmp_eia923_df = eia923_df.groupby("Plant Name", as_index=False)["Net Generation MWh"].sum()
tmp_eia923_df = tmp_eia923_df.sort_values('Net Generation MWh', ascending=False).reset_index(drop=True)
# Check Capacity Factor
tmp_eia923_df["capacity_mw"] = 0.0
tmp_eia923_df.iloc[0, 2] = gppd[gppd["name"]=="Costa Sur"]["capacity_mw"].sum() # "Costa Sur Plant" in EIA-923
tmp_eia923_df.iloc[1, 2] = gppd[gppd["name"]=="EcoEléctrica"]["capacity_mw"].sum() # EcoElectrica (no accent)
tmp_eia923_df.iloc[2, 2] = gppd[gppd["name"]=="San Juan CC"]["capacity_mw"].sum() # Central San Juan Plant
tmp_eia923_df.iloc[3, 2] = gppd[gppd["name"]=="A.E.S. Corp."]["capacity_mw"].sum() # AES Puerto Rico
tmp_eia923_df.iloc[4, 2] = gppd[gppd["name"]=="Aguirre"]["capacity_mw"].sum() # Aguirre Plant
tmp_eia923_df.iloc[5, 2] = gppd[gppd["name"]=="Palo Seco"]["capacity_mw"].sum() # Palo Seco Plant
tmp_eia923_df = tmp_eia923_df[:6]
tmp_eia923_df["capacity_factor"] = tmp_eia923_df["Net Generation MWh"]/(tmp_eia923_df["capacity_mw"]*24*365)
tmp_eia923_df

We can see the Capacity Factor is < 1 for the larger Power Plants, thus we can consider the EIA-923 consistent from the energy generation point of view. 

Let's display scatter plots of Electricity generated versus Fuel for some Power plants to check additional consistency.
<h3>Electricity generated versus Fuel</h3>

In [None]:
import matplotlib.pyplot as plt

# Let's check generated power vs gas in EcoEléctrica, a gas Combined-Cycle Power Plant
eia923_eco = eia923_df.iloc[2, 4:-2]
eco_fuel = eia923_eco[:12]
eco_power = eia923_eco[12:]
plt.scatter(eco_fuel, eco_power)
plt.xlabel("Monthly Natural Gas Consumption (mcf)")
plt.ylabel("Monthly Electricity generated (MWh)")
plt.title("Gas Consumption vs. Electricity Generated in EcoEléctrica PP")
plt.show()

# Same with the coal Power Plant, from AES Corp
eia923_aes = eia923_df.iloc[4, 4:-2]
aes_fuel = eia923_aes[1:12]
aes_power = eia923_aes[13:]
plt.scatter(aes_fuel, aes_power)
plt.xlabel("Monthly Coal Consumption (tons)")
plt.ylabel("Monthly Electricity generated (MWh)")
plt.title("Coal Consumption vs. Electricity Generated in A.E.S. Corp. PP")
plt.show()


We can see the relationship between Fuel Consumption and Electricity Generated is quite linear, which suggests the fuel consumption values in the EIA-923 dataset for Puerto Rico are reliable enough to apply standard Emissions Factor and obtain a reference for the NOx emissions and the Emissions Factor.

<h2>NOX emissions reference</h2>
In general, developing a new methodology requires as much validation against current approaches as possible. In our case of remote sensing of NOx emissions, a typical reference would be CEMS (Continuous Emission Monitoring Systems) installed in Power Plants, whose information is provided periodically and can be found in an EIA-923 file for US Power Plants; unfortunately, Puerto Rico is not (yet) included.

Another option is seek for some engineering tests performed during a limited period. I did find a 2016 report with emissions information per Power Plant in Puerto Rico, but unfortunately it does not monitorize NO2 (["Perfil de Emisiones Tóxicas de Puerto Rico 2016", in Spanish](https://estadisticas.pr/files/Publicaciones/IEPR_Perfil_de_Emisiones_Toxicas_2016.pdf)).

Thus, the only reference I could think of is applying the corresponsing Emission Factors to the monthly fuel quantities consumption in the EIA, and aggregate the emissions in the same Power Plant.

<h3>EF_eGRID.csv</h3>
The Emission Factors in EF_eGRID.csv have been gathered from ["The Emissions & Generation Resource Integrated Database"](https://www.epa.gov/sites/production/files/2020-01/documents/egrid2018_technical_support_document.pdf), Table C-2 in page 107 and onwards (this file was mentioned by @pcjimmmy in https://www.kaggle.com/c/ds4g-environmental-insights-explorer/discussion/131395).

The Emissions & Generation Resource Integrated Database (eGRID) is a comprehensive source of data on the environmental characteristics of almost all electric power generated in the United States, including Emission Factors. 

In [None]:
ef_egrid_df = pd.read_csv("../input/emissions-factors-egrid/EF_eGRID.csv", sep=";")
ef_egrid_df = ef_egrid_df.drop(columns="Boiler firing type") # N/A in Puerto Rico data
ef_egrid_df

Here we can found the following columns:
- **Primer mover**: same as "Reported Primer Mover" in EIA-923
- **Primary fuel type**: same as "Reported Fuel Type Code" in EIA-923
- **Nox Emission Factor**: actual Emission Factor values
- (Emission) **Numerator Unit**: standard Emission Factors are expressed tipically as a fraction, with an Emission Unit in the Numerator and a Fuel Unit in the Denominator; the typical Emission Unit is the pound (lb)
- (Fuel) **Denominator Unit**: same as "Physical Unit Label" in the EIA-923 file, with values mcf, (short) tons, barrels (there is one EF in MMBtu, millions of British Termal Units, a heat unit, but for Natural Gas it is almost equivalent to one mcf)

Luckily, the code values in "Primer mover" and "Primary fuel type" in the eGRID file are the same as in "Reported Primer Mover" and "Reported Fuel Type Code" in the EIA file, so we can perform a **left outer join** to apply Emission Factors in the eGRID dataframe to the sub-plants in the EIA-923 dataframe based on the mover and fuel type. 

In [None]:
eia923_egrid_df = pd.merge(eia923_df, ef_egrid_df, how="left", 
                              left_on=['Reported Prime Mover', 'Reported Fuel Type Code'], 
                              right_on=['Primer mover', 'Primary fuel type'])
# Let's remove duplicated columns
eia923_egrid_df = eia923_egrid_df.drop(columns=["Primer mover", "Primary fuel type"])
eia923_egrid_df.head(10)

There are rows with NaN values in the "NOx Emission Factor", which correspond to emission-free renewable energy plants. We will keep only the fossil-fuel sub-plants (reminder from the EIA-923 file description, at the beginning of the Notebook: NG => Natural Gas, BIT => Bituminous coal, DFO => Distillate Fuel Oil, RFO => Residual Fuel Oil):

In [None]:
eia923_egrid_df = eia923_egrid_df[eia923_egrid_df["Reported Fuel Type Code"].isin(["NG", "BIT", "DFO", "RFO"])]
eia923_egrid_df.head()

<h3>Now we can compute the emissions (short tons) per sub-plant</h3>
Emission Factors Numerator Unit is pound (lb), so we have to divide by 2000 to obtain (short) tons.

In [None]:
for month in months:
    eia923_egrid_df["Emissions ton " + month] = \
        eia923_egrid_df["Quantity " + month] * eia923_egrid_df["Nox Emission Factor"]/2000 # 1 (short) ton = 2000 pounds (lb)
eia923_egrid_df.head(10)   

<h3>Final steps:</h3>
- group by Power Plant
- compute Emissions for 2018 (ton) per Power Plant
- plot the eight most NO2 emissions releasing Power Plants 
- plot the eight largest Power Generating Plants
- incorporate all this information into the original Power Plant dataset, gppd_120_pr.csv
- calculate the NOx Emissions Factor

In [None]:
# We keep the Generation and Emissions monthly columns and group them by Power Plant 
cols_idx = eia923_egrid_df.columns
keep_cols = [cols_idx[i] for i in range(len(cols_idx)) if (cols_idx[i][:3] in ("Emi", "Net"))]
eia923_egrid_df = eia923_egrid_df.groupby("Plant Name", as_index=False)[keep_cols].sum()

# Now we can compute "Electricity Generated (GWh)" and "Emissions (ton)" per Power Plant and the whole year
for month in months:
    if month=="January": # initialize Emissions Total
        eia923_egrid_df["Electricity Generated 2018 (GWh)"] = eia923_egrid_df["Netgen January"]/1000
        eia923_egrid_df["Emissions 2018 (ton)"] = eia923_egrid_df["Emissions ton January"]        
    else:
        eia923_egrid_df["Electricity Generated 2018 (GWh)"] += eia923_egrid_df["Netgen " + month]/1000
        eia923_egrid_df["Emissions 2018 (ton)"] += eia923_egrid_df["Emissions ton " + month]

# Sort by total Emissions and plot
eia923_egrid_df = eia923_egrid_df.sort_values('Emissions 2018 (ton)', ascending=False).reset_index(drop=True)
eia923_egrid_df.loc[0, "Plant Name"] = "San Juan"

import seaborn as sns
sns.barplot(y=eia923_egrid_df.loc[:7, "Plant Name"], 
           x=eia923_egrid_df.loc[:7, "Emissions 2018 (ton)"], orient="h")

In [None]:
import seaborn as sns
sns.barplot(y=eia923_egrid_df.loc[:7, "Plant Name"], 
           x=eia923_egrid_df.loc[:7, "Electricity Generated 2018 (GWh)"], orient="h")

We can see that the two largest Power Plants, EcoEléctrica and Costa Sur (both Natural Gas fueled), are far less contaminant than San Juan (Oil fueled) and AES (Coal fueled).

Now we can add all this (real) electricity generation and (estimated) emissions information to the original dataset, gppd_120_pr.csv, after performing the same fuel-type filtering.

In [None]:
gppd_EF = gppd[gppd["primary_fuel"].isin(["Oil", "Gas", "Coal"])]
gppd_EF = gppd_EF.groupby("name", as_index=False).agg({"capacity_mw": "sum", "commissioning_year": "first",
                                             "primary_fuel": "first", "owner": "first", ".geo": "first"})
gppd_EF = gppd_EF.sort_values(["capacity_mw"], ascending=False).reset_index(drop=True)

<h2>NOx Emissions Factor reference</h2>Now we can sum up the NOx emissions and electricity generation for the whole grid in 2018 and estimate the NOx Emissions Factor reference:

In [None]:
print("The estimated annual NOx emissions for the whole grid are", round(eia923_egrid_df["Emissions 2018 (ton)"].sum(), 1), "tons")
gen_GWh_2018 = eia923_df["Net Generation MWh"].sum()/1000
print("The estimated annual electricity generation for the whole grid is", round(gen_GWh_2018, 1), "GWh")
EF_2018_ton_GWh = eia923_egrid_df["Emissions 2018 (ton)"].sum() / gen_GWh_2018
EF_2018_lb_MWh = EF_2018_ton_GWh*2000/1000
print("The estimated NOx Emissions Factor reference for 2018 is", round(EF_2018_ton_GWh, 3), "ton/GWh, or", round(EF_2018_lb_MWh, 3), "lb/MWh")

In [None]:
gppd_EF = gppd_EF.loc[:7].sort_values("name").reset_index(drop=True)
eia923_egrid_eight_df = eia923_egrid_df.loc[:7].sort_values("Plant Name").reset_index(drop=True)
gppd_EF = pd.concat([gppd_EF, eia923_egrid_eight_df], axis=1).drop(columns=["Plant Name", "Net Generation MWh"])

<h3>Last note</h3>
We have complemented the original Power Plants dataset with (real) power generation and (estimated) emissions per Power Plant, for the year 2018 and for each month, which will eventually help in development/tuning/validation of monthly Emission Factors.

We have also calculated a reference NOx Emissions Factor of 2.114 ton/GWh, or 4.228 lb/MWh.

We now can save the dataset into /kaggle/working/ for future use:

In [None]:
gppd_EF.to_csv("/kaggle/working/gppd_120_pr_ef.csv", index=False)

test_df = pd.read_csv("/kaggle/working/gppd_120_pr_ef.csv")
test_df