# **DS4G: Predicting emissions of power plants **

# Project overview and problem statement

PROJECT OVERVIEW
Develop a methodology to calculate an average historical emissions factor of electricity generated for a sub-national region, using remote sensing data and techniques.

The Environmental Insights Explorer team at Google is keen to gather insights on ways to improve calculations of global emissions factors for sub-national regions. The ultimate goal of this challenge is to test if calculations of emissions factors using remote sensing techniques are possible and on par with calculations of emissions factors from current methodologies.

PROBLEM STATEMENT
Current emissions factors methodologies are based on time-consuming data collection and may include errors derived from a lack of access to granular datasets, inability to refresh data on a frequent basis, overly general modeling assumptions, and inaccurate reporting of emissions sources like fuel consumption. This begs the question: What if there was a different way to calculate or measure emissions factors? We’re challenging the Kaggle community to see if it’s possible to use remote sensing techniques to better model emissions factors. You will develop a methodology to calculate an average historical emissions factor for electricity generation in a sub-national region.

We’ve provided an initial list of datasets covering the geographic boundary of Puerto Rico to serve as the foundation for this analysis. As an island, there are fewer confounding factors from nearby areas. Puerto Rico also offers a unique fuel mix and distinctive energy system layout that should make it easier to isolate pollution attributable to power generation in the remote sensing data.

Participants will be tasked with developing a methodology to calculate an average annual historical emissions factor for the sub-national region. Participants will also be asked to provide an explanation of the conditions that would result in a higher/lower emissions factor, as well as a recommendation for how the methodology could be applied to calculate the emissions factor of electricity for another geospatial area using similar techniques. Bonus points will be awarded for smaller time slices of the average historical emissions factors, such as one per month for the 12-month period, and additional bonus points will be awarded for participants that develop methodologies for calculating marginal emissions factors for the sub-national region.

This notebook uses the notebook of Paul Mooney as a starting point for analyzing geographical information (see link below)
https://www.kaggle.com/paultimothymooney/how-to-get-started-with-the-earth-engine-data

The suggestion of Rasyid Ridha is acknowledged and the y-axis is reverted in the processing of images to bring the images in line with images of Earth Engine
https://www.kaggle.com/rasyidstat/fixing-the-reversed-y-axis


The notebook has the following contents:
1. Summary
2. Properties of power plants on Puerto Rico  
3. Emission data on NO2 and aerosols  
4. A prediction model for emission with local emission data
5. Emission data for locations with and without a fossil fuel power plant
6. Climate data on temperature, humidity and wind speed 
7. A prediction model for emission with local emission and weather data   


# 1: Summary #


This notebook investigates the usability of emission data (from Sentinel 5P satellite) and of weather data (from GFS satellite) for the calculation of a historical emissions factor for power plants on Puerto Rico.

A historical emissions factor is succesfully calculated for each fossil fuel powered plant on Puerto Rico. The calculation only needs a few data sources such as NO in the troposphere from Sentinel 5P, a list with locations and capacity of powerplants in the region and the yearly electrical power generation/consumption for the region. The calculation distributes the total electrical power for the region to the individual plants according to the geographical distribution of NO2 emissions. This distribution is further optimized considering constraints such as the maximum capacity of plants. With this distribution the historical emission level is determined. 

For 2019 the generated power for Puerto Rico is reported as 19480 GWh. With this reference the model calculates the emission from power generation for the region as 37477 kg NO2/year.

Several investigations were done to check if additional information would lead to better results. A calculation is made for all power plants, including reusable energy powered plants. It is expected that there is a difference in emissions between the locations of a fossil fuel plant and the location of a plant not running on fossil fuel. However this difference could not be observed. For Puerto Rico local NO2 emissions cannot distinguish between these types of locations.

Also it was investigated if a weekly pattern (weekday vs weekend) can be identified. Such a pattern is common for power plants in an industrial area. In the NO2 emissions data for Puerto Rico such a pattern is not observed. Possible explanations are the absence of industrial activities on Puerto Rico or the granularity of emissions data measurements where NO2 in the higher atmosphere and/or from other sources is blurring the weekly variations in the lower atmosphere which are expected from industrial activities such as power generation . 

Finally an investigation was done if data on the weather could improve the model described. Weather data is available on temperature, humidity, wind speed and precipitable water. The data for local weather and the data on local emissions are used to calculate a model that predicts local emissions. The relative importance of the features in the model is evaluated which of the features contribute to a better prediction. This investigation did not lead to improved results in calculation of the emission factor compared to the first more simple model. Therefor the more simple model is proposed to derive historical emissions for power plants in a region.

The proposed model is based on analysis for Puerto Rico. Further exploration for other types of areas needs to be investigated to see if the model is also valid for other types of environments. 


# 2: Properties of power plants on Puerto Rico #

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import rasterio as rio
import folium
import tifffile as tiff
import seaborn as sns

def plot_points_on_map(dataframe,begin_index,end_index,latitude_column,latitude_value,longitude_column,longitude_value,zoom):
    df = dataframe[begin_index:end_index]
    location = [latitude_value,longitude_value]
    plot = folium.Map(location=location,zoom_start=zoom)
    color={ 'Hydro' : 'lightblue', 'Solar' : 'orange', 'Oil' : 'darkblue', 'Coal' : 'black', 'Gas' : 'lightgray', 'Wind' : 'green' }
    for i in range(0,len(df)):
        popup = folium.Popup(str(df.plant[i]))   #[i:i+1]))
        folium.Marker([df[latitude_column].iloc[i],df[longitude_column].iloc[i]],popup=popup, 
                      icon=folium.Icon(color=color[df.primary_fuel.iloc[i]])).add_to(plot)
    return(plot)

def overlay_image_on_puerto_rico(file_name,band_layer):
    band = rio.open(file_name).read(band_layer)
    m = folium.Map([lat, lon], zoom_start=8, width=500, height=400)
    folium.raster_layers.ImageOverlay(
        image=band,
        bounds = [[18.6,-67.3,],[17.9,-65.2]],
        colormap=lambda x: (1, 0, 0, x),
    ).add_to(m)
    return m

def plot_scaled(file_name):
    vmin, vmax = np.nanpercentile(file_name, (5,95))  # 5-95% stretch
    img_plt = plt.imshow(file_name, cmap='gray', vmin=vmin, vmax=vmax)
    plt.show()

def split_column_into_new_columns(dataframe,column_to_split,new_column_one,begin_column_one,end_column_one):
    for i in range(0, len(dataframe)):
        dataframe.loc[i, new_column_one] = dataframe.loc[i, column_to_split][begin_column_one:end_column_one]
    return dataframe

In [None]:
power_plants = pd.read_csv('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')
power_plants = split_column_into_new_columns(power_plants,'.geo','latitude',50,66)
power_plants = split_column_into_new_columns(power_plants,'.geo','longitude',31,48)
power_plants['latitude'] = power_plants['latitude'].astype(float)
a = np.array(power_plants['latitude'].values.tolist()) # 18 instead of 8
power_plants['latitude'] = np.where(a < 10, a+10, a).tolist() 

In [None]:
power_plants_df = power_plants.sort_values('capacity_mw',ascending=False).reset_index()

bounds = [[18.6,-67.3,],[17.9,-65.2]]

power_plants_df['img_idx_lt']=(((18.6-power_plants_df.latitude)*148/(18.6-17.9))).astype(int)
power_plants_df['img_idx_lg']=((67.3+power_plants_df.longitude.astype(float))*475/(67.3-65.2)).astype(int)
power_plants_df['plant']=power_plants_df.name.str[:3]+power_plants_df.name.str[-1]+'_'+power_plants_df.primary_fuel

power_plants=power_plants_df[['name','latitude','longitude','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]

power_plants

Above is a list of power plants on Puerto Rico in descending order of capacity.  

There are 34 power plants on the island. 16 plants use fossil fuel. 18 plants use a renewable source of energy. All fossil fuel powered plants are located near the shore (see map). Of the power plants that use a renewable source of energy some are located inland and some are located near the shore. 


In [None]:
lat=18.200178; lon=-66.3 #-66.664513
plot_points_on_map(power_plants,0,425,'latitude',lat,'longitude',lon,9)


In [None]:
print('Total green (solar, wind, hydro) energy capacity in MW :', power_plants_df.loc[((power_plants_df['primary_fuel']=='Hydro') | (power_plants_df['primary_fuel']=='Solar') | (power_plants_df['primary_fuel']=='Wind'))
                    ,'capacity_mw'].sum())

print('Total gray (oil, gas, coal) energy capacity in MW :',power_plants_df.loc[((power_plants_df['primary_fuel']=='Coal') | (power_plants_df['primary_fuel']=='Oil') | (power_plants_df['primary_fuel']=='Gas'))
                    ,'capacity_mw'].sum())

In [None]:
import matplotlib.patches as mpatches

fig1 = plt.figure(figsize=(10, 5))

color={ 'Hydro' : 'lightblue', 'Solar' : 'orange', 'Oil' : 'darkblue', 'Coal' : 'black', 'Gas' : 'lightgray', 'Wind' : 'green' }
barcolor=[]
for fuel in power_plants_df.primary_fuel : barcolor.append(color[fuel]) 

fig1 = fig1.add_subplot(111)
fig1.bar(x=power_plants_df.index, height=power_plants_df.capacity_mw, width=0.6, color=barcolor)     
    
plt.yscale('log')
plt.title('Power plants in Puerto Rico by primary fuel and in descending order of capacity ')
plt.ylabel('Capacity (MW, log-scale)')
plt.xlabel('Powerplants in Puerto Rico')

patches=[]
for key, value in color.items(): patches.append(mpatches.Patch(color=value, label=key))
fig1.legend(handles=patches)

plt.show()

The total capacity of 'gray' fossil fuel powered energy plants is 5760 MW.  
The total capacity of 'green' renewable source energy plants is 387 MW.  

The largest power unit uses oil as primary fuel and has a capacity over 1000 MW.  
The largest wind unit and the largest solar unit approximate the capacity of one of the smaller oil/gas units (approx. 50 to 100 MW).    
The capacity of all hydro units together approximate the capacity of one of the smaller oil/gas units (approx. 100 MW).  

Additional information is taken from the internet to calculate the activity factor for the powerplants. 

Information from eia.gov on electricity consumption of Puerto Rico gives a power consumption of 19.48 billion kWh (=19.480.000 MWh)for the year 2019.  

Information from index.mundi.com on the fuel consumption of power generation in Puerto Rico gives a fuel distribution oil/gas/coal/renewables of 40%/40%/18%/2%.

The calculation of a simple emission factor is based on the following (values used for oil) :  
The emission factor per day for the oil powerplants is calculated as 40% of total power consumption on Puerto Rico divided by the number of days in a year.  
The emission factor per oil powerplant is based on the fraction of the capacity of the plant vs the total capacity of the oil powerplants.  

This emission factor is expressed in MWh/day production per plant. The idea behind it is that production is the driving force behind emission of the plant and it is assumed that there is a linear relationship between the two. A production figure can easily be compared to capacity constraints of a plant.

In [None]:
# add information on capacity, type of fuel and activity factor

# Information from eia.gov on electricity consumption of Puerto Rico gives a power consumption of 19.48 billion kWh (=19.480.000 MWh)for the year 2019
# Information from index.mundi.com on the fuel consumption of power generation in Puerte Rico gives a distribution of 40%/40%/18%/2% for oil/gas/coal/renewables

Prod_day=int(19480000/365) # MWh/day
print('Average emission factor per day (production in MWh/day) : ',Prod_day)


# With above information the drivers for the emission factor is calculated on a daily basis:

EF_oil=19480000*0.4/365   # MWh/day
EF_gas=19480000*0.4/365   # MWh/day
EF_coal=19480000*0.18/365  # MWh/day

# With the available capacity for oil, gas and coal plants the daily activity factor A is calculated

print('Emission factor (production in MWh/day) per day for oil: ',int(EF_oil),' gas: ',int(EF_gas),' and coal: ',int(EF_coal))

#print(gray.groupby(by='primary_fuel').capacity_mw.sum())

A_oil=EF_oil/power_plants_df.loc[power_plants_df.primary_fuel=='Oil','capacity_mw'].sum() 
A_gas=EF_gas/power_plants_df.loc[power_plants_df.primary_fuel=='Gas','capacity_mw'].sum() 
A_coal=EF_coal/power_plants_df.loc[power_plants_df.primary_fuel=='Coal','capacity_mw'].sum() 

print('Activity factor of power plants (average hrs/day) for oil, gas and coal ',A_oil/24,' gas: ',A_gas/24,' and coal: ',A_coal/24)




The conclusion from the calculated activity factor is that the gas and coal plants are running at high activity rates (80% for gas and 88% for coal).   
The oil plants are running at a low activity factor (21%). This could also mean that some oil plants are stand-by and not running at all, where other oil plants are running at a higher activity factor.   

# 3. Emission data on NO2 and aerosols  #

In [None]:
# inspection of image information
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180708T172237_20180714T190743.tif'
img=rio.open(image)

# print('Shape of array with data points :',tiff.imread(image).shape)
img.descriptions


The satellite images from Sentinel 5P contain information presented in the list above.  

A first quick analysis to the relevance of this information is done for the first 7 bands. From this quick scan the following bands are analysed in further depth: 
* NO2_column_number_density   
* tropospheric_NO2_column_number_density  
* absorbing_aerosol_index  
* cloud_fraction

In [None]:
from datetime import datetime

files=[]
for dirname, _, filenames in os.walk('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))


#date_sync=1     # MSE = 0.0187 - startwaarde en waarde ingeleverd in baseline versie    
#date_sync=2 # MSE 0.0211
#date_sync=3 # MSE 0.0213, slecht resultaat features (variatie > dan feature importance) 
#date_sync=4 # MSE 0.0211, lijkt niet zo kritisch voor uitkomst features ???
#date_sync=0 # MSE = 0.01836, beste resultaat ook met features
#date_sync=-2 # MSE = 0.01857, y_pred/y_test minder goed, features minder goed
date_sync=-1 # MSE = 0.01630, y_pred/y_test heel goed, features iets minder dan sync=0 - beste MSE dus default, date_sync=0 kan ook goed zijn


# read all the absorbing aerosol index data into one list of arrays
aai_first_day=[]
aai_first_key=[]
aai_last_day=[]
aai_arr=[]
#band=0 # no2 column - redelijke locale informatie
band=1 # no2 troposfeer - goede locale differentiatie zichtbaar
#band=2 # no2 stratosfeer - geen locale informatie
#band=3 # no2 slant column - redelijke locale informatie, maar minder dan band 1

#band=5 #aerosol index - goede locale differentiatie zichtbaar
#band=6 # cloud fraction
for i in range(0,len(files)):
    aai_first_day.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').date())
    aai_first_key.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').toordinal()+date_sync) # correction of + 1 day in order to sync on climate data
    aai_last_day.append(datetime.strptime(files[i][92:107], '%Y%m%dT%H%M%S').date())
    aai_arr.append(rio.open(files[i]).read(band+1))
    



**Data cleaning** 

In [None]:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()


# a=[]
# for i,arr in enumerate(aai_arr): a.append(np.nanmean(arr))
a=[]
a_pos=[]
nll=[]
for i in range(0,len(aai_arr)): 
    a.append(np.nanmean(aai_arr[i]))
    a_pos.append(np.nanmean(np.clip(aai_arr[i],0,10000)))
    nll.append(pd.isnull(aai_arr[i]).sum().sum())

aai_rgn=pd.DataFrame({ 'first': aai_first_day,'last':aai_last_day,'aai_rgn' : a_pos, 'nll' : nll, 'aai_raw' : a,'key_date' : aai_first_key })
aai_rgn=aai_rgn.sort_values('first')
aai_rgn=aai_rgn.reset_index()

fig1 = plt.figure(figsize=(20, 10))
fig1.suptitle("data cleaning - regional mean of emission data and number of nan per observation for 1) raw data (upper graphs) 2) cleaned data, #nan < 5% (middle graphs) 3) #nan < 1% (lower graphs)")
ax1 = fig1.add_subplot(321)
ax1.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,5], label='average NO2 in troposphere per day - raw data', color='b')
ax1.legend()
ax2 = fig1.add_subplot(322)
ax2.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,4], label='# nan per observation', color='b')
ax2.legend()

aai_rgn=aai_rgn.loc[aai_rgn.nll <3515,:] # only select observations with # nan < 5%

ax3 = fig1.add_subplot(323)
ax3.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,5], label='average NO2 in troposphere per day - cleaned for data with nan > 5%', color='b')
ax3.legend()
ax4 = fig1.add_subplot(324)
ax4.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,4], label='# nan per observation', color='b')
ax4.legend()

aai_rgn=aai_rgn.loc[aai_rgn.nll <3515/5,:] # only select observations with # nan < 1%

ax5 = fig1.add_subplot(325)
#ax5.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,3], label='average aai per day - data <0 clipped to 0', color='b') 
ax5.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,5], label='average NO2 in troposphere per day - cleaned for data with nan > 1%', color='b')
ax5.legend()
ax6 = fig1.add_subplot(326)
ax6.plot(aai_rgn.iloc[:,1], aai_rgn.iloc[:,4], label='# nan per observation', color='b')
ax6.legend();

The graphs to the left show NO2 in troposphere for the region as a function of time.  
The graphs to the right show the number of nan (=not a number) -values that are in the data. Some daily data has a very high number of nan-values. This data cannot be used. Data with a nan-value percentage higher than 5% is discarded. 

From top to bottom the effect of the data cleaning is presented:  
upper graphs: raw data   
middle graphs: cleaned data (#nan < 5%)   
lower graphs: cleaned data (#nan < 1%)  

In [None]:
# read only the absorbing aerosol index arrays with a nan-percentage <5% into one list of arrays for calculation of local emission data
files=[]
for dirname, _, filenames in os.walk('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))

ef_day=[]
ef_dow=[]
ef_last_day=[]
ef_first_key=[]
no2_col_arr=[] ; trop_arr=[] ; aai_arr=[] ; clfr_arr=[]
no2_rgn=[] ; trop_rgn=[] ; aai_rgn=[] ; clfr_rgn=[]
#band=0 # no2 column - redelijke locale informatie
#band=1 # no2 troposfeer - goede locale differentiatie zichtbaar
#band=2 # no2 stratosfeer - geen locale informatie
#band=3 # no2 slant column - redelijke locale informatie, maar minder dan band 1
#band=5 #aerosol index - goede locale differentiatie zichtbaar
#band=6 # cloud fraction
#print(band,img.descriptions[band])
max_nan=3515 # max 5% (3515) of data points nan
for i in range(0,len(files)):
#     a=rio.open(files[i]).read(0+1) # no2 column
#     b=rio.open(files[i]).read(1+1) # no2 troposfeer
#     c=rio.open(files[i]).read(5+1) #aerosol index
#     d=rio.open(files[i]).read(6+1) # cloud fraction
    a=np.flip(rio.open(files[i]).read(0+1),0) # no2 column - image flipped to revert y-axis as suggested by https://www.kaggle.com/rasyidstat/fixing-the-reversed-y-axis
    b=np.flip(rio.open(files[i]).read(1+1),0) # no2 troposfeer - see remark above
    c=np.flip(rio.open(files[i]).read(5+1),0) #aerosol index - see remark above
    d=np.flip(rio.open(files[i]).read(6+1),0) # cloud fraction - see remark above
    if ((pd.isnull(a).sum().sum() < max_nan) & (pd.isnull(b).sum().sum() < max_nan) & (pd.isnull(c).sum().sum() < max_nan) & (pd.isnull(d).sum().sum() < max_nan)):
        ef_day.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').date()) 
        ef_dow.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').isoweekday())
        ef_first_key.append(datetime.strptime(files[i][76:91], '%Y%m%dT%H%M%S').toordinal()+date_sync) # correction of + 1 day in order to sync on climate data
        ef_last_day.append(datetime.strptime(files[i][92:107], '%Y%m%dT%H%M%S').date())        
        no2_col_arr.append(a)
        trop_arr.append(b)
        aai_arr.append(np.clip(c,0,10000))  # clip negative values to zero
#        aai_arr.append(c)  # clip negative values to zero
        clfr_arr.append(d)  
        no2_rgn.append(np.nanmean(a))
        trop_rgn.append(np.nanmean(b))
        aai_rgn.append(np.nanmean(np.clip(c,0,10000)))        
#        aai_rgn.append(np.nanmean(c))
        clfr_rgn.append(np.nanmean(d))

        
ef_rgn=pd.DataFrame({'day': ef_day,'no2' : no2_rgn, 'trop' : trop_rgn, 'aai' : aai_rgn, 'clfr' : clfr_rgn, 'ef_key': ef_first_key, 'dow' : ef_dow})
ef_rgn=ef_rgn.sort_values('day')
ef_rgn=ef_rgn.reset_index()        


** Visual inspection of measurements with emission**

In [None]:
image1 = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180711T162527_20180718T185658.tif'
image2 = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180816T164847_20180822T182145.tif'
image3 = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180708T172237_20180714T190743.tif'

image=image1
img=rio.open(image)

print('inspection of an image with some emissions  - orginal images') #, img.descriptions[band])

image_band = rio.open(image).read(band+1)

f2 = folium.Figure(width=500, height=400, title=img.descriptions[band])
m = folium.Map([lat, lon], min_zoom=8, max_zoom=8, width='100%', height='100%').add_to(f2)
folium.raster_layers.ImageOverlay(
    image=image_band,
    bounds = [[18.6,-67.3,],[17.9,-65.2]],
    colormap=lambda x: (1, 0, 0, x),
).add_to(m)
f2

In [None]:
#with y-axis flipped

image=image1
img=rio.open(image)

print('inspection of an image with some emissions - images flipped along y-axis') #, img.descriptions[band])

image_band = np.flip(rio.open(image).read(band+1),0)

f2 = folium.Figure(width=500, height=400, title=img.descriptions[band])
m = folium.Map([lat, lon], min_zoom=8, max_zoom=8, width='100%', height='100%').add_to(f2)
folium.raster_layers.ImageOverlay(
    image=image_band,
    bounds = [[18.6,-67.3,],[17.9,-65.2]],
    colormap=lambda x: (1, 0, 0, x),
).add_to(m)
f2

# 4. A prediction model for emission with local emission data

In [None]:
def plot_locations_model(n):

    # defining mask for locations with gray powerplants
    locations=np.zeros((148,475))

    for j in range(0,len(gray)):
        locations[gray.iloc[j,3]-n:gray.iloc[j,3]+n,gray.iloc[j,4]-n:gray.iloc[j,4]+n]=np.ones((2*n,2*n))

    #plot_scaled(locations)
    print('Overview of areas on Puerto Rico that are selected for modelling local emission factors')

    f1 = folium.Figure(width=500, height=400)
    m = folium.Map([lat, lon], min_zoom=8, max_zoom=8, width='100%', height='100%').add_to(f1)  #zoom_start=8
    folium.raster_layers.ImageOverlay(
        image=locations,
        bounds = [[18.6,-67.3,],[17.9,-65.2]],
        colormap=lambda x: (1, 0, 0, x),
    ).add_to(m)
    f1

    return f1

In [None]:
def plot_timeview_local_emission(df,mn,days):   # df is array with emission factor per location (in the columns) as function of time (in the rows)
                                            # mn is list of mean for the region as function of time
    fig3 = plt.figure(figsize=(20, 10))
    fig3.suptitle("emission as function of time for different power plants")

    wkday=[]
    for i,x in enumerate(ef_rgn.dow) : 
        if x>5: #<6: 
            wkday.append(ef_rgn.day[i]) 

    offset=0

    ax=[]
    for i in range(0,np.min((8,df.shape[1]-3))):
        ax.append(fig3.add_subplot(421+i))
        ax[i].plot(df.iloc[:days,1], mn[:days], label='mean_region', color='r')
#        ax[i].plot(df.iloc[:days,1], wktrop[:days], label='weekdays', color='g')
        ax[i].plot(df.iloc[:days,1], df.iloc[:days,3+i+offset], label=df.columns[3+i+offset], color='b')
    #    ax[i].set(ylim=(0, 2))                       #xlim=(-3, 3), ylim=(-3, 3))
        ax[i].set_xlabel('time')
        ax[i].set_ylabel('emission')
    #    ax[i].set_title("aerosol index as function of time")
        ax[i].legend()
        for x in wkday[:np.int(days*5/7)]:
            ax[i].axvline(x, color='g',linewidth=4,alpha=0.15)
    
    if df.shape[1]-3 > 8 :
        fig4 = plt.figure(figsize=(20, 10))
        
        offset=8

        ax2=[]
        for i in range(0,min((8,df.shape[1]-3-8-1))):
            ax2.append(fig4.add_subplot(421+i))
            ax2[i].plot(df.iloc[:,1], mn, label='mean_region', color='r')
            ax2[i].plot(df.iloc[:,1], df.iloc[:,3+i+offset], label=df.columns[3+i+offset], color='b')
        #    ax2[i].set(ylim=(0, 2))                       #xlim=(-3, 3), ylim=(-3, 3))
            ax2[i].set_xlabel('time')
            ax2[i].set_ylabel('emission')
        #    ax2[i].set_title("aerosol index as function of time")
            ax2[i].legend()
            for x in wkday[:np.int(days*5/7)]:
                ax2[i].axvline(x, color='g',linewidth=4,alpha=0.15)
    
    return


In [None]:
def plot_locationsview_local_data(df1,rgn1,titel1,df2,rgn2,titel2 ):

    #simplified emissions-factor as the average aai of the plant location divided by the capacity of the plant  

    df_plant=df1.drop(columns=['key_date','first','dow']).mean()

    #print('yearly average emission for the whole region : ',np.mean(rgn1))
    labels_x=df_plant.index
    x = np.arange(len(labels_x))  # the label location
    width = 0.35 
    fig5 = plt.figure(figsize=(20, 4))
    fig5.suptitle("yearly average data per plant location :  "+titel1+" (left)  -   "+titel2+"  (right)")
    ax5 = fig5.add_subplot(121)
    ax5.bar(x- width/2, df_plant.values, width, label='local average per cluster', color='b')
    #ax5.bar(x+ width/2, np.ones((len(aai_plant)))*aai_rgn.aai_rgn.mean(), width, label='average for the region', color='r')
    ax5.plot(x, np.ones((len(df_plant)))*np.mean(rgn1), label='average for the whole region', color='r')
    ax5.plot(x, np.ones((len(df_plant)))*np.mean(df_plant.values), label='average for the locations', color='g')
    ax5.set_xticks(x)
    ax5.set_xticklabels(labels_x)
    plt.setp(ax5.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    ax5.legend()
    
    df_plant=df2.drop(columns=['key_date','first','dow']).mean()

    #print('yearly average emission for the whole region : ',np.mean(rgn2))
    labels_x=df_plant.index
    x = np.arange(len(labels_x))  # the label location
    width = 0.35 
    #fig5.suptitle("yearly average data per plant location - "+titel2)
    ax5 = fig5.add_subplot(122)
    ax5.bar(x- width/2, df_plant.values, width, label='local average per cluster', color='b')
    #ax5.bar(x+ width/2, np.ones((len(aai_plant)))*aai_rgn.aai_rgn.mean(), width, label='average for the region', color='r')
    ax5.plot(x, np.ones((len(df_plant)))*np.mean(rgn2), label='average for the whole region', color='r')
    ax5.plot(x, np.ones((len(df_plant)))*np.mean(df_plant.values), label='average for the locations', color='g')
    ax5.set_xticks(x)
    ax5.set_xticklabels(labels_x)
    plt.setp(ax5.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    ax5.legend()
        
    return

In [None]:
gray_plant=power_plants.loc[((power_plants['primary_fuel']=='Coal') | (power_plants['primary_fuel']=='Oil') | (power_plants['primary_fuel']=='Gas')),
                         ['name','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]
#gray_plant=power_plants.loc[:,['name','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]


gray_plant=gray_plant.sort_values('primary_fuel')

# calculation of maximum daily production (MWh) based on max. capacity of power plants
gray_plant['Max_Prod']=(gray_plant['capacity_mw']*24).astype(int)

print('Dataframe for local calculation of emission factors : ')
#print(gray_plant.head())

n=11 # size of area around a plant where emission is averaged for further calculation
#n=4
#n=3

gray=pd.DataFrame({})
for j in range(0,len(gray_plant)):
    idx_lt=gray_plant.iloc[j,3]
    idx_lg=gray_plant.iloc[j,4]
    temp=gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),gray_plant.columns[:7]]
    if len(temp)>1 : # if there are more plants in 1 location, group them together into 1 source
        temp1=pd.DataFrame([[str(len(temp))+' locations','comb',np.int(temp.capacity_mw.sum()),np.int(temp.img_idx_lt.mean()),np.int(temp.img_idx_lg.mean()),
                           temp.plant.sum(),temp.Max_Prod.sum()]], columns=temp.columns)
        gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),'cluster']=temp.plant.sum()
        gray=pd.concat([gray,temp1])
    else :    # if there is 1 plant in a location, keep this one as the source
        gray=pd.concat([gray,temp])
        gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),'cluster']=temp.plant.sum()
        
gray=gray.drop_duplicates()

gray=gray.rename(columns = { 'plant': 'cluster', 'Max_Prod': 'Max_Prod_clst'})

print(gray)

plot_locations_model(n)

See above a picture of the locations of fossil fuel powered plants that are used in further calculations. Overlapping areas of power plants that are in close proximity of eachother are merged into one area (cluster) with a production capacity equal to the sum of the capacity of the individual plants.  

The table above gives the clusters that are generated for the calculation. Max_Prod_clst is the maximum daily production of the cluster in MWh/day.

In [None]:
# emission value in proximity of all plants with all locations in location mask - proximity is +/- n points from location of plant
no2=[]
trop=[]
aai=[]
clfr=[]
for j in range(0,len(gray)):
    idx_lt=gray.iloc[j,3]
    idx_lg=gray.iloc[j,4]
        
    no2_j=[]
    trop_j=[]
    aai_j=[]
    clfr_j=[]
    for i in range(0,len(trop_arr)):
        no2_j.append(np.nanmean(no2_col_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        trop_j.append(np.nanmean(trop_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        aai_j.append(np.nanmean(aai_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        clfr_j.append(np.nanmean(clfr_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        
    no2.append(no2_j)
    trop.append(trop_j)
    aai.append(aai_j)
    clfr.append(clfr_j)
    
    
# list to DataFrame with dates   
no=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})
tro=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow}) 
aa=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})
clf=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})

for j in range(0,len(gray)): #add average of emission for each location (plant) to dataframe with in the column name taken from df gray.cluster
    no[gray.iloc[j,5]]=no2[j]
    tro[gray.iloc[j,5]]=trop[j]
    aa[gray.iloc[j,5]]=aai[j]  #add average of aai for each location (plant) to dataframe with in the column name taken from df gray.cluster
    clf[gray.iloc[j,5]]=clfr[j]
    
#print('size of dataframe with emission data for gray-energy power-plant locations: ',no.shape, tro.shape, aa.shape, clf.shape)

# sorting dataframe on date to produce ordered time series
no=no.sort_values('key_date') ; tro=tro.sort_values('key_date') ; aa=aa.sort_values('key_date') ; clf=clf.sort_values('key_date')

no=no.reset_index() ; tro=tro.reset_index() ; aa=aa.reset_index() ; clf=clf.reset_index()
no=no.drop(columns=['index']) ; tro=tro.drop(columns=['index']) ; aa=aa.drop(columns=['index']) ; clf=clf.drop(columns=['index'])
no=no.fillna(0) ; tro=tro.fillna(0) ; aa=aa.fillna(0) ; clf=clf.fillna(0)
    
#emission_correlation() # show locations with high correlation

#plot_timeview_local_emission() # show timeview of emission per location

plot_locationsview_local_data(no,no2_rgn,'NO2 column',tro,trop_rgn, 'NO2 in troposphere') # show emissions per location vs average for the region
plot_locationsview_local_data(aa,aai_rgn, 'Aerosol absorption index (aai)',clf,clfr_rgn, 'Cloud fraction') # show emissions per location vs average for the region



Above 4 graphs that present the local measurements of emission data averaged over the year. A first glance shows that NO2 column and NO2 in troposphere have a similar pattern but NO2 in troposphere has a better distinction between locations. In the lower graphs the high values for Mayz_Gas are curious since Mayz_Gas does not have the highest production capacity.

For aerosol absorption index some pre-processing of the data is done. According to below reference only positive values of aerosol absorption index are an indication of absorption by dust and smoke. The pre-processing is done on the negative values in the data which are clipped to zero. 

https://disc.gsfc.nasa.gov/information/glossary?title=Aerosol%20Index

Text from above reference:
"Aerosol Index. It is an index that detects the presence of uv-absorbing aerosols such as dust and soot. Positive values of Aerosol Index generally represent absorbing aerosols (dust and smoke) while small or negative values represent nonabsorbing aerosols and clouds."

In [None]:
plot_timeview_local_emission(tro,ef_rgn.trop,324) # show timeview of emission per location


The graphs shows the measured emission for each location over the year. The green background in the graphs represent the weekends (saturday and sunday). The idea behind this is that for man-made emissions a weekly pattern can be observed (emissions higher on weekdays and lower in the weekend). From visual inspection it is unclear if such a pattern exists in the emission data.  

In [None]:

data_dow=ef_rgn.pivot(columns='dow',values='trop')

days=['mon','tue', 'wed','thu','fri','sat','sun']

f11, ax11 = plt.subplots(1, 2, figsize=(20, 4))
chart=sns.boxplot(data=data_dow, ax=ax11[0])
chart.set_xticklabels(days, rotation=45, horizontalalignment='right')
chart.set_title('Regional NO2 in troposphere')
chart.set_ylabel('emission (NO2 troposphere)') 
#data_dow


p=4 # number selects location to show
data_dow=tro.pivot(columns='dow',values=tro.columns[p])

chart=sns.boxplot(data=data_dow, ax=ax11[1])
chart.set_xticklabels(days, rotation=45, horizontalalignment='right')
chart.set_title(tro.columns[p])
chart.set_ylabel('emission (NO2 troposphere)') ;



The box-plots show the measured emission per weekday for the region (left) and for one location (right).  
No difference in emissions between weekdays can be observed. A weekly pattern is common for power plants in industrialized areas. Since it cannot be observed in emissions on Puerto Rico either:  
* there is no weekly pattern because there is little industry   
* the weekly pattern is blurred because of background emissions  
* aggregation of NO2 in the higher atmosphere conceils weekly variations in the lower parts.     

In [None]:
# calculation of monthly emission factors
import seaborn as sns
#sns.set(style="ticks", palette="pastel")


def monthly_data_location(df1,df2,titel1,titel2) :

    maand=df1
    t=[]
    for i, x in enumerate (maand.iloc[:,0]) : t.append(datetime.fromordinal(x+1).month)
    maand['month']=t
    maand=maand.drop(columns=['key_date','first','dow'])
    maand=maand.groupby(by='month').agg('mean')

    emis_maand=(maand.T*Prod_day/maand.T.sum(axis=0)).astype(int)

    f1, ax1 = plt.subplots(1, 2, figsize=(20, 4))
    chart=sns.boxplot(data=emis_maand.T, ax=ax1[0])
    chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')
    chart.set_title(titel1)
    chart.set_ylabel('production (MWh/day)') 
    #chart.set_xlabel('location')
        
    maand=df2
    t=[]
    for i, x in enumerate (maand.iloc[:,0]) : t.append(datetime.fromordinal(x+1).month) # correction +1 to sync first date on 1st day of July
    maand['month']=t
    maand=maand.drop(columns=['key_date','first','dow'])
    maand=maand.groupby(by='month').agg('mean')

    emis_maand=(maand.T*Prod_day/maand.T.sum(axis=0)).astype(int)

    chart=sns.boxplot(data=emis_maand.T, ax=ax1[1])
    chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')    
    chart.set_title(titel2)
    chart.set_ylabel('production (MWh/day)') 
    #chart.set_xlabel('location')
    return

monthly_data_location(no,tro,'monthly data NO2 column','monthly data troposphere' )
monthly_data_location(aa,clf,'monthly data aerosol absorption index','monthly data cloud fraction' )

The box-plots show the variability in measured emission per location per month. Emission is calculated as a corresponding production figure in order to compare it with available production capacity (see graph below)   

The upper plots give the best distinction between locations. Variation of emissions levels over the months is more or less the same for all locations. From this it is concluded that there is little monthly variation in emissions between locations. An emission factor calculated for the year or for the month will not differ very much in this case. If larger variations in monthly emissions per location can be observed that would be an indication that emission factors per location change over time.   

Since the emission factor NO2 in troposphere gives the best distcintion between location this emission measurement is used to determine the emission factor in further calculations. 

In [None]:
fig6 = plt.figure(figsize=(20, 4))
#fig5.suptitle("yearly average data per plant location :  "+titel1+" (left)  -   "+titel2+"  (right)")
ax6 = fig6.add_subplot(121)
labels_x=gray.cluster
x = np.arange(len(labels_x))  # the label location
width = 0.35 
ax6.bar(x-width/2,gray.Max_Prod_clst.values, width, label='Max_Prod (daily production capacity for the cluster)', color='b')
#ax5.bar(x+ width/2, np.ones((len(aai_plant)))*aai_rgn.aai_rgn.mean(), width, label='average for the region', color='r')
ax6.plot(np.ones((len(gray)))*np.mean(gray.Max_Prod_clst), label='average daily capacity per location for the region', color='r')
ax6.plot(np.ones((len(gray)))*Prod_day/len(gray), label='daily production level for the region averaged per location', color='g')
ax6.set_xticks(x)
ax6.set_xticklabels(labels_x)
ax6.set_ylabel('Production (MWh/day)')
plt.setp(ax6.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax6.legend();

A graph which shows the available production capacity (= driver of emissions) per cluster.

The table below gives the measured emission (EF_contrib) per location. The total daily production of electricity is distributed to the locations by the weight of the emission at the location divided by the sum of emissions for all locations. This leads to a daily production per location (hist_prd in MWh per day) and an activity factor (Activity %) (see next table).    

In [None]:
gray=gray.iloc[:,:7]

local_plant=tro.drop(columns=['key_date','first','dow']).mean()

avg_location=pd.DataFrame(local_plant, columns=['EF_contrib']).reset_index().rename(columns = {'index': 'cluster'})

gray=pd.merge(gray, avg_location.loc[:,['cluster','EF_contrib']], how='left', on='cluster')

gray=gray.drop(columns=['img_idx_lt','img_idx_lg'])

gray=gray.fillna(0)
gray



In [None]:
#print('Contribution of production factors in the model to the measured total emissions (%) :', gray.EF_contrib.sum()*100)

gray=gray.sort_values('EF_contrib', ascending=False)

# # calculation of maximum daily production (MWh) based on max. capacity of power plants
# gray['EF_max_MWh_day']=(gray['capacity_mw']*24).astype(int)

# daily energy production for the region (MWh) distributed to plants according to emission distribution from the model
gray['hist_prod_day']=(gray['EF_contrib']*Prod_day/gray.EF_contrib.sum()).astype(int)

# activity factor calculated from model emission distribution and maximum daily production
gray['Activity_%']=(gray['hist_prod_day']*100/gray['Max_Prod_clst']).clip(upper=100).astype(int)

# multiplier: if activity > 100 (%) then emissions cannot be explained by power production because the plant is running beyond maximum capacity.
# the multiplier can explain these additional emissions attributable to choice of primary_fuel and generation of technology (age of plant).
gray['multiplier']=(gray['hist_prod_day']/gray['Max_Prod_clst']).astype(int)

gray['EF_MaxCap']=gray['EF_contrib']
gray.loc[gray.hist_prod_day>gray.Max_Prod_clst,'EF_MaxCap']=0
gray['hist2_prod_day']=(gray['EF_MaxCap']*(Prod_day-gray.loc[gray.hist_prod_day>gray.Max_Prod_clst,'Max_Prod_clst'].sum())/gray.EF_MaxCap.sum()).astype(int)
gray.loc[gray.hist_prod_day>gray.Max_Prod_clst,'hist2_prod_day']=gray.Max_Prod_clst
gray['Activity2_%']=(gray['hist2_prod_day']*100/gray['Max_Prod_clst']).astype(int)     #.clip(upper=100).astype(int)

gray['EF_hist2']=gray['EF_MaxCap']
gray.loc[gray.EF_MaxCap==0,'EF_hist2']=gray['EF_contrib']*gray['hist2_prod_day']/gray['hist_prod_day']

gray_cluster=gray
gray_cluster

The column 'hist_prod_day' gives the calculated production per day in MWh. For this calculation the yearly production of 19.480 GWh from eia.gov is divided by 365 days to reach an average daily production of 53.369 MWh for the region. The daily production is distributed to the locations by using the fraction of the emission_contrib for the location divided by the sum of the emission_contrib for all locations.

The column 'Activity (%)' is 'hist_prod_day' divided by 'Max_prod_clst'. The maximum is 100% since a plant cannot produce more power than its maximum capacity.

When comparing the values in the columns 'hist_prod_day' and 'Max_prod_clst' one notices that values in the first column are sometimes higher, meaning that a higher production level is required to explain the emissions. Since Activity is already at 100% other mechanisms are needed to explain the additional emissions.

For this explanation an additional multiplier is calculated (column 'multiplier') which can be thought of as a multiplier to the emission from the power plant running at maximum capacity to reach the measured level. The assumption is that the measured level can only in part be explained by power generation since it exceeds the maximum emission capability of the plant.

To get a better distribution of power generation across plants, with the constraint of maximum production capacity per plant, a second distribution is calculated ('hist2_prod_day'). In this calculation the production of power plants with a multiplier is set at the maximum daily production. At the same time the contribution of these plants to the emission is neglected (EF_max_cap set to zero) since it is lower then can be explained by its capacity. The production for the other plants is calculated according to the remaining weights in the measured emission.

This leads to a production distribution in column 'hist2_prod_day' and an associated activity factor Activity2_%, not capped to 100% to check if the calculated plant production level remains within the available capacity.

Finally a historical emission factor is calculated 'EF_hist2' based on historical measurements and scaled back from the measurement if the maximum production of the plant cannot explain (is lower then) the measured emission level.  

EF-hist2 has the same dimension as the measurement NO2 in troposphere (mol/m^2) from which an absolute emission value needs to be derived. 

A different approach to reach an absolute emission value is based on NO2 emission factors known for electricity generation from coal, oil and gas e.g from http://data.ec.gc.ca/data/substances/monitor/canada-s-official-greenhouse-gas-inventory/E-Tables-Electricity-Canada-Provinces-Territories/?lang=en referenced on EIE page https://insights.sustainability.google/methodology#some-notes   

The reference gives values for coal: 0.01 gr NO2/kWh (Tab A13-11), oil: 0.001 gr NO2/kWh (Tab A13-2) and natural gas: 0.0007 gr NO2/kWh (Tab A13-11). With columns 'primary_fuel' and 'hist_prod_day' in the table below the absolute NO2 emission per plant can be calculated.

*It is interesting to note that the calculated absolute emission for plant 'A.E._Coal' is 10x the calculated absolute emission for plant 'Palo_Oil'. Both run at approx. the same production level (6000 MWh/day +/- 10%). The emission factor from the troposphere measurements give values that are more or less the same: 6.5*e-6 vs 7.2*e-6. The higher absolute emission level of a coal plant as reported in the reference is not observed in the local emission measurements.*

In [None]:
# post-processing - for each plant
gray_plant=gray_plant.iloc[:,:8]
#gray=gray.rename(columns = {'plant':'cluster'})

gray_plant=pd.merge(gray_plant, gray[['hist2_prod_day','Max_Prod_clst','cluster','EF_hist2']], how='left', on='cluster')

gray_plant['hist_prod_day']=(gray_plant['Max_Prod']*gray_plant['hist2_prod_day']/gray_plant['Max_Prod_clst']).astype(int)

gray_plant['Activity_%']=(gray_plant['hist_prod_day']*100/gray_plant['Max_Prod']).astype(int)  

gray_plant['EF_hist']=(gray_plant['Max_Prod']*gray_plant['EF_hist2']/gray_plant['Max_Prod_clst'])

NO_factors={ 'Coal': 0.01 , 'Oil': 0.001, 'Gas': 0.0007 } # gr/kWh is same as kg/MWh

for i,x in enumerate(gray_plant.primary_fuel) : gray_plant.loc[gray_plant.primary_fuel==x,'EF_abs']=NO_factors[x]*gray_plant.loc[gray_plant.primary_fuel==x,'hist_prod_day']

gray_plant=gray_plant.drop(columns = ['hist2_prod_day', 'Max_Prod_clst','EF_hist2'])
  
gray_location=gray_plant

gray_location

In [None]:
# fuel distribution purely based on historical emissions
print('fuel distribution based on historical emissions')
print(gray_plant.groupby(by='primary_fuel').hist_prod_day.sum()/gray_plant.hist_prod_day.sum())
print('  ')


The baseline for emission of NO2 from electricity generation in Puerto Rico is based on the daily power generation * the emission factor for coal/oil/gas.

In [None]:
print('Baseline for emission of NO2 from electricity generation in Puerto Rico is ',int(gray_location.EF_abs.sum()*365),'kg NO2/year' )

To set this value into perspective a simple approximation of emission is calculated as the product of electricity generation in the year times the fuel distribution times the emission per kWh for the fuel. 

In [None]:
print('Simple approximation of emission of NO2 from electricity generation in Puerto Rico : ',int(19480000*(.2*0.01+.4*0.001+.4*0.0007)),'kg NO2/year' )

This value is higher then the value from the model due to a different fuel distributon calculated from the model.  

Estimation of emission of NO2 from satellite measurements is based on the average NO2 density (mol/m^2) on plant locations. Molecular weight of NO2 is 46,0055 g/mol.   
Further we need to estimate the area for which the emission is generated and the lifetime of the generated emissions. For simplicity we use 1 square kilometer for the area and 1 day for the lifetime of the emissions

In [None]:
print('Estimated NO2 emission based on measurements at plant locations ', int(tro.iloc[:,3:13].mean().mean()*365*10*1000*1000/1000), 'kg NO2/year')

This does not give us a very useful value considering the baseline. If we would want to bring this value in line with the baseline value by, for instance, reducing the lifetime of NO2 emissions at the plant location (1 sq km) this would lead to a lifetime of NO2 emissions at the plant location of a couple of minutes.    

Reading about conversion from NO2 density measurements to absolute values of NO2 does not give straightforward conversion methods. Indications are that density measurements are useful for comparing emissions as it is used in estimating the distribution of electricity generation between plants.
Also it may be used to confirm/question the absolute emission factors between plants. For this purpose we calculate the emission factors per MWH for coal/gas and oil as seen in the measurements and compare these with the reported values (see table below).

There is no differentation in the measured values for emission per MWh between coal, gas and oil plants.

In [None]:
kengetal=gray_plant.loc[:,['primary_fuel','EF_hist','EF_abs','hist_prod_day']].groupby(by='primary_fuel').agg('sum')
kengetal['EF_measured_MWh']=kengetal['EF_hist']/kengetal['hist_prod_day']
kengetal['EF_abs_MWh']=kengetal['EF_abs']/kengetal['hist_prod_day']
kengetal

With the knowledge from the reported distribution of oil/gas/coal/renewables of 40%/40%/18%/2%, a better calculated distribution may be obtained by a different choice for calculating individual plant contributions within a cluster.  
Instead of using a attribution to plants pro rato the capacity within a cluster a preference for gas is used. That is: all gas plants in a cluster are set at maximum production (only if this does not exceed calculated cluster production). The remaining production in the cluster is attributed to oil and coal.  

This gives the following table and a fuel usage distribution coal/gas/oil of 10%/39%/51% : which is closer to the reported distribution for gas but still needs tweaking to get the distribution for coal and oil in line with reported values. It shows that the model has the capability of bringing calculated historical production and emission figures more in line with reported figures if additional and relevant information is added. 



In [None]:
# post-processing - for each plant
gray_plant=gray_plant.iloc[:,:8]

gray_plant=pd.merge(gray_plant, gray[['hist2_prod_day','Max_Prod_clst','cluster','EF_hist2']], how='left', on='cluster')

for x in gray_plant.cluster :
    gray_plant.loc[((gray_plant.primary_fuel=='Gas')&(gray_plant.cluster==x)),'scen_gas_max']=gray_plant['Max_Prod']
    if len(gray_plant.loc[(gray_plant.primary_fuel!='Gas')&(gray_plant.cluster==x),:])>0 :
        gray_plant.loc[((gray_plant.primary_fuel!='Gas')&(gray_plant.cluster==x)),'scen_gas_max']=((gray.loc[gray.cluster==x,'hist2_prod_day'].values-gray_plant.loc[((gray_plant.primary_fuel=='Gas')&(gray_plant.cluster==x)),'scen_gas_max'].sum())*gray_plant['Max_Prod']/gray_plant.loc[((gray_plant.primary_fuel!='Gas')&(gray_plant.cluster==x)),'Max_Prod'].sum()).astype(int)
    
    if gray_plant.loc[((gray_plant.primary_fuel=='Gas')&(gray_plant.cluster==x)),'scen_gas_max'].sum() > gray.loc[gray.cluster==x,'hist2_prod_day'].values:
        gray_plant.loc[((gray_plant.primary_fuel=='Gas')&(gray_plant.cluster==x)),'scen_gas_max']=(gray_plant['Max_Prod']*gray.loc[gray.cluster==x,'hist2_prod_day'].values/gray_plant.loc[((gray_plant.primary_fuel=='Gas')&(gray_plant.cluster==x)),'Max_Prod'].sum()).astype(int)     
        gray_plant.loc[((gray_plant.primary_fuel!='Gas')&(gray_plant.cluster==x)),'scen_gas_max']=0
        

gray_plant['Activity_%']=(gray_plant['scen_gas_max']*100/gray_plant['Max_Prod']).astype(int)  

gray_plant['EF_gas_max']=(gray_plant['Max_Prod']*gray_plant['EF_hist2']/gray_plant['Max_Prod_clst'])

#gray_plant['EF_p_MWh']=(gray_plant['EF_hist']/gray_plant['hist_prod_day'])
 
#gray_plant

In [None]:
print('fuel distribution based on historical emissions')
print(gray_plant.groupby(by='primary_fuel').scen_gas_max.sum()/gray_plant.scen_gas_max.sum())
print('  ')

# 5. Emission data for locations with and without a fossil fuel power plant #

Further exploration into emission factor in other areas then fossil fuel power plants is done to find out if emissions at fossil fuel power plant locations stand out from other areas. To this end the full table of power plants is analyzed including hydro, wind and solar locations from which no emissions are expected. 

In [None]:
gray_plant=power_plants.loc[:,['name','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]


gray_plant=gray_plant.sort_values('primary_fuel')

# calculation of maximum daily production (MWh) based on max. capacity of power plants
gray_plant['Max_Prod']=(gray_plant['capacity_mw']*24).astype(int)

print('Dataframe for local calculation of emission factors : ')
#print(gray_plant.head())

n=11 # size of area around a plant where emission is averaged for further calculation
#n=4
#n=3

gray=pd.DataFrame({})
for j in range(0,len(gray_plant)):
    idx_lt=gray_plant.iloc[j,3]
    idx_lg=gray_plant.iloc[j,4]
    temp=gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),gray_plant.columns[:7]]
    if len(temp)>1 : # if there are more plants in 1 location, group them together into 1 source
        temp1=pd.DataFrame([[str(len(temp))+' locations','comb',np.int(temp.capacity_mw.sum()),np.int(temp.img_idx_lt.mean()),np.int(temp.img_idx_lg.mean()),
                           temp.plant.sum(),temp.Max_Prod.sum()]], columns=temp.columns)
        gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),'cluster']=temp.plant.sum()
        gray=pd.concat([gray,temp1])
    else :    # if there is 1 plant in a location, keep this one as the source
        gray=pd.concat([gray,temp])
        gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),'cluster']=temp.plant.sum()
        
gray=gray.drop_duplicates()

gray=gray.rename(columns = { 'plant': 'cluster', 'Max_Prod': 'Max_Prod_clst'})

print(gray)

plot_locations_model(n)

In [None]:
# emission value in proximity of all plants with all locations in location mask - proximity is +/- n points from location of plant
no2=[]
trop=[]
aai=[]
clfr=[]
for j in range(0,len(gray)):
    idx_lt=gray.iloc[j,3]
    idx_lg=gray.iloc[j,4]
        
    no2_j=[]
    trop_j=[]
    aai_j=[]
    clfr_j=[]
    for i in range(0,len(trop_arr)):
        no2_j.append(np.nanmean(no2_col_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        trop_j.append(np.nanmean(trop_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        aai_j.append(np.nanmean(aai_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        clfr_j.append(np.nanmean(clfr_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        
    no2.append(no2_j)
    trop.append(trop_j)
    aai.append(aai_j)
    clfr.append(clfr_j)
    
    
# list to DataFrame with dates   
no=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})
tro=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow}) 
aa=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})
clf=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})

for j in range(0,len(gray)): #add average of emission for each location (plant) to dataframe with in the column name taken from df gray.cluster
    no[gray.iloc[j,5]]=no2[j]
    tro[gray.iloc[j,5]]=trop[j]
    aa[gray.iloc[j,5]]=aai[j]  #add average of aai for each location (plant) to dataframe with in the column name taken from df gray.cluster
    clf[gray.iloc[j,5]]=clfr[j]
    
print('size of dataframe with emission data for gray-energy power-plant locations: ',no.shape, tro.shape, aa.shape, clf.shape)

# sorting dataframe on date to produce ordered time series
no=no.sort_values('key_date') ; tro=tro.sort_values('key_date') ; aa=aa.sort_values('key_date') ; clf=clf.sort_values('key_date')

no=no.reset_index() ; tro=tro.reset_index() ; aa=aa.reset_index() ; clf=clf.reset_index()
no=no.drop(columns=['index']) ; tro=tro.drop(columns=['index']) ; aa=aa.drop(columns=['index']) ; clf=clf.drop(columns=['index'])
no=no.fillna(0) ; tro=tro.fillna(0) ; aa=aa.fillna(0) ; clf=clf.fillna(0)
    
#emission_correlation() # show locations with high correlation

#plot_timeview_local_emission() # show timeview of emission per location

plot_locationsview_local_data(no,no2_rgn,'NO2 column',tro,trop_rgn, 'NO2 in troposphere') # show emissions per location vs average for the region
plot_locationsview_local_data(aa,aai_rgn, 'Aerosol absorption index (aai)',clf,clfr_rgn, 'Cloud fraction') # show emissions per location vs average for the region



Measured values of emission on fossil fuel power plant locations do not stand out from locations without fossil fuel power plants. The same comment holds for the box-plots below.   

Main conclusion is that locations of fossil fuel power plants cannot be distinguished solely from emission measurements of NO2 in troposphere. Additional information on locations of fossil fuel power plants is necessary to construct a model.

In [None]:

data_dow=ef_rgn.pivot(columns='dow',values='trop')

days=['mon','tue', 'wed','thu','fri','sat','sun']

f11, ax11 = plt.subplots(1, 2, figsize=(20, 4))
chart=sns.boxplot(data=data_dow, ax=ax11[0])
chart.set_xticklabels(days, rotation=45, horizontalalignment='right')
chart.set_title('Regional NO2 in troposphere')
chart.set_ylabel('emission (NO2 troposphere)') 

p=13 # number selects location to show
data_dow=tro.pivot(columns='dow',values=tro.columns[p])

chart=sns.boxplot(data=data_dow, ax=ax11[1])
chart.set_xticklabels(days, rotation=45, horizontalalignment='right')
chart.set_title(tro.columns[p])
chart.set_ylabel('emission (NO2 troposphere)') ;

monthly_data_location(no,tro,'monthly data NO2 column','monthly data troposphere' )
monthly_data_location(aa,clf,'monthly data aerosol absorption index','monthly data cloud fraction' )

# 6. Climate data on temperature, humidity and wind speed #

In [None]:
gray_plant=power_plants.loc[((power_plants['primary_fuel']=='Coal') | (power_plants['primary_fuel']=='Oil') | (power_plants['primary_fuel']=='Gas')),
                         ['name','primary_fuel','capacity_mw','img_idx_lt','img_idx_lg','plant']]


gray_plant=gray_plant.sort_values('primary_fuel')

# calculation of maximum daily production (MWh) based on max. capacity of power plants
gray_plant['Max_Prod']=(gray_plant['capacity_mw']*24).astype(int)

print('Dataframe for local calculation of emission factors : ')
#print(gray_plant.head())

n=11 # size of area around a plant where emission is averaged for further calculation
#n=4
#n=3

gray=pd.DataFrame({})
for j in range(0,len(gray_plant)):
    idx_lt=gray_plant.iloc[j,3]
    idx_lg=gray_plant.iloc[j,4]
    temp=gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),gray_plant.columns[:7]]
    if len(temp)>1 : # if there are more plants in 1 location, group them together into 1 source
        temp1=pd.DataFrame([[str(len(temp))+' locations','comb',np.int(temp.capacity_mw.sum()),np.int(temp.img_idx_lt.mean()),np.int(temp.img_idx_lg.mean()),
                           temp.plant.sum(),temp.Max_Prod.sum()]], columns=temp.columns)
        gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),'cluster']=temp.plant.sum()
        gray=pd.concat([gray,temp1])
    else :    # if there is 1 plant in a location, keep this one as the source
        gray=pd.concat([gray,temp])
        gray_plant.loc[(((gray_plant.img_idx_lt>idx_lt-n) & (gray_plant.img_idx_lt<idx_lt+n)) & 
                         ((gray_plant.img_idx_lg>idx_lg-n) & (gray_plant.img_idx_lg<idx_lg+n))),'cluster']=temp.plant.sum()
        
gray=gray.drop_duplicates()

gray=gray.rename(columns = { 'plant': 'cluster', 'Max_Prod': 'Max_Prod_clst'})

print(gray)

plot_locations_model(n)

In [None]:
# emission value in proximity of all plants with all locations in location mask - proximity is +/- n points from location of plant
no2=[]
trop=[]
aai=[]
clfr=[]
for j in range(0,len(gray)):
    idx_lt=gray.iloc[j,3]
    idx_lg=gray.iloc[j,4]
        
    no2_j=[]
    trop_j=[]
    aai_j=[]
    clfr_j=[]
    for i in range(0,len(trop_arr)):
        no2_j.append(np.nanmean(no2_col_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        trop_j.append(np.nanmean(trop_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        aai_j.append(np.nanmean(aai_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        clfr_j.append(np.nanmean(clfr_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of aai for location of plant
        
    no2.append(no2_j)
    trop.append(trop_j)
    aai.append(aai_j)
    clfr.append(clfr_j)
    
    
# list to DataFrame with dates   
no=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})
tro=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow}) 
aa=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})
clf=pd.DataFrame({'key_date':np.array(ef_first_key), 'first': ef_day,'dow':ef_dow})

for j in range(0,len(gray)): #add average of emission for each location (plant) to dataframe with in the column name taken from df gray.cluster
    no[gray.iloc[j,5]]=no2[j]
    tro[gray.iloc[j,5]]=trop[j]
    aa[gray.iloc[j,5]]=aai[j]  #add average of aai for each location (plant) to dataframe with in the column name taken from df gray.cluster
    clf[gray.iloc[j,5]]=clfr[j]
    
print('size of dataframe with emission data for gray-energy power-plant locations: ',no.shape, tro.shape, aa.shape, clf.shape)

# sorting dataframe on date to produce ordered time series
no=no.sort_values('key_date') ; tro=tro.sort_values('key_date') ; aa=aa.sort_values('key_date') ; clf=clf.sort_values('key_date')

#aa=pd.merge(aa,aai_rgn.loc[:,['key_date','aai_rgn']], how='inner',on='key_date')

no=no.reset_index() ; tro=tro.reset_index() ; aa=aa.reset_index() ; clf=clf.reset_index()
no=no.drop(columns=['index']) ; tro=tro.drop(columns=['index']) ; aa=aa.drop(columns=['index']) ; clf=clf.drop(columns=['index'])
no=no.fillna(0) ; tro=tro.fillna(0) ; aa=aa.fillna(0) ; clf=clf.fillna(0)
#aa.head()
    

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gfs/gfs_2018072118.tif'
img2=rio.open(image)
print('Available information on climate factors')

for i in range(1,7):
    image_band = rio.open(image).read(i)
    print(img2.descriptions[i-1])
    plot_scaled(image_band)


In [None]:
files=[]
for dirname, _, filenames in os.walk('/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gfs'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))

# read all the climate data into lists of arrays
gfs_day=[]
gfs_key=[]
temp_arr=[]
spec_hum_arr=[]
rel_hum_arr=[]
u_wind_arr=[]
v_wind_arr=[]
pr_water_arr=[]
#band=0 #temperature_2m_above_ground
#band=6 # cloud fraction
for i in range(0,len(files)):
    gfs_day.append(datetime.strptime(files[i][68:78], '%Y%m%d%H').date())
    gfs_key.append(datetime.strptime(files[i][68:78], '%Y%m%d%H').toordinal())
    temp_arr.append(rio.open(files[i]).read(1)) #temperature_2m_above_ground
    spec_hum_arr.append(rio.open(files[i]).read(2)) #specific_humidity_2m_above_ground
    rel_hum_arr.append(rio.open(files[i]).read(3)) # relative_humidity_2m_above_ground
    u_wind_arr.append(rio.open(files[i]).read(4)) # u_component_of_wind_10m_above_ground
    v_wind_arr.append(rio.open(files[i]).read(5)) # v_component_of_wind_10m_above_ground
    pr_water_arr.append(rio.open(files[i]).read(6)) # precipitable_water_entire_atmosphere
    

In [None]:
#gfs data is clean - geen nan in data!

t=[] ; s=[] ; r=[] ; u=[] ; v=[] ; p=[]

for i in range(0,len(temp_arr)): 
    t.append(np.nanmean(temp_arr[i]))
    s.append(np.nanmean(spec_hum_arr[i]))
    r.append(np.nanmean(rel_hum_arr[i]))
    u.append(np.nanmean(u_wind_arr[i]))
    v.append(np.nanmean(v_wind_arr[i]))
    p.append(np.nanmean(pr_water_arr[i]))

gfs_rgn=pd.DataFrame({'day': gfs_day,'temp' : t, 'spec_hum' : s, 'rel_hum' : r, 'u_wind' : u, 'v_wind' : v, 'pr_water' : p, 'gfs_key': gfs_key })
gfs_rgn=gfs_rgn.sort_values('day')
gfs_rgn['wind']=np.sqrt(np.multiply(gfs_rgn['u_wind'],gfs_rgn['u_wind'])+np.multiply(gfs_rgn['v_wind'],gfs_rgn['v_wind']))
gfs_rgn['log_wind']=np.log(gfs_rgn['wind'])
gfs_rgn=gfs_rgn.reset_index()

fig10 = plt.figure(figsize=(20, 10))
fig10.suptitle("Average values for the region: temperature, specific_humidity, relative_humidity, wind_speed, precipitable_water (all left axis), NO2 in troposhere on right axis")
ax1 = fig10.add_subplot(321)
ax1.plot(gfs_rgn.iloc[:,1], gfs_rgn.iloc[:,2], label='average temperature', color='b')
ax12 = ax1.twinx()
ax12.plot(ef_rgn.iloc[:,6], ef_rgn.iloc[:,3], label='average '+ef_rgn.columns[3] , color='r') # aai_rgn.iloc[:,5] : waarde 3 geeft alleen positieve waardes van aai
ax1.legend() ; ax12.legend()
ax2 = fig10.add_subplot(322)
ax2.plot(gfs_rgn.iloc[:,1], gfs_rgn.iloc[:,3], label='average specific_humidity', color='b')
ax22 = ax2.twinx()
ax22.plot(ef_rgn.iloc[:,6], ef_rgn.iloc[:,3], label='average '+ef_rgn.columns[3] , color='r')
ax2.legend() ; ax22.legend()
ax3 = fig10.add_subplot(323)
ax3.plot(gfs_rgn.iloc[:,1], gfs_rgn.iloc[:,4], label='average relative_humidity', color='b')
ax32 = ax3.twinx()
ax32.plot(ef_rgn.iloc[:,6], ef_rgn.iloc[:,3], label='average '+ef_rgn.columns[3] , color='r')
ax3.legend() ; ax32.legend()
ax4 = fig10.add_subplot(324)
ax4.plot(gfs_rgn.iloc[:,1], gfs_rgn.iloc[:,9], label='average wind', color='b')
ax42 = ax4.twinx()
ax42.plot(ef_rgn.iloc[:,6], ef_rgn.iloc[:,3], label='average '+ef_rgn.columns[3] , color='r')
ax4.legend() ; ax42.legend()
ax5 = fig10.add_subplot(325)
ax5.plot(gfs_rgn.iloc[:,1], gfs_rgn.iloc[:,7], label='average precipitable_water', color='b')
ax52 = ax5.twinx()
ax52.plot(ef_rgn.iloc[:,6], ef_rgn.iloc[:,3], label='average '+ef_rgn.columns[3] , color='r')
ax5.legend() ; ax52.legend();
# ax6 = fig10.add_subplot(326)
# ax6.plot(gfs_rgn.iloc[:,1], gfs_rgn.iloc[:,7], label='average precipitable_water', color='b')
# ax62 = ax6.twinx()
# ax62.plot(ef_rgn.iloc[:,6], ef_rgn.iloc[:,3], label='average '+ef_rgn.columns[3] , color='r')
# ax6.legend() ; ax62.legend()

In [None]:
# climate values in proximity of all plants with all locations in location mask - proximity is +/- n points from location of plant

temp=[]
spec_hum=[]
rel_hum=[]
u_wind=[]
v_wind=[]
pr_water=[]
for j in range(0,len(gray)):
    idx_lt=gray.iloc[j,3]
    idx_lg=gray.iloc[j,4]
    
    temp_j=[] ; rel_hum_j=[] ; spec_hum_j=[] ; u_wind_j=[] ; v_wind_j=[] ; pr_water_j=[]
    for i in range(0,len(temp_arr)):
        temp_j.append(np.nanmean(temp_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n])) # calculate average of temp for location of plant
        spec_hum_j.append(np.nanmean(spec_hum_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n]))
        rel_hum_j.append(np.nanmean(rel_hum_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n]))
        u_wind_j.append(np.nanmean(u_wind_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n]))
        v_wind_j.append(np.nanmean(v_wind_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n]))
        pr_water_j.append(np.nanmean(pr_water_arr[i][idx_lt-n:idx_lt+n,idx_lg-n:idx_lg+n]))
    temp.append(temp_j)
    spec_hum.append(spec_hum_j)
    rel_hum.append(rel_hum_j)
    u_wind.append(u_wind_j)
    v_wind.append(v_wind_j)
    pr_water.append(pr_water_j)
    

In [None]:
gray=gray.iloc[:,:7]

# weight of each powerplant as input to the emission model is the same. The model will calculate the relative weights of each plant.
gray.loc[:,'EF_wght']=1
#gray.loc[:,'EF_wght']=gray_location.hist2_EF_MWh_day/gray_location.hist2_EF_MWh_day.sum()


# aggregation of climate data per plant location into one dataframe, addition of aai data per plant location 
# only use data for the dates that coincide for aai-data and for climate data

ww=pd.DataFrame({'key_date':gfs_key})

XX=pd.DataFrame({})
for j in range(0,len(gray)):
    #ww[gray.iloc[j,5]]=temp[j]  #add average of aai for location of plant to dataframe with column name from df gray.plant
    ww['temp']=temp[j]
#    ww['temp_2d']=ww['temp']+ww['temp'].shift(1)
#    ww['temp_3d']=ww['temp']+ww['temp'].shift(1)+ww['temp'].shift(2)
#    ww['temp_5d']=ww['temp']+ww['temp'].shift(1)+ww['temp'].shift(2)+ww['temp'].shift(3)+ww['temp'].shift(4)   
    ww['spec_hum']=spec_hum[j]
    ww['rel_hum']=rel_hum[j]
    ww['u_wind']=u_wind[j]
    ww['v_wind']=v_wind[j]
    ww['pr_water']=pr_water[j]
    ww['wind']=np.sqrt(np.multiply(ww['u_wind'],ww['u_wind'])+np.multiply(ww['v_wind'],ww['v_wind']))
#    ww['wind_2d']=ww['wind']+ww['wind'].shift(1)
#    ww=ww.drop(columns=['u_wind','v_wind'])
#    ww['wind_^3']=np.multiply(np.multiply(ww['wind'],ww['wind']),ww['wind'])
#    ww['wind_3d']=ww['wind']+ww['wind'].shift(-1)+ww['wind'].shift(-2)
#    ww['pr_water_2d']=ww['pr_water']+ww['pr_water'].shift(1)
#    ww['pr_water^3']=np.multiply(np.multiply(ww['pr_water'],ww['pr_water']),ww['pr_water'])
#    ww['pr_water_3d']=ww['pr_water']+ww['pr_water'].shift(1)+ww['pr_water'].shift(2)
    
    x=ww.groupby(by='key_date').agg(['max','min','mean','std'])
    
    X=pd.merge(clf.loc[:,['key_date',gray.iloc[j,5]]], x, how='inner', on='key_date')  #addition of cloud fraction
    X=X.rename(columns = {gray.iloc[j,5]:'clf'})
    
#     c=gray.iloc[j,5]   #'EF_'+gray.iloc[j,5]
#     X[c]=np.ones((len(X)))*gray.iloc[j,7] # addition of EF_wght for each plant to the dataframe
# #    X['EF_distribution']=np.ones((len(X)))*gray.iloc[j,7] # addition of EF_wght for each plant to the dataframe


#   XX=pd.concat([XX,X], axis=0, sort=False) # aggregation of dataframe per plantlocation into one dataframe

    X=pd.merge(tro.loc[:,['key_date',gray.iloc[j,5]]], X, how='inner', on='key_date')  #addition of no2 troposphere
    X=X.rename(columns = {gray.iloc[j,5]:'trop'})
    
    c=gray.iloc[j,5]   #'EF_'+gray.iloc[j,5]
    X[c]=np.ones((len(X)))*gray.iloc[j,7] # addition of EF_wght for each plant to the dataframe
#    X['EF_distribution']=np.ones((len(X)))*gray.iloc[j,7] # addition of EF_wght for each plant to the dataframe
    
  
    XX=pd.concat([XX,X], axis=0, sort=False) # aggregation of dataframe per plantlocation into one dataframe
    
        
XX=XX.fillna(0) 
XX=XX.reset_index()

print('DataFrame with local features from climate factors and presence of specific power plants one-hot encoded' )
XX    


# 7. A prediction model for emission with local emission and weather data  

The general model for the actual emission is proposed as:

E=EF x A x (1-R)

E = actual emission (for this calculation NO2 in troposphere is taken as input)  
EF = emission factor (production capacity and type of powerplant)  
A = activity factor (fraction of the day that power plant is active or fraction of capacity that power plant is running at)  
R = external factors that reduce the actual emission (climate variabeles)  

In the model the factor R is expanded to 5 climate variabeles (temperature, relative and specific humidity, wind-speed and precipitable water) that may have an effect on the actual emissions.  

Both factors are considered features (X) of the model in order to predict the measured emission (y) as a reference for the total emission.   

The feature importance as an output of the model will give an indication as to the importance of reduction factors versus the emission factors for predicting the total emission.  


For features with a high correlation one is kept and the other removed to have a set of independent features.


In [None]:
y=XX['trop']

X=XX.drop(columns=['index','key_date','trop'])

# removing features with high correlation
features=X.columns

counter = 0
to_remove = []
for feat_a in features:
    for feat_b in features:
        if feat_a != feat_b and feat_a not in to_remove and feat_b not in to_remove:
            c = np.corrcoef(X[feat_a], X[feat_b])[0][1]
            if c > 0.90:  #c > 0.995
                counter += 1
                to_remove.append(feat_b)
                print('{}: FEAT_A: {} FEAT_B: {} - Correlation: {}'.format(counter, feat_a, feat_b, c))
                
print('Features that are removed : ', to_remove)

XT=X.drop(columns=to_remove)
feature_imp=pd.DataFrame({'feature': XT.columns})

XT.head()

In [None]:
#effect of scaling: no visible difference in y_test vs y_pred, MSE is approx 0.8, not much difference in features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X=scaler.fit_transform(XT)

y=scaler.fit_transform(np.ravel(y).reshape(-1, 1))

feature_imp=pd.DataFrame({'feature': XT.columns})

#print(XT.shape,y.shape,X.shape)


In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import xgboost as xgb

max_depth = 4
min_child_weight = 5
subsample = 0.7
colsample_bytree = 0.7
objective = "reg:squarederror" #'reg:linear',#"reg:squarederror"
num_estimators = 500 #2000 #1000 #3000  #200
learning_rate =  0.1 #0.01  #0.05 #0.003 # 0.3

xgb_reg = xgb.XGBRegressor(max_depth=max_depth,
            min_child_weight=min_child_weight,
            subsample=subsample,
            colsample_bytree=colsample_bytree,
            objective=objective,
            n_estimators=num_estimators,
            learning_rate=learning_rate,
            early_stopping_rounds=100,
            num_boost_round = 2000)

kf = KFold(n_splits=4, random_state=42, shuffle=True) # n_splits was 5


i=0
testscore=[]
#feature_imp=pd.DataFrame({'feature': X.columns})
for train_index, test_index in kf.split(X, y):
#for train_index, test_index in gkf.split(X, y, groups):
#    X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:] # for unscaled dataframe
    X_train, X_test = X[train_index,:], X[test_index,:] # for scaled version
    y_train, y_test = y[train_index], y[test_index]
                      
    xgb_reg.fit(X_train, np.ravel(y_train)) 

    y_pred = xgb_reg.predict(X_test)
    test_score1 = mean_squared_error(y_test, y_pred)  
    
    testscore.append(test_score1)
    i=i+1
    feature_imp['importance'+str(i)]=xgb_reg.feature_importances_

feature_imp['mean']=feature_imp.iloc[:,1:i].mean(axis=1)
feature_imp['std']=feature_imp.iloc[:,1:i].std(axis=1)    
print('mean_squared_error on test_set:', testscore, np.mean(testscore))

In [None]:
fig12 = plt.figure(figsize=(20, 5))
fig12.suptitle("visual inspection of prediction of emission (y_pred) vs measured value of emission (y_test)")
ax1 = fig12.add_subplot(111)
ax1.plot(range(0,len(y_test)), y_test, label='y_test', color='b')
#ax12 = ax1.twinx()
ax1.plot(range(0,len(y_test)), y_pred, label='y_pred', color='r') # aai_rgn.iloc[:,5] : waarde 3 geeft alleen positieve waardes van aai
ax1.legend(); #; ax12.legend()

In [None]:
feature_imp=feature_imp.sort_values('mean', ascending=False)
#feature_imp

In [None]:
feature_imp=feature_imp.sort_values('mean', ascending=True)

plt.figure(figsize=(16, 12))
plt.title("Feature importances in emission model of power plants in Puerto Rico")
plt.barh(range(X.shape[1]), feature_imp['mean'],
       color="r", xerr=feature_imp['std'], align="center")
# If you want to define your own labels,
# change indices to a list of labels on the following line.
plt.yticks(range(X.shape[1]), feature_imp['feature'])
plt.ylim([-1, X.shape[1]])
plt.show()

**Interpretation of feature importance**

The feature importance represents the weights in the calculated model that explain the relation between the features (climate factors and production factors) and the measured emission.

The feature importance for the production factors can be interpreted as the contribution of electricity production to the total emission for the specific location. 

The feature importance for the climate factors can be interpreted as the reduction factors in the modelling of emissions. 

The feature importance shows that a large part of the variability in the model for predicting emission is explained by emission factors. Climate factors are not dominant in this model.

We use the feature importances of the model to calculate emission factors for the locations: 



In [None]:
gray=gray.iloc[:,:7]

prod_features=feature_imp.rename(columns= {'feature':'cluster'})

#uit kfold
gray=pd.merge(gray, prod_features.loc[:,['cluster','mean']], how='left', on='cluster')
gray=gray.rename(columns= {'mean':'EF_contrib'}) # emission contribution as calculated from the feature importances of the model
gray=gray.fillna(0)
#gray

In [None]:
#print('Contribution of production factors in the model to the measured total emissions (%) :', gray.EF_contrib.sum()*100)

gray=gray.sort_values('EF_contrib', ascending=False)

gray=gray.drop(columns=['img_idx_lt','img_idx_lg'])

# # calculation of maximum daily production (MWh) based on max. capacity of power plants
# gray['EF_max_MWh_day']=(gray['capacity_mw']*24).astype(int)

# daily energy production for the region (MWh) distributed to plants according to emission distribution from the model
gray['hist_prod_day']=(gray['EF_contrib']*Prod_day/gray.EF_contrib.sum()).astype(int)

# activity factor calculated from model emission distribution and maximum daily production
gray['Activity_%']=(gray['hist_prod_day']*100/gray['Max_Prod_clst']).clip(upper=100).astype(int)

# multiplier: if activity > 100 (%) then emissions cannot be explained by power production because the plant is running beyond maximum capacity.
# the multiplier can explain these additional emissions attributable to choice of primary_fuel and generation of technology (age of plant).
gray['multiplier']=(gray['hist_prod_day']/gray['Max_Prod_clst']).astype(int)

gray['hist_prod_mdl1']=gray_cluster['hist_prod_day']
gray['activity_mdl1']=gray_cluster['Activity_%']
#gray_plant['multiplier_mdl1']=gray_cluster['multiplier']
gray['multiplier_mdl1']=gray_cluster['multiplier']

gray

The table above is based on the weight of the feature importances. 

To the right of the table three columns are added with the values from the earlier model based only on local emissions. 

If the columns are compared one notices that the values are different but that a multiplier is calculated for the same locations as in the first model. A second step is still necessary to take into account capacity constraints of power plants. The model with climate features does not lead to a better distribution of emissions to the power plants. The observation is therefor that the model with the climate factors does not add much to the simpler model only using location. Because of this the advice is to use the more simple model.
