<h2>Remote sensing and Emissions Factors</h2>
This Notebook shows the work I have done with the S5P and GFS data (I haven't used GLDAS data yet). Takes information also from other Notebooks I made:
- gppd with additional information (monthly generation and emissions estimate per Power Plant, and the corresponding estimation of global Emissions Factor for 2018) for the second semester of 2018 (first part of the rolling period). See https://www.kaggle.com/ajulian/eia-923-input-nox-emissions-and-ef-reference
- estimate about the mobile fuel combustion annual emissions. See https://www.kaggle.com/ajulian/activities-and-ghg-precursor-gases

<h3>Some details about "plot_ee_data_on_map"</h3>
First I'll describe the function plot_ee_data_on_map, which is extensively used in the notebook.

<h4>Main parameters</h4>
"products" is a list of S5P products; products admitted:
- "NO2" (default): the column will be "tropospheric_NO2_column_number_density". EE provides S5P NO2 images since 2018-06-28
- "CO": the column will be "CO_column_number_density". EE provides S5P CO images since 2018-06-28
- "SO2": the column will be "SO2_column_number_density". EE provides S5P SO2 images since 2018-12-05 (almost 6 months later than NO2 and CO!!)
- additionally, for "NO2" and "SO2" the band "cloud_fraction" is (can be) displayed, but hidden by default

"zoom_country" has two posible values:
- True (default): Zooms at the Country level; shows images from the product(s) selected and wind arrows around Puerto Rico (unless (lat, long) are provided)
- False: Zooms at the Power Plant level; shows the product(s) selected around a Power Plant identified by PPindex. The wind arrows direction are given by the u, v wind GFS bands and are shown around the selected location; those u and v bands are used to rotate the products rectangle, thus trying to face the emissions plume.

lat, long: 
- if zoom_country is True and (lat, long) are not provided, by default they are set to the center of Puerto Rico
- if zoom_country is False, (lat, long) are obtained through PPindex 

PPindex is a number [0, 7] identifying one of the eight largest Power Plants

"fit2country" is a flag (False by default) to fit the images to the country shape. Makes the images look nice, but the wind pushes many emissions from the North to the sea and with the flag they are not shown, so I almost do not use it.

"gppd_gdf" is the Power Plants dataset originally provided with extra information:
- real power generation and estimated NO2 emissions at the month and Power Plant levels
   (I worked this in another notebook: https://www.kaggle.com/ajulian/eia-923-input-nox-emissions-and-ef-reference)
- geojson info, taken from https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson

<h4>Other displayable data: Temperature and wind</h4>
- Temperature: the GFS band "temperature_2m_above_ground" is displayed as image, but hidden by default
- Wind: the GFS "u" and "v" wind bands are displayed as arrows, and visible by default 

<h4>Visualization</h4>
Regarding color visualization: min (green) and max (red) are computed from the image to be displayed (not so easy in EE as in numpy), and taken from https://www.kaggle.com/ianakoto/emmision-factor
This means a "red" at the Power Plant level may be an "orange" or less when displayed at the Country level. Since this may be sometimes confusing, they are also printed (a colormap bar would be even better)

Remember: there is an icon at the right top of the maps allowing switching on and off each layer.

In [None]:
import numpy as np # linear algebra
import math
import folium
import matplotlib.pyplot as plt
import mplleaflet # matplotlib to leaflet

# Connect to Earth Engine
import ee
from kaggle_secrets import UserSecretsClient
from google.oauth2.credentials import Credentials

# Trigger the authentication flow.
#ee.Authenticate()

# Retrieve your refresh token.
#!cat ~/.config/earthengine/credentials

user_secret = "AJR_EIE_test" # Your user secret, defined in the add-on menu of the notebook editor
refresh_token = UserSecretsClient().get_secret(user_secret)
credentials = Credentials(
        None,
        refresh_token=refresh_token,
        token_uri=ee.oauth.TOKEN_URI,
        client_id=ee.oauth.CLIENT_ID,
        client_secret=ee.oauth.CLIENT_SECRET,
        scopes=ee.oauth.SCOPES)

# Initialize GEE
ee.Initialize(credentials=credentials)
s5p_NOx_clean = None
s5p_NOx_clean2 = None
gppd_gdf = None

In [None]:
def add_ee_layer(self, ee_image_object, vis_params, name, show=True):
    map_id_dict = ee.Image(ee_image_object).getMapId(vis_params)
    folium.raster_layers.TileLayer(
        tiles = map_id_dict['tile_fetcher'].url_format,
        attr = "Map Data © Google Earth Engine",
        name = name,
        overlay = True,
        control = True,
        show = show
    ).add_to(self)

def rotate_around_point(point, radians, origin=(0, 0)):
    """Rotate a point around a given point."""
    x, y = point
    ox, oy = origin

    qx = ox + math.cos(radians) * (x - ox) + math.sin(radians) * (y - oy)
    qy = oy + -math.sin(radians) * (x - ox) + math.cos(radians) * (y - oy)

    return qx, qy

#def rect_rotated(long, lat, side, u_pp, v_pp):
def rect_rotated(long, lat, rectangle, u_pp, v_pp):
    # clockwise
    #long1 = long - side/2; lat1 = lat  # left bottom
    #long2 = long + side/2; lat2 = lat + side # right top
    long1 = rectangle.bounds().getInfo()['coordinates'][0][0][0]
    lat1 = rectangle.bounds().getInfo()['coordinates'][0][0][1]
    long2 = rectangle.bounds().getInfo()['coordinates'][0][1][0]
    lat2 = rectangle.bounds().getInfo()['coordinates'][0][2][1]

    alpha = np.arctan2(u_pp, v_pp)
    
    nlong1, nlat1 = rotate_around_point((long1, lat1), alpha, origin=(long, lat))
    nlong2, nlat2 = rotate_around_point((long1, lat2), alpha, origin=(long, lat))
    nlong3, nlat3 = rotate_around_point((long2, lat2), alpha, origin=(long, lat))
    nlong4, nlat4 = rotate_around_point((long2, lat1), alpha, origin=(long, lat))
    
    rect = ee.Geometry.Polygon([[nlong1, nlat1], [nlong2, nlat2], [nlong3, nlat3], [nlong4, nlat4]])
    return rect

def add_gfs_wind_arrows_layer(gfs_image_col, band_u, band_v, begin_date, end_date, 
                              lat, long, zoom_country, rectangle, showMode):
    wind_speed_scale_factor = 50.0
    # dates are transformed to "previous" hours before begin_date
    previous = 6 # hours
    begin_date = ee.Date(begin_date).advance(-previous, "hour")
    end_date = ee.Date(end_date).advance(-previous, "hour")
    #print(begin_date.format().getInfo(), end_date.format().getInfo())
    image_uv = (gfs_image_col
        .filterDate(begin_date, end_date)
        .first())
    
    img = image_uv.addBands(ee.Image.pixelLonLat()) # generates bands "latitude" and "longitude"

    if zoom_country:
        # at the country level, rectangle is given
        scale = 10000
        imgList = img.reduceRegion(reducer=ee.Reducer.toList(),\
                                        geometry=rectangle,\
                                        maxPixels=1e13,\
                                        scale=scale); # WARNING: scale=1000 blocks the Puerto Rico map
        # TODO compute for Puerto Rico
        u_pp_mean = 0
        v_pp_mean = 0
    else:
        # at the PP level, rectangle is computed here:
        # - PP is at the middle of a side
        # - rotation is averaged from few (u, v) "pixels"
        # - rotated "rectangle" (in fact is a polygon) returns for the emissions image
        scale = 8000
        wind_side = 0.2 # GFS resolution is 0.25 arc??
        lat1 = lat-wind_side/2; long1 = long-wind_side/2
        lat2 = lat+wind_side/2; long2 = long+wind_side/2
        wind_rect = ee.Geometry.Rectangle([long1, lat1, long2, lat2])
        img_min_uvList = image_uv.reduceRegion(reducer=ee.Reducer.toList(),\
                                        geometry=wind_rect,\
                                        scale=scale);
        u_pp = img_min_uvList.get(band_u).getInfo()
        v_pp = img_min_uvList.get(band_v).getInfo()
        # print("El viento es", u_pp, v_pp)
        imgList = img.reduceRegion(reducer=ee.Reducer.toList(),\
                                geometry=wind_rect,\
                                maxPixels=1e13,\
                                scale=scale);
        
        u_pp_mean = np.mean(u_pp); v_pp_mean = np.mean(v_pp) # useful
        rectangle = rect_rotated(long, lat, rectangle, u_pp_mean, v_pp_mean)
       
    y = imgList.get("latitude").getInfo() # list
    x = imgList.get("longitude").getInfo()
    u_orig = np.array((ee.Array(imgList.get(band_u)).getInfo()))
    v_orig = np.array((ee.Array(imgList.get(band_v)).getInfo()))
    u = u_orig / wind_speed_scale_factor; v = v_orig / wind_speed_scale_factor

    x_mesh, y_mesh = np.meshgrid(x, y, sparse=True)
    # print("x después", x_mesh); print("y después", y_mesh)
    U = u.T
    V = v.T
    # print("U", U); print("V", V)
    fig, ax = plt.subplots()
    kw = dict(color='black', alpha=0.8, scale=1)
    q = ax.quiver(x_mesh, y_mesh, U, V, **kw)
    # fig has no data before plotted (ax.quiver) in matplotlib
    gj = mplleaflet.fig_to_geojson(fig=fig)

    # feature group allows to have all wind arrows as a layer
    feature_group = folium.map.FeatureGroup(name="Wind arrows")
    for feature in gj['features']:
        if feature['geometry']['type'] == 'Point':
            x_long, y_lat = feature['geometry']['coordinates']
            div = feature['properties']['html']

            icon_anchor = (feature['properties']['anchor_x'],
                           feature['properties']['anchor_y'])

            icon = folium.features.DivIcon(div, icon_anchor=icon_anchor)
            marker = folium.Marker(location=(y_lat, x_long), icon=icon)
            feature_group.add_child(marker)
            # folium.Marker(location=(y_lat, x_long), icon=icon).add_to(Map)
        else:
            msg = "Unexpected geometry {}".format
            raise ValueError(msg(feature['geometry']))
            
    return feature_group, rectangle, u_pp_mean, v_pp_mean

def add_gfs_layers(Map, begin_date, end_date, lat, long, zoom_country, rectangle, showMode):
    dataset = "NOAA/GFS0P25"
    band_temp = 'temperature_2m_above_ground'
    band_u = "u_component_of_wind_10m_above_ground"
    band_v = "v_component_of_wind_10m_above_ground"
    gfs_image_col = (ee.ImageCollection(dataset)
        .select(band_temp, band_u, band_v)
        # in GFS, every 6 hours in a day (00, 06, 12, 18) 384 files are generated; 
        # F000 is the first, contains no forecasting but real-time measures 
        .filterMetadata("system:index", "contains", "F000")
      )

    feature_group1, rectangle, u_pp_mean, v_pp_mean = \
        add_gfs_wind_arrows_layer(
            gfs_image_col, band_u, band_v,
            begin_date, end_date, lat, long, zoom_country, rectangle, showMode)
        
    if showMode==True:    
        feature_group1.add_to(Map)

        vis_temp_params = {
          'min': -10,
          'max': 40,
          'opacity': 0.5,
          'palette': ['blue', 'purple', 'cyan', 'green', 'yellow', 'red']}

        gfs_image = gfs_image_col.filterDate(begin_date, end_date).first().clip(rectangle).select(band_temp)
        Map.add_ee_layer(gfs_image, vis_temp_params, name="Temp 2m", show=False)
    return rectangle, u_pp_mean, v_pp_mean

# Products is a list; the products admitted are:
# - "NO2" (default): the column will be "tropospheric_NO2_column_number_density"
# - "SO2": the column will be "SO2_column_number_density"
# - "CO": the column will be "CO_column_number_density"
# - additionally, for "NO2" and "SO2" the band "cloud_fraction" is (can be) displayed, but hidden by default
# 
# zoom_country has two posible values:
# - True (default): shows the product(s) selected and wind arrows around Puerto Rico
# - False: shows the product(s) selected around a Power Plant identified by PPidx.
#          the image rectangle follows the wind direction given by the u, v GFS bands
#
# lat, long: if zoom_country is True and (lat, long) are not provided, by default they are set to the center of Puerto Rico
#            if zoom_country is False, (lat, long) are obtained through PPindex 
#
# PPindex  is a number [0, 7] identifying one of the eight largest Power Plants
#
# gppd_gdf is the Power Plants dataset originally provided with extra information:
# - real power generation and estimated NO2 emissions at the month and Power Plant levels
#   (I worked this in another notebook)
# - geojson info, taken from https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson
#
# modified from https://www.kaggle.com/paultimothymooney/how-to-get-started-with-the-earth-engine-data
def plot_ee_data_on_map(begin_date, end_date, products=["NO2"], 
                        zoom_country=True, lat=18.232527, long=-66.257565, 
                        PPindex=0, fit2country=False, opacity=1.0, height=500, side=0.2, showMode=True):

    if zoom_country: # zoom at the country level; if (lat, long) are not provided, gets those from Puerto Rico
        zoom_start = 9
        half_x = 1.0667494 # from original kaggle bounding box
        half_y = 0.3323767 # from original kaggle bounding box
        lat1 = lat - 1.5*half_y; long1 = long - 1.5*half_x # more at the West, due to San Juan, and South
        lat2 = lat + 1.5*half_y; long2 = long + half_x # more at the North, due to San Juan 
    else: # zoom at a Power Plant level
        print("Zoom on", gppd_gdf.iloc[PPindex]["name"])
        lat = gppd_gdf.iloc[PPindex]["latitude"]
        long = gppd_gdf.iloc[PPindex]["longitude"]
        if showMode==True:
            search_cols = gppd_gdf.columns
            search_cols_gen = [search_cols[i] for i in range(len(search_cols)) \
                               if search_cols[i][:3]=="Net"][6:] # second semester
            search_cols_emi = [search_cols[i] for i in range(len(search_cols)) \
                               if search_cols[i][:3]=="Emi"][6:-1] # second semester

            col_slice_gen = gppd_gdf.iloc[PPindex][search_cols_gen]
            col_slice_emi = gppd_gdf.iloc[PPindex][search_cols_emi]
            #print(col_slice_gen, col_slice_emi)
            print("The max gen value for", gppd_gdf.iloc[PPindex]["name"], "was", max(col_slice_gen), 
                  "MWh and happened in", search_cols_gen[col_slice_gen.values.argmax()].split()[1], "2018")
            print("The max emissions value for", gppd_gdf.iloc[PPindex]["name"], "was", round(max(col_slice_emi)), 
                  "ton and happened in", search_cols_emi[col_slice_emi.values.argmax()].split()[2], "2018")
        zoom_start = 10
        long1 = long - side/2; lat1 = lat  # left bottom
        long2 = long + side/2; lat2 = lat + side # right top


    Map = folium.Map(location=[lat, long], zoom_start=zoom_start, height=height)
    # Map.add_ee_layer = add_ee_layer # does not work
    folium.Map.add_ee_layer = add_ee_layer
    
    rectangle = ee.Geometry.Rectangle([long1, lat1, long2, lat2]) # for PP level will be rotated
    rectangle, u_pp_mean, v_pp_mean = add_gfs_layers(Map, begin_date, end_date, 
                                                     lat, long, zoom_country, rectangle, showMode)
            
    for product in products:
        if product in ("NO2", "CO", "SO2"):
            region_scale = 3000 # A nominal scale in meters of the projection to work in.
            #Sentinel-5P Nitrogen Dioxide, Carbon Monoxide or Sulfur Dioxide
            dataset = "COPERNICUS/S5P/OFFL/L3_" + product
            # in case product=='NO2', the beginning of the column name is 'tropospheric_', otherwise ''
            column = (product=='NO2')* 'tropospheric_'
            # the end of the column is '_column_number_density' for the three S5p products
            column +=  product + '_column_number_density'
            if product=="CO":
                band_cloud_height = "cloud_height"
            else:
                # for NOx and SO2 there are AAI and cloud bands
                band_aai = "absorbing_aerosol_index" # not used (yet?)
                band_cloud_fraction = "cloud_fraction"
                
                if showMode==True:   
                    if product=="NO2" and s5p_NOx_clean is not None:
                        cloud_image = (s5p_NOx_clean
                           .select(band_cloud_fraction)
                           .filterDate(begin_date, end_date)
                           .mean()
                          )
                    else:
                        cloud_image = (ee.ImageCollection(dataset)
                           .select(band_cloud_fraction)
                           .filterDate(begin_date, end_date)
                           .mean()
                          )

                    vis_cloud_params = {
                          'min': 0, # no clouds, black pixel
                          'max': 1, # cloudy, white pixel
                          'opacity': 1,
                          'palette': ['black', 'white']}

                    Map.add_ee_layer(cloud_image.clip(rectangle), vis_cloud_params, name="Clouds_"+product, show=False)

            if product=="NO2" and s5p_NOx_clean is not None:
                sat_image = (s5p_NOx_clean
                   .select(column)
                   .filterDate(begin_date, end_date)
                   .mean()
                  )
            else:
                sat_image = (ee.ImageCollection(dataset)
                   .select(column)
                   .filterDate(begin_date, end_date)
                   .mean()
                  )
                
            if showMode==True:
                # min and max taken from https://www.kaggle.com/ianakoto/emmision-factor
                min_value = sat_image.reduceRegion(ee.Reducer.min(), rectangle, region_scale).getInfo()[column]   
                max_value = sat_image.reduceRegion(ee.Reducer.max(), rectangle, region_scale).getInfo()[column]
                if type(min_value) is not float or type(max_value) is not float:
                    print("Error: there may be no", product,"images between", begin_date, "and", end_date, "or the quality is bad")
                    return

                print(product + " min is", round(min_value*1e6, 2), "*10^-6 mol/m^2; max is", round(max_value*1e6, 2), "*10^-6 mol/m^2")

                vis_params = {
                  'min': min_value,
                  'max': max_value,
                  'opacity': opacity,
                  'palette': ['green', 'blue', 'yellow', 'red']}

                # fit2country clips ee image to the country borders; only makes sense with zoom at the country level
                if zoom_country and fit2country:
                    countries = ee.FeatureCollection("USDOS/LSIB_SIMPLE/2017")
                    Map.add_ee_layer(sat_image.clip(rectangle).clip(countries), vis_params, product)
                else:
                    Map.add_ee_layer(sat_image.clip(rectangle), vis_params, product)
                # Map.add_child(folium.map.LayerControl())
           
        elif product == "GLDAS":
            dataset = "NASA/GLDAS/V021/NOAH/G025/T3H" # not used (yet?)
    
    if gppd_gdf is not None and showMode==True:
        # taken from https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson
        for plant_lat, plant_long in zip(gppd_gdf.latitude, gppd_gdf.longitude):
            folium.Marker(location = (plant_lat, plant_long), icon = folium.Icon(color='blue')).add_to(Map)
    
    Map.add_child(folium.map.LayerControl()) # needs to be at the end
    
    if showMode==True:
        display(Map)
    else: # computing mode
        return sat_image, rectangle, u_pp_mean, v_pp_mean

##### <h2>Zoom at the Puerto Rico level</h2>
First we will zoom at the country level, just to see two maps:
- the first shows an East-West wind pattern typical in Puerto Rico 
- the second is one month later, around the days 14th and 15th, sept. 2017, when the hurricane Maria crossed Puerto Rico and the East-West wind pattern changed completely. 

No S5P products are selected, and in fact can't be because S5P was launched in 2018.

In [None]:
# lat = 40.416775; long = -3.703790 # Madrid
# lat = 25.761681; long=-80.191788 # Miami

begin_date = '2017-08-14'; end_date = '2017-08-15'
plot_ee_data_on_map(begin_date, end_date, products=[])
begin_date = '2017-09-14'; end_date = '2017-09-15'
plot_ee_data_on_map(begin_date, end_date, products=[])

<h2>Power Plant level</h2>
To work at the Power Plant level we must first load the gppd file with the geo info. "gppd_gdf" is the Power Plants dataset originally provided with extra information:
- real power generation and estimated NO2 emissions at the month and Power Plant levels
   (I worked this in another notebook: https://www.kaggle.com/ajulian/eia-923-input-nox-emissions-and-ef-reference)
- geojson info, taken from https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson

The Power Plants have been filtered by power generation, so just the eight largest are included.

In [None]:
# geo stuff taken from https://www.kaggle.com/maxlenormand/saving-the-power-plants-csv-to-geojson
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

gppd_df = pd.read_csv('/kaggle/input/eia-923-validation-and-nox-emissions-reference/gppd_120_pr_ef.csv') # just the eight bigger plants
# We can create a new column containing this information
gppd_df['longitude'] = [float(gppd_df['.geo'][point].split("[")[1][:-2].split(",")[0]) for point in range(gppd_df.shape[0])]
gppd_df['latitude'] = [float(gppd_df['.geo'][point].split("[")[1][:-2].split(",")[1]) for point in range(gppd_df.shape[0])]

geometry_power_plants = [Point(x,y) for x,y in zip(gppd_df.longitude, gppd_df.latitude)]
gppd_gdf = gpd.GeoDataFrame(gppd_df, crs = {'init': 'epsg: 4326'}, geometry = geometry_power_plants)

# Saving the geodataframe for easy use later
# gppd_gdf.to_file('Geolocated_gppd_120_pr.geojson', driver='GeoJSON')
print("These are the eight largest Power Plants considered:")
gppd_gdf["name"]

To zoom a Power Plant environment we must set zoom_country=False and specify a PP index (0-7). A text shows the months with more electricity generation and more emissions

In [None]:
zoom_country = False
PPindex = 1; # 0-7, the 8 largest Power plants, sorted alphabetically
begin_date = '2019-04-15'; end_date = '2019-04-16' # max 50
begin_date = '2018-07-18'; end_date = '2018-07-20' # max 40
plot_ee_data_on_map(begin_date, end_date, zoom_country=zoom_country, 
                    PPindex=PPindex)

The wind information from the GFS u and v bands (represented jointly by those arrows around the Power Plant icon) usually allows positioning the NO2 emissions square quite accurately. It is difficult, however, to separate emissions from San Juan and Palo Seco since they are quite close to each other.



<h1>Emissions Factors</h1>
<h2>Cleaning images in SP5 NO2 product</h2>
I have tried to work mostly in EE rather than with the TIFF images provided for several reasons, the main one being the ability to use other products from s5p (SO2, CO) or other satellites data (gfs), and also to work in areas other than Puerto Rico. But first some image filtering needs to be performed to keep only images showing Puerto Rico; since I haven´t been able to do the filtering in EE, I took the image names from the TIFF images provided.

Also, the number of images is 387, which is bigger than the number of days in a year. The reason is there are pairs of images with the same date; since usually they are partial images which do not cover the whole Puerto Rico, I keep the largest one in each pair.

In [None]:
import os
import pandas as pd

s5p_files_path = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/'
s5p_files = os.listdir(s5p_files_path)
print("Number of TIFF images provided:", len(s5p_files)) # 387

black_df = pd.read_csv("/kaggle/input/black-list/black_pd.csv")
black_list = list(black_df["img_name"])

dataset = "COPERNICUS/S5P/OFFL/L3_NO2"
band = "tropospheric_NO2_column_number_density"
lat=18.232527; long=-66.257565 # in case lat, long are provided, they are ignored at the country level
half_x = 1.0667494 # from original kaggle bounding box
half_y = 0.3323767 # from original kaggle bounding box
lat1 = lat - 1.5*half_y; long1 = long - 1.5*half_x # more at the West, due to San Juan, and South
lat2 = lat + 1.5*half_y; long2 = long + half_x # more at the North, due to San Juan 
rectangle = ee.Geometry.Rectangle([long1, lat1, long2, lat2])

s5p = ee.ImageCollection(dataset)
s5p_files.sort() # ascending in time
date_prev=""
repeated = 0
s5p_images_repeated_list = []
s5p_images_list = []
for i in range(len(s5p_files)-1, -1, -1):
    if s5p_files[i] not in black_list:
        product_id = s5p_files[i].split(".")[0].split("s5p_no2_")[1][:-12]
        # print(product_id)
        date = product_id[:8]
        if (date == date_prev):
            s5p_images_repeated_list.append(i) # points to position in list
            repeated += 1

        img = s5p.filterMetadata("PRODUCT_ID", "contains", product_id).first()
        s5p_images_list.append(img)

        date_prev = date
        product_id_prev = product_id
    else:
        s5p_files.pop(i)
print("Number of ee.Image in the original list:", len(s5p_images_list))
print("Number of image pairs with same date:", repeated)

# we are going to pop one image for each pair in repeated date, 
# so we must reverse the list
s5p_images_repeated_list.sort(reverse=True)
# for images in same date, take the largest !! 
print("Removing the smallest image when there are two in same date. SLOW!! (getInfo() inside a loop)")
# TODO Avoid loops calling getInfo(), which I think is a communication from server to client
black_list = []
for i in range(len(s5p_images_repeated_list)):
    s5p_images_list_id = s5p_images_repeated_list[i]
    img = s5p_images_list[s5p_images_list_id]
    img = img.reduceRegion(reducer=ee.Reducer.toList(),geometry=rectangle,\
                            maxPixels=1e13,scale=1000)
    no2 = np.array(ee.Array(img.get(band)).getInfo())

    img_prev = s5p_images_list[s5p_images_list_id-1]
    img_prev = img_prev.reduceRegion(reducer=ee.Reducer.toList(),geometry=rectangle,\
                            maxPixels=1e13,scale=1000)
    no2_prev = np.array(ee.Array(img_prev.get(band)).getInfo())
    print("Image 1 size:", no2.shape, "Image 2 size:", no2_prev.shape, 
          "  (", i+1, "/", len(s5p_images_repeated_list), ")")
    if (no2_prev.shape > no2.shape):
        s5p_images_list.pop(s5p_images_list_id)
        bad_image_name = s5p_files.pop(s5p_images_list_id)
    else:
        s5p_images_list.pop(s5p_images_list_id-1)
        bad_image_name = s5p_files.pop(s5p_images_list_id-1)
    black_list.append(bad_image_name)
if len(black_list) > 0:
    black_pd = pd.DataFrame(black_list, columns=["img_name"])
    black_pd.to_csv("/kaggle/working/black_pd.csv", index=False)

print(black_list)
print("Number of ee.Image after removing same date images from the original list:", len(s5p_images_list))
s5p_NOx_clean = ee.ImageCollection(s5p_images_list)
print("Created clean collection with", s5p_NOx_clean.size().getInfo(), "images")
print("Created filename collection with", len(s5p_files), "filenames")

We have 344 images, which means in the 365 days of the rolling year there are 21 days without an image.

<h2>NO2 plot for Puerto Rico</h2>
Now we can plot the mean NO2 emissions (as seen once a day from the s5p satellite) for the rolling year. This is not directly the NOx emissions in the Power Plants or any other emission source (such as vehicles), but it's related.

In [None]:
# toBands concatenates in a single image all the bands (in our case just the selected one) 
# from all images in an image collection 
region_scale = 2000
#no2_list = s5p_NOx_clean.select(band).toBands().reduceRegion(ee.Reducer.sum(), rectangle, region_scale).toArray().getInfo()
no2_list = s5p_NOx_clean.select(band).toBands().reduceRegion(ee.Reducer.mean(), rectangle, region_scale).toArray().getInfo()
print("One point per day:", len(no2_list))

In [None]:
# We can take dates from the imageCollection, which again is slow since uses getInfo() inside a loop.
# Two ways but both slow
# datetime_list = s5p_NOx_clean.aggregate_array("system:time_start").getInfo()
# datetime_list = s5p_NOx_clean.reduceColumns(ee.Reducer.toList(), ["system:time_start"]).get('list').getInfo()
# datetime_list = [ee.Date(millisec).format("YYYY-MM-dd HH:mm:ss").getInfo() for millisec in datetime_list]
datetime_list = [f.split(".")[0].split("s5p_no2_")[1].split("_")[0].replace("T", " ") for f in s5p_files]
datetime_list = [dt[:4] + "-" + dt[4:6] + "-" + dt[6:8] + " " + dt[9:11] + ":" + dt[11:13] + ":" + dt[13:15] for dt in datetime_list]


In [None]:
import plotly.graph_objects as go
fig = go.Figure()
date_list = [datetime.split(" ")[0] for datetime in datetime_list]
fig.add_trace(go.Scatter(x=date_list, y=no2_list, name="NO2 emissions"))
fig.update_layout(title="NO2 mean emissions in tropospheric vertical column",
                  yaxis_title="NO2 emissions (mol/m^2)")
fig.show()

<h2>Cloud influence</h2>
Clouds make NO2 measures useless; we can take advantage of cloud information in S5p: NO2 and SO2 products have a band named "cloud_fraction" where pixels have values between 0 (no cloud) and 1 (cloudy). So let's apply a mask for cloudy days and check the plot again:

In [None]:
cloud_threshold = 0.5 # pixels with value of cloud_fraction > 0.5 (50%) will be masked
cloud_band = "cloud_fraction"
def clean_band(image):
    cloud = image.select(cloud_band)
    mask = cloud.lte(cloud_threshold)
    return image.updateMask(mask)

s5p_NOx_clean2 = s5p_NOx_clean.map(clean_band)
no2_clean_list = s5p_NOx_clean2.select(band).toBands().reduceRegion(ee.Reducer.mean(), rectangle, region_scale).toArray().getInfo()
#s5p_NOx_clean = s5p_NOx_clean.map(clean_band)
#no2_clean_list = s5p_NOx_clean.select(band).toBands().reduceRegion(ee.Reducer.mean(), rectangle, region_scale).toArray().getInfo()

fig = go.Figure()
fig.add_trace(go.Scatter(x=date_list, y=no2_list, name="NO2 emissions (as provided)"))
fig.add_trace(go.Scatter(x=date_list, y=no2_clean_list, name="NO2 emissions (filtering cloud)"))
fig.update_layout(title="NO2 mean emissions in tropospheric vertical column",
                  yaxis_title="NO2 emissions (mol/m^2)")
fig.show()

There has been not much change. Let's check when that maximum happened and see the map.

In [None]:
max_date = date_list[no2_clean_list.index(max(no2_clean_list))] # gets the day with the max value
print("The day with a maximum of average NO2 was", max_date)
begin_date = '2019-04-15'; end_date = '2019-04-17'
plot_ee_data_on_map(begin_date, end_date)

In [None]:
fig = go.Figure()
time_list = [datetime.split(" ")[1] for datetime in datetime_list]
df = pd.DataFrame({"no2": no2_clean_list, "time": time_list})
df.sort_values("time", inplace=True)
fig.add_trace(go.Scatter(x=df.time, y=df.no2, mode="markers"))
fig.update_layout(title="NO2 emissions vs. time in tropospheric vertical column",
                  yaxis_title="NO2 emissions (mol/m^2)")
fig.show()

I cannot see any variation of NO2 measurements with time, which would reflect a diference in lifetime. In fact, this should be easier to check at a Power Plant level but would require daily measures of some activity: electricity generation or emissions.

<h2>Emissions Factors at the Power Plant and month level</h2>
In order to compute the Emissions Factors, we need to estimate the emissions in the Power Plant.
We are going to apply a basic box model to convert from NO2 detected by the satellite (in mol/m2) to NO emitted in the Power Plant (in kg/h). This is a classic model which comes from "Introduction to atmospheric chemistry" (D.J.Jacob, 1999); other models have been considered but the box model still can be applied, and is very simple.

The conversion formula is E = NO2 w^2 / (K f tau), where:
- E is the NO emissions in the Power Plant (kg/h)
- NO2 is the average NO2 detected in an S5P image (mol/m2)
- w is the width of the box; in my case, the box rotates each time a measure is taken, so I am taking an average
- K is a conversion factor, embedded in the rest of conversions
- f is the rate between NO2 and NO; let's say S5P detects 100 parts of NO2; actually, the Power Plant emitted 132 of NO, out of which 32 did not convert to NO2 and kept as NO. So we have to multiply by 1.32 the NO2 detected to infer the NO emitted. This is a parameter justified by [[Beirle, 2011](http://projects.knmi.nl/publications/fulltexts/1737.full.pdf)]
- tau is a combination of times: 1/tau = 1/tau_loss + 1/tau_wind; tau_loss is the chemical lifetime, or rate at which the NO2 converts to other gases; it depends on many things: time of day (it is due mainly to sunlight), season of year, etc. There is an interesting figure showing different tau_loss at different times and seasons [[Siyang, 2018](https://www.sciencedirect.com/science/article/pii/S1001074218327426)]; the interesting thing is that tau_loss varies among 4-8 hours, meaning tht most of the emitted NO2 will be lossed during the day even if there is no wind
- tau_wind is called sometimes the residence time: wind favors the conversion of NO2 into other things, and in Puerto Rico, which is quite windy, is likely to be much lower than tau_loss => the inverse is bigger and we do not have to care too much about tau_loss. The formula for tau_wind is w/2U; since "w" usually is in km and U in m/s, a conversion factor must be applied to obtain hours, typical unit for these times

In [None]:
# Parameters:
# - NO2 in mol/m2
# - U speed of wind, in m/s (sqrt(u^2+v^2))
# - w width of box, in km
def emissions_jacob(NO2, U, w): # inputs (mol/m2) (m/s) (km); output kg/h
    NO2 = NO2 * 46.0/1000    # 1 mol_NO2/m2 = 46gr/m2 = 46/1000 kg/m2
    w = w * 1000           # km => m
    U = U/(1.0/3600)         # m/s => m/h 
    tau = w/(2*U)          # m/(m/h) => (h)
    f = 1.32
    E = NO2 * w**2 * f /(tau)
    return E

In [None]:
from haversine import haversine
import math

def emissions_PP(begin_date, end_date, band, PPindex, side, scale):
    w = 21 # km

    box, box_rect, u_pp_mean, v_pp_mean = plot_ee_data_on_map(begin_date, end_date, zoom_country=False, 
                    PPindex=PPindex, side=side, showMode=False)
    boxList = box.reduceRegion(reducer=ee.Reducer.toList(),\
                                        geometry=box_rect,\
                                        scale=scale);
    NO2_list = boxList.get(band).getInfo()
    NO2_mean = np.mean(NO2_list)
    U = math.sqrt(u_pp_mean**2 + v_pp_mean**2)
    E_kg_h = emissions_jacob(NO2_mean, U, w)
    E_tonne_day = E_kg_h*24/1000
    # print(E_tonne_day, "tonne/day")
    return E_tonne_day

In [None]:
def emissions_slices(PPindex, side, scale):
    emissions_l = [0, 0, 0, 0, 0, 0]
    # last minute problem is not allowing to begin before september 2018
    begin_date = ee.Date("2018-09-01")
    band = "tropospheric_NO2_column_number_density"

    for m in range(4):
        print("Month change")
        if m==0:
            begin_month=begin_date
        else:
            begin_month = end_month
        end_month = begin_month.advance(1, "month").advance(-1, "day")
        # print(begin_month.format("YYYY-MM-dd").getInfo(), end_month.format("YYYY-MM-dd").getInfo())
        for slice in range(n_slices):
            if slice==0:
                begin_slice = begin_month
            else:
                begin_slice = end_slice # last end_slice
            if slice==n_slices-1:
                end_slice = end_month
                # TODO end month emissions
            else:
                end_slice = begin_slice.advance(period, "day")

            emissions_l[m] += period*emissions_PP(begin_slice, end_slice, band, PPindex, side, scale)
            # print(begin_slice.format("YYYY-MM-dd").getInfo(), end_slice.format("YYYY-MM-dd").getInfo())
    return emissions_l

In [None]:
PPindex = 1; # 0-7, the 8 largest Power plants, sorted alphabetically
scale = 3000
side = 0.2 # This is the default in plot_ee function
n_months = 6 # just can compare with data from 2018
n_slices = 6 
period = 5

emissions_l = emissions_slices(PPindex, side, scale)

<h2>Monthly Emissions Factors for Aguirre (sept-dec 2018)</h2>

In [None]:
print("The emissions inferred from the system are", emissions_l[:4])
print("The emissions estimated from EIA were", gppd_gdf.iloc[PPindex]["Emissions ton September"],
     gppd_gdf.iloc[PPindex]["Emissions ton October"], gppd_gdf.iloc[PPindex]["Emissions ton November"],
     gppd_gdf.iloc[PPindex]["Emissions ton December"])
print("The Emissions Factors are", round(1000*emissions_l[0]/gppd_gdf.iloc[PPindex]["Netgen September"], 3),
     round(1000*emissions_l[1]/gppd_gdf.iloc[PPindex]["Netgen October"], 3),
     round(1000*emissions_l[2]/gppd_gdf.iloc[PPindex]["Netgen November"], 3),
     round(1000*emissions_l[3]/gppd_gdf.iloc[PPindex]["Netgen December"], 3))

<h2>The Emissions Factors for Aguirre are 1.008, 0.722, 0.885, 0.72 kg/MWh</h2>
For the rest of Power Plants we would proceed the same (but I am running out of time...)