In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Objective

This Data Science for Good Competition intends to use **remote sensing** techniques to understand Environmental Emissions. Since the whole concept of Satellite Imagery and can be a little overwhelming, this is just an introductory kernel, where I try to explain the various terms and datasets related to satellite Imagery.

# Problem Statement: Measuring **Emissions factors** from Satellite Data
Air Quality Management is an important area and influences a lot of decisions taken by countries. But how does one ascertain the Air quality of a place? This is done by calculating the Emissions Factor of that area.

What is the Emission factor? 
A lot of activities today results in the release of Green House Gases(GHG) in the atmosphere. There are various activities that contribute to the release of GHG like burning fuel, vehicles, Power Plants, etc. Therefore, in order to estimate GHG emissions per unit of available activity, we need to use a factor called emission factor (EF).[source]
For example: how many kgs of GHG are emitted by 1 kWh of natural gas?
Thus, an emission factor is a coefficient that converts any activity's data into GHG emissions. [This factor attempts to relate the quantity of a pollutant released to the atmosphere with an activity associated with the release of that pollutant.](https://www.epa.gov/air-emissions-factors-and-quantification/basic-information-air-emissions-factors-and-quantification#About%20Emissions%20Factors)

![](https://cdn-images-1.medium.com/max/800/1*3ToZXr2ObHrT5vlTvxU7pg.png)

![](https://cdn-images-1.medium.com/max/800/1*K4nA9SlqCrXHFLQOGP5_dg.png)

[Source](https://www.epa.gov/air-emissions-factors-and-quantification/basic-information-air-emissions-factors-and-quantification#About%20Emissions%20Factors)

# Where does Satellite Data fit in?

Today, a lot of activities related to the calculation of Emission factors entail a long and time-consuming process of Data Collection. Data Collection can in itself be erroneous and can introduce disparities. Here is an example of a typical emission factor :

* pounds of NOx per million cubic feet of natural gas combusted (the million cubic feet of natural gas is the activity unit [A])[source]

An emission factor is one of the most common questions in calculating an equipment's emissions and so what if there were a better way to calculate them. This is what the competition is all about. We need to try and see [if if it's possible to use remote sensing techniques to better model emissions factors](https://www.kaggle.com/c/ds4g-environmental-insights-explorer/overview/description)

# Remote Sensing

Remote sensing is the process of gathering information from an object or place without any actual contact with the object. In the case of Satellite, remote sensing means to use satellites to gather data.

![](https://cdn-images-1.medium.com/max/800/1*qeaNfhcHLLGMpJI3mkBRRA.png)

*a NASA program comprising a series of satellite missions as of 2 February 2015*

Satellite Imagery is the image of Earth(or other planets) which are collected by imaging satellites. Satellites have been collecting Earth Observation data for decades. Governments or private firms may own these Satellite. Some of the imagery that has been made available in the public domain are:

* [**Landsat**](https://en.wikipedia.org/wiki/Landsat_program): It is the oldest continuous Earth-observing satellite imaging program.The [Landsat 7](https://en.wikipedia.org/wiki/Landsat_7) and [Landsat 8](https://en.wikipedia.org/wiki/Landsat_8 "Landsat 8") satellites are currently in orbit. [Landsat 9](https://en.wikipedia.org/wiki/Landsat_9 "Landsat 9") is planned.

![](https://cdn-images-1.medium.com/max/800/1*-OR24t1D6OoJ7CLW4IMGKQ.png)

*LANDSAT*

* **MODIS** : [MODIS](https://modis.gsfc.nasa.gov/about/) stands for The **Moderate Resolution Imaging Spectroradiometer** (**MODIS**). It is a key instrument aboard the [Terra](http://terra.nasa.gov/) and [Aqua](http://aqua.nasa.gov/) satellites which viewing the entire Earth’s surface every 1 to 2 days.

* [**Sentinel**](https://en.wikipedia.org/wiki/Copernicus_Programme#Sentinel_missions) : The Sentinel missions by European Space Agency([ESA](https://en.wikipedia.org/wiki/European_Space_Agency)) includes radar and super-spectral imaging for the land, ocean and atmospheric monitoring.

![](https://cdn-images-1.medium.com/max/800/1*a6keWwYKkPUFHycGP6l_Kw.png)

*Sentinel*


# Analysing the different Datasets

The following datasets have been provided as a starter kit to get started with the competition. Let’s understand them briefly:

#### 1. [Global Power Plant Database](https://developers.google.com/earth-engine/datasets/catalog/WRI_GPPD_power_plants) by [World Resources Institute](http://datasets.wri.org/dataset/globalpowerplantdatabase)(WRI)

The Global Power plant database is a fully open-sourced(licensed under [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/).) and a comprehensive database that includes details of powerplants around the world. The database covers approximately 30,000 power plants from 164 countries and includes both thermal and renewable power plants.

#### 2. [Sentinel 5P OFFL NO2](https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_NO2)

**Sentinel-5 Precursor** (**Sentinel-5P**) is an [Earth observation satellite](https://en.wikipedia.org/wiki/Earth_observation_satellite "Earth observation satellite") developed by [ESA](https://en.wikipedia.org/wiki/European_Space_Agency "European Space Agency") as part of the [Copernicus Programme](https://en.wikipedia.org/wiki/Copernicus_Programme "Copernicus Programme"). The [Copernicus Programme](https://en.wikipedia.org/wiki/Copernicus_Programme "Copernicus Programme") is dedicated to monitoring [air pollution](https://en.wikipedia.org/wiki/Air_pollution "Air pollution") and Sentinel 5P Precursor is its first mission. It consists of an instrument called [Tropomi](https://en.wikipedia.org/wiki/Sentinel-5_Precursor) (TROPOspheric Monitoring Instrument) which is a spectrometer to monitor [ozone](https://en.wikipedia.org/wiki/Ozone "Ozone"), [methane](https://en.wikipedia.org/wiki/Methane "Methane"), [formaldehyde](https://en.wikipedia.org/wiki/Formaldehyde "Formaldehyde"), [aerosol](https://en.wikipedia.org/wiki/Aerosol "Aerosol"), [carbon monoxide](https://en.wikipedia.org/wiki/Carbon_monoxide "Carbon monoxide"), [NO2](https://en.wikipedia.org/wiki/Nitrogen_dioxide "Nitrogen dioxide") and [SO2](https://en.wikipedia.org/wiki/Sulfur_dioxide "Sulfur dioxide") in the atmosphere.

The **OFFL/NO2 is a dataset** that provides  offline high-resolution imagery of NO2 concentrations

#### 3. [Global Forecast System 384-Hour Predicted Atmosphere Data](https://developers.google.com/earth-engine/datasets/catalog/NOAA_GFS0P25)

Global Forecast System(GFS) is a model that forecasts weather.The GFS is a coupled model, composed of an atmosphere model, an ocean model, a land/soil model, and a sea ice model which work together to provide an accurate picture of weather conditions

![](https://cdn-images-1.medium.com/max/800/1*10SiCHb5aTr5zeTrHLMbVA.gif)

[An animated image of GFS simulated total atmospheric ozone concentration](https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs)

#### 4. [GLDAS-2.1: Global Land Data Assimilation System](https://developers.google.com/earth-engine/datasets/catalog/NASA_GLDAS_V021_NOAH_G025_T3H)

This dataset provided by NASA ingest satellite- and ground-based observational data products, using advanced land surface modeling and data assimilation techniques, in order to generate optimal fields of land surface states and fluxes (Rodell et al., 2004a)


In [None]:
## Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Analysing datetime
import datetime as dt
from datetime import datetime 

# Plotting geographical data
import folium
import rasterio as rio

# File system manangement
import os

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

# Analysing the various datasets

We have been given access to the Data for Puerto Rico from July 2018 to July 2019. This data has been Exported from Earth Engine.

## 1. Exploring the Global Power Plant Database for Puerto Rico

The given `gppd_120_pr.csv` consists of all the power plants which belongs to the Puerto Rico, an unincorporated territory of the United States located in the northeast Caribbean Sea.The **latitude** of Puerto Rico is **18.200178**, and the longitude is **-66.664513**. The island has been chosen for the analysis since [there are fewer confounding factors from nearby areas](https://www.kaggle.com/c/ds4g-environmental-insights-explorer). Puerto Rico also offers a unique fuel mix and distinctive energy system layout that should make it easier to isolate pollution attributable to power generation in the remote sensing data.

In [None]:
# Total power plants in Puerto Rico
global_power_plants = pd.read_csv('../input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')
global_power_plants.head().T

In [None]:
# No of different powerplants
global_power_plants.shape

Before analysing further, let's understand what some of the attributes mean.[[source](http://datasets.wri.org/dataset/globalpowerplantdatabase)]
>
* **capacity_mw** - electrical generating capacity in megawatts
* **commissioning_year** - year of plant operation, weighted by unit-capacity when data is available
* **estimated_generation_gwh** - estimated annual electricity generation in gigawatt-hours 
* **generation_gwh_2013** - electricity generation in gigawatt-hours for the year 2013
* **gppd_idnr** - estimated annual electricity generation in gigawatt-hours for the year 2014
* **name** - name or title of the power plant
* **primary_fuel** - energy source used in primary electricity generation or export
* **wepp_id** - a reference to a unique plant identifier in the widely-used PLATTS-WEPP datase
* **year_of_capacity_data** -year the capacity information was reported
* **source** - entity reporting the data
* **owner** - majority shareholder of the power plant

### Kinds of Power Plants based on primary Fuel used

In [None]:
# Let's check the different kinds of Power Plants based on primary Fuel used.
sns.barplot(x=global_power_plants['primary_fuel'].value_counts().index,y=global_power_plants['primary_fuel'].value_counts())
plt.ylabel('Count')


### How old are the plants
Power plants built decades ago tend to pollute more since they donot meet the newer anti-pollution requirements.

In [None]:
global_power_plants['commissioning_year'].value_counts()

Well, a lot of powerplants donot have their date of Commission. The plants are as old as 1942 and the latest one belongs to the year 2012.

### The data different sources of data

In [None]:
fig = plt.gcf()
fig.set_size_inches(10, 6)
colors = ['dodgerblue', 'plum', '#F0A30A','#8c564b','orange','green','yellow'] 
global_power_plants['source'].value_counts(ascending=True).plot(kind='barh',color=colors,linewidth=2,edgecolor='black')

The majority of the data came from CEPR followed by PREPA(Puerto Rico Electric Power Authority)

### Who owns the Power Plants

In [None]:
# Owner - majority shareholder of the power plant

fig = plt.gcf()
fig.set_size_inches(10, 6)
colors = ['dodgerblue', 'plum', '#F0A30A','#8c564b','orange','green','yellow'] 
global_power_plants['owner'].value_counts(ascending=True).plot(kind='barh',color=colors)

PREPA is a government agency that owns the electricity transmission and distribution systems for the main island, Vieques, and Culebra, as well as 80% of the electricity generating capacity([source](https://www.eia.gov/state/analysis.php?sid=RQ#25))

### Total Installed Capacity

The Total Installed capacity of a power plant refers to the maximum output of electricity that it can produce under ideal conditions but this won’t necessarily be the actual amount of electricity produced.It is usually expressed in megawatts (MW)

In [None]:
# Total capacity of all the plants
total_capacity_mw = global_power_plants['capacity_mw'].sum()
print('Total Installed Capacity: '+'{:.2f}'.format(total_capacity_mw) + ' MW')


In [None]:
capacity = (global_power_plants.groupby(['primary_fuel'])['capacity_mw'].sum()).to_frame()
capacity = capacity.sort_values('capacity_mw',ascending=False)
capacity['percentage_of_total'] = (capacity['capacity_mw']/total_capacity_mw)*100
capacity

In [None]:
fig = plt.gcf()
fig.set_size_inches(10, 6)
colors = ['dodgerblue', 'plum', '#F0A30A','#8c564b','orange','green','yellow'] 
capacity['percentage_of_total'].plot(kind='bar',color=colors)


Oil run plants consitutes about 68% of Puerto Rico’s total installed capacity and natural gas accounted for 18%. Coal continues to fuel 7% of generation, while renewables supplied around 5.5%.

### Estimated generation

Electricity generation, on the other hand, refers to the amount of electricity that is produced over a specific period of time. This is usually measured in kilowatt-hours, megawatt-hours o gigawatt-hours.

In [None]:
# Total generation of all the plants
total_gen_mw = global_power_plants['estimated_generation_gwh'].sum()
print('Total Generatation: '+'{:.2f}'.format(total_gen_mw) + ' GW')

In [None]:
generation = (global_power_plants.groupby(['primary_fuel'])['estimated_generation_gwh'].sum()).to_frame()
generation = generation.sort_values('estimated_generation_gwh',ascending=False)
generation['percentage_of_total'] = (generation['estimated_generation_gwh']/total_gen_mw)*100
generation

More than 90% of estimated generation comes from Fossil Fuel powered plants while only a minority share can be attributed to the Renewable Resources fueled plants.

## A geographical view of the various Power Plants

We can use the power plant dataset to visualise the existing locations of the various power plant. We will extract the latitudes and longitudes from the `geo` column



In [None]:
# Code source: https://www.kaggle.com/paultimothymooney/overview-of-the-eie-analytics-challenge
from folium import plugins      
def plot_points_on_map(dataframe,begin_index,end_index,latitude_column,latitude_value,longitude_column,longitude_value,zoom):
    df = dataframe[begin_index:end_index]
    location = [latitude_value,longitude_value]
    plot = folium.Map(location=location,zoom_start=zoom,tiles = 'Stamen Terrain')
    

    for i in range(0,len(df)):
        popup = folium.Popup(str(df.primary_fuel[i:i+1]))
        folium.Marker([df[latitude_column].iloc[i],
                       df[longitude_column].iloc[i]],
                       popup=popup,icon=folium.Icon(color='white',icon_color='red',icon ='bolt',prefix='fa',)).add_to(plot)
    return(plot)

def overlay_image_on_puerto_rico(file_name,band_layer,lat,lon,zoom):
    band = rio.open(file_name).read(band_layer)
    m = folium.Map([lat, lon], zoom_start=zoom)
    folium.raster_layers.ImageOverlay(
        image=band,
        bounds = [[18.6,-67.3,],[17.9,-65.2]],
        colormap=lambda x: (1, 0, 0, x),
    ).add_to(m)
    return m

def split_column_into_new_columns(dataframe,column_to_split,new_column_one,begin_column_one,end_column_one):
    for i in range(0, len(dataframe)):
        dataframe.loc[i, new_column_one] = dataframe.loc[i, column_to_split][begin_column_one:end_column_one]
    return dataframe

In [None]:

global_power_plants = split_column_into_new_columns(global_power_plants,'.geo','latitude',50,66)
global_power_plants = split_column_into_new_columns(global_power_plants,'.geo','longitude',31,48)
global_power_plants['latitude'] = global_power_plants['latitude'].astype(float)
a = np.array(global_power_plants['latitude'].values.tolist()) 
global_power_plants['latitude'] = np.where(a < 10, a+10, a).tolist() 

lat=18.200178; lon=-66.664513 # Puerto Rico's co-ordinates
plot_points_on_map(global_power_plants,0,425,'latitude',lat,'longitude',lon,9)

## 2. Exploring the Sentinel 5P OFFL NO2 dataset

This dataset provides offline high-resolution imagery of NO2 concentrations in the troposphere and the stratosphere. Nitrogen Oxides are predominantly released during the burning of fossil fuels and also during other processes like wildfires, lightening and other microbiological processes in soils. This dataset is named as `s5p_no2` and consists of 387 `.tif` files. Before analysing the NO2 emissions, let us look at a single image and see what all information it contains.

### Analysing images using the the Rasterio module

[Rasterio](https://automating-gis-processes.github.io/CSC18/lessons/L6/reading-raster.html) is a module  for reading and writing several different raster formats in Python. 

A [raster image](https://www.computerhope.com/jargon/r/raster.htm) is an image file format that is defined by a pixel that has one or more numbers associated with it. The number defines the location, size, or color of the pixels. Raster images are commonly .BMP, .GIF, .JPEG, .PNG, and .TIFF files. Today, almost all of the images you see on the Internet and images taken by a digital camera are raster images.

Let’s start with inspecting one of the files we downloaded:

In [None]:

image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180701T161259_20180707T175356.tif'

# Opening the file
raster = rio.open(image)

# All Metadata for the whole raster dataset
raster.meta

* Driver : Data Format
* dtype : data type
* width and Height : The dimensions of the image are : 475 X 148
* count : There are 12 bands in the image
* crs:  Coordinate Reference Systems which refers to the way in which spatial data that represent the earth’s surface.A     particular CRS can be referenced by its EPSG code (i.e.,epsg:4121). The EPSG is a structured dataset of CRS and Coordinate Transformations([link](http://www.epsg-registry.org/
http://spatialreference.org/)]
* transform : Affine transform (how raster is scaled, rotated, skewed, and/or translated)


In [None]:
from rasterio.plot import show
show(raster)

In [None]:
# Plotting the red channel.
show((raster, 4), cmap='Reds')

## Bands

The satellites cover the full earth on 13 bands with a revisiting every 5 days.

![](https://miro.medium.com/max/419/1*rN7V6sE1qpkSV0nVHl23uA.png)

source: https://arxiv.org/pdf/1709.00029.pdf

In [None]:
# Calculating the dimensions of the image on earth in metres
sat_data = raster

width_in_projected_units = sat_data.bounds.right - sat_data.bounds.left
height_in_projected_units = sat_data.bounds.top - sat_data.bounds.bottom
print("Width: {}, Height: {}".format(width_in_projected_units, height_in_projected_units))
print("Rows: {}, Columns: {}".format(sat_data.height, sat_data.width))

### Converting the pixel co-ordinates to longitudes and latitudes

In [None]:
# Upper left pixel
row_min = 0
col_min = 0
# Lower right pixel.  Rows and columns are zero indexing.
row_max = sat_data.height - 1
col_max = sat_data.width - 1
# Transform coordinates with the dataset's affine transformation.
topleft = sat_data.transform * (row_min, col_min)
botright = sat_data.transform * (row_max, col_max)
print("Top left corner coordinates: {}".format(topleft))
print("Bottom right corner coordinates: {}".format(botright))

### Bands
The image that we are inspecting is a multispectral image consisting of 4 bands int he order B,G,R,N where N stands for near infrared.each band is stored as a numpy array.

In [None]:
print(sat_data.count)

# sequence of band indexes
print(sat_data.indexes)

### Visualising the Satellite Imagery
We will use matplotlib to visualise the image since it essentially consists of arrays.

In [None]:
# Load the 12 bands into 2d arrays
b01, b02, b03, b04,b05,b06,b07,b08, b09,b10, b11, b12 = sat_data.read()

In [None]:
# Displaying the second band.

fig = plt.imshow(b02)
plt.show()

In [None]:
fig = plt.imshow(b03)
fig.set_cmap('gist_earth')
plt.show()

In [None]:
fig = plt.imshow(b04)
fig.set_cmap('inferno')
plt.colorbar()
plt.show()


In [None]:
# Displaying the infrared band.

fig = plt.imshow(b08)
fig.set_cmap('winter')
plt.colorbar()
plt.show()