***Objective***

This Data Science for Good Competition intends to use **remote sensing** techniques to understand Environmental Emissions. Since the whole concept of Satellite Imagery and can be a little overwhelming, this is just an introductory kernel, where I try to explain the various terms and datasets related to satellite Imagery.

**Problem Statement: Measuring Emissions factors from Satellite Data ?**

What is an emission factor?

The release of GHG into the atmosphere depends mainly on the activity and the product. In order to estimate GHG emissions per unit of available activity, we need to use a factor called emission factor (EF).

For example: how many  kg of GHG are emitted by 1 kWh of natural gas ?

Thus, the emission factor is the sum of emissions of CO2eq of the human activity described as mass unit of CO2eq / reference flows. For example: the EF for the natural gas is the sum of the combustion (0.205 kg CO2eq / kWh ICV) and the upstream (i.e. the production and transport of the gas) (0.0389 kg CO2eq / kWh ICV).

![](https://cdn-images-1.medium.com/max/800/1*3ToZXr2ObHrT5vlTvxU7pg.png) 


The general equation for emissions estimation is:

**E = A x EF x (1-ER/100)**

where:

E = emissions; A = activity rate; EF = emission factor, and ER =overall emission reduction efficiency, 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns 
import matplotlib.pyplot as plt 
import missingno as msno
import rasterio as rio
import folium
import tifffile as tiff 
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
data = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv'
df = pd.read_csv(data)

* get the lat and log from .geo

In [None]:
def split_latnlog_into_new_columns_fromgeo(dataframe,column_to_split,new_column_one,begin_column_one,end_column_one):
    for i in range(0, len(dataframe)):
        dataframe.loc[i, new_column_one] = dataframe.loc[i, column_to_split][begin_column_one:end_column_one]
    return dataframe

In [None]:
power_plants = split_latnlog_into_new_columns_fromgeo(df,'.geo','latitude',50,66)
power_plants = split_latnlog_into_new_columns_fromgeo(df,'.geo','longitude',31,48)

In [None]:
power_plants.head()

* **Now lets make some visualization and understand the data :)**

In [None]:
plt.figure(figsize=(8,8))
sns.catplot('primary_fuel', data= power_plants, kind='count', alpha=0.7, height=6, aspect= 3.5)

# Get current axis on current figure
ax = plt.gca()

# Max value to be set
y_max = power_plants['primary_fuel'].value_counts().max() 

# Iterate through the list of axes' patches
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/5., p.get_height(),'%d' % int(p.get_height()),
            fontsize=13, color='blue', ha='center', va='bottom')
plt.title('Frequency plot of primary_fule', fontsize = 20, color = 'black')
plt.show()

* **Hydro and gas stands with more than others**

* **Seem like Oli,Gas and the Coal sre polluted Primary_fule**

In [None]:
plt.figure(figsize=(8,8))
sns.catplot('source', data= power_plants, kind='count', alpha=0.7, height=6, aspect= 3.5)

# Get current axis on current figure
ax = plt.gca()

# Max value to be set
y_max = power_plants['source'].value_counts().max() 

# Iterate through the list of axes' patches
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/5., p.get_height(),'%d' % int(p.get_height()),
            fontsize=13, color='blue', ha='center', va='bottom')
plt.title('Frequency plot of source', fontsize = 20, color = 'black')
plt.show()

In [None]:
plt.figure(figsize=(8,8))
sns.catplot('owner', data= power_plants, kind='count', alpha=0.7, height=6, aspect= 3.5)

# Get current axis on current figure
ax = plt.gca()

# Max value to be set
y_max = power_plants['owner'].value_counts().max() 

# Iterate through the list of axes' patches
for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/5., p.get_height(),'%d' % int(p.get_height()),
            fontsize=13, color='blue', ha='center', va='bottom')
plt.title('Frequency plot of owner ', fontsize = 20, color = 'black')
plt.show()

* **lets take a look at the 'estimated_generation_gwh' as it is the estimated annual electricity generation in gigawatt-hours**

In [None]:
from scipy import stats

sns.distplot(power_plants['estimated_generation_gwh'] , fit=stats.norm);

# Get the fitted parameters used by the function
(mu, sigma) = stats.norm.fit(power_plants['estimated_generation_gwh'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('estimated_generation_gwh distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(power_plants['estimated_generation_gwh'], plot=plt)
plt.show()

* **Seem like the distribution is not good lets apply log transformation and look**

In [None]:
#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
power_plants["estimated_generation_gwh"] = np.log1p(power_plants["estimated_generation_gwh"])

#Check the new distribution 
sns.distplot(power_plants['estimated_generation_gwh'] , fit=stats.norm);

# Get the fitted parameters used by the function
(mu, sigma) =stats.norm.fit(power_plants['estimated_generation_gwh'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('estimated_generation_gwh distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(power_plants['estimated_generation_gwh'], plot=plt)
plt.show()

* **This looks good :)**

In [None]:
plt.figure(figsize=(8,8))
sns.catplot(x="primary_fuel",
            y="capacity_mw",
            data=power_plants,
            jitter=False,
           )
plt.show()

* **distribution of 'primary_fuel' with the 'capacity_mw'**

* **ok now lets chek for the missing values and the columns that should be Droped**

In [None]:
missing_df = df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name','missing_count']
missing_df = missing_df.ix[missing_df['missing_count']>0]
missing_df = missing_df.sort_values(by='missing_count')

ind = np.arange(missing_df.shape[0])
width = 0.5
fig,ax = plt.subplots(figsize=(12,18))
rects = ax.barh(ind,missing_df.missing_count.values,color='blue')
ax.set_yticks(ind)
ax.set_yticklabels(missing_df.column_name.values, rotation='horizontal')
ax.set_xlabel("Count of missing values")
ax.set_title("Number of missing values in each column")
plt.show()

* **So Other_fuel1,other_fuel2,other_fuel3 can be droped**

In [None]:
generation_gwh_years = ["generation_gwh_2013", 
                        "generation_gwh_2014", 
                        "generation_gwh_2015", 
                        "generation_gwh_2016",
                        "generation_gwh_2017"]

power_plants.loc[:, generation_gwh_years].sum()

* **all this generation_gwh_years can be Droped**

In [None]:
power_plants_df = power_plants.sort_values('capacity_mw',ascending=False).reset_index()
power_plants_df=power_plants_df[['name','latitude','longitude','primary_fuel','owner','capacity_mw','estimated_generation_gwh',]]
power_plants_df.head()

* **Now as we have Latitude and longitude we check on the map and see what we will get**

In [None]:
world_d= dict(
   name=list(power_plants['country']),
    lat=list(power_plants['latitude']),
   lon=list(power_plants['longitude']),
   estimated_generation_gwh =list(power_plants['estimated_generation_gwh'])
)
world_data = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in world_d.items() ]))
world_data = world_data.fillna(method='ffill') 


# create map and display it
world_map = folium.Map(location=[18, -66], zoom_start=9)

for lat, lon, value, name in zip(world_data['lat'], world_data['lon'], world_data['estimated_generation_gwh'], world_data['name']):
    folium.CircleMarker([lat, lon],
                        radius=10,
                        popup = ('<strong>country</strong>: ' + str(name).capitalize() + '<br>'
                                '<strong>estimated_generation_gwh</strong>: ' + str(value) + '<br>'),
                        color='red',
                        
                        fill_color='red',
                        fill_opacity=0.7 ).add_to(world_map)
world_map

* **Here on the above map if we want to know the  estimated annual electricity generation in gigawatt-hours of the city just clik in the RED cricle :)**

* **Now lets move on the other Satellite images data set **

DATASET SELECTION STARTER PACK

* Global Power Plant database by WRI
* Sentinel 5P OFFL NO2 by EU/ESA/Copernicus
* Global Forecast System 384-Hour Predicted Atmosphere Data by NOAA/NCEP/EMC
* Global Land Data Assimilation System by NASA

In [None]:
def plot_tif_img_on_map(file_name,lat,lon,zoom):
    wor_map = folium.Map([lat, lon], zoom_start=zoom)
    folium.raster_layers.ImageOverlay(
        image=file_name,
        bounds = [[18.6,-67.3,],[17.9,-65.2]],
        colormap=lambda x: (1, 0, 0, x),
    ).add_to(wor_map)
    return wor_map

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180708T172237_20180714T190743.tif'
band = rio.open(image).read(7)
print(band.shape)
vmin, vmax = np.nanpercentile(band, (5,95))  # 5-95% stretch
img_plt = plt.imshow(band, cmap='Oranges', vmin=vmin, vmax=vmax)
plt.show()
latitude=18.1429005246921; longitude=-65.4440010699994
plot_tif_img_on_map(band,latitude,longitude,9)

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180707T174140_20180713T191854.tif'
band = rio.open(image).read(7)
print(band.shape)
vmin, vmax = np.nanpercentile(band, (5,95))  # 5-95% stretch
img_plt = plt.imshow(band, cmap='Oranges', vmin=vmin, vmax=vmax)
plt.show()
latitude=18.1429005246921; longitude=-65.4440010699994
plot_tif_img_on_map(band,latitude,longitude,9)

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gfs/gfs_2018070106.tif'
band = rio.open(image).read(6)
vmin, vmax = np.nanpercentile(band, (5,95))  # 5-95% stretch
img_plt = plt.imshow(band, cmap='Oranges', vmin=vmin, vmax=vmax)
plt.show()
latitude=18.1429005246921; longitude=-65.4440010699994
plot_tif_img_on_map(band,latitude,longitude,9)

* **ok its better to check the property of the TIF images -:)**

In [None]:
from rasterio.plot import show
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gldas/gldas_20180701_0300.tif'

#load the image
band = rio.open(image)
show(band)
#All Metadata for the whole raster dataset
band.meta

* Driver : Data Format
* dtype : data type
* width and Height : The dimensions of the image in  **GFS** **image data**  are : 475 X 148
* count : There are 12 bands in the image
* crs: Coordinate Reference Systems which refers to the way in which spatial data that represent the earth’s surface.A particular CRS can be referenced by its EPSG code
* transform : Affine transform (how raster is scaled, rotated, skewed, and/or translated)

***this in gldas image dataset is with 12 channels iamge***

the same we qill check for the Global Land Data Assimilation System by NASA(gfs)

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/gfs/gfs_2018070106.tif'

#load the image
band = rio.open(image)
show(band)
#All Metadata for the whole raster dataset
band.meta

* ***this image in gfs image dataset are in 6 channels***

the same we qill check for the Sentinel 5P OFFL NO2 by EU/ESA/Copernicus (s5p_no2)

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180718T173458_20180724T193016.tif'

#load the image
band = rio.open(image)
show(band)
#All Metadata for the whole raster dataset
band.meta

* ***the images in s5p_no2 data set is with 12 channels*** 

In [None]:
quantity_of_electricity_generated = power_plants_df['estimated_generation_gwh'][29:30].values
print('Quanity of Electricity Generated: ', quantity_of_electricity_generated)

In [None]:
image = '/kaggle/input/ds4g-environmental-insights-explorer/eie_data/s5p_no2/s5p_no2_20180718T173458_20180724T193016.tif'

average_no2_emission = [np.average(tiff.imread(image))]
print('Average NO2 emissions value: ', average_no2_emission)

In [None]:
simplified_emissions_factor = float(average_no2_emission/quantity_of_electricity_generated)
print('Simplified emissions factor (S.E.F.) for a single power plant on the island of Vieques =  \n\n', simplified_emissions_factor, 'S.E.F. units')

* **this is the just sample Emission Factor cal **

**to find the Emission Factor we have to get all the varablein the formule

**E = A x EF x (1-ER/100)**

where:

**E = emissions; A = activity rate; EF = emission factor, and ER =overall emission reduction efficience 
**
**so to get all this factors i think we need to include more data and see if it’s possible to use remote sensing techniques to better model emissions factors,and to  develop a methodology to calculate an average historical emissions factor for electricity generation in a sub-national region.**

**planning to make a  methodology which is more accurate will soon update**

**will work more and soon will update**

**please upvote if you like that can make me to work more motive :)**