# Understanding Sentinel-5P OFFL NO2

Before we get into modelling, it's critical we understand exactly what the data represents. Unlike many other competitions on Kaggle (that I've done before), this one requires a lot scientific background, and the datasets are complex.

Let's look at the Sentinel-5P OFFL NO2 dataset as it is one of the most important ones for predicting emission factors at distance.

Firstly, what does the name even mean? 

Sentinel-5P OFFL NO2:
* Sentinel-5P: This is the name of the satellite, the Sentinel-5 Precursor which was launched on 13 October 2017 by the European Space Agency to monitor air pollution. (pictured below)
* OFFL: Offline data. This is in contrast to NRTI, near real time data which is basically more recent data.
* NO2: Nitrogen dioxide. Nitrogen oxides (NO2 and NO) are important trace gases in the Earthâ€™s atmosphere, present in both the troposphere and the stratosphere. They enter the atmosphere as a result of anthropogenic activities (in our case, power generation) and natural processes (e.g. wildfires and lightning, we'll have to watch out for these events). Note that here, NO2 is used to represent concentrations of collective nitrogen oxides because during daytime, i.e. in the presence of sunlight, a photochemical cycle involving ozone (O3) converts NO into NO2 and vice versa on a timescale of minutes. So our final values won't just be NO2.

<img src="https://airbus-h.assetsadobe2.com/is/image/content/dam/products-and-solutions/space/earth-observation/sentinel/Sentinel-5-Precursor-inOrbit-Copyright-Max-Alexander-Airbus2017.jpg?wid=1920&fit=fit,1&qlt=85,0" alt="Drawing" style="width: 600px;"/>

Image from airbus: <https://airbus-h.assetsadobe2.com/is/image/content/dam/products-and-solutions/space/earth-observation/sentinel/Sentinel-5-Precursor-inOrbit-Copyright-Max-Alexander-Airbus2017.jpg>

You can also check out this inspiring video: https://youtu.be/zP0UVKAwdMc.
Or blog post from the Sentinel team themselves: https://medium.com/sentinel-hub/measuring-air-pollution-from-space-7492f5dad7bc.

Information from: <https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S5P_OFFL_L3_NO2>.


You might also be wondering how a satellite even records NO2 levels. Satellites measure the concentration of particles in the atmosphere by observing how much light reaches the surface of the Earth and how much is reflected off the aerosols. The measurement is called aerosol optical depth or aerosol optical thickness.

Information from: <https://terra.nasa.gov/citizen-science/air-quality/part-ii-track-pollution-from-space>.


Alright, that's step one. Now we know roughly what to expect from the data. Let's open a sample image.

In [None]:
import tifffile as tiff

import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Import Data

We'll work with just the first image for now, and assume that similar conclusions can be taken about the other images.

In [None]:
data_path = '/kaggle/input/ds4g-environmental-insights-explorer'

image = '/eie_data/s5p_no2/s5p_no2_20190629T174803_20190705T194117.tif'

data = tiff.imread(data_path + image)

print('Data shape:', data.shape)

In [None]:
print('Data sample')
print(data[:2])

Eek, those are a lot of NaNs, and the numbers are varying from 10^-5 to 10^1, what does it all mean!

## Measurements Overview

Jumping back to our dataset description, we can find further information.

The dataset is divided into bands, where each band represents something that was measured. In this case, we have 12 bands. These bands are each measured at different locations (arg degrees).

So above we saw we have 148 x 475 x 12. We have 12 different measurements each with 148 x 475 = 70,300 locations.

Let's see just one location:

In [None]:
data[0][0]

As noted before, we have a wide range of values. Let's list exactly what they mean.

![](https://i.imgur.com/I2xNQ1Y.png)

The first column is good to see - it's the concentration of NO2 (in mols) per square meter, which is exactly what we're looking for. The creators of this dataset have done all the hard work in calculating the number for us.

We could just stop here, but let's have a look at the other values.

The next three measurements are related to different parts of the NO2 measurement which can be split out, noticeably we see that the majority of the NO2 sits in the troposhere.

The next three measurements describe some weather conditions.

The last five measurements describe the information about the satellites altitude and angle.


### Null Values

Before we look at the detailed values, do you remember seeing the NaN values above? Let's check how many there are - and try to work out why.

In [None]:
f = plt.figure()
f.set_size_inches(12, 9)
for i in range(12):
    plt.subplot(3, 4, i+1)
    sns.heatmap(data[:, :, i], cbar=False)
    f
    ax = plt.gca()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    # https://seaborn.pydata.org/generated/seaborn.heatmap.html

So we can see that at least half of the values are missing, and they're consistent across the different measurements. This is the boundary for the latitude / longitude that this dataset contains.

Let's calculate the exact value.

In [None]:
print('{:.0f}% of the dataset is null'.format(
    np.isnan(data[:, :, 0]).sum() / np.multiply(*data[:, :, 0].shape)*100
))

print('Last measurement at y index {}'.format(np.argwhere(np.isnan(data[0, :, 0])).min()-1))

## Detailed Look

Now let's take a detailed look at the parts of data (ignoring the nulls).

In [None]:
data_nn = data[:, :177, :]  # no nulls

### NO2

We saw four different measurements of NO2, how do the differ?

In [None]:
titles = ['NO2_column_number_density',
          'tropospheric_NO2_column_number_density', 
          'stratospheric_NO2_column_number_density',
          'NO2_slant_column_number_density']

f = plt.figure()
f.set_size_inches(8, 8)
for i in range(4):
    plt.subplot(2, 2, i+1)
    sns.heatmap(data_nn[:, :, i], cbar=False)
    f
    ax = plt.gca()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    plt.title(titles[i], fontsize=10)
    # https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
for i in range(4):
    print('{}: {:.2E} mol/m^2'.format(titles[i], np.nanmean(data_nn[:, :, i])))


Three of the four measurements all look basically the same, which suggests that our assumption to simply take the first value is good. The outlier is the stratoshperic value which seems to gradually increase from top to bottom. It doesn't affect the first measurement since the magnitude is very small.

Bonus (for those who need a refresher): the Troposhere is the lowest level of the atmoshpere, it is where humans live, and where weather happens. The Stratosphere sits above, has a much lower air density / pressure and contains the ozone layer. See the image below.

![](https://scied.ucar.edu/sites/default/files/images/large_image_for_image_content/atmosphere_layers_diagram_720x440.jpg)

Image from the Center for Science Education: <https://scied.ucar.edu>.

We see that the slant column number density is slightly higher than the first, since the former is multiplied by the total air mass factor which reduces it (not sure what that is though).

Let's just reiterate that this first measurement NO2_column_number_density appears to be the measurement we want to use for this competition. For example, I could say that during this period the average concentration was 6.636E-05 mol / m^2 which can be fed straight into our Emissions Factor equation 

$$EF = \frac{E}{A  \times (1-ER/100)}$$

as E. We can then combine it with our Activity data from the powerplant dataset.


### Weather

Let's look at the weather:
* tropopause_pressure
* absorbing_aerosol_index
* cloud_fraction

In [None]:
titles = ['tropopause_pressure', 'absorbing_aerosol_index', 'cloud_fraction']
f = plt.figure()
f.set_size_inches(12,4)
for i in range(3):
    plt.subplot(1, 3, i+1)
    sns.heatmap(data_nn[:, :, 4+i], cbar=False)
    f
    ax = plt.gca()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    plt.title(titles[i], fontsize=16)
    # https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
for i in range(3):
    print('{}: {:.2f}'.format(titles[i], np.nanmean(data_nn[:, :, i+4])))


We can see from the latter two figures that there was some cloud coverage on this day towards the center-right. The tropopause pressure is mostly reverse correlated, but not quite exactly.

The weather information is important because it tells us if our data can be reliable or not. For example: cloud-covered scenes or partially snow/ice covered scenes can cause problems.

Information from: <http://www.tropomi.eu/sites/default/files/files/publicSentinel-5P-Nitrogen-Dioxide-Level-2-Product-Readme-File.pdf>.

### Satellite Information

Finally, we can look at the satellite information:

In [None]:
titles = ['sensor_altitude', 'sensor_azimuth_angle', 'sensor_zenith_angle',
          'solar_azimuth_angle', 'solar_zenith_angle']

f = plt.figure()
f.set_size_inches(12, 8)
for i in range(5):
    plt.subplot(2, 3, i+1)
    sns.heatmap(data_nn[:, :, 7+i], cbar=False)
    f
    ax = plt.gca()
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    plt.title(titles[i], fontsize=16)
    # https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [None]:
for i in range(5):
    print('{}: {:.2f}'.format(titles[i], np.nanmean(data_nn[:, :, i+7])))

Not much to see here, but interestingly, our satellite is sitting at 838km (relative to WSG84). WSG84 appears to be the center of the Earth  which means our satellite would be >800km above the surface. This seems to be way too high, but I don't really know why at the moment.


WGS84 notes: https://confluence.qps.nl/qinsy/latest/en/world-geodetic-system-1984-wgs84-182618391.html
Orbits notes: https://earthobservatory.nasa.gov/features/OrbitsCatalog

## Conclusion

Looking into the data, we've understood that:
* the data has a large amount of nulls
* each band represents a different type of measurement (some more valuable than others), and
* how we can get an estimate of the NO2 concentration from a TIFF file for use in this competition

As a next step, we can begin to investigate:
* temporal effects from other TIFF files
* other correlations between measurements

Further reading:
* Readme for the dataset: <http://www.tropomi.eu/sites/default/files/files/publicSentinel-5P-Nitrogen-Dioxide-Level-2-Product-Readme-File.pdf>