# CSS 120: Environmental Data Science

## Remote Sensing

### Umberto Mignozzetti (UCSD)

(Based on Project Pythia and ClimateMatch)

## Packages

In [None]:
# Packages
# !pip install s3fs --quiet

In [None]:
# Packages
import s3fs
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
import datetime
import boto3
import botocore
import pooch
import os
import tempfile

## Announcements

1. PS02 and Lab03 will be on later this weekend.

2. I will post our make-up lecture every Thursday.

We have three make-up lectures, and three weeks left, so it matches perfectly.

In the make-up, I am focusing on climate change science, so that we do most of the computational stuff in class.

## Today's Lecture

Satellites provide a wealthy of reliable data about the environment.

In this lecture, we will study:

1. Arrays of long-term satellite remote sensing datasets that are tailored for climate applications, brought to you by three leading data providers: NOAA, NASA, and ESA.

2. Navigate and utilize these datasets to study vegetation and precipitation.

## Exploring Satellite Climate Data Records

In 2004, a committee convened by the US National Research Council defined a [***Climate Data Record (CDR)***](https://www.ncei.noaa.gov/products/climate-data-records) is "a time series of measurements of sufficient length, consistency and continuity to determine climate variability and change." 

Although there are no specific number to determine the "sufficient length", the typical climate length is considered to be at least 30 years. 

To achieve a stable, consistent, and reliable satellite CDR, we need to carefully calibrate the raw satellite data. 

## Satellite Missions for Environmental Monitoring

The image below illustrates a timeline showcasing a selection of polar-orbiting satellite missions whose data are commonly employed to generate satellite Climate Data Records (CDRs). 

![satellite timeline](https://github.com/ClimateMatchAcademy/course-content/blob/main/tutorials/W1D3_RemoteSensingLandOceanandAtmosphere/asset/img/t2_satellite_timeline.png?raw=true)

Credit: Douglas Rao

## Inter-satellite Calibration

To address the differences that are caused by sensor and satellite changes, we often perform an **inter-satellite calibration**

This adjusts the raw data collected by different satellites to a pre-defined reference to remove or minimize the systematic difference between data. 

The pre-defined reference is usually determined using data during the period of time when the satellites overlap. 

## Inter-satellite Calibration

![](https://github.com/ClimateMatchAcademy/course-content/blob/main/tutorials/W1D3_RemoteSensingLandOceanandAtmosphere/asset/img/t2_calibration_pt2.png?raw=true)

## Inter-satellite Calibration

When choosing satellite data for climate applications, you should be asking yourself a few questions:

* Are the data that you are planning to use collected by the same sensor and satellite?
* If the data are from multiple satellites/sensors, are there steps to ensure the data are consistent across difference satellites/sensors?
* Can you find out how the inter-satellite calibration is done and what is the level of difference between satellites/sensors?

These questions will help you determine if the remotely sensed data is appropriate for climate applications that you are interested in.

## NOAA Climate Data Records

The **National Atmospheric and Oceanic Administration (NOAA)** implemented the recommendation from the US National Research Council to develop satellite-based climate data records in the 2000s. 

They have maintained a suite of operational CDRs that can be used to study different aspects of the changing climate system since then.

All NOAA CDR data are available freely to the public via [NOAA National Centers for Environmental Information](https://www.ncei.noaa.gov/products/climate-data-records). 

Recently, the [NOAA Open Data Dissemination Program](https://www.noaa.gov/information-technology/open-data-dissemination) also made all NOAA CDRs available on three major commercial cloud service providers (i.e., Amazon Web Service, Google Cloud, and Microsoft Azure).

The NOAA Climate Data Records (CDRs) are available to anyone interested in accessing the data and are typically free of charge.

## NOAA Climate Data Records

NOAA CDRs have two different categories with different purposes:

* _Fundamental CDR (FCDR)_: This category consists of high-quality, low-level processed satellite sensor data, such as reflectance and brightness temperature. The FCDR datasets are carefully calibrated between satellites and sensors to ensure accuracy and consistency. These datasets are primarily used to assess and improve Earth system models (which you will learn about next week).
* _Thematic CDR_: Thematic CDRs provide valuable information for understanding climate processes and changes in various domains.  The thematic CDRs are divided into terrestrial, atmospheric, and ocean categories to reflect the different components of the climate system.

## NOAA Climate Data Records

The table below lists a selection of thematic CDRs operates by NOAA. You can find out more about all NOAA CDRs by visiting the specific webpage of each CDR categories:

* [Fundamental CDR](https://www.ncei.noaa.gov/products/climate-data-records/fundamental) - 16 datasets
* [Thematic CDR: Terrestrial](https://www.ncei.noaa.gov/products/climate-data-records/terrestrial) - 4 datasets
* [Thematic CDR: Atmospheric](https://www.ncei.noaa.gov/products/climate-data-records/atmospheric) - 18 datasets
* [Thematic CDR: Oceanic](https://www.ncei.noaa.gov/products/climate-data-records/oceanic) - 5 datasets

## NOAA Climate Data Records

| Dataset | Category | Start Year | Frequency | Spatial Resolution | Example Application Areas |
|:--|:--:|:--:|:--:|:--:|:--|
|Leaf Area Index and FAPAR|Terrestrial|1981|Daily|0.05°| Vegetation status monitoring; Agriculture monitoring; Crop yield/food security|
|Normalized Difference Vegetation Index (NDVI)|Terrestrial|1981|Daily|0.05°|Vegetation status monitoring; Vegetation phenology study|
|Snow Cover Extent (Northern Hemisphere)|Terrestrial|1966|Weekly (prior 1999-06) <br><br> Daily (post 1999-06)|~190 km| Hydrology; Water resources; Snow-climate feedback|
|Aerosol Optical Thickness|Atmospheric|1981|Daily & Monthly|0.1°|Air quality; Aerosol-climate feedback|
|PATMOS-x Cloud Properties|Atmospheric|1979|Daily|0.1°|Cloud process; Cloud-climate feedback|
|Precipitation - PERSIANN|Atmospheric|1982|Daily|0.25° <br><br> (60°S–60°N)|Hydrology; Water resources; Extreme events|
|Sea Surface Temperature - Optimum Interpolation|Oceanic|1981|Daily|0.25°|Climate variability; Marine heatwave; Marine ecosystem|
|Sea Ice Concentration|Oceanic|1978|Daily & Monthly|25 km|Crosphere study; Wildlife conservation; Ocean/climate modeling|

## ESA Climate Change Initiative

The **European Space Agency (ESA)** initiated a similar effort to develop consistent satellite-based long-term records to support the mission of climate monitoring for societal benefits in late 2010s. 

**[ESA Climate Change Initiative (CCI)](https://climate.esa.int/en/esa-climate/esa-cci/)** has established more than 26 projects to develop satellite-based CDRs and directly engage with downstream users.

Through CCI, there is very strong emphasis on applications to support the monitoring of [**essential climate variables** (ECVs) defined by Global Climate Observing System (GCOS)](https://public.wmo.int/en/programmes/global-climate-observing-system/essential-climate-variables). 

An ECV is defined as "a physical, chemical or biological variable or a group of linked variables that critically contributes to the characterization of Earth’ s climate."

The table below lists a selection of ESA CCI datasets and their example application areas.

## ESA Climate Change Initiative

| Dataset | Category | Duration | Frequency | Spatial Resolution | Example Application Areas |
|:--|:--:|:--:|:--:|:--:|:--|
|Sea Level|Oceanic|1992-2015|Monthly|0.25°|Sea level rise; Ocean modeling|
|Water Vapor|Atmospheric|1985-2019|Monthly|5°|Water vapor-climate feedback; Hydrology|
|Fire|Terrestrial|1981-2020|Monthly|0.05° (pixel dataset) <br><br> 0.25° (grid dataset)|Ecosystem disturbance; Extreme events; Social impact|
|Land Cover|Terrestrial|1992-2020|Yearly|300 m|Terrestrial modeling|
|Soil Moisture|Terrestrial|1978-2021|Daily|0.25°|Hydrology; Ecosystem impacts; Extreme events|

## ESA Climate Change Initiative

You may observe that some datasets do not span the typical duration necessary for climate studies (for instance, 30 years). This occurrence is influenced by a variety of factors such as:

- Legacy sensors were not designed or capable of accurately capturing the ECVs of interest.
- The CCI project is executed in stages, initiating with the most recent satellite missions/sensors. However, plans are underway to incorporate older data from heritage satellite missions/sensors.

Moreover, each ESA CCI project frequently offers different versions of ECV variables, each designed for specific applications.

The specifications of these ECV variables might deviate from the table above if they represent a subset of the time period, utilizing data from the latest sensors. 

The table primarily provides information on the longest time record for each CCI.

## ESA Climate Change Initiative

All ESA CCI data are openly accessible and free of charge to users without any restrictions. 

All these resources can be accessed via the [**ESA CCI Open Data Portal**](https://climate.esa.int/en/odp/#/dashboard).

To further assist users in accessing and analyzing the CCI data, ESA has also developed the [**CCI Analysis Toolbox (Cate)**](https://climate.esa.int/en/explore/analyse-climate-data/). 

It is described as a "cloud-enabled computing environment geared for scientists who need to analyze, process, and visualize ESA’s climate data and other spatiotemporal data."

## NASA Earth System Data Records

Similar to other two satellite data providers, the **National Aeronautics and Space Administration (NASA)** also produces and distributes long-term satellite-based data records that may be suitable for different climate applications. 

NASA [Earth System Data Records (ESDRs)](https://www.earthdata.nasa.gov/esds/competitive-programs/measures?page=0) are defined as "as a unified and coherent set of observations of a given parameter of the Earth system, which is optimized to meet specific requirements in addressing science questions."

While NASA's ESDR does not specifically target climate, these records are often created to monitor and study various components of the climate system.

For instance, surface temperature, global forest coverage change, atmospheric compositions, and ice sheet velocity are all areas of focus.

The table below showcases a selection of NASA ESDRs datasets that nicely complement the satellite Climate Data Records (CDRs) offered by NOAA and ESA.

## NASA Earth System Data Records

| Dataset | Category | Duration | Frequency | Spatial Resolution | Example Application Areas |
|:--|:--:|:--:|:--:|:--:|:--|
|[Sulfur Dioxide](https://www.earthdata.nasa.gov/esds/competitive-programs/measures/multi-decadal-sulfur-dioxide)|Atmospheric|1978-2022|Daily|50 km|Atmospheric modeling; Air quality|
|[Ozone](https://disc.gsfc.nasa.gov/datasets/MSO3L3zm5_1/summary)|Atmospheric|1970-2013|Monthly|5°|Ozone monitoring; Air quality|
|[Sea Surface Height](https://podaac.jpl.nasa.gov/dataset/SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205)|Oceanic|1992-ongoing|5-Day|1/6°|Sea level rise; Ocean modeling|
|[GPCP Satellite-Gauge Combined Precipitation Data](https://disc.gsfc.nasa.gov/datasets/GPCPMON_3.2/summary)|Atmospheric|1983-2020|Daily & Monthly|0.5°|Hydrology; Extreme events|

## NASA Earth System Data Records

If you've visited the linked NASA ESDR page above, you may have noticed that it appears less structured compared to the NOAA CDR program and the ESA CCI open data portal.

This is partly because NASA operates different data centers for distinct application areas (e.g., atmosphere, snow/ice, land, etc.).

However, you can always visit [**NASA's Earth Data Search**](https://search.earthdata.nasa.gov/search) – a comprehensive portal for accessing datasets provided by NASA's Earth science data system.

To access the data, you'll be required to create a user account. Rest assured, registration for the NASA Earthdata system is free and open to anyone.

# Visualizing Satellite CDR - Global Vegetation Mapping

##  Helper functions

In [None]:
def pooch_load(filelocation=None, filename=None, processor=None):
    shared_location = "~/"
    user_temp_cache = tempfile.gettempdir()

    if os.path.exists(os.path.join(shared_location, filename)):
        file = os.path.join(shared_location, filename)
    else:
        file = pooch.retrieve(
            filelocation,
            known_hash=None,
            fname=os.path.join(user_temp_cache, filename),
            processor=processor,
        )

    return file

## Satellite Monitoring of Vegetation Status

All the **National Atmospheric and Oceanic Administration Climate Data Record (NOAA-CDR)** datasets are available both at NOAA National Centers for Environmental Information (NCEI) and commercial cloud platforms. 

Here, we are accessing the data directly via the **Amazon Web Service (AWS)**. You can find out information about the NOAA CDRs on AWS's Open Data Registry.

* [NOAA Fundamental CDR on AWS](https://registry.opendata.aws/noaa-cdr-fundamental/) 
* [NOAA Terrestrial CDR on AWS](https://registry.opendata.aws/noaa-cdr-terrestrial/) 
* [NOAA Atmospheric CDR on AWS](https://registry.opendata.aws/noaa-cdr-atmospheric/) 
* [NOAA Oceanic CDR on AWS](https://registry.opendata.aws/noaa-cdr-oceanic/) 

## Satellite Monitoring of Vegetation Status

The index we will use in this lecture is the **Normalized Difference Vegetation Index (NDVI)**.

It measures the "greeness" of vegetation, and is useful in understanding vegetation density and assessing changes in plant health.

For example, NDVI can be used to study the impact of drought, heatwave, and insect infestation on plants covering Earth's surface.

## Access NOAA NDVI CDR Data from AWS

If we go to the [cloud storage space (or a S3 bucket)](https://noaa-cdr-ndvi-pds.s3.amazonaws.com/index.html#data/) that hosts NOAA NDVI CDR data, you will see the pattern of how the NOAA NDVI CDR is organized:

`s3://noaa-cdr-ndvi-pds/data/1981/AVHRR-Land_v005_AVH13C1_NOAA-07_19810624_c20170610041337.nc`

We can take advantage of the pattern to search for the data file systematically. 

> Parent directory: `s3://noaa-cdr-ndvi-pds/data/`  
> Sub-directory for each year: `1981/`  
> File name of each day: `AVHRR-Land_v005_AVH13C1_NOAA-07_19810624_c20170610041337.nc`

## Access NOAA NDVI CDR Data from AWS

The file name also has a clear pattern:

> Sensor name: `AVHRR`  
> Product category: `Land`  
> Product version: `v005`  
> Product code: `AVH13C1`  
> Satellite platform: `NOAA-07`  
> Date of the data: `19810624`  
> Processing time: `c20170610041337` (*This will change for each file based on when the file was processed*)  
> File format: `.nc` (*netCDR-4 format*)

In other words, if we are looking for the data of a specific day, we can easily locate where the file might be. 

## Access NOAA NDVI CDR Data from AWS

For example, if we want to find the AVHRR data for the day of *2002-03-12 (or March 12, 2002)*, you can use:

`s3://noaa-cdr-ndvi-pds/data/2002/AVHRR-Land_v005_AVH13C1_*_20020312_c*.nc`

The reasaon that we put `*` in the above directory is because we are not sure about what satellite platform this data is from and when the data was processed.

The `*` is called a **wildcard**, and is used because we want *all* the files that contain our specific criteria, but do not want to have to specify all the other pieces of the filename we are not sure about yet.

It should return all the data satisfying that initial criteria and you can refine further once you see what is available. Essentially, this first step helps to narrow down the data search.

## Access NOAA NDVI CDR Data from AWS

In [None]:
# to access the NDVI data from AWS S3 bucket, we first need to connect to s3 bucket
fs = s3fs.S3FileSystem(anon = True)

# we can now check to see if the file exist in this cloud storage bucket using the file name pattern we just described
date_sel = datetime.datetime(
    2002, 3, 12, 0
)  # select a desired date and hours (midnight is zero)

# automatic filename from data_sel. we use strftime (string format time) to get the text format of the file in question.
file_location = fs.glob(
    "s3://noaa-cdr-ndvi-pds/data/"
    + date_sel.strftime("%Y")
    + "/AVHRR-Land_v005_AVH13C1_*"
    + date_sel.strftime("%Y%m%d")
    + "_c*.nc"
)
# now let's check if there is a file match the pattern of the date that we are interested in.
file_location

### Your turn

NDVI CDR data switched sensors on 2014 from AVHRR (the older generation sensor) to VIIRS (the newest generation sensor).

Using the code above and the [list of data names](https://noaa-cdr-ndvi-pds.s3.amazonaws.com/index.html#data/) for VIIRS, find data from a day after 2014. You will need to modify string input into `glob()` to do so.

In [None]:
# select a desired date and hours (midnight is zero)
exercise_date_sel = ...

# automatic filename from data_sel. we use strftime (string format time) to get the text format of the file in question.
exercise_file_location = ...

# now let's check if there is a file match the pattern of the date that we are interested in.
exercise_file_location

## Read NDVI CDR Data

Now that we have the location of the NDVI data for a specific date, we can read in the data using the python library `xarray` to open the [netCDF-4 file](https://pro.arcgis.com/en/pro-app/latest/help/data/multidimensional/what-is-netcdf-data.htm).

First, we need to open the connection to the file object of the selected date. We are still using the date of 2002-03-12 as the example here.

To keep consistency, we are going to use boto3 and pooch to open the file. But `s3fs` also has the ability to open files from s3 remotely.

## Read NDVI CDR Data

In [None]:
client = boto3.client(
    "s3", config=botocore.client.Config(signature_version=botocore.UNSIGNED)
)  # initialize aws s3 bucket client

ds = xr.open_dataset(
    pooch_load(
        filelocation = "http://s3.amazonaws.com/" + file_location[0],
        filename = file_location[0],
    ),
    decode_times = False # to address overflow issue
)  # open the file
ds

## Read NDVI CDR Data

The output from the code block tells us that the NDVI data file of 2002-03-12 has dimensions of `3600x7200`.

This makes sense for a dataset with the spatial resolution of 0.05°×0.05° that spans 180° of latitude and 360° of longitude.

There is another dimension of the dataset named `time`. Since it is a daily data file, it only contains one value.

## Read NDVI CDR Data

Two main data variables are in this dataset are `NDVI` and `QA`.

* `NDVI` is the variable that contains the value of Normalized Difference Vegetation Index (NDVI - ranges between -1 and 1) that can be used to measure the vegetation greeness.  

* `QA` is the variable that indicates the quality of the NDVI values for each corresponding grid. It reflects whether the data is of high quality or should be discarded because of various reasons (e.g., bad sensor data, potentially contanminated by clouds).

## Visualize NDVI CDR Data

In [None]:
# examine NDVI values from the dataset
ndvi = ds.NDVI
ndvi

## Visualize NDVI CDR Data

To visualize the raw data, we will will plot it using `matplotlib` by calling `.plot()` on our xarray `DataArray`.

Figure settings:
1. `vmin` & `vmax`: Minimum and maximum values for the legend
1. `aspect`: Setting the aspect ratio of the figure, must be combined with `size`
1. `size`: setting the overall size of the figure

In [None]:
# to make plotting faster and less memory intesive, use coarsen to reduce the number of pixels
ndvi.coarsen(latitude=5).mean().coarsen(longitude = 5).mean().plot(
    vmin=-0.1, vmax=1.0, aspect=1.8, size=5
)

## Mask NDVI Data Using a Quality Flag

There is also a variable `QA` that indicates the quality of the NDVI value for each grid cell.

This quality information is very important when using satellite data to ensure the climate analysis is done using only the highest quality data.

For NDVI CDR data, it has a complex quality flag system that is represented using a 16-bit system.

`QA` value needs to be converted to binary values of 16 bits and recognize the quality flag based on the information listed in the table below. 

## Mask NDVI Data Using a Quality Flag

| Bit No. | Description | Value=1 | Value=0 |
|-:|:-|:-:|:-:|
|15|Flag to indicate if the pixel is in polar region|Yes|No|
|14|Flag to indicate BRDF-correction issues|Yes|No|
|13|Flag to indicate RH03 value is invalid|Yes|No|
|12|Flag to indicate AVHRR Channel 5 value is invalid|Yes|No|
|11|Flag to indicate AVHRR Channel 4 value is invalid|Yes|No|
|10|Flag to indicate AVHRR Channel 3 value is invalid|Yes|No|
| 9|Flag to indicate AVHRR Channel 2 value is invalid|Yes|No|
| 8|Flag to indicate AVHRR Channel 1 value is invalid|Yes|No|
| 7|Flag to indicate all 5 AVHRR Channels are valid|Yes|No|
| 6|Flag to indicate the pixel is at night (no visible channel data)|Yes|No|
| 5|Flag to indicate the pixel is over dense dark vegetation|Yes|No|
| 4|Flag to indicate the pixel is over sunglint (over ocean)|Yes|No|
| 3|Flag to indicate the pixel is over water|Yes|No|
| 2|Flag to indicate the pixel contains cloud shadow|Yes|No|
| 1|Flag to indicate the pixel is cloudy|Yes|No|
| 0|(Unused)|Yes|No|

## Mask NDVI Data Using a Quality Flag

This shows the complex system to ensure that satellite CDR data is of high quality for climate applications. But how can we decifer the quality of a given pixel? 

Assuming that we have a grid with `QA=18`, when converted into a binary value with the length of 16 bits it becomes `0000000000010010`.

That is, every `QA` value will be converted into a list of 1's and 0's that is 16 numbers long. 

Converting our example above of 18 we have:

|Bit15|Bit14|Bit13|Bit12|Bit11|Bit10|Bit9|Bit8|Bit7|Bit6|Bit5|Bit4|Bit3|Bit2|Bit1|Bit0|
|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|
|0|0|0|0|0|0|0|0|1|0|0|1|0|0|1|0|
|No|No|No|No|No|No|No|No|Yes|No|No|Yes|No|No|Yes|No|

## Mask NDVI Data Using a Quality Flag

|Bit15|Bit14|Bit13|Bit12|Bit11|Bit10|Bit9|Bit8|Bit7|Bit6|Bit5|Bit4|Bit3|Bit2|Bit1|Bit0|
|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|-:|
|0|0|0|0|0|0|0|0|1|0|0|1|0|0|1|0|
|No|No|No|No|No|No|No|No|Yes|No|No|Yes|No|No|Yes|No|

Interpreting the table, for a quality flag of 18:

1. The NDVI is retrieved from valid values of AVHRR channels (`Bit7=1`) 

2. The grid is over dense dark vegetation (`Bit5=1`)

3. But the grid is cloudy (`Bit1=1`). 

Therefore, the QA tells us that we should not use this grid since it is covered by clouds and does not reflect vegetation information on the land surface.

## Mask NDVI Data Using a Quality Flag

In [None]:
# Function to extract quality
def get_quality_info(QA):
    """
    QA: the QA value read in from the NDVI data

    High quality NDVI should meet the following criteria:
    Bit 7: 1 (All AVHRR channels have valid values)
    Bit 2: 0 (The pixel is not covered by cloud shadow)
    Bit 1: 0 (The pixel is not covered by cloud)

    Output:
    True: high quality
    False: low quality
    """
    # unpack quality assurance flag for cloud (byte: 1)
    cld_flag = (QA % (2**2)) // 2
    # unpack quality assurance flag for cloud shadow (byte: 2)
    cld_shadow = (QA % (2**3)) // 2**2
    # unpack quality assurance flag for AVHRR values (byte: 7)
    value_valid = (QA % (2**8)) // 2**7

    mask = (cld_flag == 0) & (cld_shadow == 0) & (value_valid == 1)

    return mask

## Mask NDVI Data Using a Quality Flag

In [None]:
# get the quality assurance value from NDVI data
QA = ds.QA

# create the high quality information mask
mask = get_quality_info(QA)

# check the quality flag mask information
mask

## Mask NDVI Data Using a Quality Flag

The output of the previous operation gives us a data array with logical values to indicate if a grid has high quality NDVI values or not.

Now let's mask out the NDVI data array with this quality information to see if this will make a difference in the final map.

In [None]:
# use `.where` to only keep the NDVI values with high quality flag
ndvi_masked = ndvi.where(mask)
ndvi_masked

## Mask NDVI Data Using a Quality Flag

A lot of the NDVI values in the masked data array becomes `nan` which means `not a number`. 

This means that the grid does not have a high quality NDVI value based on the QA value.

Now, let's plot the map one more time to see the difference after the quality masking.

In [None]:
# re-plot the NDVI map using masked data
ndvi_masked.coarsen(latitude=5).mean().coarsen(longitude=5).mean().plot(
    vmin=-0.1, vmax=1.0, aspect=1.8, size=5
)

## Mask NDVI Data Using a Quality Flag

Note the large difference after the quality mask was applied and you removed data that was compromised due to clouds.

Since the NDVI value is calculated using the reflectance values of the red and near-infrared spectral band, this value is only useful for vegetation and surface monitoring when there are no clouds present.

Thus, we always need to remove the grid with clouds in the data.

## Your turn

You just learned how to use `xarray` and `matplotlib` to access NDVI CDR data from AWS and visualize it. 

Find a different date that you are interested in and visualize the high quality NDVI data of that day?

Note the solution is just an example of a date that you could choose.

In [None]:
# define the date of your interest YYYYMMDD (e.g., 20030701)
# select a desired date and hours (midnight is zero)
date_sel_exercise = ...

# locate the data in the AWS S3 bucket
# hint: use the file pattern that we described
file_location_exercise = ...

# open file connection to the file in AWS S3 bucket and Use xarray to open the NDVI CDR file
# open the file
ds_exercise = ...

# get the QA value and extract the high quality data mask and Mask NDVI data to keep only high quality value
# hint: reuse the get_quality_info helper function we defined
ndvi_masked_exercise = ...

# plot high quality NDVI data
# hint: use plot() function
# ndvi_masked_exercise.coarsen(latitude=5).mean().coarsen(longitude=5).mean().plot(
#     vmin=-0.1, vmax=1.0, aspect=1.8, size=5
# )

# Understanding Climatology Through Precipitation Data

## Obtain Monthly Precipitation Data

We will calculate the long-term precipitation climatology using monthly precipitation climate data records from NOAA. 

We will use the [Global Precipitation Climatology Project (GPCP) Monthly Precipitation Climate Data Record (CDR)](https://www.ncei.noaa.gov/products/climate-data-records/precipitation-gpcp-monthly).

This dataset contains monthly satellite-gauge data and corresponding precipitation error estimates from January 1979 to the present, gridded at a 2.5°×2.5° resolution. 

*Satellite-gauge* means that the climate data record (CDR) is a compilation of precipitation data from multiple satellites and in-situ sources, combined into a final product that optimizes the advantages of each type of data.

While a higher spatial resolution (1°×1°) at daily resolution exists for varied applications, we will restrict ourselves to the coarser resolution monthly data due to computational limitations. 

However, you are encouraged to delve into the daily higher resolution data for your specific project needs.

## Access GPCP Monthly CDR Data on AWS

To perform analysis, we will need to access the monthly data files from AWS first:

In [None]:
# Step 1: Connect to the AWS S3 bucket for the GPCP Monthly Precipitation CDR data
fs = s3fs.S3FileSystem(anon = True)

# Step 2: Get the list of all data files in the AWS S3 bucket fit the data file pattern.
file_pattern = "noaa-cdr-precip-gpcp-monthly-pds/data/*/gpcp_v02r03_monthly_*.nc"
file_location = fs.glob(file_pattern)

In [None]:
print("Total number of GPCP Monthly precipitation data files:")
print(len(file_location))

## Access GPCP Monthly CDR Data on AWS

We have more than 500 GPCP monthly precipitation CDR data files in the AWS S3 bucket.

Each data file contains the data of each month globally starting from January 1979.

Let's open a single data file to look at the data structure before we open all data files.

In [None]:
# first, open a client connection
client = boto3.client(
    "s3", config=botocore.client.Config(signature_version=botocore.UNSIGNED)
)  # initialize aws s3 bucket client

# read single data file to understand the file structure
# ds_single = xr.open_dataset(pooch.retrieve('http://s3.amazonaws.com/'+file_location[0],known_hash=None )) # open the file
ds_single = xr.open_dataset(
    pooch_load(
        filelocation="http://s3.amazonaws.com/" + file_location[0],
        filename=file_location[0],
    )
)

# Check variables in the data file
ds_single.data_vars

## Access GPCP Monthly CDR Data on AWS

From the information provided by `xarray`, there are a total of five data variables in this monthly data file, including `precip` for the monthly precipitation and `precip_error` for the monthly precipitation error.

In [None]:
# check the coordinates for the data file
ds_single.coords

## Access GPCP Monthly CDR Data on AWS

All data is organized in three dimensions: `latitude`, `longitude`, and `time`.

We want to create a three-dimensional data array for the monthly precipitation data across the entire data period (from January 1979 until present) so we must open all the available files

In [None]:
# open all the monthly data files
# this process will take ~ 5 minute to complete due to the number of data files.

# file_ob = [pooch.retrieve('http://s3.amazonaws.com/'+file,known_hash=None ) for file in file_location]
file_ob = [
    pooch_load(filelocation="http://s3.amazonaws.com/" + file, filename=file)
    for file in file_location
]

## Access GPCP Monthly CDR Data on AWS

In [None]:
# using this function instead of 'open_dataset' will concatenate the data along the dimension we specify
ds = xr.open_mfdataset(file_ob, combine="nested", concat_dim="time")
ds

## Access GPCP Monthly CDR Data on AWS

We used `combine='nested', concat_dim='time'` to combine all monthly precipitation data into one data array along the dimension of `time`. 

This command is very useful when reading in multiple data files of the same structure but covering different parts of the full data record.

Since we are interested in the precipitation data globally at this moment, let's extract the entire data array of precipitation from the entire dataset.

In [None]:
# examine the precipitation data variable
precip = ds.precip
precip

## Access GPCP Monthly CDR Data on AWS

As you can see, the data array has the dimensions of `time` `longitude` `latitude`.

Before delving into further analysis, let's visualize the precipitation data to gain a better understanding of its patterns and characteristics. 

## Visualize GPCP Data Using Cartopy

We already learned how to make simple visualization using `matplotlib` using `latitude` and `longitude` as the y-axis and x-axis.

In [None]:
# create simple map of the GPCP precipitation data using matplotlib
fig, ax = plt.subplots(figsize=(9, 6))

# use the first month of data as an example
precip.sel(time="1979-01-01").plot(ax=ax)

## Visualize GPCP Data Using Cartopy

From the figure, the boundary between land and ocean, especially for North and South America, can be observed vaguely.

To overcome this limitation, we use `cartopy`. With `cartopy`, we can incorporate additional elements such as coastlines, major grid markings, and specific map projections.

In [None]:
# visualize the precipitation data of a selected month using cartopy

# select data for the month of interest
data = precip.sel(time="1979-01-01", method="nearest")

# initate plot with the specific figure size
fig, ax = plt.subplots(subplot_kw={"projection": ccrs.Robinson()}, figsize=(9, 6))

# add coastal lines to indicate land/ocean
ax.coastlines()

# add major grid lines for latitude and longitute
ax.gridlines()

# add the precipitation data with map projection transformation
# also specify the maximum and minumum value show on the map to increase the
# contrast in the map.
data.plot(ax = ax, transform=ccrs.PlateCarree(), vmin=0,
    vmax=20, cbar_kwargs=dict(shrink=0.5, label="GPCP Monthly Precipitation \n(mm/day)"),)

## Visualize GPCP Data Using Cartopy

The updated map provides significant improvements, offering us a wealth of information to enhance our understanding of the GPCP monthly precipitation data.

From the visualization, we can observe that regions such as the Amazon rainforest, the northern part of Australia, and other tropical areas exhibit higher levels of monthly precipitation in January 1979.

These patterns align with our basic geographical knowledge, reinforcing the validity of the data and representation.

# Climatology

## Plot Time Series of Data at a Specific Location

We have over 40 years of monthly precipitation data.

Let's examine a specific location throughout the entire time span covered by the GPCP monthly data. For this purpose, we will focus on the data point located at (0°N, 0°E).

In [None]:
# select the entire time series for the grid that contains the location of (0N, 0E)
grid = ds.precip.sel(latitude=0, longitude=0, method="nearest")

# initate plot
fig, ax = plt.subplots(figsize=(9, 6))

# plot the data
grid.plot(ax=ax)

# remove the automatically generated title
ax.set_title("")

## Plot Time Series of Data at a Specific Location

From the time series plot, note a repeating pattern with a seasonal cycle (roughly the same ups and downs over the course of a year, for each year).

We can apply this same calculation we learned before to this data to investigate the annual cycle of precipitation at this location.

## Calculate the Climatology

A climatology typically employs a 30-year time period to use for the calculation. In this case, let's use the reference period of 1981-2010.

In [None]:
# first, let's extract the data for the time period that we want (1981-2010)
precip_30yr = ds.precip.sel(time = slice("1981-01-01", "2010-12-30"))
precip_30yr

## Calculate the Climatology

Now we can use Xarray's `.groupby()` functionality to calculate the monthly climatology.

Recall that `.groupby()` splits the data based on a specific criterion (in this case, the month of the year) and then applies a process (in our case, calculating the mean value across 30 years for that specific month) to each group before recombining the data together.

In [None]:
# use groupby to calculate monthly climatology (1981-2010)
precip_clim = precip_30yr.groupby("time.month").mean(dim="time")
precip_clim

## Calculate the Climatology

With the resulting climatology data array, we can make a set of maps to visualize the monthly climatology from four different seasons.

In [None]:
# define the figure and each axis for the 2 rows and 2 columns
fig, axs = plt.subplots(
    nrows=2, ncols=2, subplot_kw={"projection": ccrs.Robinson()}, figsize=(12, 8)
)

# axs is a 2 dimensional array of `GeoAxes`.  We will flatten it into a 1-D array
axs = axs.flatten()

# loop over selected months (Jan, Apr, Jul, Oct)
for i, month in enumerate([1, 4, 7, 10]):

    # Draw the coastines and major gridline for each subplot
    axs[i].coastlines()
    axs[i].gridlines()

    # Draw the precipitation data
    precip_clim.sel(month=month).plot(
        ax=axs[i],
        transform=ccrs.PlateCarree(),
        vmin=0,
        vmax=15,  # use the same range of max and min value
        cbar_kwargs=dict(shrink=0.5, label="GPCP Climatology\n(mm/day)"),
    )

## Calculate the Climatology

In the seasonal collection of the climatology map, we can observe a clear pattern of precipitation across the globe.

The tropics exhibit a higher amount of precipitation compared to other regions.

Additionally, the map illustrates the seasonal patterns of precipitation changes across different regions of the globe, including areas influenced by monsoons.

## Calculate the Climatology

Now let's examine the climatology of the location we previously analyzed throughout the entire time series, specifically at (0°N, 0°E).

In [None]:
# initate plot with the specific figure size
fig, ax = plt.subplots(figsize=(9, 6))

precip_clim.sel(latitude=0, longitude=0, method="nearest").plot(ax=ax)
# Remove the automatically generated title
ax.set_title("")

## Calculate the Climatology

The monthly climatology time series for the point of interest demonstrates a noticeable seasonal pattern, with dry and rainy months observed in the region.

Precipitation is typically more abundant between December and May, while August experiences the driest conditions throughout the year.

## Your turn (Will be in PS03)

As climate changes, the climatology of precipitation may also change. 

In fact, climate researchers recalculate climatology every 10 years. This allows climate scientists to monitor how the norms of our climate system change. 

In this exercise, you will visualize how the climatology of our dataset changes depending on the reference period used.

Calculate the climatology for a different reference period (1991-2020) and compare it to the climatology that we just generated with reference period (1981-2010). 

Be sure to compare the two and note differences. Can you see why it is important to re-calculate this climatology?

In [None]:
# extract 30 year data for 1991-2020
precip_30yr_exercise = ...

# calculate climatology for 1991-2020
precip_clim_exercise = ...

# find difference in climatologies: (1981-2010) minues (1991-2020)
precip_diff_exercise = ...

# Compare the climatology for four different seasons by generating the
#         difference maps for January, April, July, and October with colorbar max and min = 1,-1

# Define the figure and each axis for the 2 rows and 2 columns
fig, axs = plt.subplots(
    nrows=2, ncols=2, subplot_kw={"projection": ccrs.Robinson()}, figsize=(12, 8)
)

# axs is a 2 dimensional array of `GeoAxes`.  We will flatten it into a 1-D array
axs = ...

#Loop over selected months (Jan, Apr, Jul, Oct)
for i, month in enumerate([1, 4, 7, 10]):
    ...

## Questions?

## See you next class!