# Vegetation attributes and time-series extraction

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is part of the EStreams publication and was used to extract and aggregate the vegetation time-series from the MODIS dataset (i.e., LAI and NDVI).

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made available in this repository due to redistribution and storage-space reasons.  

## Requirements
**Python:**
* Python>=3.6
* Jupyter
* geopandas=0.10.2
* glob
* numpy
* os
* pandas
* tqdm

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**
* data/shapefiles/estreams_catchments.shp
* data/gee/vegetation/LAI/EStreams_modis_LAI_mean_gee_{}.csv. LAI time-series CSV-file(s) exported from GEE.
* data/gee/vegetation/NDVI/EStreams_modis_NDVI_mean_gee_{}.csv. NDVI time-series CSV-file(s) exported from GEE.

**Directory:**
* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## References

* Didan, K. MODIS/Terra Vegetation Indices 16-Day L3 Global 500m SIN Grid V061 [Data set]. ASA EOSDIS Land Processes Distributed Active Archive Center https://doi.org/10.5067/MODIS/MOD13A1.061 (2021).
* Myneni, R., Knyazikhin, Y. & Park, T. MODIS/Terra Leaf Area Index/FPAR 8-Day L4 Global 500m SIN Grid V061 [Data set]. NASA EOSDIS Land Processes Distributed Active Archive Center https://doi.org/10.5067/MODIS/MOD15A2H.061 (2021).

## License

* LAI and NDVI: Open access: "MODIS data and products acquired through the LP DAAC have no restrictions on subsequent use, sale, or redistribution." https://lpdaac.usgs.gov/products/mod13a1v061/; https://lpdaac.usgs.gov/products/mod15a2hv061/;  (Last access 23 November 2023) 

## Observations
* This notebook assumes that the GEE code to export LAI and NDVI mean time-series from the MODIS dataset (EStreams_landscape_timeseries_LAI_gee.txt; EStreams_landscape_timeseries_NDVI_gee.txt) were run before in the GEE platform and that the output CSV-files are locally available. 
* It is not possible to export the 17,130 catchments at one single CSV, so there might be many files with the time-series stored separetly. 

# Import modules

In [None]:
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import tqdm as tqdm
import glob

# Configurations

In [None]:
# Only editable variables:
# Relative path to your local directory
PATH = "../../.."

* #### The users should NOT change anything in the code below here.

In [None]:
# Non-editable variables:
PATH_OUTPUT_TS = "results/timeseries/vegetationindices"
PATH_OUTPUT_ST = "results/staticattributes"

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [None]:
catchment_boundaries = gpd.read_file('data/shapefiles/estreams_catchments.shp')
catchment_boundaries

In [None]:
print("The total number of catchments to be processed are:", len(catchment_boundaries))

## GEE outputs
### Leaf Area index (LAI)

In [None]:
# Check the files in the subdirectory:
filenames = glob.glob("data/gee/vegetation/LAI/*.csv")
print("Number of files:", len(filenames))
print("First file:", filenames[0])

In [None]:
# First, we create an empty DataFrame for the data with a datetime index:
LAI_df = pd.DataFrame(index=pd.date_range(start='2001-01-01', end='2022-12-31', freq='M'))

# Loop for reading and concatenating the data:
for file in tqdm.tqdm(filenames):
    
    # Read the data from the CSV file:
    LAI_file = pd.read_csv(file)
    LAI_file.drop(["system:index", ".geo"], axis=1, inplace=True)
    LAI_file = LAI_file.T
    
    # Set columns based on the "basin_id" row and drop it
    LAI_file.columns = LAI_file.loc["basin_id", :].tolist()
    LAI_file.drop(["basin_id"], axis=0, inplace=True)
    
    # Convert the index to integers and sort it
    LAI_file.index = LAI_file.index.astype(int)
    LAI_file.sort_index(inplace=True)
    
    # Create a new DataFrame with datetime index and assign values
    LAI_file_df = pd.DataFrame(columns=LAI_file.columns)
    LAI_file_df["dates"] = pd.date_range(start='2001-01-01', end='2022-12-31', freq='M')
    LAI_file_df.loc[:, LAI_file.columns] = LAI_file
    LAI_file_df.set_index("dates", inplace=True)
    LAI_file_df.index.name = ""
    
    # Concatenate the DataFrames along the columns (axis=1)
    LAI_df = pd.concat([LAI_df, LAI_file_df], axis=1)
    
# Apply the scale factor from Google Earth Engine (GEE)
LAI_df = LAI_df * 0.01
LAI_df

In [None]:
# Here we add the columns of the catchemnts that were not processed
# Adding new columns with NaN values only if they don't exist
for col in catchment_boundaries.basin_id.tolist():
    if col not in LAI_df.columns:
        LAI_df[col] = np.nan
LAI_df

In [None]:
# Here we sort the columns:
LAI_df = LAI_df.sort_index(axis=1)
LAI_df

In [None]:
# Resample to yearly mean
LAI_yr = LAI_df.resample('Y').mean()
LAI_yr

In [None]:
# Calculate the mean for each month across all years (monht of the year)
LAI_moy = LAI_df.groupby(LAI_df.index.month).mean()

# Rename the index to the three-letter month abbreviations
LAI_moy.index = pd.to_datetime(LAI_moy.index, format='%m').strftime('%b')

LAI_moy

### Normalized Vegetation Difference Index (NDVI)

In [None]:
# Check the files in the subdirectory:
filenames = glob.glob("data/gee/vegetation/NDVI/*.csv")
print("Number of files:", len(filenames))
print("First file:", filenames[0])

In [None]:
# First, we create an empty DataFrame for the data with a datetime index:
ndvi_df = pd.DataFrame(index=pd.date_range(start='2001-01-01', end='2022-12-31', freq='M'))

# Loop for reading and concatenating the data:
for file in tqdm.tqdm(filenames):
    
    # Read the data from the CSV file:
    ndvi_file = pd.read_csv(file)
    ndvi_file.drop(["system:index", ".geo"], axis=1, inplace=True)
    ndvi_file = ndvi_file.T
    
    # Set columns based on the "basin_id" row and drop it
    ndvi_file.columns = ndvi_file.loc["basin_id", :].tolist()
    ndvi_file.drop(["basin_id"], axis=0, inplace=True)
    
    # Convert the index to integers and sort it
    ndvi_file.index = ndvi_file.index.astype(int)
    ndvi_file.sort_index(inplace=True)
    
    # Create a new DataFrame with datetime index and assign values
    ndvi_file_df = pd.DataFrame(columns=ndvi_file.columns)
    ndvi_file_df["dates"] = pd.date_range(start='2001-01-01', end='2022-12-31', freq='M')
    ndvi_file_df.loc[:, ndvi_file.columns] = ndvi_file
    ndvi_file_df.set_index("dates", inplace=True)
    ndvi_file_df.index.name = ""
    
    # Concatenate the DataFrames along the columns (axis=1)
    ndvi_df = pd.concat([ndvi_df, ndvi_file_df], axis=1)
    
# Apply the scale factor from Google Earth Engine (GEE)
ndvi_df = ndvi_df * 0.0001
ndvi_df

In [None]:
# Here we add the columns of the catchemnts that were not processed
# Adding new columns with NaN values only if they don't exist
for col in catchment_boundaries.basin_id.tolist():
    if col not in ndvi_df.columns:
        ndvi_df[col] = np.nan
ndvi_df

In [None]:
# Here we sort the columns:
ndvi_df = ndvi_df.sort_index(axis=1)
ndvi_df

In [None]:
# Resample to yearly mean
ndvi_yr = ndvi_df.resample('Y').mean()
ndvi_yr

In [None]:
# Calculate the mean for each month across all years (monht of the year)
ndvi_moy = ndvi_df.groupby(ndvi_df.index.month).mean()

# Rename the index to the three-letter month abbreviations
ndvi_moy.index = pd.to_datetime(ndvi_moy.index, format='%m').strftime('%b')

ndvi_moy

# Final aggregation (static attributes)

In [None]:
# LAI:
LAI_moy_T = LAI_moy.T
LAI_moy_T.columns = pd.to_datetime(LAI_moy_T.columns, format='%b').strftime('%m')
LAI_moy_T.columns = "lai_" + LAI_moy_T.columns

LAI_moy_T["lai_mean"] = LAI_moy_T.mean(axis = 1)

LAI_moy_T

In [None]:
# NDVI:
ndvi_moy_T = ndvi_moy.T
ndvi_moy_T.columns = pd.to_datetime(ndvi_moy_T.columns, format='%b').strftime('%m')
ndvi_moy_T.columns = "ndvi_" + ndvi_moy_T.columns

ndvi_moy_T["ndvi_mean"] = ndvi_moy_T.mean(axis = 1)

ndvi_moy_T

In [None]:
# First we create an empty table data frame to assing the values to it
vegetation_df = pd.DataFrame(index = ndvi_moy_T.index)

# Now we proceed with the concatenation:
vegetation_df = pd.concat([LAI_moy_T, ndvi_moy_T], axis=1)

vegetation_df

In [None]:
# Assign the "basin_id" to the gauges names:
vegetation_df.index.name = "basin_id"

In [None]:
# Assign the "date" to the df index:
LAI_df.index.name = "date"
LAI_yr.index.name = "date"
ndvi_df.index.name = "date"
ndvi_yr.index.name = "date"

In [None]:
# Round the data to 3 decimals
LAI_df = LAI_df.astype(float).round(3)
LAI_yr = LAI_yr.astype(float).round(3)
ndvi_df = ndvi_df.astype(float).round(3)
ndvi_yr = ndvi_yr.astype(float).round(3)
vegetation_df = vegetation_df.astype(float).round(3)

# Data export

In [None]:
# Export the final datasets:
# Time-series:
LAI_df.to_csv(PATH_OUTPUT_TS+"/estreams_LAI_monhtly.csv")
LAI_yr.to_csv(PATH_OUTPUT_TS+"/estreams_LAI_yearly.csv")

ndvi_df.to_csv(PATH_OUTPUT_TS+"/estreams_NDVI_monhtly.csv")
ndvi_yr.to_csv(PATH_OUTPUT_TS+"/estreams_NDVI_yearly.csv")

# Static attributes:
vegetation_df.to_csv(PATH_OUTPUT_ST+"/estreams_vegetation_attributes.csv")

# End