# Landcover attributes extraction

Author: Thiago Nascimento (thiago.nascimento@eawag.ch)

This notebook is part of the EStreams publication and was used to extract and aggregate the lancover characteristics from the Corine dataset.

* Note that this code enables not only the replicability of the current database but also the extrapolation to new catchment areas. 
* Additionally, the user should download and insert the original raw-data in the folder of the same name prior to run this code. 
* The original third-party data used were not made avaialable in this repository due to redistribution and storage-space reasons.


## Requirements
**Python:**
* Python>=3.6
* Jupyter
* geopandas=0.10.2
* glob
* numpy
* os
* pandas
* tqdm

Check the Github repository for an environment.yml (for conda environments) or requirements.txt (pip) file.

**Files:**
* data/gee/landcover/EStreams_lulc{1990, 2000, 2006, 2012, 2018}_attributes_gee.csv. Landcover attributes CSV-files exported from GEE
* data/shapefiles/estreams_catchments.shp

**Directory:**
* Clone the GitHub directory locally
* Place any third-data variables in their respective directory.
* ONLY update the "PATH" variable in the section "Configurations", with their relative path to the EStreams directory. 

## References

* CORINE Land Cover — Copernicus Land Monitoring Service. European Environment Agency [data set], Copenhagen, Denmark https://land.copernicus.eu/en/products/corine-land-cover.

## Licenses
* Corine: Open access. "The Copernicus land monitoring products and services are made available on a principle of full, open and free access, as established by the Commission Delegated Regulation (EU) No 1159/2013 of 12 July 2013." https://land.copernicus.eu/en/data-policy (Last access 27 November 2023)

## Observations
* This notebook assumes that the GEE code to export the landcover descriptors from the Corine dataset (EStreams_landscape_attributes_landcover_gee.txt) was run before in the GEE platform and that all the output CSV-files are locally available. 
* It is possible that there are more than one CSV-file per year if the user decided to subset the catchments in smaller groups for optimze the exportation. 
* All the lulc csv-files must be placed in a single folder together. 

# Import modules

In [2]:
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import tqdm as tqdm
import glob
from utils.landcover import *

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


# Configurations

In [4]:
# Only editable variables:
# Relative path to your local directory
PATH = "../../.."

* #### The users should NOT change anything in the code below here.

In [5]:
# Non-editable variables:
PATH_LULC="data/gee/landcover"
PATH_OUTPUT = "results/staticattributes/"

# Set the directory:
os.chdir(PATH)

# Import data
## Catchment boundaries

In [6]:
catchment_boundaries = gpd.read_file('data/shapefiles/estreams_catchments.shp')
catchment_boundaries.head()

Unnamed: 0,id,area_km2,outlet_lat,outlet_lng,area_offic,area_calc,Code,basin_id,label_area,name,geometry
0,DE01945,144000,50.937,6.963,144232.0,144432.885,DE01945,DE01945,0,BundespegelKoeln,"POLYGON ((7.96208 46.53708, 7.96625 46.53291, ..."
1,DE01946,148000,51.226,6.77,147680.0,147934.665,DE01946,DE01946,0,BundespegelDuesseldorf,"POLYGON ((7.96208 46.53708, 7.96625 46.53291, ..."
2,DE01947,144000,50.937,6.963,144232.0,144432.885,DE01947,DE01947,0,BundespegelKoeln,"POLYGON ((7.96208 46.53708, 7.96625 46.53291, ..."
3,DE01948,148000,51.226,6.77,147680.0,147934.665,DE01948,DE01948,0,BundespegelDuesseldorf,"POLYGON ((7.96208 46.53708, 7.96625 46.53291, ..."
4,DE01949,159000,51.757,6.395,159300.0,159352.653,DE01949,DE01949,0,BundespegelRees,"POLYGON ((7.96208 46.53708, 7.96625 46.53291, ..."


In [7]:
print("The total number of catchments to be processed are:", len(catchment_boundaries))

The total number of catchments to be processed are: 166


# Reproject to projected coordinates system

In [8]:
# Define the target CRS to ETRS89 LAEA
target_crs = 'EPSG:3035' 

# Reproject the GeoDataFrame to the target CRS
catchment_boundaries_reprojected = catchment_boundaries.to_crs(target_crs)

## GEE outputs

In [9]:
# Check the files in the subdirectory:
filenames = glob.glob(PATH_LULC + "/*.csv")
filenames

['data/gee/landcover/EStreams_lulc2018_attributes_gee.csv',
 'data/gee/landcover/EStreams_lulc2006_attributes_gee.csv',
 'data/gee/landcover/EStreams_lulc1990_attributes_gee.csv',
 'data/gee/landcover/EStreams_lulc2000_attributes_gee.csv',
 'data/gee/landcover/EStreams_lulc2012_attributes_gee.csv']

In [10]:
# First we create an empty dataframe for the data:
landcover_df = pd.DataFrame()

# Loop for reading and concatenating the data:
for file in tqdm.tqdm(filenames):
    
    # First we read our data:
    landcover_file = pd.read_csv(file)
    landcover_file.drop(["system:index", ".geo"], axis = 1, inplace = True)
    landcover_file["class_name"] = "lulc_" + landcover_file["year"].astype(str) + "_" + landcover_file["class"].astype(str)
    year = landcover_file.loc[0, "year"]
    
    # Here we can create a pivot-table to organize our dataset:
    landcover_pivot = pd.pivot_table(
        landcover_file,
        values='area_sqm',          
        index='code',               # Rows are based on 'code'
        columns='class_name',       # Columns are based on 'class_name'
        fill_value=np.nan)
    
    # Total are per year:
    #landcover_pivot[str(year)+"_tot_area"] = landcover_pivot.sum(axis = 1)
    #landcover_pivot.iloc[:, :-1] = landcover_pivot.iloc[:, :-1].div(landcover_pivot[str(year)+"_tot_area"], axis=0)
    landcover_pivot["tot_area_"+str(year)] = landcover_pivot.sum(axis = 1)
    landcover_pivot.iloc[:, :-1] = landcover_pivot.iloc[:, :-1].div(landcover_pivot["tot_area_"+str(year)], axis=0)
    
    # Now we proceed with the concatenation:
    landcover_df = pd.concat([landcover_df, landcover_pivot], axis=1)
    
    # Here we deal with the case we have more than one file for the same year:
    landcover_df = landcover_df.T.groupby(level=0).apply(lambda group: group.ffill().bfill().iloc[0]).T

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:00<00:00, 46.10it/s]


In [11]:
# Here we add the majority class for each basin:
landcover_df = pd.concat([landcover_df, landcover_df.apply(get_majority_columns, axis=1)], axis=1)

landcover_df

Unnamed: 0_level_0,lulc_1990_111,lulc_1990_112,lulc_1990_121,lulc_1990_122,lulc_1990_123,lulc_1990_124,lulc_1990_131,lulc_1990_132,lulc_1990_133,lulc_1990_141,...,tot_area_1990,tot_area_2000,tot_area_2006,tot_area_2012,tot_area_2018,lulc_dom_2012,lulc_dom_2000,lulc_dom_1990,lulc_dom_2018,lulc_dom_2006
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR019,0.00026,0.049822,0.003864,0.000298,0.0,0.000135,0.000539,0.000379,4.6e-05,0.000159,...,50304010000.0,50304100000.0,50083040000.0,50173320000.0,50173320000.0,211,211,211,211,211
HUGR020,0.000101,0.050807,0.004048,8.9e-05,0.0,0.0,0.000258,0.000135,0.0,9.5e-05,...,9595701000.0,9595701000.0,9595701000.0,9595701000.0,9595701000.0,211,211,211,211,211
HUGR021,0.000631,0.045165,0.004286,0.000356,4.2e-05,0.000728,0.000946,0.000181,0.000456,0.000594,...,186852800000.0,188596500000.0,188596500000.0,188596500000.0,188596500000.0,211,211,211,211,211
HUGR022,0.000329,0.05206,0.002848,5.1e-05,0.0,1.2e-05,0.000536,0.000202,1.3e-05,0.000188,...,21087860000.0,21087890000.0,20901960000.0,20933280000.0,20933280000.0,311,311,311,311,311
HUGR023,0.000632,0.045184,0.004282,0.000357,4.2e-05,0.000729,0.000944,0.000181,0.000457,0.000595,...,186541900000.0,188285600000.0,188285600000.0,188285600000.0,188285600000.0,211,211,211,211,211
HUGR024,0.0,0.025958,0.00374,0.0,0.0,0.0,0.000763,0.000583,0.0,0.0,...,3243604000.0,3243604000.0,3243604000.0,3243604000.0,3243604000.0,311,311,311,311,311
HUGR025,0.0,0.055027,0.006998,0.000562,0.0,0.0,0.000636,0.001302,0.0,0.0,...,1206414000.0,1206414000.0,1206414000.0,1206414000.0,1206414000.0,211,211,211,211,211
HUGR026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,109630100.0,109630100.0,109630100.0,109630100.0,109630100.0,211,211,211,211,211
HUGR027,0.000403,0.049085,0.006017,0.000427,0.0,0.001066,0.000975,0.000401,0.000392,0.000198,...,4494349000.0,4494349000.0,4494349000.0,4494349000.0,4494349000.0,211,211,211,211,211
HUGR028,0.000284,0.032889,0.002534,5.5e-05,0.0,0.000904,0.000568,0.000116,0.0,0.000365,...,5773433000.0,5773433000.0,5773433000.0,5773433000.0,5773433000.0,211,211,211,211,211


In [12]:
# Here we add the percentage of each catchment area covered by the Corine (there are countries not covered)
columns_tot_areas = ["tot_area_1990", "tot_area_2000", "tot_area_2006", "tot_area_2012", "tot_area_2018"]

landcover_df.loc[:, columns_tot_areas] = landcover_df.loc[:, columns_tot_areas].div(catchment_boundaries_reprojected.set_index("basin_id").area, axis=0)
landcover_df

Unnamed: 0_level_0,lulc_1990_111,lulc_1990_112,lulc_1990_121,lulc_1990_122,lulc_1990_123,lulc_1990_124,lulc_1990_131,lulc_1990_132,lulc_1990_133,lulc_1990_141,...,tot_area_1990,tot_area_2000,tot_area_2006,tot_area_2012,tot_area_2018,lulc_dom_2012,lulc_dom_2000,lulc_dom_1990,lulc_dom_2018,lulc_dom_2006
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR019,0.00026,0.049822,0.003864,0.000298,0.0,0.000135,0.000539,0.000379,4.6e-05,0.000159,...,,,,,,211,211,211,211,211
HUGR020,0.000101,0.050807,0.004048,8.9e-05,0.0,0.0,0.000258,0.000135,0.0,9.5e-05,...,,,,,,211,211,211,211,211
HUGR021,0.000631,0.045165,0.004286,0.000356,4.2e-05,0.000728,0.000946,0.000181,0.000456,0.000594,...,,,,,,211,211,211,211,211
HUGR022,0.000329,0.05206,0.002848,5.1e-05,0.0,1.2e-05,0.000536,0.000202,1.3e-05,0.000188,...,,,,,,311,311,311,311,311
HUGR023,0.000632,0.045184,0.004282,0.000357,4.2e-05,0.000729,0.000944,0.000181,0.000457,0.000595,...,,,,,,211,211,211,211,211
HUGR024,0.0,0.025958,0.00374,0.0,0.0,0.0,0.000763,0.000583,0.0,0.0,...,,,,,,311,311,311,311,311
HUGR025,0.0,0.055027,0.006998,0.000562,0.0,0.0,0.000636,0.001302,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR027,0.000403,0.049085,0.006017,0.000427,0.0,0.001066,0.000975,0.000401,0.000392,0.000198,...,,,,,,211,211,211,211,211
HUGR028,0.000284,0.032889,0.002534,5.5e-05,0.0,0.000904,0.000568,0.000116,0.0,0.000365,...,,,,,,211,211,211,211,211


In [13]:
# Here we sort the index:
landcover_df = landcover_df.sort_index(axis=0)
landcover_df.index.name = "basin_id"
landcover_df

Unnamed: 0_level_0,lulc_1990_111,lulc_1990_112,lulc_1990_121,lulc_1990_122,lulc_1990_123,lulc_1990_124,lulc_1990_131,lulc_1990_132,lulc_1990_133,lulc_1990_141,...,tot_area_1990,tot_area_2000,tot_area_2006,tot_area_2012,tot_area_2018,lulc_dom_2012,lulc_dom_2000,lulc_dom_1990,lulc_dom_2018,lulc_dom_2006
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR019,0.00026,0.049822,0.003864,0.000298,0.0,0.000135,0.000539,0.000379,4.6e-05,0.000159,...,,,,,,211,211,211,211,211
HUGR020,0.000101,0.050807,0.004048,8.9e-05,0.0,0.0,0.000258,0.000135,0.0,9.5e-05,...,,,,,,211,211,211,211,211
HUGR021,0.000631,0.045165,0.004286,0.000356,4.2e-05,0.000728,0.000946,0.000181,0.000456,0.000594,...,,,,,,211,211,211,211,211
HUGR022,0.000329,0.05206,0.002848,5.1e-05,0.0,1.2e-05,0.000536,0.000202,1.3e-05,0.000188,...,,,,,,311,311,311,311,311
HUGR023,0.000632,0.045184,0.004282,0.000357,4.2e-05,0.000729,0.000944,0.000181,0.000457,0.000595,...,,,,,,211,211,211,211,211
HUGR024,0.0,0.025958,0.00374,0.0,0.0,0.0,0.000763,0.000583,0.0,0.0,...,,,,,,311,311,311,311,311
HUGR025,0.0,0.055027,0.006998,0.000562,0.0,0.0,0.000636,0.001302,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR027,0.000403,0.049085,0.006017,0.000427,0.0,0.001066,0.000975,0.000401,0.000392,0.000198,...,,,,,,211,211,211,211,211
HUGR028,0.000284,0.032889,0.002534,5.5e-05,0.0,0.000904,0.000568,0.000116,0.0,0.000365,...,,,,,,211,211,211,211,211


In [15]:
# Round the data to 3 decimals:
landcover_df.iloc[:, 0:-5] = landcover_df.iloc[:, 0:-5].astype(float).round(3)
landcover_df

Unnamed: 0_level_0,lulc_1990_111,lulc_1990_112,lulc_1990_121,lulc_1990_122,lulc_1990_123,lulc_1990_124,lulc_1990_131,lulc_1990_132,lulc_1990_133,lulc_1990_141,...,tot_area_1990,tot_area_2000,tot_area_2006,tot_area_2012,tot_area_2018,lulc_dom_2012,lulc_dom_2000,lulc_dom_1990,lulc_dom_2018,lulc_dom_2006
basin_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HUGR019,0.0,0.05,0.004,0.0,0.0,0.0,0.001,0.0,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR020,0.0,0.051,0.004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR021,0.001,0.045,0.004,0.0,0.0,0.001,0.001,0.0,0.0,0.001,...,,,,,,211,211,211,211,211
HUGR022,0.0,0.052,0.003,0.0,0.0,0.0,0.001,0.0,0.0,0.0,...,,,,,,311,311,311,311,311
HUGR023,0.001,0.045,0.004,0.0,0.0,0.001,0.001,0.0,0.0,0.001,...,,,,,,211,211,211,211,211
HUGR024,0.0,0.026,0.004,0.0,0.0,0.0,0.001,0.001,0.0,0.0,...,,,,,,311,311,311,311,311
HUGR025,0.0,0.055,0.007,0.001,0.0,0.0,0.001,0.001,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR027,0.0,0.049,0.006,0.0,0.0,0.001,0.001,0.0,0.0,0.0,...,,,,,,211,211,211,211,211
HUGR028,0.0,0.033,0.003,0.0,0.0,0.001,0.001,0.0,0.0,0.0,...,,,,,,211,211,211,211,211


# Data export

In [12]:
# Export the final dataset:
landcover_df.to_csv(PATH_OUTPUT+"estreams_landcover_attributes.csv")

# End