This notebook details the data structure and shows how to load the data.

In [None]:
%pylab inline --no-import-all

from pathlib import Path

We first need to clone our code:

In [None]:
!rm -rf GLC
!git clone https://github.com/maximiliense/GLC

Then, we need to define the path to the data:

In [None]:
# Change this path to adapt to where you downloaded the data
DATA_PATH = Path("../input/geolifeclef-2022-lifeclef-2022-fgvc9/")

This folder is the path root where the data was downloaded and extracted:

In [None]:
ls -L $DATA_PATH

We can now look into these subfolders and the data they contain.

# Observations

The `observations` subfolder contains 4 CSV files:

In [None]:
ls $DATA_PATH/observations

Each of line of those files corresponds to a single observation.

In the files corresponding to the training data, there are 5 columns:
- `observation_id`: unique identifier of the observation
- `latitude`: latitude coordinates of this observation
- `longitude`: longitude coordinates of this observation
- `species_id`: identifier of the species observed at that location
- `subset`: proposed train/val split using the same splitting procedure than for train and test (equal to either "train" or "val")

In the files corresponding to the test data, there are only 3 columns:
- `observation_id`: unique identifier of the observation
- `latitude`: latitude coordinates of this observation
- `longitude`: longitude coordinates of this observation

The goal is then to predict the identifier of the species observed at that location.

Let's load these CSV files using [pandas](https://pandas.pydata.org/):

In [None]:
import pandas as pd

In [None]:
df_obs_fr = pd.read_csv(DATA_PATH / "observations" / "observations_fr_train.csv", sep=";", index_col="observation_id")
df_obs_us = pd.read_csv(DATA_PATH / "observations" / "observations_us_train.csv", sep=";", index_col="observation_id")

df_obs = pd.concat((df_obs_fr, df_obs_us))

print("Number of observations for training: {}".format(len(df_obs)))

df_obs.head()

In [None]:
df_obs_fr_test = pd.read_csv(DATA_PATH / "observations" / "observations_fr_test.csv", sep=";", index_col="observation_id")
df_obs_us_test = pd.read_csv(DATA_PATH / "observations" / "observations_us_test.csv", sep=";", index_col="observation_id")

df_obs_test = pd.concat((df_obs_fr_test, df_obs_us_test))

print("Number of observations for testing: {}".format(len(df_obs_test)))

df_obs_test.head()

The observations are not uniformly sampled in the two countries as shown the following plots.
The training observations are shown in blue while the test ones are shown in red.

In [None]:
from GLC.plotting import plot_map


def plot_observations_distribution(ax, df_obs, df_obs_test=None, **kwargs):
    default_kwargs = {
        "zorder": 1,
        "alpha": 0.1,
        "s": 0.5
    }
    default_kwargs.update(kwargs)
    kwargs = default_kwargs
    
    ax.scatter(df_obs.longitude, df_obs.latitude, color="blue", **kwargs)
    
    if df_obs_test is not None:
        ax.scatter(df_obs_test.longitude, df_obs_test.latitude, color="red", **kwargs)


fig = plt.figure(figsize=(10, 5.5))
ax = plot_map(region="us")
plot_observations_distribution(ax, df_obs_us, df_obs_us_test)
ax.set_title("Observations distribution (US)")

fig = plt.figure(figsize=(8, 8))
ax = plot_map(region="fr")
plot_observations_distribution(ax, df_obs_fr, df_obs_fr_test)
ax.set_title("Observations distribution (France)")

A close-up view on the region around Montpellier, France, shows the train/test splitting procedure.

Note that there is no geographical overlap between training and test sets.

In [None]:
def select_samples_around_point(df_obs, lon_min, lon_max, lat_min, lat_max):
    ind = (
        (lon_min <= df_obs.longitude) & (df_obs.longitude <= lon_max)
        & (lat_min <= df_obs.latitude) & (df_obs.latitude <= lat_max)
    )
    return df_obs[ind]


extent = [3, 4.5, 43.25, 44.25]

fig = plt.figure(figsize=(9.5, 7))
ax = plot_map(extent=extent)

df_obs_zoom = select_samples_around_point(df_obs_fr, *extent)
df_obs_zoom_test = select_samples_around_point(df_obs_fr_test, *extent)

kwargs = {
    "alpha": 0.2,
    "s": 5,
}
plot_observations_distribution(ax, df_obs_zoom, df_obs_zoom_test, **kwargs)
ax.set_title("Observations distribution around Montpellier, France")

The dataset contains 17K species and is imbalanced.

In [None]:
species_value_counts = df_obs["species_id"].value_counts()

print("Total number of species: {}".format(len(species_value_counts)))


fig = plt.figure()
ax = fig.gca()

x = np.arange(len(species_value_counts))
ax.plot(x, species_value_counts)

ax.set_yscale("log")

ax.set_xlabel("ranked species")
ax.set_ylabel("number of observations per species")
ax.set_title("Species observations distribution")

ax.grid()
ax.autoscale(tight=True)
ax.set_ylim(bottom=1)

# Metadata

In the `metadata` folder, some additional data is provided.
There are 4 files containing:
1. GBIF species, genus, families and kingdom names associated with the species id provided in the observations in `species_details.csv`
2. The description of the environmental (bioclimatic and pedological) variables in `environmental_variables.csv`
3. The labels corresponding to the original land cover codes in `landcover_original_labels.csv`
4. The suggested alignment of land cover codes between France and US in `landcover_suggested_alignment.csv`

In [None]:
df_species = pd.read_csv(DATA_PATH / "metadata" / "species_details.csv", sep=";")

print("Total number of species: {}".format(len(df_species)))

print("\nNumber of species in each kingdom:")
print(df_species.GBIF_kingdom_name.value_counts())

df_species.head()

In [None]:
df_obs = df_obs.reset_index().merge(df_species, on="species_id", how="left").set_index(df_obs.index.names)

print("Number of observations of each kingdom:")
print(df_obs.GBIF_kingdom_name.value_counts())

df_obs.head()

In [None]:
df_env_vars = pd.read_csv(DATA_PATH / "metadata" / "environmental_variables.csv", sep=";")
df_env_vars.head()

In [None]:
df_landcover_labels = pd.read_csv(DATA_PATH / "metadata" / "landcover_original_labels.csv", sep=";")
df_landcover_labels.head()

In [None]:
df_suggested_landcover_alignment = pd.read_csv(DATA_PATH / "metadata" / "landcover_suggested_alignment.csv", sep=";")
df_suggested_landcover_alignment.head()

# Patches

The patches consist of images centered at each observation's location capturing three types of information in the 250m x 250m neighboring square:
1. remote sensing imagery under the form of RGB-IR images
2. land cover data
3. altitude data

They are located in the two subfolder `patches-fr` and `patches-us`, one for each country:

In [None]:
ls $DATA_PATH

The first digit of the observation id tells the country it belongs to:
- `1` for France, thus to be found in subfolder `patches-fr`
- `2` for US, thus to be found in subfolder `patches-us`

For instance, `10561900` is an observation made in France (on the Pic Saint-Loup mountain) whereas `22068175` was observed in the US.

Inside those folders, there are two levels of hierarchy, corresponding to the last four digits of the observation id:

In [None]:
ls $DATA_PATH/patches-fr

and

In [None]:
ls $DATA_PATH/patches-fr/00

To find the files corresponding to an observation:
1. the first subfolder corresponds to the last two digits,
2. the second subfolder corresponds to the two digits right before them.

For instance, the patches corresponding to observation `10171444` can be found in `patches/fr/44/14`, whereas `22068100` can be found in `patches/us/00/81`:

In [None]:
ls $DATA_PATH/patches-fr/44/14/10171444*

and

In [None]:
ls $DATA_PATH/patches-fr/00/81/22068100*

There are 4 files for each observation:
- a color JPEG image containing an RGB image (`*_rgb.jpg`)
- a grayscale JPEG image containing a near-infrared image (`*_near_ir.jpg`)
- a TIFF with Deflate compression containing altitude data (`*_altitude.tif`)
- a TIFF with Deflate compression containing land cover data (`*_landcover.tif`)

We provide a loading function which, given an observation id, loads all this data at once using [Pillow](https://pillow.readthedocs.io/en/stable/) for the images and [tiffile](https://github.com/cgohlke/tifffile) for the TIFF files and returns them as a tuple `(rgb, near-ir, altitude, landcover)`:

In [None]:
from GLC.data_loading.common import load_patch

patch = load_patch(10171444, DATA_PATH)

print("Number of data sources: {}".format(len(patch)))
print("Arrays shape: {}".format([p.shape for p in patch]))
print("Data types: {}".format([p.dtype for p in patch]))

It can also automatically perform the land cover alignment if necessary:

In [None]:
landcover_mapping = df_suggested_landcover_alignment["suggested_landcover_code"].values
patch = load_patch(10171444, DATA_PATH, landcover_mapping=landcover_mapping)

We also provide an visualization function for the patches:

In [None]:
from GLC.plotting import visualize_observation_patch

# Extracts land cover labels
landcover_labels = df_suggested_landcover_alignment[["suggested_landcover_code", "suggested_landcover_label"]].drop_duplicates().sort_values("suggested_landcover_code")["suggested_landcover_label"].values

visualize_observation_patch(patch, observation_data=df_obs.loc[10561900], landcover_labels=landcover_labels)

Similarly, for the observation `22068100`:

In [None]:
patch = load_patch(22068100, DATA_PATH, landcover_mapping=landcover_mapping)

visualize_observation_patch(patch, observation_data=df_obs.loc[22068100], landcover_labels=landcover_labels)

# Environmental rasters

The rasters contain low-resolution environmental data - bioclimatic and pedological data.

There are two ways to use this data:
1. directly use the environmental vectors pre-extracted that can be found in the CSV file `pre-extracted/environmental_vectors.csv`
2. manually extract patches centered at each observation using the rasters located in the `rasters` subfolder

## Pre-extracted environmental vectors

These vectors are ready to be used - see the Random Forest training baseline in the corresponding notebook.

They are easy to load as they are provided as a CSV file.

Each line of this file correspond to an observation and each column to one of the environmental variable.

In [None]:
df_env = pd.read_csv(DATA_PATH / "pre-extracted" / "environmental_vectors.csv", sep=";", index_col="observation_id")
df_env.head()

Note that it typically contains NaN values due to absence of data over the seas and oceans for both types of data as well as rivers and others for the pedologic data.

In [None]:
print("Variables which can contain NaN values:")
df_env.isna().any()

## Patch extraction from rasters

To more easily extract patches from the rasters, we provide a `PatchExtractor` class which uses [rasterio](https://github.com/mapbox/rasterio).

In [None]:
from GLC.data_loading.environmental_raster import PatchExtractor

The following code loads the rasters for all the variables and prepares to extract patches of size 256x256.

Here the patches are not of the same resolution as the provided ones as one pixel corresponds to 30arcsec (~1km) for the bioclimatic data and to 250m for the pedologic data.

Note that this uses quite a lot of memory (~18Go) as all the rasters will be loaded in the RAM.

To avoid this issue, we will only load the bioclimatic rasters here.

In [None]:
extractor = PatchExtractor(DATA_PATH / "rasters", size=256)
extractor.add_all_bioclimatic_rasters()

print("Number of rasters: {}".format(len(extractor)))

To load all the rasters use:
```
extractor.add_all_rasters()
```
To load all the pedologic rasters use:
```
extractor.add_all_pedologic_rasters()`
```

A patch can then easily to be extracted given the localization using:

In [None]:
patch = extractor[43.61, 3.88]

print("Patch shape: {}".format(patch.shape))
print("Data type: {}".format(patch.dtype))

Note that it typically contains NaN values due to absence of data over the seas and oceans for both types of data as well as rivers and others for the pedologic data.

In [None]:
print("Contains NaN: {}".format(np.isnan(patch).any()))

A helper function to plot the patches is also provided.

The following example displays the patches obtained around the region of Montpellier, France.

In [None]:
fig = plt.figure(figsize=(14, 10))
extractor.plot((43.61, 3.88), fig=fig)