<a href="https://colab.research.google.com/github/whrc/ARTS/blob/main/Tutorial/data_formatting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Metadata formatting for the ARTS data set

Heidi Rodenhizer, Yili Yang

Jan 2024

## Dependencies

In [None]:
import uuid
import numpy as np
import pandas as pd
import geopandas as gpd
import warnings
import re
from datetime import datetime
import os
from os.path import dirname
from pathlib import Path

## User-Defined Input

Are you using colab to run this script? Provide 'True' or 'False':

In [None]:
colab = False

if colab:
    %%capture
    pip install git+https://github.com/whrc/ARTS.git
    from google.colab import drive
    
    drive.mount("/content/drive")

from ARTS import dataformatting

Would you like to run this as a demo with the mock data set? Provide 'True' (mock data demo) or 'False' (actual data processing of a new contribution):

In [None]:
demo = True

Before starting, copy your new RTS dataset (can be a shapefile or a geojson) into a directory called "input_data" within the directory in which you would like to work. If you do not change any code in the following code chunk, the default directory will be the directory up one level from the location of this script, if you are running the script locally, **OR** a folder called "ARTS" in MyDrive, if you are using colab (you will have to create this folder manually).

Provide the location of the directory in which you are working:

In [None]:
if colab:
    base_dir = Path("/content/drive/MyDrive/ARTS")
else:
    base_dir = Path('..')

print('Your base directory is ' + str(base_dir.resolve()))

Provide the file name of the data:

In [None]:
# set this - if demo == True, what this is set to doesn't matter
your_file = 'rts_dataset_test_polygons_new.geojson'

# leave everything else in this chunk alone
if demo:
    # RTS data set to be processed
    your_rts_dataset_file = 'rts_dataset_test_polygons_new.geojson'
    your_rts_dataset_filepath = base_dir / 'Tutorial' / 'mock_dataset' / 'input_data' / your_rts_dataset_file
    
    # ARTS main dataset to be appended
    ARTS_main_dataset_filepath = base_dir / 'Tutorial' / 'mock_dataset' / 'input_data' / 'rts_dataset_test_polygons_current.geojson'
    
else:
    # RTS data set to be processed
    your_rts_dataset_file = your_file
    your_rts_dataset_filepath = base_dir / 'input_data' / your_rts_dataset_file
    
    # ARTS main dataset to be appended
    ARTS_main_dataset_filepath = base_dir / 'ARTS_main_dataset' / 'ARTS_main_dataset.geojson'
    
# Metadata Description file
metadata_filepath = base_dir / 'Metadata_Format_Summary.csv'


Provide the names of any metadata fields in your new file that are not already in the official RTS Data Set (please check the list to ensure that the field has not been included previously) that you would like to be included in the compiled data set:

In [None]:
# Provide new metatdata fields as a list of the character column names. If there are no new fields, leave the code assigning an empty list.
# If your new file is a shapefile, also provide a list of the abbreviated names
# Example:
# new_fields = ['CustomColumn1', 'CustomColumn2']
# Shapefile example:
# new_fields_abbreviated = ['CstmCl1', 'CstmCl2']
new_fields = []

Have you already created RTS centroid columns, or would you like them to be created within this script? Provide either True, if the columns do not exist yet, or False, if you have already created them:

In [None]:
# Example:
# calculate_centroid = False
calculate_centroid = True

Would you like your formatted new data to be output in its own file (in which case you will email the file of new features to us to merge with the compiled data set) or appended to the compiled dataset (in which case you will commit your updated file to your forked github repository and create a pull request to add the file to the official github repository). Your decision here should mostly be based on your comfort with github. If you have no idea what any of the details about the GitHub stuff means, please opt for the separate file and email it to us.
If you choose to create a separate file, make a directory called "output" in `base_dir`.

In [None]:
# Example
# separate_file = True
separate_file = True

# Import Metadata Format Summary

In [None]:
metadata_format_summary = pd.read_csv(metadata_filepath)

required_fields = list(
    metadata_format_summary[metadata_format_summary.Required == "True"].FieldName.values
)

generated_fields = list(
    metadata_format_summary[
        metadata_format_summary.Required == "Generated"
    ].FieldName.values
)

optional_fields = list(
    metadata_format_summary[
        metadata_format_summary.Required == "False"
    ].FieldName.values
)

all_fields = required_fields + generated_fields + optional_fields + new_fields

metadata_format_summary

# Load the Main ARTS Data Set

In [None]:
ARTS_main_dataset = gpd.read_file(ARTS_main_dataset_filepath).filter(
    items=required_fields + generated_fields + optional_fields + ["geometry"]
)

ARTS_main_dataset.ContributionDate = pd.to_datetime(ARTS_main_dataset.ContributionDate)

for field in required_fields:  # Check if all required columns are present
    if field not in ARTS_main_dataset.columns:
        raise ValueError(
            "{field} is missing. Has the RTS data set been modified since download?".format(
                field=repr(field)
            )
        )

ARTS_main_dataset

# Load Your New RTS Data Set

In [None]:
# pre-processing your rts data
new_dataset = dataformatting.preprocessing(
    your_rts_dataset_filepath,
    required_fields,
    generated_fields,
    optional_fields,
    new_fields,
    calculate_centroid
)
new_dataset

# Check Metadata Format of New Data

In [None]:
dataformatting.run_formatting_checks(new_dataset)

# Generate UIDs

Set seed for UID generation (R) by concatenating all required metadata columns (except UID) into a single string

In [None]:
dataformatting.seed_gen(new_dataset)
new_dataset.seed

Generate UIDs

In [None]:
new_dataset["UID"] = [
    str(uuid.uuid5(uuid.NAMESPACE_DNS, name=seed)) for seed in new_dataset.seed
]
new_dataset.UID

# Check for Intersections with RTS Data Set

Find intersecting RTS polygons from the official RTS data set and retrieve their UIDs. Create empty columns to manually classify the repeated polygons.

In [None]:
if demo:
    intersections_output_filepath = (
        base_dir / 'Tutorial' / 'mock_dataset' / 'output' / (
            str(your_rts_dataset_file).split('.')[0] + "_overlapping_polygons.geojson"
        )
    )
else:
    intersections_output_filepath = (
        base_dir / 'output' / (
            str(your_rts_dataset_file).split('.')[0] + "_overlapping_polygons.geojson"
        )
    )

new_dataset = dataformatting.check_intersections(
    new_dataset, ARTS_main_dataset, intersections_output_filepath, demo
)
new_dataset

At this point, you will need to manually check all polygons with intersections against the polygons in the official RTS data set in your preferred GIS software and save the output as a geojson file.

When possible/necessary, try to find imagery that matches the date of the intersecting polygons - this may require contacting the lab that did the original delineation.

Your job is to inspect each of the polygons listed in the 'Intersections' column compared to the new RTS feature and manually copy and paste the UIDs from the 'Intersections' column into the 'RepeatRTS', 'StabilizedRTS', 'MergedRTS', or 'AccidentalOverlap' based on the relationship between the two polygons.

- Paste the UID into the RepeatRTS column when the new RTS feature is the same RTS feature as the RTS feature in the 'Intersections' column, but was delineated at a different point in time, by a different lab at the same point in time, or from different imagery at the same point in time. The RTS feature is the same when it was the result of the same RTS initiation event.

- Paste the UID into the StabilizedRTS column when the RTS feature in the 'Intersections' column is a stabilized RTS scar as of the date of the imagery used in the new RTS delineations.

- Paste the UID into the MergedRTS column when multiple RTS features in the 'Intersections' column merged to form the new RTS feature.

- Paste the UID into the AccidentalOverlap column when inaccuracies in delineation of separate RTS features lead to overlap (e.g. features that are very close to each other and the polygons barely touch).

When this is done, each of the UIDs in the Intersections column should have been copied into one (and only one) of the 'RepeatRTS', 'StabilizedRTS', 'MergedRTS', or 'AccidentalOverlap' columns.


# Load Manually Edited File and Join to Processed Data

Add the 'RepeatRTS', 'StabilizedRTS', and 'MergedRTS' columns that you just edited back into `new_dataset`.

In [None]:
# path to the manually-edited file
if demo:
    edited_filepath = base_dir / 'Tutorial' / 'mock_dataset' / 'output' / (
        str(your_rts_dataset_file).split('.')[0] + "_overlapping_polygons.geojson"
    )

else:
    edited_filepath = base_dir / 'output' / (
        str(your_rts_dataset_file).split('.')[0] + "_overlapping_polygons.geojson"
        )

merged_data = dataformatting.merge_data(new_dataset, edited_filepath)
merged_data

# Check for Intersections within New RTS Data Set

Intersections within the new data set are assumed to be repeat delineations of the same RTS feature. If this is not true (e.g. if you have delineated an old RTS scar and an active RTS feature on top of it), this code will not assign UIDs properly. In this case, please get in touch with us to determine how to proceed.

In [None]:
merged_data_self_intersect = dataformatting.self_intersection(merged_data)
merged_data_self_intersect

# Check Completeness of Intersection Information

In [None]:
dataformatting.check_intersection_info(merged_data_self_intersect)

# Final Column Selection

In [None]:
formatted_data = dataformatting.add_empty_columns(
    merged_data_self_intersect,
    [col for col in optional_fields],
)

formatted_data = formatted_data[all_fields + ["geometry"]]

formatted_data

In [None]:
dataformatting.output(
    formatted_data,
    ARTS_main_dataset,
    optional_fields,
    all_fields,
    base_dir,
    your_rts_dataset_file,
    ARTS_main_dataset_filepath,
    separate_file,
    demo
)

Now you are ready to submit the above file.