<img src="https://raw.githubusercontent.com/storm-platform/storm-visual-design/main/src/logo/storm-txt-vertical.png" align="right" width="128"/>

# Deforestation Detection experiment - Reproducibility analysis
<hr style=\"border:2px solid #0077b9;\">


In [1]:
import pandas as pd
import rasterio as rio

from pydash import py_
from pathlib import Path

from is_equal import is_equal

### Auxiliary functions

In [2]:
def visualize_checksum_file(checksum_list_as_text, start_=0, end_=5):
    """Visualize individual lines of a checklist list in a string."""
    return "\n".join(checksum_list_as_text.split('\n')[start_:end_])

def textsum2frame(checksum_list_as_text):
    """Transform a list of checksum (string) to a pd.DataFrame."""
    return pd.DataFrame(
        (
            py_.chain(checksum_list_as_text)
                .thru(lambda x: x.split("\n"))
                .map(lambda x: x.split(" "))
                .map(lambda x: (x[0], x[-1]))
        ).value(),
        columns = ["checksum", "filepath"]
    )

### General definitions

In [3]:
#
# Checksum location
#

# Base checksum directory
checksum_base = Path("../checksums/")

# MD5 checksums
checksum_files_md5 = checksum_base / "md5sums.txt"

# SHA256 checksums
checksum_files_sha256 = checksum_base / "sha256sums.txt"

# SHA512 checksums
checksum_files_sha512 = checksum_base / "sha512sums.txt"

### Reading and visualizing the checksum files

**MD5 checksum**

In [4]:
md5sum = checksum_files_md5.open().read()

print(
    visualize_checksum_file(md5sum, 0, -1)
)

3f4c6e8707e9be0b16d7e27cf7c867a5  ./compendium/evaluate-classification/Dockerfile
d41d8cd98f00b204e9800998ecf8427e  ./compendium/evaluate-classification/README.md
a68b9d22347112680476218a6b303d35  ./compendium/evaluate-classification/environment.yml
3620751421846f017c02bf5363b4081c  ./compendium/evaluate-classification/poetry.lock
731ec82b9c640739ec6dad2db28e03c0  ./compendium/evaluate-classification/.dockerignore
0b9f38463da0933e669a83b3e43f54c0  ./compendium/evaluate-classification/data/raw_data/deforestation-map/LANDSAT-8_OLI_038047_2018-08-29_2019-08-29_bayes_v1.tif
f5a04bbeb10092c171ee971e51543fbd  ./compendium/evaluate-classification/data/raw_data/deforestation-map/LANDSAT-8_OLI_038047_2018-08-29_2019-08-29_probs_v1.tif
d41d8cd98f00b204e9800998ecf8427e  ./compendium/evaluate-classification/data/raw_data/deforestation-map/.gitkeep
4b9c8e42956bc733129059fd46250e92  ./compendium/evaluate-classification/data/raw_data/deforestation-map/LANDSAT-8_OLI_038047_2018-08-29_2019-08-29_class_

**SHA256 checksum**

In [5]:
sha256sum = checksum_files_sha256.open().read()

print(
    visualize_checksum_file(sha256sum)
)

796dd7cc09b78179722673a07b3ede094af83cb1015f2c13270121224ca4c1d2  ./compendium/evaluate-classification/Dockerfile
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855  ./compendium/evaluate-classification/README.md
1d83419bf126c349c3394a50f209236a1035747fd643a10fffc64a985ad773c1  ./compendium/evaluate-classification/environment.yml
3fdefd161895d437a626af378456d3a88d95099742abb3b9e4501990aebfa11e  ./compendium/evaluate-classification/poetry.lock
86c742777cedd8b71110d9ca09b613b9c2954c32387fae1fc4887e83022724ea  ./compendium/evaluate-classification/.dockerignore


**SHA512 checksum**

In [6]:
sha512sum = checksum_files_sha512.open().read()

print(
    visualize_checksum_file(sha512sum)
)

ee530f21b44694018ba5bf26f77343b7a04ccb2255addf0fb244282afd078bd04423443879be32bee6af2bcfa9c6795b891d5fb66fbcb3958a1f718fd8a04eaa  ./compendium/evaluate-classification/Dockerfile
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  ./compendium/evaluate-classification/README.md
36841e93bba29e789a467bbcb978f1dfbc61ab16da600f76c6d64fdddb12ddfdc671626a6b996fb59925342469735b9d225876ecb92bbfa27dac6a3645fe60cf  ./compendium/evaluate-classification/environment.yml
00d5e5ecab17d82bed5ceb0f7c2e35bccbadd6cfa41d4287e95b0068bc9ab354a550bc25cc70564a874f4b67e21ede6d9a56943040378b285a6dfd06fecf09d0  ./compendium/evaluate-classification/poetry.lock
1940024dd99b947856ec730f06b061a89012e7bbc92260e6e23c9b411ddea17ab7bf82ed6465176d0bb14515e5a3b10d7b7ae271993a887d16f71a4f3d8a0429  ./compendium/evaluate-classification/.dockerignore


### Preparing the data for the `data wrangling`



**MD5 checksum**

In [7]:
md5sum_df = textsum2frame(md5sum)

md5sum_df.head()

Unnamed: 0,checksum,filepath
0,3f4c6e8707e9be0b16d7e27cf7c867a5,./compendium/evaluate-classification/Dockerfile
1,d41d8cd98f00b204e9800998ecf8427e,./compendium/evaluate-classification/README.md
2,a68b9d22347112680476218a6b303d35,./compendium/evaluate-classification/environme...
3,3620751421846f017c02bf5363b4081c,./compendium/evaluate-classification/poetry.lock
4,731ec82b9c640739ec6dad2db28e03c0,./compendium/evaluate-classification/.dockerig...


**SHA256 checksum**

In [8]:
sha256sum_df = textsum2frame(sha256sum)

sha256sum_df.head()

Unnamed: 0,checksum,filepath
0,796dd7cc09b78179722673a07b3ede094af83cb1015f2c...,./compendium/evaluate-classification/Dockerfile
1,e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b93...,./compendium/evaluate-classification/README.md
2,1d83419bf126c349c3394a50f209236a1035747fd643a1...,./compendium/evaluate-classification/environme...
3,3fdefd161895d437a626af378456d3a88d95099742abb3...,./compendium/evaluate-classification/poetry.lock
4,86c742777cedd8b71110d9ca09b613b9c2954c32387fae...,./compendium/evaluate-classification/.dockerig...


**SHA512 checksum**

In [9]:
sha512sum_df = textsum2frame(sha512sum)

sha512sum_df.head()

Unnamed: 0,checksum,filepath
0,ee530f21b44694018ba5bf26f77343b7a04ccb2255addf...,./compendium/evaluate-classification/Dockerfile
1,cf83e1357eefb8bdf1542850d66d8007d620e4050b5715...,./compendium/evaluate-classification/README.md
2,36841e93bba29e789a467bbcb978f1dfbc61ab16da600f...,./compendium/evaluate-classification/environme...
3,00d5e5ecab17d82bed5ceb0f7c2e35bccbadd6cfa41d42...,./compendium/evaluate-classification/poetry.lock
4,1940024dd99b947856ec730f06b061a89012e7bbc92260...,./compendium/evaluate-classification/.dockerig...


### Merging data (using the [Tidy data](https://r4ds.had.co.nz/tidy-data.html) format)

In [10]:
#
# Classifying checksums
#
md5sum_df["type"] = "md5"

sha256sum_df["type"] = "sha256"

sha512sum_df["type"] = "sha512"

#
# Merging
#
checksumdf = pd.concat([md5sum_df, sha256sum_df, sha512sum_df])
checksumdf

Unnamed: 0,checksum,filepath,type
0,3f4c6e8707e9be0b16d7e27cf7c867a5,./compendium/evaluate-classification/Dockerfile,md5
1,d41d8cd98f00b204e9800998ecf8427e,./compendium/evaluate-classification/README.md,md5
2,a68b9d22347112680476218a6b303d35,./compendium/evaluate-classification/environme...,md5
3,3620751421846f017c02bf5363b4081c,./compendium/evaluate-classification/poetry.lock,md5
4,731ec82b9c640739ec6dad2db28e03c0,./compendium/evaluate-classification/.dockerig...,md5
...,...,...,...
1149,ef5e08f6f80d76434bef8a2fdd0a9735df01e1cd951447...,./compendium/fraction-classification/.Rbuildig...,sha512
1150,7896ac00ff863202d2c1d80d3b90a181e692077cf2d2d9...,./compendium/fraction-classification/renv/acti...,sha512
1151,1c088564419f58e53ab9ee1aeec5f4cd9f30bc9252d75e...,./compendium/fraction-classification/renv/.git...,sha512
1152,e6d5fe4047123bd621ecc02395530bdae9975b18c0303b...,./compendium/fraction-classification/.Rprofile,sha512


### Classifying the type file origin (Original or reproduction)

For this classification, the directories will be checked. All files that are related to the directory `execution-compendium` are considered `reproduction`, since this directory and its files are only generated in this particular step.

In [11]:
checksumdf = checksumdf.assign(origin = (
    lambda x: x["filepath"].map((
        lambda y: 'reproduction' if 'execution-compendium' in y else 'original'
    ))
))

checksumdf

Unnamed: 0,checksum,filepath,type,origin
0,3f4c6e8707e9be0b16d7e27cf7c867a5,./compendium/evaluate-classification/Dockerfile,md5,original
1,d41d8cd98f00b204e9800998ecf8427e,./compendium/evaluate-classification/README.md,md5,original
2,a68b9d22347112680476218a6b303d35,./compendium/evaluate-classification/environme...,md5,original
3,3620751421846f017c02bf5363b4081c,./compendium/evaluate-classification/poetry.lock,md5,original
4,731ec82b9c640739ec6dad2db28e03c0,./compendium/evaluate-classification/.dockerig...,md5,original
...,...,...,...,...
1149,ef5e08f6f80d76434bef8a2fdd0a9735df01e1cd951447...,./compendium/fraction-classification/.Rbuildig...,sha512,original
1150,7896ac00ff863202d2c1d80d3b90a181e692077cf2d2d9...,./compendium/fraction-classification/renv/acti...,sha512,original
1151,1c088564419f58e53ab9ee1aeec5f4cd9f30bc9252d75e...,./compendium/fraction-classification/renv/.git...,sha512,original
1152,e6d5fe4047123bd621ecc02395530bdae9975b18c0303b...,./compendium/fraction-classification/.Rprofile,sha512,original


### Visualizing the files quantities

To check the quantities of files in each group (`type` and `origin`), the grouping of the data was done. In this grouping, it can be seen that there is a difference between the amount of original data and the data generated in the reproduction. This is because in the process of reproduction, only the **final results** are generated.

In [12]:
checksumdf.groupby(['type', 'origin']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,checksum,filepath
type,origin,Unnamed: 2_level_1,Unnamed: 3_level_1
md5,original,1007,1007
md5,reproduction,147,147
sha256,original,1007,1007
sha256,reproduction,147,147
sha512,original,1007,1007
sha512,reproduction,147,147


### Splitting files by compendium


**Removing empty filepaths**

In [13]:
checksumdf = checksumdf[checksumdf['filepath'].str.len() > 0]

**Extracting the experiment of each file**

In [14]:
checksumdf = checksumdf.assign(
    experiment = checksumdf['filepath'].transform(lambda x: x.split('/')[2])
)

checksumdf

Unnamed: 0,checksum,filepath,type,origin,experiment
0,3f4c6e8707e9be0b16d7e27cf7c867a5,./compendium/evaluate-classification/Dockerfile,md5,original,evaluate-classification
1,d41d8cd98f00b204e9800998ecf8427e,./compendium/evaluate-classification/README.md,md5,original,evaluate-classification
2,a68b9d22347112680476218a6b303d35,./compendium/evaluate-classification/environme...,md5,original,evaluate-classification
3,3620751421846f017c02bf5363b4081c,./compendium/evaluate-classification/poetry.lock,md5,original,evaluate-classification
4,731ec82b9c640739ec6dad2db28e03c0,./compendium/evaluate-classification/.dockerig...,md5,original,evaluate-classification
...,...,...,...,...,...
1148,559a789eb911e77eeb8056ffc6f793d323e8648745b1d3...,./compendium/fraction-classification/analysis/...,sha512,original,fraction-classification
1149,ef5e08f6f80d76434bef8a2fdd0a9735df01e1cd951447...,./compendium/fraction-classification/.Rbuildig...,sha512,original,fraction-classification
1150,7896ac00ff863202d2c1d80d3b90a181e692077cf2d2d9...,./compendium/fraction-classification/renv/acti...,sha512,original,fraction-classification
1151,1c088564419f58e53ab9ee1aeec5f4cd9f30bc9252d75e...,./compendium/fraction-classification/renv/.git...,sha512,original,fraction-classification


### Comparing files

Now, a comparison will be made between the original data and the data generated in the reproduction. To do this, the following steps will be performed:

1. Comparison based on the **name** of the files;
2. Comparison based on the **checksum** of the files.

At the end, it is expected that the values observed in each of the steps performed will be the same.

**Extracting the file names**

In [15]:
checksumdf = checksumdf.assign(
    filename = checksumdf["filepath"].transform(Path).transform(lambda x: x.name)
)

checksumdf

Unnamed: 0,checksum,filepath,type,origin,experiment,filename
0,3f4c6e8707e9be0b16d7e27cf7c867a5,./compendium/evaluate-classification/Dockerfile,md5,original,evaluate-classification,Dockerfile
1,d41d8cd98f00b204e9800998ecf8427e,./compendium/evaluate-classification/README.md,md5,original,evaluate-classification,README.md
2,a68b9d22347112680476218a6b303d35,./compendium/evaluate-classification/environme...,md5,original,evaluate-classification,environment.yml
3,3620751421846f017c02bf5363b4081c,./compendium/evaluate-classification/poetry.lock,md5,original,evaluate-classification,poetry.lock
4,731ec82b9c640739ec6dad2db28e03c0,./compendium/evaluate-classification/.dockerig...,md5,original,evaluate-classification,.dockerignore
...,...,...,...,...,...,...
1148,559a789eb911e77eeb8056ffc6f793d323e8648745b1d3...,./compendium/fraction-classification/analysis/...,sha512,original,fraction-classification,workflow.yaml
1149,ef5e08f6f80d76434bef8a2fdd0a9735df01e1cd951447...,./compendium/fraction-classification/.Rbuildig...,sha512,original,fraction-classification,.Rbuildignore
1150,7896ac00ff863202d2c1d80d3b90a181e692077cf2d2d9...,./compendium/fraction-classification/renv/acti...,sha512,original,fraction-classification,activate.R
1151,1c088564419f58e53ab9ee1aeec5f4cd9f30bc9252d75e...,./compendium/fraction-classification/renv/.git...,sha512,original,fraction-classification,.gitignore


**Groupping data**

In [16]:
checksum_groups = checksumdf.groupby(['filename', 'experiment', 'type', 'checksum'])

**Filtering to remove individual files**

In [17]:
checksum_groups = checksum_groups.filter(lambda x: x["filename"].shape[0] > 1)

**Groupping again and counting**


In [18]:
checksum_groups = checksum_groups.groupby(['filename', 'experiment', 'type', 'checksum'])

checksum_groups_counts = checksum_groups.count()
checksum_groups_counts

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,filepath,origin
filename,experiment,type,checksum,Unnamed: 4_level_1,Unnamed: 5_level_1
.gitkeep,evaluate-classification,md5,d41d8cd98f00b204e9800998ecf8427e,2,2
.gitkeep,evaluate-classification,sha256,e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855,2,2
.gitkeep,evaluate-classification,sha512,cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e,2,2
LC8_30_16D_STK_v001_038047_2018-08-29_2018-09-13.tif,fraction-classification,md5,fbf500eb2cec054ced7603fed732f326,2,2
LC8_30_16D_STK_v001_038047_2018-08-29_2018-09-13.tif,fraction-classification,sha256,c53a680f257d3edfd3440324214cc6cea4c98b25bffcaa38ee83ed8634c3eb09,2,2
...,...,...,...,...,...
index.json,endmember-selection,sha256,0e086574b4acf545b549ad9bdc1d085263239308e6d3e12f5a34d68f4eda6188,2,2
index.json,endmember-selection,sha512,9057e642f0cfe7338c7eef6aa156eef9d669f2f8cea07688cab746686661d9f6f6f9282b3c240722902f3f27a5690a1c96d3cae763ecf3f582ee4e9c6b6b0f05,2,2
metrics.json,evaluate-classification,md5,14f376cd90059fdea204dd4825ba2f12,2,2
metrics.json,evaluate-classification,sha256,c453b45c81c7fe21918a92069c297dda8b109443e1655641df9f62dadb2ace34,2,2
