Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate other large datasets. #340

Open
bw4sz opened this issue Aug 27, 2022 · 37 comments
Open

Integrate other large datasets. #340

bw4sz opened this issue Aug 27, 2022 · 37 comments
Labels
Ideas for Machine Learning! These are machine learning ideas and papers that could be useful for DeepForest models. High level. Performance Question on model performance and accuracy

Comments

@bw4sz
Copy link
Collaborator

bw4sz commented Aug 27, 2022

This issue is meant for cataloging data that should eventually go into a deepforest baseline. Happy to have contributions from the community on this issue.

https://arxiv.org/pdf/2208.10607.pdf
https://github.com/jonathanventura/urban-tree-detection-data

A portion of this could be used.
https://google.github.io/auto-arborist/

More forest data

https://lila.science/datasets/forest-damages-larch-casebearer/

Along with this issue we need a strategy for updating the baseline model and potential tradeoffs to new model weights.

nightonion/yosemite-tree-dataset#2

Roadmap

  1. Format train and test data to meet deepforest inputs.
  2. Assess test, or portion of train, on existing backbone.
  3. Document a training strategy on new data
  4. Assess test on new model
  5. Assess combined model on NeonTreeEvaluation benchmark to assess tradeoffs.
@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 15, 2022

Keep an eye on https://www.fruitpunch.ai/challenges/ai-for-trees.

We have downloaded this dataset from Kenya, currently cannot get annotations to overlap.

In orange:

Uploading 000089_jpg.rf.d593d62c4021840b13e90f1aa6625de9.png…

@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 21, 2022

There my be data associated with
Methods Ecol Evol - 2021 - Tang - Large‐scale image‐based tree species mapping in a tropical forest using artificial.pdf

Ground labels of tree species are available for a 16-ha area, the
Luquillo Forest Dynamics Plot (LFDP), covering <1% of the area captured by the aerial images (Figure 1). All stems ≥1 cm in diameter in
the plot were censused in 2016, including information on species
taxonomy and stem diameters, and all trees are fully georeferenced
(see Thompson et al., 2002 for details). In this paper, we used ground
label information from all live stems ≥20 cm diameter in the 2016
census. This size class is expected to include most individuals with
visible canopies in the aerial photographs and provides more reliable
supervision information for the proposed method. We also ran experiments using 15 cm and 25 cm as the threshold, and results are
similar (Appendix C).

@bw4sz
Copy link
Collaborator Author

bw4sz commented Oct 3, 2022

https://essd.copernicus.org/preprints/essd-2022-312/ germany tree species

@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 3, 2022

Synthetic trees from ground view, useful in understanding generalization.

https://arxiv.org/pdf/2210.17424.pdf

@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 17, 2022

https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13860
drones. Part of https://openforestobservatory.org/dataset-details/ and the larger conversation with Derek Young at UC Davis. I wrote Derek directly about a larger collaboration.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Nov 21, 2022

@bw4sz
Copy link
Collaborator Author

bw4sz commented Feb 19, 2023

Is there any overlapping RGB data here? Will need to contact each individually. https://open-research-europe.ec.europa.eu/articles/3-32

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 8, 2023

@henrykironde can we download these datasets and let's start listing here. I'll do some and you can do some as we prepare for GSOC students. For each dataset:

  1. How many training trees?
  2. How many training images?
  3. What format are the annotations?
  4. Drop a sample image with annotations overlaid (shapefiles?).

Overall, we can start moving them to

/blue/ewhite/DeepForest

In general my hope is to create a train and test split for each dataset where possible.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 8, 2023

Summary of https://lila.science/datasets/forest-damages-larch-casebearer/.
which is now on /blue/ewhite/DeepForest

(base) [b.weinstein@login3 Radogoshi_Sweden]$ ls
Bebehojd_20190527  Bebehojd_20190819  Data_Set_Larch_Casebearer  Ekbacka_20190527  Ekbacka_20190819  images  Jallasvag_20190527  Jallasvag_20190819  Kampe_20190527  Kampe_20190819  Nordkap_20190527  Nordkap_20190819  test.csv  train.csv

There are 60,000 training trees.

(base) [b.weinstein@login3 Radogoshi_Sweden]$ cat train.csv | wc -l
60919

There are 41,148 test trees. That seems like too many test compared to train. We should reset that.

DeepTreeAttention) [b.weinstein@login3 Images]$ rio info B06_0051.JPG
/home/b.weinstein/miniconda3/envs/DeepTreeAttention/lib/python3.8/site-packages/rasterio/__init__.py:277: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)
{"bounds": [0.0, 1500.0, 1500.0, 0.0], "colorinterp": ["red", "green", "blue"], "compress": "jpeg", "count": 3, "crs": null, "descriptions": [null, null, null], "driver": "JPEG", "dtype": "uint8", "height": 1500, "indexes": [1, 2, 3], "interleave": "pixel", "mask_flags": [["all_valid"], ["all_valid"], ["all_valid"]], "nodata": null, "photometric": "ycbcr", "res": [1.0, 1.0], "shape": [1500, 1500], "tiled": false, "transform": [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0], "units": [null, null, null], "width": 1500}

The resolution of the image is unknown, but looks like be about 10cm.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 15, 2023

Here is a report for https://docs.google.com/document/d/16kKik2clGutKejU8uqZevNY6JALf4aVk2ELxLeR-msQ/edit

I have sent the maintainers an email. Things didn't quite line up.

Hello,
I maintain a small python tool for object detection in forested landscapes. 
https://deepforest.readthedocs.io/en/latest/
https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13472

The tree detection tool has become quite popular in the scientific community and we are looking to bolster it with images around the world. I've been collecting links to tree annotations and imagery and have a student this summer who will start training new models. I went to inspect the dataset linked from https://docs.google.com/document/d/16kKik2clGutKejU8uqZevNY6JALf4aVk2ELxLeR-msQ/edit, but I became a bit confused. 
I downloaded the .shp and overlaid it on the imagery. I had initial trouble because in the document here: https://docs.google.com/document/d/16kKik2clGutKejU8uqZevNY6JALf4aVk2ELxLeR-msQ/edit, the geotiff that is linked does not spatially overlap with the .shp. 

Looking around https://map.openaerialmap.org/#/-175.26188850402832,-21.1463924204947,13/square/20002233030/5a82999d5a9ef7cb5d5ae685?_k=ofo82l I found a geotiff that does overlap the area. However, zooming in, it looks as if there isn't strong georeferencing, the tree points don't correspond to trees. They seem more or less randomly distributed. I'm assuming that this is not the tile that was intended to be georeferenced to this shapefile.

Screenshot 2023-03-15 at 10 14 47 AM

image

All help appreciated.
Ben Weinstein

There are 13,000 labeled trees. We may need to clean up non-coconut trees with new annotations.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 15, 2023

image

https://zenodo.org/record/7528566

https://www.mdpi.com/2072-4292/15/5/1463

This is a private dataset and cannot be shared or re-used in additional capacity besides improving the model baseline. I can see the images but haven't checked the .xml annotation overlap. The image quality looks dark but acceptable.
000002640

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 15, 2023

Report on

We present a training dataset of tropical Northern Australia savanna woodland tree species that was generated using RPAS and on-ground surveys to confirm species labels. RPAS-derived imagery was annotated, resulting in 2547 polygons representing 36 tree species.

This dataset is now on /blue/ewhite/DeepForest/Australia_Savannah

Figure is from paper, trained model, but gives a sense for the resolution and annotation format. Data is still downloading from zenodo.

Screenshot 2023-03-15 at 11 15 21 AM

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 15, 2023

I wrote the corresponding author:

https://www.sciencedirect.com/science/article/pii/S0303243422000903

image

A small number of polygon annotations from central park NY.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 15, 2023

Here is a report on https://github.com/jonathanventura/urban-tree-detection-data

from article https://arxiv.org/pdf/2208.10607.pdf

Nice summary on the README. Excellent paper.
image

https://github.com/jonathanventura/urban-tree-detection-data#readme
bishop_2020_1
Screenshot 2023-03-15 at 11 45 58 AM

Annotations are made in point format! This may require more trouble than its worth, but could be generated in a semi-supervised manner using existing deepforest baseline. As a way of improving resolution and urban scenes. NAIP imagery is in high demand from deepforest users. Atleast to be used in an evaluation benchmark against NAIP.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 15, 2023

Exploring https://doi.pangaea.de/10.1594/PANGAEA.933263?format=html#download. Siberia trees from.
Screenshot 2023-03-15 at 3 01 09 PM

This Individual-labelled trees dataset is a part of the SiDroForest data collection (https://www.pangaea.de/?q=keyword%3A%22SiDroForest%22) and contains spatial data in the form of points and polygons of 872 trees and shrubs that were recorded in Siberia during a 2-month fieldwork expedition in 2018 by the Alfred Wegener Institute Helmholtz Centre for Polar and Marine Research in Germany (Kruse et al., 2019). During the fieldwork, 15 m radius vegetation surveys were performed during which information such as height, species, crown diameter were recorded for individual trees. I

The files are too large to download, so I don't know which orthomosaics go with the crown polygons. I downloaded "Kruse_et_al_SiDroForest_RGB_Orthomosiac.zip", but there are two others.

image

From this image it make it seem like there are small image chips, not one giant tile, but I don't see that in the manifest.

There are also synthetic trees that I have not investigated: https://doi.pangaea.de/10.1594/PANGAEA.932795

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 16, 2023

Deepforest was used to preprocess training data in Ecuador. Important to use only in train, since it is weakly annotated and already touched our system. https://arxiv.org/pdf/2201.11192.pdf

https://github.com/gyrrei/ReforesTree

@bw4sz
Copy link
Collaborator Author

bw4sz commented Mar 27, 2023

We have very little data from south asia. May be interested in annotating some tiles from https://www.mdpi.com/1999-4907/14/3/586

@bw4sz
Copy link
Collaborator Author

bw4sz commented Apr 21, 2023

9,000 trees from TEAK and SOAP neon sites. with alive dead labels

https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2022JG007234
TEAK_trees.zip

available on github

data will need to be cropped

image

@bw4sz
Copy link
Collaborator Author

bw4sz commented Apr 24, 2023

We’ve received approval by our field partner (LEAD Foundation) too that they are willing to share the datasets that we collected together.

We have drone imagery of 17 villages in Tanzania where we implement an agroforestry program. For 13 of these villages, we have data over 3 consecutive years (2018 , 2019, 2020) in this cloud bucket:

https://console.cloud.google.com/storage/browser/justdiggit-drone

Some of the data has been labeled. The file ‘justdiggit-drone/label_sample/Annotations_trees_only.json’ contains the most complete annotations. There is also another file 'label_sample_COCO_RDP - Tree annotation.json' which also contains a class 'messy vegetation' in

addition to trees. However, there are very few annotations for this class compared to the trees.

We have some more recent data too, but that hasn’t been added to the cloud bucket yet. Let me know if you would need that too.

For some of the villages we also have 50 cm resolution satellite images (Planet SkySat) captured around the same time as we have the drone flights (as we want to use the detected trees on our drone imagery to detect trees on this satellite imagery too). They are in this bucket:

https://console.cloud.google.com/storage/browser/justdiggit-skysat

Has been downloaded to

/blue/ewhite/DeepForest/justdiggit/

very large, over 300GB.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Jul 18, 2023

@bw4sz
Copy link
Collaborator Author

bw4sz commented Jul 21, 2023

I had a meeting with Josh Veitch Michaelis from ETH about https://restor.eco/?lat=26&lng=14.23&zoom=3. OpenStreetMap data, alot of labels, not sure the status.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Aug 9, 2023

https://www.biorxiv.org/content/10.1101/2023.08.03.548604v1.full.pdf polygon annotations, 14 species

The dataset created here (23,000 segmented individual tree crowns) includes 13 species and genera,
which have an environmental and an economic importance in Northeastern North America.
Moreover, these trees are segmented on seven different dates, totalling almost 161,000 annotated
tree crowns. This dataset is available online for use by others. 

I emailed the authors. Nice paper.

Dataset is out with RGB and LiDAR imagery. https://zenodo.org/records/8148479

@bw4sz
Copy link
Collaborator Author

bw4sz commented Aug 9, 2023 via email

@bw4sz
Copy link
Collaborator Author

bw4sz commented Aug 10, 2023

https://zenodo.org/record/8008028

https://www.mdpi.com/2072-4292/15/14/3599

image

Netherlands

Data is corrupt as far as I can tell? Data was downloaded, provided by authors.

@bw4sz
Copy link
Collaborator Author

bw4sz commented Aug 26, 2023

data was not made available, but probably could ask authors. Netherlands. https://www.mdpi.com/2072-4292/15/17/4128#

@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 7, 2023

image LiDAR benchmark, as Puliti if there is overlapping RGB data?

@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 7, 2023

There is a LiDAR dataset, but one figure has a ortho basemap. I emailed the authors.

https://essd.copernicus.org/articles/14/2989/2022/essd-14-2989-2022.pdf

image

@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 7, 2023

An additional siberia dataset? https://doi.pangaea.de/10.1594/PANGAEA.957253

@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 29, 2023

@bw4sz
Copy link
Collaborator Author

bw4sz commented Sep 29, 2023

Need to write
image

@bw4sz
Copy link
Collaborator Author

bw4sz commented Oct 11, 2023

France LiDAR paper, is there orthos? https://www.mdpi.com/2072-4292/14/5/1083

@bw4sz bw4sz added Performance Question on model performance and accuracy Ideas for Machine Learning! These are machine learning ideas and papers that could be useful for DeepForest models. High level. labels Oct 15, 2023
@bw4sz
Copy link
Collaborator Author

bw4sz commented Oct 22, 2023

unlabeled https://datadryad.org/stash/dataset/doi:10.5061/dryad.21t1805 tropical forest liana

@tyfoan
Copy link

tyfoan commented Nov 2, 2023

Here is a report for https://docs.google.com/document/d/16kKik2clGutKejU8uqZevNY6JALf4aVk2ELxLeR-msQ/edit

I have sent the maintainers an email. Things didn't quite line up.

Hello, I maintain a small python tool for object detection in forested landscapes.  https://deepforest.readthedocs.io/en/latest/ https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.13472

The tree detection tool has become quite popular in the scientific community and we are looking to bolster it with images around the world. I've been collecting links to tree annotations and imagery and have a student this summer who will start training new models. I went to inspect the dataset linked from https://docs.google.com/document/d/16kKik2clGutKejU8uqZevNY6JALf4aVk2ELxLeR-msQ/edit, but I became a bit confused.  I downloaded the .shp and overlaid it on the imagery. I had initial trouble because in the document here: https://docs.google.com/document/d/16kKik2clGutKejU8uqZevNY6JALf4aVk2ELxLeR-msQ/edit, the geotiff that is linked does not spatially overlap with the .shp. 

Looking around https://map.openaerialmap.org/#/-175.26188850402832,-21.1463924204947,13/square/20002233030/5a82999d5a9ef7cb5d5ae685?_k=ofo82l I found a geotiff that does overlap the area. However, zooming in, it looks as if there isn't strong georeferencing, the tree points don't correspond to trees. They seem more or less randomly distributed. I'm assuming that this is not the tile that was intended to be georeferenced to this shapefile.

Screenshot 2023-03-15 at 10 14 47 AM image

All help appreciated. Ben Weinstein

There are 13,000 labeled trees. We may need to clean up non-coconut trees with new annotations.

If you still need help then you should change the CRS to ESPG:3857 for .tif layer

image

@rlrognstad
Copy link

Ran across datasets for this paper and repo.
An interesting use case to complement RGB with thermal imagery for detecting trees in shadow.
Potential immediate use of RGB labeled images and future use if high-res thermal becomes more widely available (HotSat or similar)

image

@bw4sz
Copy link
Collaborator Author

bw4sz commented Jan 26, 2024

Thanks @rlrognstad. We got this one, and are in contact with the authors. I appreciate you looking with us! I'm compiling a list below. This issue is pretty old and pre-dates any formal attempt in pulling data together. They actually used DeepForest for a portion of their training structure.

I suppose I can close this issue and point to the sheet.

https://docs.google.com/spreadsheets/d/1-Q6ekQNE7TZBHQnrbjGl_2tcfh_X9154Kbzn3YFFruM/edit?usp=sharing

Definitely let me know about absolutely anything you see!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ideas for Machine Learning! These are machine learning ideas and papers that could be useful for DeepForest models. High level. Performance Question on model performance and accuracy
Projects
None yet
Development

No branches or pull requests

3 participants