Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide option to plot actual (realized) distribution on map #21

Closed
timadriaens opened this issue Jun 1, 2020 · 32 comments
Closed

provide option to plot actual (realized) distribution on map #21

timadriaens opened this issue Jun 1, 2020 · 32 comments
Assignees

Comments

@timadriaens
Copy link
Member

Hi, think it would be useful to give users the option to plot the actual distribution (from the cube of from gbif directly), so they can explore to what extent potential area in Belgium has already been invaded. Need to think about proper date cut-offs (e.g. from 2000), type of data to use (human observations or other).

@niconoe niconoe self-assigned this Jun 2, 2020
@niconoe
Copy link
Collaborator

niconoe commented Jun 2, 2020

@timadriaens: yep, I can see the usefulness of this!

I can see two approaches:

  • Directly and dynamically asks GBIF to generate the tiles for use via their maps API, and add the ability to display such overlays to our maps.
  • Write some scripts to transform the data cube (and/or additional data) to a "current distribution" GeoTiff that can be shown via the existing machinery.

I lean towards the former approach since it's simpler to implement and seems (at first look) to do all we need. The drawback is that it's a bit less flexible since GBIF is doing all the data preparation work, so we are stuck with their restrictions (only GBIF data and the the cut-offs/filtering options we want should be available).

I'd be tempted to try this solution soon and see how it works. To make sure we are working as efficiently as possible:

  • @timadriaens and @SoVDH: can you already think of the cut-off and type of data to use (so I check it's indeed available via GBIF APIs). Are you confortable with the fact that all this data will come from GBIF (no other sources)?
  • @peterdesmet: I'm interested in your opinion on all of this. Do you agree with the suggested approach(es)?

@peterdesmet
Copy link
Member

I would definitely try the GBIF maps API first. The fact that it is GBIF data only is not a problem, the fact that it isn't processed to a gridded cube might (but it can show more detail). I therefore suggest to create a quick test and have @timadriaens and @SoVDH see if that meets the demand.

@timadriaens
Copy link
Member Author

Indeed me too I prefer to plot the real xy data from gbif rather than the "squarified" ones, this will also be clearer distinction between the gridded risk maps and the real observations. I am in favour of using a dot.

@niconoe Selection criteria: BasisofRecord = human observation mostly (excluding records with geospatial issues), but probably you can simply use the code for the indicators discussed in this issue which uses all but fossil specimen.

occ_clean<- occ %>% filter(basisOfRecord!="FOSSIL_SPECIMEN") %>% filter(hasCoordinate =="TRUE") %>% filter(hasGeospatialIssues=="FALSE") %>% filter(is.na(coordinateUncertaintyInMeters)| coordinateUncertaintyInMeters< 708) %>% select(taxonKey,species, scientificName,decimalLatitude,decimalLongitude,eventDate,year,coordinateUncertaintyInMeters,datasetKey,countryCode,establishmentMeans)%>%#select desirable variables filter(!grepl("^[0-9]+(\\.[0-9]{0,1})?$",decimalLatitude))%>% filter(!grepl("^[0-9]+(\\.[0-9]{0,1})?$",decimalLongitude))

I would however perhaps exclude data that have too big coordinate uncertainty. If we eventually plot maps in Harmonia, probably best to make sure these use the same criteria to select occurrences from gbif so there are no discrepancies between TrIAS products anywhere.

@niconoe
Copy link
Collaborator

niconoe commented Jun 4, 2020

I've now implemented a first version of this feature, visible here as usual.

I've taken the simplest approach discussed above (make GBIF render maps for us!). I therefore had to live with some limitations both in terms of data selection and visual rendering and diverge slightly from what was discussed before:

In terms of selection criteria:

(@timadriaens: unfortunately we can't just run random R code in the web API)

  • I can't unfortunately filter on coordinateUncertaintyInMeters
  • I can't filter on hasGeospatialIssues (but I assume those records are excluded anyway from the maps API- we can check with GBIF)
  • I asked for the following values for basisOfRecord: OBSERVATION, HUMAN_OBSERVATION, MACHINE_OBSERVATION, MATERIAL_SAMPLE, PRESERVED_SPECIMEN, LIVING_SPECIMEN, LITERATURE. I assume this is equivalent to exclude FOSSIL_SPECIMEN and UNKNOWN (should I include this one?)

In terms of display

  • Showing a point per occurrence isn't really visible on the map background, because it's only one pixel per occurrence
  • I therefore had to show a density map (occurrences aggregated in squares or hexagons - I choose this option for now)
  • showing those hexagons and simultaneously the squared model is completely unreadable, so I made it possible to show one or the other
  • I don't think GBIF can give us a color legend ("X number of occurrences gives color Y"), contrary to what we do with Amy's models.

I'm interested in all feedback, but my main question to @timadriaens and @SoVDH is: can we live with the limitations listed above? (other changes such as colors, application interface, display logic, ... can still take place). If yes, the work on this issue is almost done 🥳. Otherwise just tell me and I'll look for a heavier but more flexible approach!

@timadriaens
Copy link
Member Author

Looking great, but it would be more handy if we could visualize those superimposed on the risk map. Do you think this is possible?

@SoVDH
Copy link

SoVDH commented Jun 4, 2020

This is GREAT ! Regarding Tim's comment here above, I would even suggest to be able to choose 'risk map' or 'realized distribution' or 'superimposed'
Thanks a lot for this Nico. This is a great achievement already :-)

@niconoe
Copy link
Collaborator

niconoe commented Jun 4, 2020

@timadriaens and @SoVDH : I made a few quick tests previously and the result was quite visually messy and poorly readable. But I'll try again with different settings and colours and see what we can get!

@niconoe
Copy link
Collaborator

niconoe commented Jun 5, 2020

@timadriaens and @SoVDH: I implemented Sonia's suggestion and tweaked a few things (colours, how the opacity settings work, ...) and I think we have reached something decent.

Can you have a look?

@SoVDH
Copy link

SoVDH commented Jun 5, 2020

Yes ! Giving the choice between the 3 visualizations is indeed a good idea and by playing on the opacity, it's very readable I find. Merci Nico, c'est très chouette :-)

@peterdesmet
Copy link
Member

Nice work! Noticed an error (surimposed rather than superimposed) in the labels, and would rename them to:

  • Modelled data
  • Occurrence data
  • Both

I'm not sure we should use "realized distribution", because there might be more distribution than there is occurrence data. I think it is fine to not mention GBIF in the label name, as we'll need to explain anyway that the modelled and occurrence data are based on GBIF.

@peterdesmet
Copy link
Member

Also, why is the modelled data still showing when I select "Occurrence data"?

niconoe pushed a commit that referenced this issue Jun 9, 2020
@niconoe
Copy link
Collaborator

niconoe commented Jun 9, 2020

Thanks @peterdesmet: I fixed a couple of bugs in the display logic and updated the labels, I think the situation is better now.

@niconoe
Copy link
Collaborator

niconoe commented Jun 11, 2020

If the current implementation seems decent to everyone, I suggest closing this issue.

@qgroom
Copy link
Contributor

qgroom commented Jun 14, 2020

It looks good to me

@niconoe niconoe closed this as completed Jun 16, 2020
@amyjsdavis
Copy link

amyjsdavis commented Jul 1, 2020

Hi All: I am sorry I missed this last month. I had asked Nico (while I was unaware of this thread) to only show only the occurrence data I used to make the risk models, because there is sometimes a big difference between the date I used and the occurrences that are shown in the risk mapping application. Now that I've read through this thread, it seems like the biggest differences in the occurrence data he is showing and the one I used for the risk modelling is time and excluding data based on coordinate uncertainty. In order to align the models with the historic climate data, only data from 1976 to 2005 are used in my models. I had prepared shapefiles that show the occurrence data used in the models for Belgium and that is ideally what I like to use since he can't filter based on coordinate uncertainty. The 2nd best option is to filter the data based on time. I think that would make the occurrence data closely approximate what was used. What do you all think? @niconoe @timadriaens @SoVDH @DiederikStrubbe

@niconoe
Copy link
Collaborator

niconoe commented Jul 1, 2020

Indeed, let's continue the discussion here rather than by e-mail. My two cents:

  • about which data to include, I'll let scientists answer :)
  • about the webapp implementation: If I understand correctly: the way the occurrences are show and user interface should stay like it is now (same logic, styling, ...), the difference being that the data source is now @amyjsdavis's shapefiles instead of directly loaded via the GBIF API. Is this correct (that's basically all that I need to know to implement properly).

Related questions:

  • What's the timeline for this change? If I understand correctly, @SoVDH would like to use the tool ASAP?
  • @amyjsdavis: would it be possible to change the shapefile's naming convention so it's consistent with the modelled geotiff files (for example, using the GBIF taxon ID rather than the scientificname) => this would makes things much simpler and smoother in terms of automation (easier for a machine to match the different files related to a given species/taxon)
  • @amyjsdavis: as discussed in other threads, we think it would be great and more in line with the project philosophy if your various data transformations were fully available and documented at all time on GitHub and repeatable/comment-able/improved-able
    by everyone. I am thinking that maybe the generation process of those shapefiles is a good candidate to use a fully "open" workflow from scratch? I understand working with GitHub and pipelining multiple tools together can be time consuming at first and a bit out of your comfort zone, so if you think a short 4-hands "hackaton" could help, just tell me and I'll free some time for you.

@damianooldoni / @timadriaens / @SoVDH / @peterdesmet / @qgroom / @DiederikStrubbe : as usual, your opinion is appreciated!

@timadriaens
Copy link
Member Author

@amyjsdavis The idea is to give the assessor an idea about the realised niche in Belgium. I see no reason why you would show only the data you used for drafting your models. And certainly, it makes little sense to not provide a risk assessor with the last (post 2005) 15 years of data. This would in effect mean you would not show any waarnemingen.be data as that recording platform only started in 2006. But indeed, we need to show verified data only and if this preprocessing is not possible using the gbif api perhaps it's better to use the cube as a data source for the visualization? I feel a hackaton would certainly be useful. Perhaps there could also be a session alongside for dummies like me to actually explain how to run the trias packages and produce graphs and maps for species. As end users, we don't need to crack the functions in depth, but we will want to use it to produce the indicator graphs and risk maps for the species we want.

@amyjsdavis
Copy link

amyjsdavis commented Jul 1, 2020

@niconoe: they don't need to be shapefiles. they can be text files if it makes life easier. And sorry, I meant to rename them using the taxon key before sending.
Also, my entire work flow is already on github with the exception of this last one where the EU occurrences were clipped to the Belgian border I did not publish this latest work flow (or data transformation) because it is still undecided about whether these data will be used.
Update: I used the TriAS workflow to create my global download but this is not evident in my script. I will change that. And I will add the scripts to create all the data products used in the modeling.

@amyjsdavis
Copy link

amyjsdavis commented Jul 1, 2020

@timadriaens : Indeed, I could see the utility of showing more recent occurrences, but not the old ones that predate the climate data. However all the data used to show the realized niche, should follow the same filtering criteria used to make the models with the exception of the time period with the logic being if the data were not good enough to include in the model, they should not be good enough to indicate the realized niche.

@damianooldoni
Copy link

I have just updated the occurrence cubes. 🥳 🎈 I still wait to upload it to Zenodo. I want to double check it after I return from holiday..
So, maybe @niconoe can use the Belgian cube for the visualization? It is made using verified data only and it can be easy to remove squares based on year or min_coordinate uncertainty. As it makes use of the 1km squares of the EEA, the shapefiles can be read as well. Just an idea.

@amyjsdavis
Copy link

@damianooldoni : I really like this idea.

@niconoe
Copy link
Collaborator

niconoe commented Jul 2, 2020

@timadriaens / @SoVDH / @DiederikStrubbe: do you agree the Belgian cube should be the source of data displayed in the map viewer?

@ALL: in that case, I suppose the appropriate rendering would be:

  • to show the EEA grid "squares" (where we have occurrences of the selected species, after filtering per year and minimal coordinates uncertainty)
  • the color of the square reflects the number of occurrences (darker = more occurrences, or something similar)

@DiederikStrubbe
Copy link

DiederikStrubbe commented Jul 2, 2020 via email

@timadriaens
Copy link
Member Author

Regarding the point (c): I wonder if it is very useful to show a risk assessor the data that were used for the model versus the data that were not used (unless you provide him/her with lots of explanation why this is so they will not understand why the model did not incorporate all). Could a simple legend showing temporal range of the occurrences not be more informative? For example, black dots for >2000 records, hollow dots for <2000?

As a general remark, I feel we should perhaps avoid showing different distribution maps in TrIAS at different places?

@DiederikStrubbe
Copy link

DiederikStrubbe commented Jul 2, 2020 via email

@DiederikStrubbe
Copy link

DiederikStrubbe commented Jul 2, 2020 via email

@amyjsdavis
Copy link

if we can find a date for a hackathon that would be great! :-) In that case, I can go over the model with those interested and we can work together to improve the modeling code. I am signing off for now! Ciao

@qgroom
Copy link
Contributor

qgroom commented Jul 2, 2020

I'd also be interested in a hackathon, as would some of the team that you usually don't meet.

Regarding the point (c): I wonder if it is very useful to show a risk assessor the data that were used for the model versus the data that were not used (unless you provide him/her with lots of explanation why this is so they will not understand why the model did not incorporate all). Could a simple legend showing temporal range of the occurrences not be more informative? For example, black dots for >2000 records, hollow dots for <2000?

BTW: You have to be cautious aggregating data in the cube across years. Due to the random assignments of observations to grid squares it is not impossible for a single isolated tree to be assigned to more than 5 different grid cells. This is not such a problem for single years, but the risk increases as you aggregate,

@timadriaens
Copy link
Member Author

--> not sure what this remark is referring to? Correspondence between risk maps and temperal extent of occurrence maps?

No. I just mean that the distribution maps we'll show on Harmonia eventually (at least, I thought that was the idea) should not deviate from distribution maps in other TrIAS products, such as this tool to explore the risk maps. Unless of course the idea is to integrate this tool entirely.

@SoVDH
Copy link

SoVDH commented Jul 7, 2020

I agree with Tim about the maps produced. As much as possible, we should avoid producing a diversity of maps. We should aim for a cartographic tool that is as 'generalist' as possible and that can be used as much as possible for TrIAS and for their future integration in Harmonia or in regional portals. After talking with Nico, I understand that this wish is a bit illusory and that it is rare that a 'recycling' for other purposes can be envisaged. I believe, however, that we must try to maximize the possible uses.

Also, regarding 'Regarding the point (c): I wonder if it is very useful to show a risk assessor the data that were used for the model versus the data that were not used (unless you provide him/her with lots of explanation why this is so they will not understand why the model did not incorporate all).
--> I am unconvinced of the value of showing the PRA assessor the difference between points used for risk mapping and points not used in addition to occurrence data. It is important to keep it simple and only give information that is useful for assessing current and future establishment capacity. They are not asked to assess the quality of the modeling, only to consider the map and its associated uncertainty.
Could a simple legend showing temporal range of the occurrences not be more informative? For example, black dots for >2000 records, hollow dots for <2000?'

--> Indeed, this is may be informative.

@niconoe
Copy link
Collaborator

niconoe commented May 19, 2021

I have to admit I am a bit lost in this huge thread going in multiple directions.

Can we close it, or are there still active action/discussion points?

@peterdesmet
Copy link
Member

I am fine with closing it. The scope of the current application should be kept limited, especially since the RShiny dashboards will likely provide much more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

8 participants