Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to tackle uncertainty due to range of year_of_introduction variable #18

Closed
damianooldoni opened this issue Mar 27, 2018 · 13 comments
Closed

Comments

@damianooldoni
Copy link
Contributor

No description provided.

@timadriaens
Copy link
Member

timadriaens commented Apr 12, 2018

Several options are envisaged to tackle uncertainty associated with time periods or uncertainty around the date of first untroduction of a species:

  • a method for imputation of missing data cf. the methods described in Onkelinx et al.
  • to randomly select a year within the range to avoid arbitrary peaks cf. Seebens et al.
  • to fit functions to the time series to test for different shapes of the temporal trends of first record rates and to reduce deviation between predicted and observed series using an optimization algorithm cf. Seebens et al.

Here, we chose to simply show uncertainty around time periods using the miminum and maximum year with time intervals.

@peterdesmet
Copy link
Member

Here, we chose to simply show uncertainty around time periods using the mininum and maximum year with time intervals.

How?

@damianooldoni
Copy link
Contributor Author

The point is that the columns year_of_introduction_min and year_of_introduction_max in data output as explained in issue #17 (which is also related to issues #19 and #20) has another meaning than what we thought at first sight.
What we called year_of_introduction_min is actually year_of_introduction, while year_of_introduction_max is a kind of year_of_extinction (@peterdesmet: maybe you can find a better name for this column?)
It is therefore difficult to apply previous methods to calculate uncertainty.

@peterdesmet
Copy link
Member

peterdesmet commented Apr 12, 2018

Indeed, the dates are: first observed / last observed. Quite often those are the same (i.e. we only have a single date).

Plotting first observed

I played a bit with the Alien plant data (only), and this is what we would get if you just plot first observed:

first-observed

The line is cumulative and will never drop. As you can see, there are some arbitrary peaks in the data, which I don't consider a problem.

Plotting first/last observed

I think it would be worth taking an approach where for each year, you check for each species if its date range includes that year:

first-last-observed

It has the advantage that it makes use of first observed and last observed, so those species that are only recorded for a short time also drop from the timeline. It also shows a steep drop at the end, because of the last assessment year in the checklist, which does reflect the actual data we have.

Note: I couldn't find a smart algorithm to generate this chart: it was created in a spreadsheet.

@damianooldoni
Copy link
Contributor Author

using first_observed we get an answer to the cumulative number of species, see issue #20.

@peterdesmet
Copy link
Member

It would be cool if we could switch between the two charts, but might be best to shelf that for later.

@peterdesmet
Copy link
Member

If we want a line chart as proposed in #20, then the error margin could be the line in plot 2. Not sure how to explain this without a visual example, but the trend line = cumul first year of introduction and the error line = cumul first year minus last year. The error area will always be below the chart and could be shown as a shaded area. It expresses the lack of recent assessments.

@peterdesmet
Copy link
Member

@stijnvanhoey @SanderDevisscher can you make a version of #20 (comment) as described in the comment above? The line you currently have would be the error line

@stijnvanhoey
Copy link
Contributor

stijnvanhoey commented Apr 17, 2018

@peterdesmet as a first test:

The data to create the plot, with n_defined the data created for #25 (number defined by all ranges), n_introduced the number of introduced species in that specific year (cfr. #17) and cum_n_introduced the cumulative number of species introduced:

# A tibble: 6 x 4
   year n_defined n_introduced cum_n_introduced
  <int>     <int>        <dbl>            <dbl>
1  1201         1         1                1
2  1202         1         0                1
3  1203         1         0                1 
4  1204         1         0                1 
5  1205         1         0                1 
6  1206         1         0                1 
...
   2016      1263         42               2537
   2017      1096         40               2577
   2018       741         17               2594

and plotting provides following result:

image

image

@stijnvanhoey
Copy link
Contributor

Should we include this into the cumulative indicator? Maybe someone can provide more appropriate names for both lines?

As a reference, the figure above is created with this code

introduction_count <- df_cleaned %>% 
        group_by(.data$startDate) %>%
        count() %>%
        ungroup() %>%
        rename(year = startDate,
               n_introduced = n)

df_extended <- df_cleaned %>%
    rowwise() %>%
    do(year = .data$startDate:.data$endDate) %>%
    bind_cols(df_cleaned) %>% 
    unnest(year)

totals <- df_extended %>% 
    group_by(year) %>% 
    count() %>%
    ungroup() %>%
    rename(n_defined = n)

start_year_plot <- 1900
plot_info <- left_join(totals, introduction_count, 
                       by = "year") %>%
    replace_na(list(n_introduced = 0)) %>%
    mutate(cum_n_introduced = cumsum(n_introduced))

maxDate <- max(df_extended$year)
plot <- ggplot(plot_info, aes(x = year)) +
    geom_line(mapping = aes(y = n_defined, 
                            color = "described alien species"), 
              label = "described alien species") +
    geom_line(mapping = aes(y = cum_n_introduced, 
                            color = "cumulative number of introductions"), 
              label = "cumulative number of introductions") +
    geom_ribbon(aes(ymin = n_defined, ymax = cum_n_introduced), 
                fill = "grey", alpha = "0.5") +
    xlab("Year") +
    ylab("Number of alien species") +
    scale_x_continuous(breaks = seq(start_year_plot, maxDate, 
                            x_scale_stepsize),
                   limits = c(start_year_plot, maxDate)) +
    theme_inbo()
plot

@peterdesmet
Copy link
Member

peterdesmet commented Apr 17, 2018

n_introduced and cum_n_introduced look fine as names. For n_defined I would use n_recorded ("Recorded alien species"... for that year)

@timadriaens
Copy link
Member

after some discussion with Wolfgang Rabitsch and @damianooldoni we conclude it does not make sense to put a cumulative and a non-cumulative graph on the same plot.

@damianooldoni
Copy link
Contributor Author

Thanks @timadriaens ! I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants