Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions to import reported daily sub-national new cases and deaths from LMIC #5

Open
4 tasks
ffinger opened this issue Apr 9, 2020 · 22 comments
Open
4 tasks
Labels
enhancement Add new features to an existing package high_priority Urgent for COVID19 analytics medium_complexity Can be completed by 1 person in a few (<5) days.

Comments

@ffinger
Copy link
Collaborator

ffinger commented Apr 9, 2020

Description

A function for each country that accesses public data sources and makes the data accessible in R.
Good examples in this package: https://github.com/epiforecasts/NCoVUtils for a number of countries.

Functions are existent for most European and Asian countries and the US. We are looking for data and functions for LMIC at the moment, especially African countries.

I suggest we use this issue to keep track of

  1. Needs for data and R functions for specific countries
  2. Available data sources for specific countries
  3. Functions that import the data sources identified in 2.

See below for google sheet tracking those.

Output

The output format of each function should be a long data frame containing the following columns:

  • country
  • admin_subdivision_level_1
  • admin_subdivision_level_2 (if available)
  • (more levels if available)
  • date
  • cases
  • deaths

where cases and deaths stand for the newly reported cases and deaths on that day.

Functions can either be added to https://github.com/epiforecasts/NCoVUtils via pull request, or we can start our own package that wraps NCoVUtils and other solutions for the already implemented countries.

Countries already covered:

https://github.com/epiforecasts/NCoVUtils covers the following countries so far:

  • Belgium
  • Canada
  • France
  • Germany
  • Italy
  • Spain
  • United Kingdom
  • United States
  • Japan
  • Korea
  • Afghanistan

Countries to be done

Spreadsheet to track requested countries, data sources and implementations:
https://docs.google.com/spreadsheets/d/1uvg07BAmwKqLqhKvkejhkX7uvXiGCre4sz11Au3pz9Q/edit?usp=sharing

my own priority list:

  • Burkina Faso
  • Irak
  • Democratic Republic of the Congo
  • Syria

Links

A few places where data sources are indexed:

https://data.humdata.org/event/covid-19
https://coronavirustechhandbook.com/home
https://www.europeandataportal.eu/data/datasets?locale=en&categories=heal&page=1&query=covid

@ffinger ffinger added high_priority Urgent for COVID19 analytics medium_complexity Can be completed by 1 person in a few (<5) days. enhancement Add new features to an existing package new_package Create a new R package labels Apr 9, 2020
@ffinger
Copy link
Collaborator Author

ffinger commented Apr 9, 2020

@scottyaz
Copy link

scottyaz commented Apr 9, 2020

Would be good to start a google spreadsheet (if it doesn't exist) with sources for each country. For example http://covid19.health.gov.mw is a good source for Malawi.

@seabbs
Copy link

seabbs commented Apr 9, 2020

We are keen to have contributions to our package but also happy for this to be a separate project if that makes sense. We've been talking about how/if we want to support it as a more widely known data resource and that is seeming to make more and more sense.

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 9, 2020

Google spreadsheet to track requests for countries, data sources and implementations:
https://docs.google.com/spreadsheets/d/1uvg07BAmwKqLqhKvkejhkX7uvXiGCre4sz11Au3pz9Q/edit?usp=sharing

@ffinger ffinger removed the new_package Create a new R package label Apr 9, 2020
@ffinger
Copy link
Collaborator Author

ffinger commented Apr 9, 2020

@seabbs, happy to contribute to NCoVUtils

@xt-21
Copy link

xt-21 commented Apr 10, 2020

Would like to work on Burkina Faso

@ColinFay
Copy link

ColinFay commented Apr 12, 2020

Hey,

Can you pitch on the process of contributing?

Do you want us to PR {NCovUtils}?

Also, there is: https://www.worldometers.info/coronavirus/

Here's a fun to get today and yesterday df:

get_worldmeter_df <- function(){
  url <- xml2::read_html(
    "https://www.worldometers.info/coronavirus/"
  )
  tbls <- rvest::html_table(url)
  tbls[[1]] <- tbls[[1]][8:nrow(tbls[[1]]),]
  tbls[[2]] <- tbls[[2]][8:nrow(tbls[[2]]),]
  list(
    today = tbls[[1]], 
    yesterday = tbls[[2]]
  )
}
get_worldmeter_df()

Has Burkina Faso, Irak, Congo and Syria

get_worldmeter_df()$today[
  tod$`Country,Other` %in% c("Burkina Faso", "Irak", "Congo", "Syria"),
]
    Country,Other TotalCases NewCases TotalDeaths NewDeaths
99   Burkina Faso        484                   27          
147         Congo         60                    5          
166         Syria         25                    2          
    TotalRecovered ActiveCases Serious,Critical
99             155         302                 
147              5          50                 
166              5          18                 
    Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop
99                23             1                        
147               11           0.9                        
166                1           0.1                        
    Continent
99     Africa
147    Africa
166      Asia

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 12, 2020

Hi @ColinFay,
yes, the best is to PR NCovUtils.

I haven't seen any sub-national data (by region, province or similar) on wordlometers, am I missing something?

Think it would still be a good additional resource to the already existing functions to get national data from ECDC, WHO, JHU or similar, especially since there seems to be data on testing.

@ColinFay
Copy link

@ffinger not that I know of

@ColinFay
Copy link

ColinFay commented Apr 12, 2020

Possible other source for Burkina Faso :
https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19

Need to scrape the pdf(s)

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 12, 2020

Thanks for this!
Anyone having time to implement scraping?

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 12, 2020

There is a figure here too, sources are probably the previously linked sitreps:
https://fr.wikipedia.org/wiki/Pand%C3%A9mie_de_Covid-19_au_Burkina_Faso

image

Probably possible to scrape since the data seems to be in the code of the figure (click on modify code to see).

@ColinFay
Copy link

Here's the code to download all the pdfs:

dir.create("burkina_covid")
for (
  i in c(
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19", 
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=1", 
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=2", 
    "https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=3"
  )
){
  url <- xml2::read_html(
    i
  )
  but <- rvest::html_nodes(url, ".dropdown-menu a")
  lapply(
    rvest::html_attr(but, "href"), 
    function(x){
      download.file(
        x, 
        file.path(
          "burkina_covid", 
          basename(x)
        )
      )
    }
  )
}

> fs::dir_tree("burkina_covid/")
burkina_covid/
├── covidresponseplanremarks-french.docx
├── ghrp-covid19-en.pdf
├── ghrp-covid19-fr.pdf
├── integration_du_covid-19_dans_la_reponse_humanitaire.pdf
├── plan_de_riposte_covid19-revise_def.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_centre_nord_fevrier_2020-1.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_centre_nord_fevrier_2020-1_1.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_sahel_fevrier_2020.pdf
├── sitrep_n27_du_24_03_20.pdf
├── sitrep_n_29_0.pdf
├── sitrep_n_32_covid-19_du_29_mars_2020_0.pdf
├── sitrep_n_33.pdf
├── sitrep_ndeg17_du_14_03_20.pdf
├── sitrep_ndeg21.pdf
├── sitrep_ndeg24_du_21_03_20.pdf
├── sitrep_ndeg25.pdf
├── sitrep_ndeg28.pdf
├── sitrep_ndeg35_0.pdf
├── sitrep_ndeg_20_du_17_03_20.pdf
├── sitrep_ndeg_22_du_19_03_20.pdf
├── sitrep_ndeg_26_du_23_03_20.pdf
├── sitrep_ndeg_31.pdf
├── sitrep_ndeg_34.pdf
├── sitrep_ndeg_36.pdf
├── sitrep_ndeg_37.pdf
├── sitrep_ndeg_38_covid_bfa_au_04_04_2020.pdf
├── sitrep_ndeg_39_1.pdf
├── sitrep_ndeg_40_0.pdf
├── sitrep_ndeg_41_au_7_avril_2020_1.pdf
├── sitrep_ndeg_42_covid-19_burkina_faso.pdf
├── sitrep_ndeg_43.pdf
└── sitrep_ndeg_44.pdf

@ColinFay
Copy link

Here's a piece of code to extract data from the latest pdf:

library(tabulizer)
res <- tabulizer::extract_text("burkina_covid/sitrep_ndeg_44.pdf")
res <- strsplit(res, "\n")[[1]]
num_extr <- function(
  res, txt
){
  gsub(
    "[^:]*: ([0-9]*).*", 
    "\\1", 
    grep(txt, res, value = TRUE)
  )
}

cont <- c(
  "Cumul personnes contacts listées",
  "Contacts confirmés COVID-19 depuis le début", 
  "Nbre de contacts sortis de suivi ce jours", 
  "Cumul de contacts sortis après 14 jours de suivis", 
  "Nombre de contacts à suivre", 
  "Nombre de contacts vus", 
  "Nombre de contacts non vus", 
  "Nombre de contacts devenus suspects", 
  "Nombre de nouveaux contacts"
)

x <- sapply(
  cont, function(x){
    num_extr(res, x)
  }
) 

tibble::rownames_to_column(
  as.data.frame(x), 
  "type"
)
                                               type    x
1                  Cumul personnes contacts listées 2409
2       Contacts confirmés COVID-19 depuis le début  272
3         Nbre de contacts sortis de suivi ce jours   31
4 Cumul de contacts sortis après 14 jours de suivis 1076
5                       Nombre de contacts à suivre 1061
6                            Nombre de contacts vus    1
7                        Nombre de contacts non vus   19
8               Nombre de contacts devenus suspects   10
9                       Nombre de nouveaux contacts   99

I'm french so these seems to be the interesting part, but as I'm no expert in the field that would be nifty to have s.o with domain knowledge pointing me to the interesting part of the pdf.

@PaulC91
Copy link

PaulC91 commented Apr 12, 2020

nouveaux cas confirmés et décès par district seraient super. but it doesn't there is any pattern in the way this information is given in the pdf (unlike the suivi des contacts section above), so I'm guessing it would be difficult to scrape consistently.

@ColinFay
Copy link

here's a attempt at a package to download and scrape data: https://github.com/ColinFay/covidbf

Let me know if you want me to work more on this.

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 12, 2020

@ColinFay thanks a lot.
As mentioned by @PaulC91 the information you are scraping is the reports about contact tracing. The new cases per region or per district are hidden in the text and not consistently reported it seems. There is also the map at the beginning that gives new cases by district, but very hard to scrape I believe...

@ColinFay
Copy link

Just to check, have you tried contacting the people listed at the bottom of the pdf? They might be willing to share the data

@ColinFay
Copy link

Oh and, what's LMIC?

@xt-21
Copy link

xt-21 commented Apr 12, 2020 via email

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 12, 2020

@ColinFay yes, we are in contact with authorities.

@ffinger
Copy link
Collaborator Author

ffinger commented Apr 28, 2020

I added some new countries and potential data sources to the spreadsheet.

See here for details:
epiforecasts/NCoVUtils#72 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Add new features to an existing package high_priority Urgent for COVID19 analytics medium_complexity Can be completed by 1 person in a few (<5) days.
Projects
None yet
Development

No branches or pull requests

6 participants