New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on accessing Australian data #17

Open
njtierney opened this Issue Oct 16, 2017 · 23 comments

Comments

Projects
None yet
@njtierney
Collaborator

njtierney commented Oct 16, 2017

There is a lot of Australia data sources available through resources such as data.gov.au, which contains a huge amount of public data.

However, it is almost guaranteed that you need to invest a solid chunk of time into cleaning the data and preparing it for analysis and checking the data quality.

I'd like to develop a catalog/table/similar that describes Australian datasets for analysis that are ready, or near-ready to analyse. Or perhaps even just discuss this idea here on the repo.

What I imagine is something like a table where you have columns like:

  • keywords
  • abstract
  • data_type
  • api_available (E.g., the ukpolice data has a official API, and the Australian census data doesn't)
  • cleaned (excess whitespace trimmed, variable names make sense - no names like 2016 Rainfall at Toowong_Bowls_Club etc.)
  • data_in_software: Is the data in an R package or other software?
  • data_access_in_software: Can you access the data with an R package or other software?

This could help direct the efforts of researchers and analysts, knowing the state of what is ready to access, and also identify those data sources that might be ripe for an R package containing the data, or a way to access it.

I can think of a few R packages and datasets that we could add right now:

  • eechidna, contains Australian election and census data along with shapefiles, which were downloaded from the ABS and cleaned up and collected together so that they are ready for analysis.
  • bomrang - fetch Australian Bureau of Meterology data
  • GSODR Global summary daily weather data in R
  • ausmacrodata - facilitates the use of quantitative, publicly available Australian macroeconomic data. blogpost

Related to this, there was an R package developed to access data from data.gov.au - ozdata, which could be very useful in accessing the data.

@stevage, do you might have some ideas of where we could start looking? Or thoughts on this topic?

@HughParsonage

This comment has been minimized.

HughParsonage commented Oct 16, 2017

I'd add my own packages to this list: (suggestions for API changes welcome)

The ABS contains some very rich data. However, their interface leaves a lot of room for improvement.
I have had some experience with accessing ABS APIs. While I have a lot of sympathy for the ABS, in practice I find it easier, faster, and less error-prone to just download the relevant Excel file and then type out the data into a tsv file manually than to use their API or automate any of the process. This is obviously a bit too laborious to do for all of the ABS catalogue, (and keep updated) though I would be willing to do it if I knew it would be widely used. ABS cooperation would make the process easier, but would not be absolutely necessary.

The ATO has tidier data and a more proactive approach to releasing data. Some quick enhancements to the taxstats package could include adding their Excel tables to the package, and possibly some contemplation of the 16% sample file.

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 16, 2017

Wow, that's great, thanks @HughParsonage ! 💯

An outcome of the ozunconf could be to improve upon existing package documentation, perhaps making a pull request to your packages with READMEs or vignettes with examples, and perhaps even package websites, to improve accessibility.

This issue might also fracture out into multiple projects, for example, one group working on documentation, another group working on searching and finding datasets and APIs, and another writing packages for existing data.

Really excited for this issue!

@stephdesilva

This comment has been minimized.

stephdesilva commented Oct 16, 2017

This would be a huge productivity benefit for many people.

On a related, but probably separate issue: has anybody been following the discussions around Indigenous Data Sovereignty? It seems to me that open source software/data and projects like this would interface very nicely and provide support for indigenous communities to control, preserve and generate data -> under the leadership and/or acquiescence of the indigenous community, obviously.

@dfalster

This comment has been minimized.

dfalster commented Oct 17, 2017

Great ideas re data.gov.au and ABS data (they were on my list to bring up).

In relation to the data from data.gov.au it's also worth pointing to the great NationalMap portal developed by data61 (now NICTA), for displaying spatial data. As far as I can understand, much of that data is all pulled from data.gov.au (see nationalmap.gov.au/help/data-catalogue.html). Its reasonably clean and has standardised formatting. The protal provides a nice way to visualise. But we could consider what is needed to pull layers into R.

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 17, 2017

Another example, the Australian Road Deaths Database, contains monthly, quite clean data.

An example of a super brief analysis is here

It looks like there's a bunch of other interesting data, like the airport traffic data.

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 17, 2017

@dfalster good idea re the national map portal - It looks like another great trove of data, if there is a way to search it and then pull the layers/shapefiles into R that would be a huge win, @mdsumner, you might be able to speak a bit to this.

@mdsumner

This comment has been minimized.

mdsumner commented Oct 17, 2017

I love this topic, I have some explorations of TAS open data, cadastre, address, roads etc. Collating sources is a very good plan, I think the synching/reading is pretty well covered by other general tools, but I'm probably going to need to outline the bowerbird way-of-life to show why. (And maybe a good example for a shared vm to prepare...)

@mdsumner

This comment has been minimized.

mdsumner commented Oct 17, 2017

It looks as though the portal is primarily WMS (images rendered) and CSV, which is not much good. From a quick scan it looks as though going to state-based opendata sources will be better, but happy to be shown otherwise.

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 18, 2017

Another resource that might be useful:

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 18, 2017

@mdsumner I'm keen to see the bowerbird way of life! It would be great if we can determine a way to get the shapefiles out from these sources, or even if we can point to where they are stored so we can access them.

@raymondben

This comment has been minimized.

Member

raymondben commented Oct 18, 2017

An example of throwing bowerbird at a data.gov.au dataset:

devtools::install_github("AustralianAntarcticDivision/bowerbird")
library(bowerbird)

my_source <- bb_source(
    name="Bike Paths - Greater Geelong",
    id="http://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611",
    description="Polyline data of bike path locations for the City of Greater Geelong.",
    reference="https://data.gov.au/dataset/geelong-bike-paths",
    citation="Not provided, see https://data.gov.au/dataset/geelong-bike-paths ",
    source_url="https://data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611",
    license="CC-BY",
    method=quote(bb_handler_wget),
    method_flags=c("--recursive","--level",1,"--accept-regex=download","--adjust-extension"),
    postprocess=quote(bb_unzip),
    collection_size=0.002)

cf <- bb_config("/temp/data") %>% bb_add(my_source)
bb_sync(cf)

This will create the local directory /temp/data/data.gov.au/dataset/7af9cf59-a4ea-47b2-8652-5e5eeed19611 and mirror the data files associated with that data.gov.au entry. Here there are two subdirectories, one with a kml and the other with a shapefile (unzipped for you, ready to use).

I'm assuming that roughly the same template (with different source_url and other dataset-specific details) would work with other data.gov.au datasets as well. Some of the entries are metadata that are intended for humans to refer to (description, reference, citation, license).

We like bowerbird because (a) it's recursive, so you generally only need to specify the top-level directory, even if the data set contains many files; and (b) it will do incremental updates, so you can run the sync process again later and it will only download what has changed. The ckanr package offers another way of interacting with data.gov.au, but for data retrieval (assuming you know which data sets you want) we find bowerbird to be easier.

@jonocarroll

This comment has been minimized.

jonocarroll commented Oct 18, 2017

FYI, ozdata (the data.gov.au part) never really got wrapped up because we hit a roadblock going down a path we probably didn't need to. I've since cleaned up the functionality and intend to have it in working order (if not on CRAN) before ozunconf17. Searching and mapping work fine now.

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 19, 2017

Another data source to potentially look at - Queensland police data: https://www.police.qld.gov.au/online/data/default.htm

@njtierney

This comment has been minimized.

Collaborator

njtierney commented Oct 19, 2017

@jonocarroll awesome! I think that for the scope of the ozunconf and to make it easier to maintain, it might be easiest to wrap up the data sources into individual packages and then get ozdata to import them?

@ellisp

This comment has been minimized.

ellisp commented Oct 22, 2017

Progress in this space looks both useful and achievable. Also, I'm officially wearing a Stats NZ hat (metaphorically) at this conference and there could be some useful suggestions / opportunities to feedback to New Zealand on this.

@timchurches

This comment has been minimized.

timchurches commented Oct 22, 2017

Extending and/or generalising the Census2016 packages by Hugh Parsonage at https://github.com/hughparsonage/Census2016 and
https://github.com/hughparsonage/Census2016.Datapack would be great - right down to SA1 level. As Hugh notes, ABS still don't seem to understand that most researchers want clean raw data, not data facsimiles of nicely presented tables with subtotals and totals and weird partial aggregations littered through them...

It probably isn't necessary to include the raw data in such packages, because it is all freely available online, and thus can be downloaded on-demand by functions in the package. ABS are reasonably good at keeping data resources available at specific URLs, once published (but some maintenance is inevitable). It may even be possible to spider and parse the ABS web site pages to dynamically determine data download URLs, which would be more robust.

@raymondben

This comment has been minimized.

Member

raymondben commented Oct 23, 2017

Assorted comments (sorry for a very bowerbird focus)!

  • having had a brief rummage around in ozdata's code I am not sure that bowerbird offers a lot here - there is already functionality to do the downloading and I don't think that bowerbird would add much to this

  • however, for data that are not in data.gov.au, bowerbird might be worth considering. There is an eechidna-style election data example in the bowerbird readme. The Queensland police data mentioned above should be fairly straightforward too. This sort of approach would follow @timchurches on-demand suggestion above.

  • @jonocarroll if you are working on ozdata, maybe worth thinking about propagating "citation" info through to the user? That is, data sets that are released under a CC-BY license should have an associated citation that users are obliged to cite when using the data. Making this information easily accessible to the user would be helpful, BUT I am not sure that citation info is part of the standard CKAN schema. I know that some data sets provide a specific "here's how to cite this data set" entry, but maybe this is not consistent enough to include in ozdata in a general manner?

  • @timchurches - spidering is basically what bowerbird does. Maybe useful in that context.

@stevage

This comment has been minimized.

stevage commented Oct 23, 2017

@dfalster As far as I can understand, much of that data is all pulled from data.gov.au

NationalMap uses the CKAN API to list datasets in data.*.gov.au, but also has lots of other sources of data - manually listed datasets, the ABS SDMX API, various Esri services etc. Most of the "national datasets" are hand-curated.

@stevage

This comment has been minimized.

stevage commented Oct 23, 2017

Hey everyone, and sorry to chime in so late. (Had a very full-on last week). I used to work on NationalMap, and have been pretty active around the open data space in Australia for a few years, working with many government bodies at different levels. (I'm generally working on data that is "useful but boring", rather than ripe for statistical analysis, machine learning etc... however).

But lots of the aggregators and links in the above thread are new to me - that's awesome.

Just to add to a list, I've been working on http://opencouncildata.github.io/Platform, which is another approach to aggregating data - it focuses on data that meet the Open Council Data Standards. The main relevance would be some of the aggregated datasets, like the 500,000 odd trees that power opentrees.org, that might be of interest.

There is also Magda, which I think is meant to eventually replace CKAN as the registry for data.gov.au. It was just being started around the time I left Data61, so I don't know much about it.

Finally, one more interesting dataset you may like is http://github.com/stevage/BikeTrafficCounts, which is - well, read the README.

A dream I've had for a while is to map out the whole potential open data universe as some kind of grid, and start filling in the boxes, based on whether data exists and is public, exists but is not public, or is not public. That is, instead of starting, like most catalogues, from the question of "what is available" and trying to organise those into some useful structure, I'd like to start from the question of "what do people want", and provide definitive answers like "that is not available". It should be possible to start at some high level like "water", and subdivide that domain into "freshwater > river levels > Yarra River > ..." But I'm a bit scared of the ontological work required to make that meaningful :)

(I do suspect that that approach, where you map out the domain, and draw attention to blanks, will yield useful pressure - much in the way that map.opencouncildata.org has been surprisingly useful at encouraging councils to join the open data movement.)

Anyway, I'm really looking forward to supporting whatever project people want to work on, however best. (Caveat: I don't know R :) )

@stephdesilva

This comment has been minimized.

stephdesilva commented Oct 23, 2017

Steve I think that's a really useful approach - a resource where people can see what's available, what could be for the asking/pressing and what's not would be useful across all sorts of domains.

@stevage

This comment has been minimized.

stevage commented Oct 23, 2017

(I should mention that there is the open data census but it's really about scoring organisations on a very small number of datasets rather than actually facilitating access to data.)

@katerobsau

This comment has been minimized.

katerobsau commented Oct 24, 2017

In terms of Aussie data I have been curious about Australian real estate prices, eg. sold, rental etc. I think there is definitely some interesting data mining and analysis that could be done there. @HughParsonage I see you've got a package for NSW property prices. Is this something would be worth generalising to other parts of Aus, like Vic or Qld?

Also @stevage I like the idea of being able to look at what datasets are available for a given gridded location - so often we search by data type, not the other way around.

@HughParsonage

This comment has been minimized.

HughParsonage commented Oct 25, 2017

@SAUNDERSK1 While it would be certainly worth generalizing, I'm not aware that the other state governments have released such data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment