Identify the core datasets that we want to include in the inventory #1

Closed
waldoj opened this Issue Jan 7, 2015 · 15 comments

Comments

Projects
None yet
4 participants
@waldoj
Contributor

waldoj commented Jan 7, 2015

No description provided.

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Jan 12, 2015

Contributor

At U.S. Open Data, we've identified five core datasets that we advise every state government to publish:

  • registered corporations
  • legislation and legislators
  • laws and regulations
  • address points
  • campaign finance

Although each of these datasets are of varying value on its own, what makes each of these datasets valuable is that they make it possible to connect other datasets together, substantially increasing their value. For instance, without a list of registered corporations (and data about each of those corporations), it's impossible to know that a campaign contribution by Acme, LLC actually came from John Smith of Springfield (who owns 100% of the shares in Acme). And without data about every law, it's impossible to know what it manes that court ruling struck down §1.23-45. And so on. See the linked U.S. Open Data page for more about this.

So, those are five recommendations for datasets to include in this census.

Contributor

waldoj commented Jan 12, 2015

At U.S. Open Data, we've identified five core datasets that we advise every state government to publish:

  • registered corporations
  • legislation and legislators
  • laws and regulations
  • address points
  • campaign finance

Although each of these datasets are of varying value on its own, what makes each of these datasets valuable is that they make it possible to connect other datasets together, substantially increasing their value. For instance, without a list of registered corporations (and data about each of those corporations), it's impossible to know that a campaign contribution by Acme, LLC actually came from John Smith of Springfield (who owns 100% of the shares in Acme). And without data about every law, it's impossible to know what it manes that court ruling struck down §1.23-45. And so on. See the linked U.S. Open Data page for more about this.

So, those are five recommendations for datasets to include in this census.

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Feb 3, 2015

Contributor

I don't know what to do with transportation (transit and otherwise) data. Some states simply have no public transit (beyond a couple of cities with buses), and most states have nothing in the way of commuter rail, and those states that do have such services generally run them on a municipal level, rather than as a state program (which is to say that it's out of states' hands as to whether that data exists). There are a lot of transportation data sets that are important, so I think it's going to be complicated to figure out which ones to look for, how to score, etc.

Solvable, but a challenge.

Contributor

waldoj commented Feb 3, 2015

I don't know what to do with transportation (transit and otherwise) data. Some states simply have no public transit (beyond a couple of cities with buses), and most states have nothing in the way of commuter rail, and those states that do have such services generally run them on a municipal level, rather than as a state program (which is to say that it's out of states' hands as to whether that data exists). There are a lot of transportation data sets that are important, so I think it's going to be complicated to figure out which ones to look for, how to score, etc.

Solvable, but a challenge.

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Feb 3, 2015

Contributor

I'm interested in nuts-and-bolts data, like the state budget, agency checkbooks, the org chart of all agencies and employees, their blue book, all boards and their members, FOIA requests/responses, all state websites, attorney general opinions, etc. I don't think that any of these warrant being their own category (I could be persuaded otherwise re: budget/checkbook), but I think there's sense in having a category that lumps all of these together. A lot of work with open data is impossible or impractical without having this kind of nuts-and-bolts data to connect it to.

Contributor

waldoj commented Feb 3, 2015

I'm interested in nuts-and-bolts data, like the state budget, agency checkbooks, the org chart of all agencies and employees, their blue book, all boards and their members, FOIA requests/responses, all state websites, attorney general opinions, etc. I don't think that any of these warrant being their own category (I could be persuaded otherwise re: budget/checkbook), but I think there's sense in having a category that lumps all of these together. A lot of work with open data is impossible or impractical without having this kind of nuts-and-bolts data to connect it to.

@emily878

This comment has been minimized.

Show comment
Hide comment
@emily878

emily878 Feb 6, 2015

Contributor

States are a lot closer to countries than cities are, so we could use more of the G8 National Action Plan definition of "high value datasets":

Data Category Example datasets
Companies Company/business register
Crime and Justice Crime statistics, safety
Earth observation Meteorological/weather, agriculture, forestry, fishing, and hunting
Education List of schools; performance of schools, digital skills
Energy and Environment Pollution levels, energy consumption
Finance and contracts Transaction spend, contracts let, call for tender, future tenders, local budget, national budget (planned and spent)
Geospatial Topography, postcodes, national maps, local maps
Global Development Aid, food security, extractives, land
Government Accountability and Democracy Government contact points, election results, legislation and statutes, salaries (pay scales), hospitality/gifts
Health Prescription data, performance data
Science and Research Genome data, research and educational activity, experiment results
Statistics National Statistics, Census, infrastructure, wealth, skills
Social mobility and welfare Housing, health insurance and unemployment benefits
Transport and Infrastructure Public transport timetables, access points broadband penetration
Contributor

emily878 commented Feb 6, 2015

States are a lot closer to countries than cities are, so we could use more of the G8 National Action Plan definition of "high value datasets":

Data Category Example datasets
Companies Company/business register
Crime and Justice Crime statistics, safety
Earth observation Meteorological/weather, agriculture, forestry, fishing, and hunting
Education List of schools; performance of schools, digital skills
Energy and Environment Pollution levels, energy consumption
Finance and contracts Transaction spend, contracts let, call for tender, future tenders, local budget, national budget (planned and spent)
Geospatial Topography, postcodes, national maps, local maps
Global Development Aid, food security, extractives, land
Government Accountability and Democracy Government contact points, election results, legislation and statutes, salaries (pay scales), hospitality/gifts
Health Prescription data, performance data
Science and Research Genome data, research and educational activity, experiment results
Statistics National Statistics, Census, infrastructure, wealth, skills
Social mobility and welfare Housing, health insurance and unemployment benefits
Transport and Infrastructure Public transport timetables, access points broadband penetration
@emily878

This comment has been minimized.

Show comment
Hide comment
@emily878

emily878 Feb 6, 2015

Contributor

If we went with the full G8 list, then we could do a test with a data-friendly state to find out what kind of specific datasets we could, under ideal conditions, currently expect to get from these categories.

Contributor

emily878 commented Feb 6, 2015

If we went with the full G8 list, then we could do a test with a data-friendly state to find out what kind of specific datasets we could, under ideal conditions, currently expect to get from these categories.

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Feb 6, 2015

Contributor

I hadn't realized that the core state-level datasets that I've identified are all found within the G8 list, at least as subsets of them. 👍

Contributor

waldoj commented Feb 6, 2015

I hadn't realized that the core state-level datasets that I've identified are all found within the G8 list, at least as subsets of them. 👍

@rebeccawilliams

This comment has been minimized.

Show comment
Hide comment
@rebeccawilliams

rebeccawilliams Feb 13, 2015

Yes, a State Census!

Looking at this list, I think you'll need a ton of experts and/or 14 different Censuses. Modifying the Census for nesting I think would be useful overall and is something @ondrae and I had discussed back in the day, maybe there could be a joint effort? Or perhaps the Local Censuses could get the upgraded Global Open Data Index design treatment.

In addition to the G8 Categories, were other categories explored? Can data that sits squarely in between the US City Open Data Census and Global Open Data Index datasets be prioritized for review?

For Companies, Open Corporates already collects a lot of state info, can new information be suggested to them rather than having a separate place for review?

I think you two both have these notes, but if helpful, some of this would apply to states as well:

Yes, a State Census!

Looking at this list, I think you'll need a ton of experts and/or 14 different Censuses. Modifying the Census for nesting I think would be useful overall and is something @ondrae and I had discussed back in the day, maybe there could be a joint effort? Or perhaps the Local Censuses could get the upgraded Global Open Data Index design treatment.

In addition to the G8 Categories, were other categories explored? Can data that sits squarely in between the US City Open Data Census and Global Open Data Index datasets be prioritized for review?

For Companies, Open Corporates already collects a lot of state info, can new information be suggested to them rather than having a separate place for review?

I think you two both have these notes, but if helpful, some of this would apply to states as well:

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Feb 18, 2015

Contributor

Modifying the Census for nesting I think would be useful overall and is something @ondrae and I had discussed back in the day, maybe there could be a joint effort?

That's out of the scope of what we're doing in this initial effort but, yeah, that sure would be helpful. I've been frustrated by the limitations of Open Data Census, but that's probably just because I'm trying to do things with it that it wasn't meant for. :) At this point, I think we're going to have to try to muddle by with the existing architecture.

In addition to the G8 Categories, were other categories explored?

Just what you see above in this thread.

Can data that sits squarely in between the US City Open Data Census and Global Open Data Index datasets be prioritized for review?

I'm not sure what you mean—could you explain?

For Companies, Open Corporates already collects a lot of state info, can new information be suggested to them rather than having a separate place for review?

It'd be impossible for us to provide comprehensive scores for each state without including business data. (We can't just say "see this other ranking") But it seems to me that we'd want to just incorporate Open Corporates' metrics, rather than trying to do our own thing there.

Contributor

waldoj commented Feb 18, 2015

Modifying the Census for nesting I think would be useful overall and is something @ondrae and I had discussed back in the day, maybe there could be a joint effort?

That's out of the scope of what we're doing in this initial effort but, yeah, that sure would be helpful. I've been frustrated by the limitations of Open Data Census, but that's probably just because I'm trying to do things with it that it wasn't meant for. :) At this point, I think we're going to have to try to muddle by with the existing architecture.

In addition to the G8 Categories, were other categories explored?

Just what you see above in this thread.

Can data that sits squarely in between the US City Open Data Census and Global Open Data Index datasets be prioritized for review?

I'm not sure what you mean—could you explain?

For Companies, Open Corporates already collects a lot of state info, can new information be suggested to them rather than having a separate place for review?

It'd be impossible for us to provide comprehensive scores for each state without including business data. (We can't just say "see this other ranking") But it seems to me that we'd want to just incorporate Open Corporates' metrics, rather than trying to do our own thing there.

@dsmorgan77

This comment has been minimized.

Show comment
Hide comment
@dsmorgan77

dsmorgan77 Feb 18, 2015

Doing my part to be an expert in transportation and infrastructure here...the issue you raise about "public transport timetables" is interesting. The thing is, in the United States, transit is very rarely a state government matter. According to the National Transit Database (http://www.ntdprogram.gov/ntdprogram/datbase/2013_database/2013%20Agency%20Information.xls), only 20 of the 857 reporting transit agencies are state governments.

The top two types of transit agencies are city/county/local government (430), independent authority (255). It breaks down really quickly after that. non-profit corporations (32), metropolitan planning organizations & councils of government (30), private for-profit corporations (25) and state governments (20).

Given this, I would recommend against including transit in a state-level census.

Doing my part to be an expert in transportation and infrastructure here...the issue you raise about "public transport timetables" is interesting. The thing is, in the United States, transit is very rarely a state government matter. According to the National Transit Database (http://www.ntdprogram.gov/ntdprogram/datbase/2013_database/2013%20Agency%20Information.xls), only 20 of the 857 reporting transit agencies are state governments.

The top two types of transit agencies are city/county/local government (430), independent authority (255). It breaks down really quickly after that. non-profit corporations (32), metropolitan planning organizations & councils of government (30), private for-profit corporations (25) and state governments (20).

Given this, I would recommend against including transit in a state-level census.

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Feb 18, 2015

Contributor

👍 That was my gut feeling on transit, but having hard numbers is wonderful. Thank you!

Contributor

waldoj commented Feb 18, 2015

👍 That was my gut feeling on transit, but having hard numbers is wonderful. Thank you!

@rebeccawilliams

This comment has been minimized.

Show comment
Hide comment
@rebeccawilliams

rebeccawilliams Feb 18, 2015

Can data that sits squarely in between the US City Open Data Census and Global Open Data Index datasets be prioritized for review?

I'm not sure what you mean—could you explain?

With this list, there are 134 discrete types of data you are looking to evaluate (and a lot of those could be broken down further, e.g. weather). I am assuming that these won't happen in the same pass. With that in mind, I think it'd be really useful to prioritize the areas that were covered by the Global Open Data Index and the US City Open Data Census because perhaps we'd see whole verticals of open data emerge.

I made a table, with total overlap in bold and some overlap in italics:

US Index US State Census Possibilities US City Census
Government Accountability and Democracy: asset disclosure Asset Disclosure
Government Budget Finance and contracts: add budget to the list Budget
Company Register Companies / see Open Corporates. Business Listings
Government Accountability and Democracy: campaign finance / see Open Secrets Campaign Finance Contributions
Code Enforcement Violations
Construction Permits
Crime and Justice: crime statistics Crime
Earth observation
Economic Development
Education
Election Results Government Accountability and Democracy: election results, see Open Elections
Health
Legislation Government Accountability and Democracy: legislation / see Open States
Government Accountability and Democracy: lobbyist activity Lobbyist Activity
National Map Geospatial: maps you list
National Statistics Statistics
Geospatial: parcel maps Parcels (shapefiles)
Pollutant Emissions Energy and Environment: air quality?
Postcodes / Zipcodes Geospatial: postcodes
Finance and contracts: contracts Procurement Contracts
Property Assessment
Property Deeds
Public Buildings
Restaurant Inspections
Science and Research
Service Requests (311)
Social mobility and welfare
Government Spending Finance and contracts: expenditures / see OSPIRG Spending
Transport Timetables Transport and Infrastructure (though this gets regional fast) Transit
Zoning (shapefiles)
Web Analytics

So support/coordinate with:

  • Open Corporates in company data assessment
  • Open Elections in election result data assessment
  • Open Secrets in state campaign finance data assessment (is this complete?) @boblannon?
  • Open States in legislative data assessment (would anything be additive here?)
  • OSPIRG in spending data/checkbook assessment

Honestly, after writing those all down, creating a site that points to all of those open state data efforts would be useful in and of itself I think as a form of Open Everything Advocacy.

Based on overlap, it'd be great to prioritize:

  • open budget data, under Finance and contracts; it's not currently listed? Do most states have it?
  • Transport and Infrastructure -- @dsmorgan77 is right, though I worry the regional authorities will never be assessed FWIW

Then:

  • Crime and Justice: crime statistics
  • Energy and Environment: air quality
  • Finance and contracts:
    • contracts
    • procurement processes
  • Geospatial [w/ NSGIC and/or @sbma44]:
    • parcels
    • post codes
    • etc
  • Government Accountability and Democracy:
    • asset disclosure
    • lobbyist activity
  • Statistics - though I'm not sure this one is clear

And very last, not because they aren't important, but because there is less open data assessment overlap currently, these categories:

  • Earth observation
  • Economic Development
  • Education
  • Health
  • Science and Research
  • Social mobility and welfare

Can data that sits squarely in between the US City Open Data Census and Global Open Data Index datasets be prioritized for review?

I'm not sure what you mean—could you explain?

With this list, there are 134 discrete types of data you are looking to evaluate (and a lot of those could be broken down further, e.g. weather). I am assuming that these won't happen in the same pass. With that in mind, I think it'd be really useful to prioritize the areas that were covered by the Global Open Data Index and the US City Open Data Census because perhaps we'd see whole verticals of open data emerge.

I made a table, with total overlap in bold and some overlap in italics:

US Index US State Census Possibilities US City Census
Government Accountability and Democracy: asset disclosure Asset Disclosure
Government Budget Finance and contracts: add budget to the list Budget
Company Register Companies / see Open Corporates. Business Listings
Government Accountability and Democracy: campaign finance / see Open Secrets Campaign Finance Contributions
Code Enforcement Violations
Construction Permits
Crime and Justice: crime statistics Crime
Earth observation
Economic Development
Education
Election Results Government Accountability and Democracy: election results, see Open Elections
Health
Legislation Government Accountability and Democracy: legislation / see Open States
Government Accountability and Democracy: lobbyist activity Lobbyist Activity
National Map Geospatial: maps you list
National Statistics Statistics
Geospatial: parcel maps Parcels (shapefiles)
Pollutant Emissions Energy and Environment: air quality?
Postcodes / Zipcodes Geospatial: postcodes
Finance and contracts: contracts Procurement Contracts
Property Assessment
Property Deeds
Public Buildings
Restaurant Inspections
Science and Research
Service Requests (311)
Social mobility and welfare
Government Spending Finance and contracts: expenditures / see OSPIRG Spending
Transport Timetables Transport and Infrastructure (though this gets regional fast) Transit
Zoning (shapefiles)
Web Analytics

So support/coordinate with:

  • Open Corporates in company data assessment
  • Open Elections in election result data assessment
  • Open Secrets in state campaign finance data assessment (is this complete?) @boblannon?
  • Open States in legislative data assessment (would anything be additive here?)
  • OSPIRG in spending data/checkbook assessment

Honestly, after writing those all down, creating a site that points to all of those open state data efforts would be useful in and of itself I think as a form of Open Everything Advocacy.

Based on overlap, it'd be great to prioritize:

  • open budget data, under Finance and contracts; it's not currently listed? Do most states have it?
  • Transport and Infrastructure -- @dsmorgan77 is right, though I worry the regional authorities will never be assessed FWIW

Then:

  • Crime and Justice: crime statistics
  • Energy and Environment: air quality
  • Finance and contracts:
    • contracts
    • procurement processes
  • Geospatial [w/ NSGIC and/or @sbma44]:
    • parcels
    • post codes
    • etc
  • Government Accountability and Democracy:
    • asset disclosure
    • lobbyist activity
  • Statistics - though I'm not sure this one is clear

And very last, not because they aren't important, but because there is less open data assessment overlap currently, these categories:

  • Earth observation
  • Economic Development
  • Education
  • Health
  • Science and Research
  • Social mobility and welfare
@dsmorgan77

This comment has been minimized.

Show comment
Hide comment
@dsmorgan77

dsmorgan77 Feb 18, 2015

We should talk to NSGIC about overlap between some of the transport datasets we have and their efforts with Transportation for the Nation (http://www.nsgic.org/transportation-for-the-nation). Some, but not all, of the transportation data sets in this Census are geospatial and should rightly be expected to be published via State geospatial portals.

The work we've done to get full roadway centerline information from the states via the Highway Performance Monitoring System has been pretty helpful here (@sbma44 should know, cuz MapBox has written about it).

As for partners in getting the transportation bits of the Census done, I'd recommend AASHTO (http://www.transportation.org/Pages/Default.aspx), the association of all the State DOTs. They have a bunch of committees, like GIS-T (http://www.gis-t.org/) and I know NSGIC coordinates with them.

As for regional (transit) authorities, this is super complicated. There is plenty of regional transit that crosses state boundaries (and here I'm limiting the discussion to fixed-route service ... human services transportation, especially on-demand services does cross state boundaries). Obviously, we know WMATA/VRE, but there's also PATCO/SEPTA (greater Philly; PA/NJ), PATH (greater NYC; NY/NJ), MBTA (greater Boston, including bits of NH, RI), etc. Then there are independent transit agencies that don't cross state lines, and each of their charters are different. For instance, Chicago Transit Authority is created by Illinois state law. Similarly, Atlanta's MARTA was created by state legislation and approved by the counties that were impacted. For situations where an independent transit authority was established by State law, I could see reasons for including them in the Census - nevertheless, a State may have multiple transit authorities, and States don't always have to pass laws to form independent authorities. Sound Transit in Seattle was formed by the Snohomish, King, and Pierce County Councils. All that to say this: there's no one-size fits all approach to this issue, but it's clearly not just a State census issue. We could have a lot of fun investigating the various quasi-governmental entities, but I think that's an entirely different Census and set of issues.

We should talk to NSGIC about overlap between some of the transport datasets we have and their efforts with Transportation for the Nation (http://www.nsgic.org/transportation-for-the-nation). Some, but not all, of the transportation data sets in this Census are geospatial and should rightly be expected to be published via State geospatial portals.

The work we've done to get full roadway centerline information from the states via the Highway Performance Monitoring System has been pretty helpful here (@sbma44 should know, cuz MapBox has written about it).

As for partners in getting the transportation bits of the Census done, I'd recommend AASHTO (http://www.transportation.org/Pages/Default.aspx), the association of all the State DOTs. They have a bunch of committees, like GIS-T (http://www.gis-t.org/) and I know NSGIC coordinates with them.

As for regional (transit) authorities, this is super complicated. There is plenty of regional transit that crosses state boundaries (and here I'm limiting the discussion to fixed-route service ... human services transportation, especially on-demand services does cross state boundaries). Obviously, we know WMATA/VRE, but there's also PATCO/SEPTA (greater Philly; PA/NJ), PATH (greater NYC; NY/NJ), MBTA (greater Boston, including bits of NH, RI), etc. Then there are independent transit agencies that don't cross state lines, and each of their charters are different. For instance, Chicago Transit Authority is created by Illinois state law. Similarly, Atlanta's MARTA was created by state legislation and approved by the counties that were impacted. For situations where an independent transit authority was established by State law, I could see reasons for including them in the Census - nevertheless, a State may have multiple transit authorities, and States don't always have to pass laws to form independent authorities. Sound Transit in Seattle was formed by the Snohomish, King, and Pierce County Councils. All that to say this: there's no one-size fits all approach to this issue, but it's clearly not just a State census issue. We could have a lot of fun investigating the various quasi-governmental entities, but I think that's an entirely different Census and set of issues.

@rebeccawilliams

This comment has been minimized.

Show comment
Hide comment
@rebeccawilliams

rebeccawilliams Feb 18, 2015

Thanks @dsmorgan77! Maybe @kpwebb's & friends will have the motivation for a transit deep dive that addresses all the forms of authority some day. Some folks assessed some regional transit on the US City Census, but, yes, it was all over the place, and the authority matters for advocacy.

Thanks @dsmorgan77! Maybe @kpwebb's & friends will have the motivation for a transit deep dive that addresses all the forms of authority some day. Some folks assessed some regional transit on the US City Census, but, yes, it was all over the place, and the authority matters for advocacy.

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Feb 18, 2015

Contributor

Honestly, after writing those all down, creating a site that points to all of those open state data efforts would be useful in and of itself I think as a form of Open Everything Advocacy.

I think you're right. It does look awfully useful. :) I imagine that was a lot of work!

We're definitely going for an iterative process here. That is, our aspirations are high, we'll start with something meh, and we'll improve from there. So maybe, at first, we identify just one core dataset for each area—the one dataset that defines it. Or maybe we change the scoring criteria, so that we instead score on the basis of whether a series of datasets exist. I don't know. But it seems best to start with something limited, and then work up to something complete!

Contributor

waldoj commented Feb 18, 2015

Honestly, after writing those all down, creating a site that points to all of those open state data efforts would be useful in and of itself I think as a form of Open Everything Advocacy.

I think you're right. It does look awfully useful. :) I imagine that was a lot of work!

We're definitely going for an iterative process here. That is, our aspirations are high, we'll start with something meh, and we'll improve from there. So maybe, at first, we identify just one core dataset for each area—the one dataset that defines it. Or maybe we change the scoring criteria, so that we instead score on the basis of whether a series of datasets exist. I don't know. But it seems best to start with something limited, and then work up to something complete!

@waldoj

This comment has been minimized.

Show comment
Hide comment
@waldoj

waldoj Mar 12, 2015

Contributor

We've broken this up in a bunch of different issues now—closing as a duplicate.

Contributor

waldoj commented Mar 12, 2015

We've broken this up in a bunch of different issues now—closing as a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment