Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data discussion #1

Closed
tanmaysharma19 opened this issue Jan 11, 2021 · 8 comments
Closed

Data discussion #1

tanmaysharma19 opened this issue Jan 11, 2021 · 8 comments
Assignees

Comments

@tanmaysharma19
Copy link
Collaborator

Discussing ideas for finalizing the dataset for the dashboard.

@tanmaysharma19 tanmaysharma19 linked a pull request Jan 12, 2021 that will close this issue
@tanmaysharma19 tanmaysharma19 removed a link to a pull request Jan 12, 2021
@rtaph
Copy link
Collaborator

rtaph commented Jan 12, 2021

Cancer dataset:

  • What I like:

    • Many quantitative variables. This opens up potential for interesting graphs.
    • Few variables have missing data
  • Limitations (what I like less):

    • There only seems to be one categorical dimension (the Geography variable). If we want to build in dimensions as disaggregations/filters this might be a bit limiting. Of course, we could bin some categories if we wanted.
    • The data is at the level of the county, not the individual (it is not microdata). This might make it a bit tougher to disaggregate and uncover patterns. It restricts us to inferences at the level of the county and above. Many levels are actually semi-aggregated levels (e.g. pctbachdeg18_24). This will make it hard to reshape the data since we cannot cross those columns while reshaping. We can probably only analyse the data marginally, rather than drill down by multiple variables.
    • Even if we can aggregate and slice the data above, there is an additional level of complication needed in that we would likely need to weight every single measure by population size of the county.
    • There are no time-series to plot. This is not mission critical but would be nice to have.
    • This is a combined sample. This means that some variables likely clustered in the sampling design itself, which might mean we have a lot of holes in our data if we try to do geographic mapping at a certain level.
  • Potential uses / personas:

    • Clinical researcher / scientific audience
    • Policy planner (municipal gov’t)
    • Physician

See auto-generated data profile report.

@rtaph
Copy link
Collaborator

rtaph commented Jan 12, 2021

NYC agency performance indicators from the FY20 Mayor's Management Report:

  • What I like:

    • It's a collection of KPIs, which is something that people naturally create dashboards for.
    • It has time depth (FY16-FY20: probably more years in older datasets) so can be visualized as several series.
    • It has target data. We could make something similar to this scorecard I made (unrelated), or use bullet charts.
    • Appears quite complete and clean.
    • There probably is a written report somewhere where we can get a lot more information about the data, and validate that our summaries match.
  • Limitations (what I like less):

    • Other than the year, there are not many disaggregating variables. Maybe we need/want more than what exists in the data? We might be able to augment the data ourselves by, say, classifying KPIs into themes.
    • It could also be overwhelming if we have too many KPIs.
    • might be hard to determine which indicators can safely be summed from year to year (vs. being cumulative or needing distinct counts)
    • The value variables is represented different ways for different KPIs (e.g. 72 calls, 0:11 average wait time, $12,300 dollars, "↓" target, 99.2%)
    • KPIs have subsetted sections making it a bit tricky to work with. E.g. "– Robbery" is a sub-bullet KPI of "Major felony crime"
  • Potential uses / personas:

    • Mayor of NYC
    • Residents of NYC (taxpayer), accountability dashboard

@rtaph rtaph moved this from To do to In progress in Group1-dashboard Jan 12, 2021
@rtaph
Copy link
Collaborator

rtaph commented Jan 12, 2021

OECD Business Tendency Data:

  • What I like:

    • OECD generally has very complete and reliable, especially on economics.
    • Monthly data for all OECD countries going back a decade
    • Many economic metrics
    • Can disaggregate/filter by countries (or regions) and industry
  • Limitations (what I like less):

    • Would be nicer to have more disaggregating variables
    • I don't think we can access micro-data.
    • Likely requires weights for aggregation.
    • Metrics are all on a relative scale. Might make it a little less intuitive for the average person.
  • Potential uses / personas:

    • Economists (which metrics are trending up or down)
    • Public servants (planning agencies)
  • Related ideas:

    • Compare these series with Coronavirus time series to see the impact the pandemic has had on numerous economic measures

@dusty736
Copy link
Collaborator

World Happiness Report

  • What I like:
    • Complete Dataset
    • Contains spatiotemporal features
    • All numeric features
    • Clear features to filter on (freedom level, life expectancy, generosity, etc)
  • Limitations:
    • Only 9 common features across data
    • Some years have different features
  • Potential Uses:
    • Travel companies (moving recommendations)
    • Public servants (seeing what makes people happy)

@jraza19
Copy link
Collaborator

jraza19 commented Jan 12, 2021

Obesity dataset

  • What I like:

    • simple dataset with time as a variable
    • imagine a nice interactive map that could be created
    • data was pre cleaned
  • Limitations

    • Only a 3 variables
  • Personas/usage

    • international government agencies
    • non profit organizations
    • dieticians/public health professionals

@tanmaysharma19
Copy link
Collaborator Author

COVID data

  • What I like:

    • contemporary dataset
    • 55 variables
    • time series data
    • data for all countries
    • can subset data easily across time, countries and attributes
  • Limitations

    • some missing data from first half of 2020
  • Personas/usage

    • covid researchers
    • public health agencies
    • government agencies

@rtaph
Copy link
Collaborator

rtaph commented Jan 13, 2021

Desiderata:
• Micro-data (each row represents one unit, without aggregation)
• 5+ categorical dimensions for filtering/disaggregating
• 5+ numeric measures
• Geographic variables (ideally hierarchical, e.g. municipality -> province -> country).
• time-series data
• little to no missing data
• no need for weighting

@rtaph
Copy link
Collaborator

rtaph commented Jan 16, 2021

Closing this issue out. The team decided to go with the obesity data during a team meeting.

@rtaph rtaph closed this as completed Jan 16, 2021
Group1-dashboard automation moved this from In progress to Done Jan 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

4 participants