Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Story: Generate datasets from Wikipedia dumps #11

Open
4 of 11 tasks
feep opened this issue Apr 7, 2020 · 2 comments
Open
4 of 11 tasks

Data Story: Generate datasets from Wikipedia dumps #11

feep opened this issue Apr 7, 2020 · 2 comments

Comments

@feep
Copy link
Contributor

feep commented Apr 7, 2020

Content

  • Parse out all tables
  • Trim garbage tables
  • Dataset/tables: table title, row count, column count, url, page title, hit_count, id, revision.timestamp
  • Multiple datasets: generate datasets for top tables by hits over XX rows
  • Multiple datasets: generate datasets for hand-picked useful tables
  • Transform script to generate page from raw mediawiki markup or single page html

Story? The tables available and how they were generated? Histogram with size of tables?

Probably out of scope

  • Possible longitudinal tables, every Directed By: from infobox table from every (film) page

Pageviews

These are available as an API, but I’m not going to hit the API for every page on wikipedia. They need to be available as a dataset.

  • Aggregate and subsample
  • Trim garbage pages (Special:, Main_Page...)
  • Script to generate
  • Dataset for relative hits for all pages, month of 202003.

No story, not at this time. Data only used for the hit_count column in the third checkbox under content.

@feep
Copy link
Contributor Author

feep commented Apr 22, 2020

@rgardaphe

Testing your @ settings. You get this?

@rgardaphe
Copy link
Member

rgardaphe commented Apr 22, 2020 via email

@feep feep added this to To do in Sprint I Apr 30, 2020
@chriswhong chriswhong added this to To Do in High Value Datasets May 5, 2020
@chriswhong chriswhong moved this from To Do to In Progress in High Value Datasets May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Sprint I
  
To do
Development

No branches or pull requests

2 participants