Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into automation of SQL insertion code generation from country breakdown pages #22

Closed
vipulnaik opened this issue May 4, 2019 · 7 comments
Assignees

Comments

@vipulnaik
Copy link
Owner

@vipulnaik vipulnaik commented May 4, 2019

Basically read from pages like https://stats.wikimedia.org/archive/squid_reports/2018-01/SquidReportPageViewsPerCountryBreakdownHuge.htm and output SQL like the one at https://github.com/vipulnaik/wikipediaviews/blob/master/sql/country-language-data.sql

@riceissa
Copy link
Collaborator

@riceissa riceissa commented May 8, 2019

Work in progress: https://github.com/riceissa/wikipediaviews-country-breakdown (the basic scraping for each page is done, but I still need to loop through all the months and print the SQL).

I ran into some difficulties with strange-looking Wikipedia names, specifically "Other", "Portal", "m Wp", "Commons Wp", "wwwhttp Wp", "zero Wp", "plhttp Wp", "enhttp Wp", "eshttp Wp". Right now I'm just ignoring these, but maybe you'd like to do something else.

Am I correct to assume that December 2014 is the first time these reports became available? Also I think the reports stopped in September 2018.

@vipulnaik
Copy link
Owner Author

@vipulnaik vipulnaik commented May 8, 2019

@riceissa Yes December 2014 is the start. September 2018 is the last date so far, but they may add more months later (there is significant lag in them updating the data).

@riceissa
Copy link
Collaborator

@riceissa riceissa commented May 9, 2019

Ok, this is done now at https://github.com/riceissa/wikipediaviews-country-breakdown

You can find the output SQL at out-2019-05-09.sql.

Right now the year/month combinations are hard-coded so if the script is run in the future, that will need to be modified.

@vipulnaik
Copy link
Owner Author

@vipulnaik vipulnaik commented Jun 4, 2019

@riceissa It looks like this is failing to insert, because it includes zh-tw as a language. However, there is no separate zh-tw Wikipedia; it redirects to zh Wikipedia.

The situation seems a bit complicated. It is explained here in some more detail: https://en.wikipedia.org/wiki/Chinese_Wikipedia#Automatic_conversion_between_traditional_and_simplified_Chinese_characters

My suggested approach: For countries that have zh-tw, add the zh-tw count into the zh count. Alternative ideas are also welcome. Also, can you check if there are any other languages with similar issues?

@riceissa
Copy link
Collaborator

@riceissa riceissa commented Jun 4, 2019

Other similar languages:

Aside from the above, I was able to get the SQL to insert.

My guess is that for some of them (e.g. zh-tw -> zh) we'll want to add the counts to an existing language, but for others (e.g. gag, xmf) we'll want to add the language to the enum for the viewcountsbymonth table schema (because there is no similar language to add it to).

riceissa added a commit to riceissa/wikipediaviews-country-breakdown that referenced this issue Jun 5, 2019
See vipulnaik/wikipediaviews#22 (comment)
and vipulnaik/wikipediaviews#22 (comment)

This commit doesn't have the actual list of languages that should be
added together, since that hasn't been decided yet.
@riceissa
Copy link
Collaborator

@riceissa riceissa commented Jun 5, 2019

I've made code changes over at https://github.com/riceissa/wikipediaviews-country-breakdown to make it easy to combine view counts for different languages, so now I just need a list of languages that should be combined.

@riceissa
Copy link
Collaborator

@riceissa riceissa commented Jun 5, 2019

I wrote a small script to compare the list of Wikipedias on Wikipedia Views with the one at Wikimedia meta-wiki: https://gist.github.com/riceissa/1b633cf2cf0629e0e87f715c828326a3

The output of the script is:

  • Things in meta not in Vipul {'sat', 'pfl', 'din', 'gor', 'olo', 'xmf', 'inh', 'atj', 'lrc', 'jam', 'shn', 'gag', 'kbp', 'be-tarask', 'azb', 'lfn', 'dty', 'hyw', 'ady', 'tcy'}
  • Things in Vipul not in meta {'mo'}

"mo: has been redirected to ro:, as of December 2017" https://meta.wikimedia.org/wiki/List_of_Wikipedias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants