Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look into automation of SQL insertion code generation from country breakdown pages #22

Closed
vipulnaik opened this issue May 4, 2019 · 7 comments

Comments

Projects
None yet
2 participants

@vipulnaik vipulnaik referenced this issue May 4, 2019

Closed

Vipul Saturday checklist | 2019-05-04 #63

9 of 14 tasks complete
@riceissa

This comment has been minimized.

Copy link
Collaborator

commented May 8, 2019

Work in progress: https://github.com/riceissa/wikipediaviews-country-breakdown (the basic scraping for each page is done, but I still need to loop through all the months and print the SQL).

I ran into some difficulties with strange-looking Wikipedia names, specifically "Other", "Portal", "m Wp", "Commons Wp", "wwwhttp Wp", "zero Wp", "plhttp Wp", "enhttp Wp", "eshttp Wp". Right now I'm just ignoring these, but maybe you'd like to do something else.

Am I correct to assume that December 2014 is the first time these reports became available? Also I think the reports stopped in September 2018.

@vipulnaik

This comment has been minimized.

Copy link
Owner Author

commented May 8, 2019

@riceissa Yes December 2014 is the start. September 2018 is the last date so far, but they may add more months later (there is significant lag in them updating the data).

@riceissa

This comment has been minimized.

Copy link
Collaborator

commented May 9, 2019

Ok, this is done now at https://github.com/riceissa/wikipediaviews-country-breakdown

You can find the output SQL at out-2019-05-09.sql.

Right now the year/month combinations are hard-coded so if the script is run in the future, that will need to be modified.

@vipulnaik

This comment has been minimized.

Copy link
Owner Author

commented Jun 4, 2019

@riceissa It looks like this is failing to insert, because it includes zh-tw as a language. However, there is no separate zh-tw Wikipedia; it redirects to zh Wikipedia.

The situation seems a bit complicated. It is explained here in some more detail: https://en.wikipedia.org/wiki/Chinese_Wikipedia#Automatic_conversion_between_traditional_and_simplified_Chinese_characters

My suggested approach: For countries that have zh-tw, add the zh-tw count into the zh count. Alternative ideas are also welcome. Also, can you check if there are any other languages with similar issues?

@vipulnaik vipulnaik referenced this issue Jun 4, 2019

Closed

Vipul Tuesday checklist | 2019-06-04 #80

1 of 1 task complete
@riceissa

This comment has been minimized.

Copy link
Collaborator

commented Jun 4, 2019

Other similar languages:

Aside from the above, I was able to get the SQL to insert.

My guess is that for some of them (e.g. zh-tw -> zh) we'll want to add the counts to an existing language, but for others (e.g. gag, xmf) we'll want to add the language to the enum for the viewcountsbymonth table schema (because there is no similar language to add it to).

riceissa added a commit to riceissa/wikipediaviews-country-breakdown that referenced this issue Jun 5, 2019

Allow similar language views to be added
See vipulnaik/wikipediaviews#22 (comment)
and vipulnaik/wikipediaviews#22 (comment)

This commit doesn't have the actual list of languages that should be
added together, since that hasn't been decided yet.
@riceissa

This comment has been minimized.

Copy link
Collaborator

commented Jun 5, 2019

I've made code changes over at https://github.com/riceissa/wikipediaviews-country-breakdown to make it easy to combine view counts for different languages, so now I just need a list of languages that should be combined.

@riceissa

This comment has been minimized.

Copy link
Collaborator

commented Jun 5, 2019

I wrote a small script to compare the list of Wikipedias on Wikipedia Views with the one at Wikimedia meta-wiki: https://gist.github.com/riceissa/1b633cf2cf0629e0e87f715c828326a3

The output of the script is:

  • Things in meta not in Vipul {'sat', 'pfl', 'din', 'gor', 'olo', 'xmf', 'inh', 'atj', 'lrc', 'jam', 'shn', 'gag', 'kbp', 'be-tarask', 'azb', 'lfn', 'dty', 'hyw', 'ady', 'tcy'}
  • Things in Vipul not in meta {'mo'}

"mo: has been redirected to ro:, as of December 2017" https://meta.wikimedia.org/wiki/List_of_Wikipedias

riceissa added a commit to riceissa/wikipediaviews that referenced this issue Jun 6, 2019

@vipulnaik vipulnaik referenced this issue Jun 8, 2019

Closed

Vipul Saturday checklist | 2019-06-08 #72

6 of 6 tasks complete

@vipulnaik vipulnaik closed this in 1d6a89e Jun 9, 2019

@vipulnaik vipulnaik referenced this issue Jun 9, 2019

Closed

Vipul Sunday checklist | 2019-06-09 #82

9 of 10 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.