-
-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wikipedia: retrieve section summaries #1163
Conversation
d1ede00
to
144f6a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple things.
First, this needs rebasing to fix merge conflicts if it's going to be included.
Second, it would be ideal not to add a new dependency. Meaning, if there's a suitable alternative to including bs4
, that would be great. (If there really isn't a good alternative, add the used version of bs4
to requirements.txt
so Travis can run the tests properly.)
144f6a1
to
7587539
Compare
7587539
to
6bcb54b
Compare
I've pushed a very rough initial feature. It's not suitable for merging yet, and fails style checks. I'll continue working on it soon. |
@alanhuang122 Do you still intend to finish this feature, or does someone need to adopt it? 😸 |
6bcb54b
to
d776b6f
Compare
This is now a more complete ("complete") function. There will almost certainly be stylistic issues (at the very least) that would be simple to correct. One known issue is that the snippet feature does not recognize section titles that are italicized (e.g., this page [comparing |
Happy to see activity on your PRs whenever you have the time and motivation! As I said on IRC, the milestones I assigned are just when I think it'll be ready; if something's not done in time, I don't have any problem with bumping it to the next release unless it's a really important bugfix. Take your time! 😸
The JSON pulled down to find the section also includes a field named |
Co-Authored-By: alanhuang122 <Alan.Huang@utdallas.edu>
|
||
snippet = mw_section(server, query, section) | ||
if not snippet: | ||
bot.say("[WIKIPEDIA] Error fetching section \"{}\" for page \"{}\".".format(section, page_name)) | ||
return | ||
|
||
msg = '[WIKIPEDIA] {} - {} | "{}"'.format(page_name, section, snippet) | ||
msg = '[WIKIPEDIA] {} - {} | "{}"'.format(page_name, section.replace('_', ' '), snippet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a couple thoughts while looking at this before bed. Cursory stuff, not a formal review (and not that this PR needs another until it's "really ready"), but things that might inform the evolution as it's getting close to done.
I wonder how many users would be confused if that separating -
turned into a §
… Somehow the hyphen doesn't seem right. Section symbol, some Unicode arrow, >
… one of those maybe?
Also: What do you think about returning a tuple from mw_section()
, say (section_title, snippet)
? Have to think about the possibility of HTML, character entities, etc. in the section title, though. The replace
here is certainly the simplest way to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the hyphen to fit better than the alternatives, but that's wholly a personal opinion.
Wrt. the tuple suggestion, I presume you're referring to taking the entry from the MediaWiki sections listing. That'd involve at least some HTML parsing in some cases (replacing <i></i>
with a call to formatting.italic()
, e.g.) which I'd remove from my implementation, but could be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think parsing HTML in the section title is worth it. Stripping tags is fine. It's just that the anchor name is sanitized of more than just HTML tags, and might be significantly different from the "real" section title that mw_section()
has access to.
§
seems appropriate because that's literally the reason that symbol exists—to denote a document section—but I'm not going to push. It can always be changed later. 😸
I feel like this one is either stale or need to be kept for when we extract some plugins out of the core—this module would be a great candidate to be an external module. |
@Exirel Why do you say this is a great candidate for an external module? It requires no API keys or authentication of any kind, and is broadly useful (everyone links Wikipedia on IRC eventually). Grabbing the relevant section when someone links directly to one will only make it more broadly useful. |
Because of its increasing complexity, it becomes more and more subject to issues, bugs, that will ultimately require a faster release cycle than the bot itself. All I see in this PR looks good, but also more complex, more prone to errors and bugs. (That being said... it also lacks some unit-test.) |
I'm far more comfortable with the output format of Wikipedia remaining the same than with many of the APIs that we use in other core modules, honestly. All of this discussion could be moot, though, if @alanhuang122 no longer wants to work on this and we can't find anyone to take over the work. I'm willing, but there are so many more important things to do… |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, on a lark, I locally merged this into current master
and it works just fine… The conflicts were a little gnarly, but they're sorted. I haven't found anything that broke. Output for links to sections looks good, even if it gets a bit cluttered for sections with a lot of "See also" and "Main article" annotations. We learned during the run-up to 7.0 that we mustn't allow perfect to be the enemy of good. 😸 There's always another tweak to make in the next release.
Unfortunately I don't think Git will let me take both this and my local merge of #1178, but that one had a really simple conflict-resolution step, and I wouldn't mind doing it again.
@Exirel Would you be OK with me taking this for 7.1, with the option to add unit tests when/if you see fit?
Kinda messy as this introduces a double-merge, but I don't want to rebase #1163 again. Doing it was far too annoying the *first* time...
Module will now retrieve text from the relevant section
instead of the beginning of the article irrespective of
section designations in the URL.