Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia: retrieve section summaries #1163

Merged
merged 8 commits into from
May 17, 2020
Merged

Conversation

alanhuang122
Copy link
Contributor

Module will now retrieve text from the relevant section
instead of the beginning of the article irrespective of
section designations in the URL.

Copy link
Member

@dgw dgw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple things.

First, this needs rebasing to fix merge conflicts if it's going to be included.

Second, it would be ideal not to add a new dependency. Meaning, if there's a suitable alternative to including bs4, that would be great. (If there really isn't a good alternative, add the used version of bs4 to requirements.txt so Travis can run the tests properly.)

sopel/modules/wikipedia.py Outdated Show resolved Hide resolved
@dgw dgw added this to the 6.6.0 milestone Mar 22, 2018
@alanhuang122 alanhuang122 force-pushed the wiki-sections branch 2 times, most recently from 144f6a1 to 7587539 Compare March 22, 2018 10:27
@alanhuang122
Copy link
Contributor Author

I've pushed a very rough initial feature. It's not suitable for merging yet, and fails style checks. I'll continue working on it soon.

@dgw
Copy link
Member

dgw commented Nov 5, 2018

@alanhuang122 Do you still intend to finish this feature, or does someone need to adopt it? 😸

@alanhuang122
Copy link
Contributor Author

This is now a more complete ("complete") function. There will almost certainly be stylistic issues (at the very least) that would be simple to correct.

One known issue is that the snippet feature does not recognize section titles that are italicized (e.g., this page [comparing <i>Secret History</i> to Secret History]). I could special-case the removal of HTML tags, but that assumes that HTML tags will never be part of any section title.

sopel/modules/wikipedia.py Outdated Show resolved Hide resolved
@dgw
Copy link
Member

dgw commented Nov 15, 2018

Happy to see activity on your PRs whenever you have the time and motivation! As I said on IRC, the milestones I assigned are just when I think it'll be ready; if something's not done in time, I don't have any problem with bumping it to the next release unless it's a really important bugfix. Take your time! 😸

One known issue is that the snippet feature does not recognize section titles that are italicized (e.g., this page [comparing <i>Secret History</i> to Secret History]). I could special-case the removal of HTML tags, but that assumes that HTML tags will never be part of any section title.

The JSON pulled down to find the section also includes a field named anchor, which should exactly match the trigger group that captures the section anchor off the URL. I doubt it would introduce any new issues to rewrite the code to match based on that instead, and it would fix the known issue you described.


snippet = mw_section(server, query, section)
if not snippet:
bot.say("[WIKIPEDIA] Error fetching section \"{}\" for page \"{}\".".format(section, page_name))
return

msg = '[WIKIPEDIA] {} - {} | "{}"'.format(page_name, section, snippet)
msg = '[WIKIPEDIA] {} - {} | "{}"'.format(page_name, section.replace('_', ' '), snippet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a couple thoughts while looking at this before bed. Cursory stuff, not a formal review (and not that this PR needs another until it's "really ready"), but things that might inform the evolution as it's getting close to done.

I wonder how many users would be confused if that separating - turned into a §… Somehow the hyphen doesn't seem right. Section symbol, some Unicode arrow, >… one of those maybe?

Also: What do you think about returning a tuple from mw_section(), say (section_title, snippet)? Have to think about the possibility of HTML, character entities, etc. in the section title, though. The replace here is certainly the simplest way to go!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the hyphen to fit better than the alternatives, but that's wholly a personal opinion.

Wrt. the tuple suggestion, I presume you're referring to taking the entry from the MediaWiki sections listing. That'd involve at least some HTML parsing in some cases (replacing <i></i> with a call to formatting.italic(), e.g.) which I'd remove from my implementation, but could be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think parsing HTML in the section title is worth it. Stripping tags is fine. It's just that the anchor name is sanitized of more than just HTML tags, and might be significantly different from the "real" section title that mw_section() has access to.

§ seems appropriate because that's literally the reason that symbol exists—to denote a document section—but I'm not going to push. It can always be changed later. 😸

@dgw dgw modified the milestones: 6.6.0, 7.0.0 Dec 30, 2018
@Exirel
Copy link
Contributor

Exirel commented Apr 21, 2019

I feel like this one is either stale or need to be kept for when we extract some plugins out of the core—this module would be a great candidate to be an external module.

@dgw
Copy link
Member

dgw commented Apr 22, 2019

@Exirel Why do you say this is a great candidate for an external module? It requires no API keys or authentication of any kind, and is broadly useful (everyone links Wikipedia on IRC eventually). Grabbing the relevant section when someone links directly to one will only make it more broadly useful.

@Exirel
Copy link
Contributor

Exirel commented Apr 22, 2019

@Exirel Why do you say this is a great candidate for an external module?

Because of its increasing complexity, it becomes more and more subject to issues, bugs, that will ultimately require a faster release cycle than the bot itself. All I see in this PR looks good, but also more complex, more prone to errors and bugs.

(That being said... it also lacks some unit-test.)

@dgw
Copy link
Member

dgw commented Apr 22, 2019

I'm far more comfortable with the output format of Wikipedia remaining the same than with many of the APIs that we use in other core modules, honestly.

All of this discussion could be moot, though, if @alanhuang122 no longer wants to work on this and we can't find anyone to take over the work. I'm willing, but there are so many more important things to do…

@dgw dgw modified the milestones: 7.0.0, 7.1.0 Jul 5, 2019
@dgw dgw added Stale Mostly used for PRs that no longer work and need updating before re-review/merge. and removed Tweak Work In Progress labels Apr 25, 2020
Copy link
Member

@dgw dgw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, on a lark, I locally merged this into current master and it works just fine… The conflicts were a little gnarly, but they're sorted. I haven't found anything that broke. Output for links to sections looks good, even if it gets a bit cluttered for sections with a lot of "See also" and "Main article" annotations. We learned during the run-up to 7.0 that we mustn't allow perfect to be the enemy of good. 😸 There's always another tweak to make in the next release.

Unfortunately I don't think Git will let me take both this and my local merge of #1178, but that one had a really simple conflict-resolution step, and I wouldn't mind doing it again.

@Exirel Would you be OK with me taking this for 7.1, with the option to add unit tests when/if you see fit?

@dgw dgw removed the Stale Mostly used for PRs that no longer work and need updating before re-review/merge. label May 8, 2020
dgw added a commit that referenced this pull request May 17, 2020
Kinda messy as this introduces a double-merge, but I don't want to
rebase #1163 again. Doing it was far too annoying the *first* time...
@dgw dgw merged commit 48c1c56 into sopel-irc:master May 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants