Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

QSURLExtractor needs improving #1146

Closed
pjrobertson opened this Issue Sep 28, 2012 · 3 comments

Comments

Projects
None yet
2 participants
Owner

pjrobertson commented Sep 28, 2012

A couple of things:

  • On/after line 58 there should be another check for the contents of the <a></a> tag using thisLink['title'] = link.contents
    What I've done previously is to then strip out all the HTML tags from this using the common file from the webscraping import
    so
from web scraping import common
…
        if thisLink['title'] is None:
        thisLink['title'] = common.normalize(link.contents)
  • The link var should be an ordered set if possible. It's a bit unnerving having results show up in any order on the page
  • The script doesn't seem to correctly convert all HTML entities properly. Put http://www.wordreference.com/fren/grand in QS's 1st pane, then right arrow into it. Notice how there are lots of things like &quot; etc.
Owner

skurfer commented Sep 28, 2012

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker, but it complicates things quite a bit.

The link var should be an ordered set if possible.

I looked into this a bit the other day. I think the best solution is to just use a list and manually prevent duplicates.

The script doesn't seem to correctly convert all HTML entities properly.

I don’t remember, but more likely, it doesn’t convert any of them. Properly or otherwise. :-)

Owner

pjrobertson commented Sep 28, 2012

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker,
but it complicates things quite a bit.

I used the assumption that since we're already packaging BeautifulSoup.py
with QS, why not package one more? :-)

We could probably just pinch the method (or are they functions in Python?)
out of the file and keep it QSURLExtractor if you like.

Let me know if you're looking into these Pythony things, or I will ;-)

On 28 September 2012 13:54, Rob McBroom notifications@github.com wrote:

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker, but
it complicates things quite a bit.

The link var should be an ordered set if possible.

I looked into this a bit the other day. I think the best solution is to
just use a list and manually prevent duplicates.

The script doesn't seem to correctly convert all HTML entities properly.

I don’t remember, but more likely, it doesn’t convert any of them.
Properly or otherwise. :-)


Reply to this email directly or view it on GitHubhttps://github.com/quicksilver/Quicksilver/issues/1146#issuecomment-8975438.

Owner

skurfer commented Sep 28, 2012

I used the assumption that since we're already packaging BeautifulSoup.py with QS, why not package one more? :-)

Yeah, but Beautiful Soup was designed to be one file. That other thing appears to be a more traditional "module". I'm sure there's a way to include it if it's important enough, though.

Let me know if you're looking into these Pythony things, or I will ;-)

It's interesting, but no, probably not any time soon.

@skurfer skurfer closed this in 918b69a Oct 2, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment