Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QSURLExtractor needs improving #1146

Closed
pjrobertson opened this issue Sep 28, 2012 · 3 comments
Closed

QSURLExtractor needs improving #1146

pjrobertson opened this issue Sep 28, 2012 · 3 comments

Comments

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Sep 28, 2012

A couple of things:

  • On/after line 58 there should be another check for the contents of the <a></a> tag using thisLink['title'] = link.contents
    What I've done previously is to then strip out all the HTML tags from this using the common file from the webscraping import
    so
from web scraping import commonif thisLink['title'] is None:
        thisLink['title'] = common.normalize(link.contents)
  • The link var should be an ordered set if possible. It's a bit unnerving having results show up in any order on the page
  • The script doesn't seem to correctly convert all HTML entities properly. Put http://www.wordreference.com/fren/grand in QS's 1st pane, then right arrow into it. Notice how there are lots of things like &quot; etc.
@skurfer
Copy link
Member

@skurfer skurfer commented Sep 28, 2012

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker, but it complicates things quite a bit.

The link var should be an ordered set if possible.

I looked into this a bit the other day. I think the best solution is to just use a list and manually prevent duplicates.

The script doesn't seem to correctly convert all HTML entities properly.

I don’t remember, but more likely, it doesn’t convert any of them. Properly or otherwise. :-)

@pjrobertson
Copy link
Member Author

@pjrobertson pjrobertson commented Sep 28, 2012

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker,
but it complicates things quite a bit.

I used the assumption that since we're already packaging BeautifulSoup.py
with QS, why not package one more? :-)

We could probably just pinch the method (or are they functions in Python?)
out of the file and keep it QSURLExtractor if you like.

Let me know if you're looking into these Pythony things, or I will ;-)

On 28 September 2012 13:54, Rob McBroom notifications@github.com wrote:

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker, but
it complicates things quite a bit.

The link var should be an ordered set if possible.

I looked into this a bit the other day. I think the best solution is to
just use a list and manually prevent duplicates.

The script doesn't seem to correctly convert all HTML entities properly.

I don’t remember, but more likely, it doesn’t convert any of them.
Properly or otherwise. :-)


Reply to this email directly or view it on GitHubhttps://github.com//issues/1146#issuecomment-8975438.

@skurfer
Copy link
Member

@skurfer skurfer commented Sep 28, 2012

I used the assumption that since we're already packaging BeautifulSoup.py with QS, why not package one more? :-)

Yeah, but Beautiful Soup was designed to be one file. That other thing appears to be a more traditional "module". I'm sure there's a way to include it if it's important enough, though.

Let me know if you're looking into these Pythony things, or I will ;-)

It's interesting, but no, probably not any time soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants