Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

QSURLExtractor needs improving #1146

Closed
pjrobertson opened this Issue · 3 comments

2 participants

@pjrobertson
Owner

A couple of things:

  • On/after line 58 there should be another check for the contents of the <a></a> tag using thisLink['title'] = link.contents What I've done previously is to then strip out all the HTML tags from this using the common file from the webscraping import so
from web scraping import common
…
        if thisLink['title'] is None:
        thisLink['title'] = common.normalize(link.contents)
  • The link var should be an ordered set if possible. It's a bit unnerving having results show up in any order on the page

  • The script doesn't seem to correctly convert all HTML entities properly. Put http://www.wordreference.com/fren/grand in QS's 1st pane, then right arrow into it. Notice how there are lots of things like &quot; etc.

@skurfer
Owner

using the common file from the webscraping import

This isn't available by default on users’ systems. Not a deal breaker, but it complicates things quite a bit.

The link var should be an ordered set if possible.

I looked into this a bit the other day. I think the best solution is to just use a list and manually prevent duplicates.

The script doesn't seem to correctly convert all HTML entities properly.

I don’t remember, but more likely, it doesn’t convert any of them. Properly or otherwise. :-)

@pjrobertson
Owner
@skurfer
Owner

I used the assumption that since we're already packaging BeautifulSoup.py with QS, why not package one more? :-)

Yeah, but Beautiful Soup was designed to be one file. That other thing appears to be a more traditional "module". I'm sure there's a way to include it if it's important enough, though.

Let me know if you're looking into these Pythony things, or I will ;-)

It's interesting, but no, probably not any time soon.

@skurfer skurfer closed this in 918b69a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.