Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New HTML parser (based on Beautiful Soup) #345

Merged
merged 7 commits into from Jul 22, 2011
Merged

New HTML parser (based on Beautiful Soup) #345

merged 7 commits into from Jul 22, 2011

Conversation

skurfer
Copy link
Member

@skurfer skurfer commented Jun 1, 2011

This is a total rewrite using Beautiful Soup, which should handle badly written documents gracefully. Here are some sites to test with:

http://www.greenspec.co.uk/small-wind-turbines.php (the original Perl parser and the current Python parser choke on this)
http://www.wired.com/ (lots of links)
http://github.com/ (even more links - if you’re logged in via Safari)

As I feared, this parser is slower than the original and the current one. The price you pay for robust fault tolerance, I suppose. The web is a nasty, unpredictable place. (I’ve always felt like this should be done in Obj-C using WebKit or something, but that’s beyond my abilities for now.) But keep in mind, just like before, the differences in parsing speed are dwarfed by the time it takes to get the HTML over the network so I think it’s acceptable. (I can’t really tell a difference actually.)

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 20, 2011

Oh yeah that was why I never looked at this: I don't speak python :o

I'll merge it and just have a play. If nobody else comments, then we have the benefit of the doubt that it's all fine :)

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 20, 2011

Oh deary me. Just checked this out and lost all my changes that I've made to the results view stuff :(

I still have a diff file open, so that's my only hope. Looks like this'll have to wait a little longer though...

@skurfer
Copy link
Member Author

@skurfer skurfer commented Jul 20, 2011

Oh deary me. Just checked this out and lost all my changes that I've made to the results view stuff

Sorry. That wasn’t my fault was it? I did get a merge conflict which I think was between this and your paste pull request, but it was just a couple of file references in the xcodeproj.

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 20, 2011

Nah wasn't your fault! I just got carried away and forgot to git stash :(

The diff was helpful though, got it all back now so I can look at this :)

On 20 July 2011 18:45, skurfer <
reply@reply.github.com>wrote:

Oh deary me. Just checked this out and lost all my changes that I've made
to the results view stuff

Sorry. That wasnt my fault was it? I did get a merge conflict which I
think was between this and your paste pull request, but it was just a couple
of file references in the xcodeproj.

Reply to this email directly or view it on GitHub:
#345 (comment)

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 22, 2011

Hmmm.... trying to right arrow into http://github.com I get this:

    Traceback (most recent call last):
    File "/tmp/QS/build/Debug/Quicksilver.app/Contents/PlugIns/Core Support.qsplugin/Contents/Resources/QSURLExtractor.py", line 60, in <module>
        thisLink['title'].strip(),
    AttributeError: 'NoneType' object has no attribute 'strip'

And it means I just get the bump and can't load the contents. That's with me logged into GitHub in Safari

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 22, 2011

Edit: random link I came across when just browsing any old site that gives the same error: http://devlink.net

Just incase GitHub doesn't give it for you :)

@skurfer
Copy link
Member Author

@skurfer skurfer commented Jul 22, 2011

This is why we require a second pair of eyes before merging. :)

GitHub works fine for me, but I was able to reproduce the problem with devlink. I thought an empty <span> would return an empty string instead of None. Guess not.

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 22, 2011

I thought my eyes would be useless as I can't really follow the python, but
fortunately not!

Works fine now :)
Just looking at the code and what you said - is it worth doing

if thisLink['title'] is None OR if thisLink['title'] is '':
(dunno if the syntax is right...) just incase it can ever be an empty
string? Don't know, just throwing it out there.

All good from my "don't know python" point of view. I have looked over it
and can't see anything you may have forgotten.

..oh, maybe. I made a "I wanna break this" page, and it looks like UTF-8
encoding isn't working
Check out http://qsapp.com/urls.html in a browser (meta has the encoding set
to UTF-8) then check it in Quicksilver
Also - since links with nothing between the don't show up in
browsers, should they show up jn Quicksilver?
Perhaps the code should be

if thisLink['title'] is None OR if thisLink['title'] is '':
continue;

Sorry for trying to break it! :P

On 22 July 2011 13:40, skurfer <
reply@reply.github.com>wrote:

This is why we require a second pair of eyes before merging. :)

GitHub works fine for me, but I was able to reproduce the problem with
devlink. I thought an empty <span> would return an empty string instead of
None. Guess not.

Reply to this email directly or view it on GitHub:
#345 (comment)

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 22, 2011

P.S. it looks like the thingy handles UTF-8 data in the actual URLs (href bit) but not in the title bit (between the <a></a>)

@skurfer
Copy link
Member Author

@skurfer skurfer commented Jul 22, 2011

Also - since links with nothing between the don't show up in browsers, should they show up jn Quicksilver?

An empty <a> tag results in None, so it will never match '' (which is the behavior that caused the exception you found). If there’s no text, it’ll use the URL instead. I suppose we could drop it altogether, but does it hurt anything to leave these in?

it looks like the thingy handles UTF-8 data in the actual URLs (href bit) but not in the title bit (between the <a></a>)

Hmm. It looks correct in the Terminal. I wonder if it’s an issue with NSTask or something else in Quicksilver that needs a specific encoding. B60 shows the same behavior. I did something I haven’t done it a long time: dragged out ol’ B54 (with the Perl parser). It’s even worse:

UTF-8 in B54

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 22, 2011

I suppose we could drop it altogether, but does it hurt anything to leave
these in?

I'm just thinking some sites add 'hidden' links for some things - search
engines, forms etc. Not sure how good it would be to expose them. I'm not
too fussed, but I think it should behave as much like a browser as possible.

It looks correct in the Terminal.

OK cool so it's not this code then? Maybe there's some problem with the
names
Got it:
.. changing lines 70 and 30 of QSHTMLLinkParser.m to NSUTF8StringEncoding
fixes the problem :)

On 22 July 2011 14:56, skurfer <
reply@reply.github.com>wrote:

Also - since links with nothing between the don't show up in
browsers, should they show up jn Quicksilver?

An empty <a> tag results in None, so it will never match '' (which is
the behavior that caused the exception you found). If theres no text, itll
use the URL instead. I suppose we could drop it altogether, but does it hurt
anything to leave these in?

it looks like the thingy handles UTF-8 data in the actual URLs (href bit)
but not in the title bit (between the <a></a>)

Hmm. It looks correct in the Terminal. I wonder if its an issue with
NSTask or something else in Quicksilver that needs a specific encoding. B60
shows the same behavior. I did something I havent done it a long time:
dragged out ol B54 (with the Perl parser). Its even worse:

UTF-8 in B54

Reply to this email directly or view it on GitHub:
#345 (comment)

@skurfer
Copy link
Member Author

@skurfer skurfer commented Jul 22, 2011

I think it should behave as much like a browser as possible.

Agreed and changed.

changing lines 70 and 30 of QSHTMLLinkParser.m to NSUTF8StringEncoding fixes the problem

Oh, so I could save some time in the future by just letting you track these things down. :) Yeah, I found the same thing. I went back to where the encoding variable is set and fixed it there.

@pjrobertson
Copy link
Member

@pjrobertson pjrobertson commented Jul 22, 2011

Oh, so I could save some time in the future by just letting you track these things down. :)

Meh!

Yeah, I found the same thing. I went back to where the encoding variable is set and fixed it there.

Good stuff. Setting that one will probably stop some of those 'invalid encoding' console logs as well.

Almost 2 months down the line... merging!

pjrobertson added a commit that referenced this issue Jul 22, 2011
New HTML parser (based on Beautiful Soup)
@pjrobertson pjrobertson merged commit 774abb5 into quicksilver:master Jul 22, 2011
@skurfer
Copy link
Member Author

@skurfer skurfer commented Jul 22, 2011

Happy that it’s merged, but now I’m going to have to deal with the merge conflict with the paste plain branch. Sneaky.

Well, I’ve done it 4 times now, so I think I can handle it.

Setting that one will probably stop some of those 'invalid encoding' console logs as well.

Hopefully. I briefly scanned the rest of the project for any BS encodings and didn’t see any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants