Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facebook info extractor (v2) #81

Closed
wants to merge 1 commit into from
Closed

Conversation

imrehg
Copy link
Contributor

@imrehg imrehg commented Feb 20, 2011

Sorry for taking so long to answer. I've been a bit busy this week. The code looks good and no doubt I'll merge it as long as you commit to support the InfoExtractor. It seems to integrate very well with the existing code. I'll make a few comments so you can address possible issues.

Sure, I take responsibility for it. :) As for the issues you mentioned:

It prints "Facebook" with a capital F. Normally we print the name all in lowercase, so you could change that if you are going to make more changes. It's a minor detail.

Fixed.

Second, there is a block of code in the InfoExtractor indented with 4 spaces instead of tabs. You could probably use tabs to be consistent with the rest of the code.

Took a little time to find it but now fixed. Not sure what went wrong, will keep an eye out next time.

In that same loop, line 2394, you use the word "format" as a variable name. "format" is a reserved word in some Python versions and, while you can assign another value to it, it's dangerous because we could add code that uses it later and it wouldn't work.

Sure, fixed, I don't think there's any good reason to cause conflicts, I didn't know it was reserved previously.

Finally, I don't mind importing more modules if they allow for doing something easily, as long as the modules are available in Python 2.5. My worry here is that the function you're using, unicode_escape_decode, is not mentioned in the documentation. Could you investigate this a little bit? If you explain me what it's supposed to do with an example, maybe we can code a safer version of that routine.

I think it should be safe, tested with Python 2.5.5, and it works just the same. I used something like:

import codecs
assert codecs.unicode_escape_decode('\x61\x62\x63') == (u'abc', 3)

The webpage originally downloaded into an 'str' type, with the URL we are interested in encoded in unicode as "https\u00253A\u00252F\u00252F", which should be turned into "https://". Let's say:

>>> s = "https\u00253A\u00252F\u00252F"
>>> s.decode('utf-8')
u'https\\u00253A\\u00252F\\u00252F'
>>> import codecs
>>> codecs.unicode_escape_decode(s)
(u'https%3A%2F%2F', 29)

Here the last line contains quoted format that we can now convert back to the right string to use. This whole unicode thing is pretty confusing, but I'm trying to learn more. Any thoughts?

Let me know if there's anything else I missed. :)
Cheers!

This IE should be full-featured.

Public videos can be downloaded without login, e.g:
https://www.facebook.com/video/video.php?v=696729990595

Private videos need login, and subject to login rate limit of
a couple of tries / minute.
@rg3
Copy link
Collaborator

rg3 commented Feb 20, 2011

In your example, I think

s.decode('unicode_escape')

Should produce the same results. I'll review it again later and, if no problems are found, I'll merge it.

@imrehg
Copy link
Contributor Author

imrehg commented Feb 20, 2011

That seems to be right. Awesome, I remove the module and send another pull request. Sorry to give you so much work. Keep them suggestions coming if there's any other unreasonable code.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants