Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doing a post request #874

Closed
yasoob opened this issue Jun 6, 2013 · 24 comments
Closed

Doing a post request #874

yasoob opened this issue Jun 6, 2013 · 24 comments

Comments

@yasoob
Copy link
Contributor

@yasoob yasoob commented Jun 6, 2013

Hi, I am making an IE for http://vbox7.com/ . It is near completion. I wanted to know how to do a post request in InfoExtractors using _download_webpage or _download_webpage_handle . Can anyone tell me how to do a post request ?

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 6, 2013

is urllib.urlopen(url,data) the best way to do it or is there any other way ?

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 6, 2013

okay I have got it.

request = compat_urllib_request.Request(info_url,data)
info_response = self._download_webpage(request, video_id) 

Now heres another question. the main url for vbox7's video is "http://vbox7.com/play:03fbd68d4e" but when we open it , it first redirects us to "http://vbox7.com/show:misscookie?back_to=%2Fplay%3A249bb972c2" and after that it again redirects us to "http://vbox7.com/show:missjavascript?back_to=%2Fplay%3A249bb972c2" and then finally it redirects us to the video page "http://vbox7.com/play:03fbd68d4e". How can i simulate these redirects ?

My main IE code is :

class Vbox7IE(InfoExtractor):
    """Information Extractor for hypem"""
    _VALID_URL = r'(?:http://)?(?:www\.)?vbox7\.com/play:([^/]+)'

    def _real_extract(self,url):
        mobj = re.match(self._VALID_URL, url)
        self.to_screen("hi")
        if mobj is None:
            raise ExtractorError(u'Invalid URL: %s' % url)
        video_id = mobj.group(1)
        webpage = self._download_webpage(url, video_id)
        self.report_extraction(video_id)
        title = re.search(r'<title>(.*)</title>',webpage)
        title = (title.group(1)).split('/')[0]
        ext = "flv"
        info_url = "http://vbox7.com/play/magare.do"
        data = urllib.urlencode({'as3':'1','vid':video_id})
        request = compat_urllib_request.Request(info_url,data)
        info_response = self._download_webpage(request, video_id)
        if info_response is None:
            raise ExtractorError(u'Unable to extract the media url')
        final_url = (info_response.split('&')[0]).split('=')[1]
        return [{
            'id':       video_id,
            'url':      final_url,
            'ext':      ext,
            'title':    title,
        }]

Can you guys suggest me a solution ?

@phihag
Copy link
Contributor

@phihag phihag commented Jun 7, 2013

Our HTTP handler should already follow HTTP redirects by default. But if you look at the URLs, it seems like we may have to jump through hoops to get the cookie, and simulate active JavaScript.

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

If i do this using the python interpretter then i have to do this and it works :

>>> import requests
>>> url = "http://vbox7.com/play:03fbd68d4e"
>>> _VALID_URL = r'(?:http://)?(?:www\.)?vbox7\.com/play:([^/]+)'
>>> import re
>>> mobj = re.match(_VALID_URL, url)
>>> video_id = mobj.group(1)
>>> webpage = requests.get(url) 
>>> title = re.search(r'<title>(.*)</title>',webpage.text)
>>> title = (title.group(1)).split('/')[0]
>>> info_url = "http://vbox7.com/play/magare.do"
>>> import urllib
>>> data = urllib.urlencode({'as3':'1','vid':video_id})
>>> request = urllib.urlopen(info_url,data)
>>> final_url = ((request.read()).split('&')[0]).split('=')[1]
>>> print final_url
'http://media12.vbox7.com/s/03/03fbd68d4e.flv'

I am trying to acheive this result. You can use this code and do a quick check.
One solution which came to my mind is to not follow any redirects. That way we wont have any problem.

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 7, 2013

I would pay a lot to be able to use requests (and BeautifulSoup)... Sigh.

My bet is on cookies, let me have a try.

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

hmm lets see what you can come up with and BTW i think we should stay away from beautifulsoup for as long as possible because it will greatly slow down youtube-dl.

On Fri, Jun 7, 2013 at 11:46 PM, Filippo Valsorda
notifications@github.comwrote:

I would pay a lot to be able to use requests (and BeautifulSoup)... Sigh.

My bet is on cookies, let me have a try.


Reply to this email directly or view it on GitHubhttps://github.com//issues/874#issuecomment-19125693
.

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

here is the code without requests this means that it should be compatible with youtube-dl:

import urllib
import re
import sys
import subprocess

def give_info(url):
    url = "http://vbox7.com/play:03fbd68d4e"
    _VALID_URL = r'(?:http://)?(?:www\.)?vbox7\.com/play:([^/]+)'
    mobj = re.match(_VALID_URL, url)
    if mobj is None:
        print "[vbox7]  Error: The url is incorrect."
        sys.exit()
    video_id = mobj.group(1)
    print "[vbox7]  Opening the main webpage."
    webpage = urllib.urlopen(url) 
    title = re.search(r'<title>(.*)</title>',webpage.read())
    if title:
        title = (title.group(1)).split('/')[0]
    else:
        print "[vbox7]  Unable to extract title."
    info_url = "http://vbox7.com/play/magare.do"
    data = urllib.urlencode({'as3':'1','vid':video_id})
    print "[vbox7]  Extracting the absolute url of the video."
    try:
        request = urllib.urlopen(info_url,data)
    except:
        print "[vbox7]  Error: Check your internet connection."
        sys.exit()
    final_url = ((request.read()).split('&')[0]).split('=')[1]
    ext = "flv"
    print [{
            'id':       video_id,
            'url':      final_url,
            'ext':      ext,
            'title':    title,
    }]
    cmd = 'wget -O "%s.flv" "%s"' % (title,final_url) 
    process = subprocess.Popen(cmd, shell=True)
    try:
        process.wait() #Wait for wget to finish
    except KeyboardInterrupt: #If we are interrupted by the user
        print "\n[vbox7]  Download cancelled by the user."

if __name__ == '__main__':
    url = sys.argv[-1]#raw_input("What is the url of the video ?  ")
    give_info(url)
@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 7, 2013

Ok, the cookie protection is easily bypassed with a request.add_header('cookie', 'checkCookies=yes'). Now I imagine I'll have to reverse some js...

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 7, 2013

(Shouldn't our opener handle cookie transparently?)

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

Hey why are we thinking about the cookies ? Is there no way to bypass the redirection check in youtube-dl and simply pass the url to the info extractors ? I have made a repository with a working code to download from vbox7. The code can be run with python 2.6 upto 3.3 without any dependency . In my code i havent given any importance to cookies or js. Sometimes there are small sollutions to big problems :) . Repository link https://github.com/yasoob/Vbox7-dl

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Jun 7, 2013

@yasoob What's your exact problem? it downloads the video for me. If it works without the need to care of cookies or javascript go ahead and implement it in youtube-dl.

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

the problem is that when i run the code which i mentioned above gives me the following output.

root@bt:~/Desktop/youtube-dl# python youtube_dl/__main__.py "http://vbox7.com/play:03fbd68d4e"
[redirect] Following redirect to http://vbox7.com/show:misscookie?back_to=%2Fplay%3A03fbd68d4e
WARNING: Falling back on generic information extractor.
[generic] show:misscookie?back_to=%2Fplay%3A03fbd68d4e: Downloading webpage
[generic] show:misscookie?back_to=%2Fplay%3A03fbd68d4e: Extracting information
ERROR: Invalid URL: http://vbox7.com/show:misscookie?back_to=%2Fplay%3A03fbd68d4e
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Jun 7, 2013

That means either you have added VBox7IE() in the wrong place in the list or that the _VALID_URL pattern is not matching the url (I have tested it and it mactches).

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

OH thanx......:p It was a logical problem. I forgot to add VBox7IE() in the list of IEs .......Sorry :(

@yasoob yasoob closed this Jun 7, 2013
@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Jun 7, 2013

Yeah, my usal workflow is, once I have an overall idea of how to extract the video, is to start making sure I match the url and then do the real work. Don't worry ;)

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 7, 2013

Great, don't worry! (Still, I wonder how it can work, the missjavascript redirection is done with window.location)

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 7, 2013

hey another little problem . In the following code

class Vbox7IE(InfoExtractor):
    """Information Extractor for hypem"""
    _VALID_URL = r'(?:http://)?(?:www\.)?vbox7\.com/play:([^/]+)'

    def _real_extract(self,url):
        mobj = re.match(self._VALID_URL, url)
        if mobj is None:
            raise ExtractorError(u'Invalid URL: %s' % url)
        video_id = mobj.group(1)
        webpage = self._download_webpage(url, video_id)
        title = re.search(r'<title>(.*)</title>',webpage)
        title = (title.group(1)).split('/')[0]
        ext = "flv"
        info_url = "http://vbox7.com/play/magare.do"
        data = urllib.urlencode({'as3':'1','vid':video_id})
        info_response = urllib.urlopen(info_url, data).read()
        if info_response is None:
            raise ExtractorError(u'Unable to extract the media url')
        final_url = (info_response.split('&')[0]).split('=')[1]
        return [{
            'id':       video_id,
            'url':      final_url,
            'ext':      ext,
            'title':    title,
        }]

check the info_response variable. If i use urllib.urlopen(info_url,data) then everything works like a charm but If i use compat_urllib_request.urlopen(info_url,data) then i donot receive the correct response and so the IE breaks. Whats the matter ? Basically i am doing a post request here .

@yasoob yasoob reopened this Jun 8, 2013
@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 8, 2013

Umh, the missjavascript thing wasn't fixed, it only took the title of the error page.

Fix:

redirect_page, urlh = self._download_webpage_handle(url, video_id)
redirect_url = urlh.geturl() + re.search(r'window\.location = \'(.*)\';', redirect_page).group(1)
webpage = self._download_webpage(redirect_url, video_id, u'Downloading redirect page')
@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 8, 2013

Looking into the urllib - urllib2 difference

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 8, 2013

Uff, made it!

urllib2 didn't set the Content-Type: application/x-www-form-urlencoded header

class Vbox7IE(InfoExtractor):
    """Information Extractor for hypem"""
    _VALID_URL = r'(?:http://)?(?:www\.)?vbox7\.com/play:([^/]+)'

    def _real_extract(self,url):
        mobj = re.match(self._VALID_URL, url)
        if mobj is None:
            raise ExtractorError(u'Invalid URL: %s' % url)
        video_id = mobj.group(1)

        redirect_page, urlh = self._download_webpage_handle(url, video_id)
        redirect_url = urlh.geturl() + re.search(r'window\.location = \'(.*)\';', redirect_page).group(1)
        webpage = self._download_webpage(redirect_url, video_id, u'Downloading redirect page')

        title = re.search(r'<title>(.*)</title>', webpage)
        title = (title.group(1)).split('/')[0].strip()

        ext = "flv"
        info_url = "http://vbox7.com/play/magare.do"
        data = compat_urllib_parse.urlencode({'as3':'1','vid':video_id})
        info_request = compat_urllib_request.Request(info_url, data)
        info_request.add_header('Content-Type', 'application/x-www-form-urlencoded')
        info_response = self._download_webpage(info_request, video_id, u'Downloading info webpage')
        if info_response is None:
            raise ExtractorError(u'Unable to extract the media url')
        final_url = (info_response.split('&')[0]).split('=')[1]

        return [{
            'id':       video_id,
            'url':      final_url,
            'ext':      ext,
            'title':    title,
        }]
@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 8, 2013

Oh so finally with a combined effort we managed to make an IE for Vbox7.com . I ll do a pull request with the tests and the IE shortly. After that you can close this issue as well #284

@FiloSottile
Copy link
Collaborator

@FiloSottile FiloSottile commented Jun 8, 2013

Great. Ah, change the docstring: """Information Extractor for hypem"""

@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 8, 2013

Ah okay i ll do it..... ;)

@yasoob yasoob closed this Jun 8, 2013
@yasoob
Copy link
Contributor Author

@yasoob yasoob commented Jun 8, 2013

i have done the pull request #878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.