Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pornhub.com] ValueError: unknown url type: /*c783868bc0e01f259bca*/c783868bc0e01f259bca/*fdcbae*/... #12129

Closed
awei78 opened this issue Feb 14, 2017 · 18 comments

Comments

@awei78
Copy link

@awei78 awei78 commented Feb 14, 2017

Please follow the guide below

  • You will be asked some questions and requested to provide some information, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your issue (like that [x])
  • Use Preview tab to see how your issue will actually look like

Make sure you are using the latest version: run youtube-dl --version and ensure your version is 2017.02.14. If it's not read this FAQ entry and update. Issues with outdated version will be rejected.

  • I've verified and I assure that I'm running youtube-dl 2017.02.14

Before submitting an issue make sure you have:

  • At least skimmed through README and most notably FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones

What is the purpose of your issue?

  • Bug report (encountered problems with youtube-dl)
  • Site support request (request for adding support for a new site)
  • Feature request (request for a new functionality)
  • Question
  • Other

[issue]
http://www.pornhub.com/view_video.php?viewkey=ph588edef3b9f03
Can't be downloaded, error message is:
ValueError: unknown url type: /c783868bc0e01f259bca/c783868bc0e01f259bca/fdcbae/d5d92256fcc46f27d/edf2a757f7d5215004/fdcbae/a533b33fb773c0cee02/e19d715e98acc2f05022739abbe/a2f07b472cc9ea5506c14e6b/a2f07b472cc9ea5506c14e6b/d9efb8fed99c/d9efb8fed99c/b5160c30aac2a/c76952c6c6e3a5ec743d8ae3d4d2d/e19d715e98acc2f05022739abbe/fa1551ae8084a6b68a11f/e45adc12/bd62/be2f6ebe2a35af6378/e45adc12/b5160c30aac2a/be2f6ebe2a35af6378/cc51db0f9206969b8af8f6/da8bd/a533b33fb773c0cee02/a533b33fb773c0cee02/ed724b928c777b6b/ade/d9efb8fed99c/b5160c30aac2a/ade/c73100/be2f6ebe2a35af6378/ed724b928c777b6b/c76952c6c6e3a5ec743d8ae3d4d2d/ce0229a65a8cf91ff7731720/bd62/edf2a757f7d5215004/bd62/cd1742706bd8e18279b42/fdcbae/cc51db0f9206969b8af8f6
Reason: website code changed.

[Rolution]
Revised file: youtube_dl\extractor\pornhub.py
Replace old code with:

        # from line 159, version: 2017.02.14
        # video_variables = {}
        # for video_variablename, quote, video_variable in re.findall(
        #         r'(player_quality_[0-9]{3,4}p\w+)\s*=\s*(["\'])(.+?)\2;', webpage):
        #     video_variables[video_variablename] = video_variable
        #
        # video_urls = []
        # for encoded_video_url in re.findall(
        #         r'player_quality_[0-9]{3,4}p\s*=(.+?);', webpage):
        #     for varname, varval in video_variables.items():
        #         encoded_video_url = encoded_video_url.replace(varname, varval)
        #     video_urls.append(re.sub(r'[\s+]', '', encoded_video_url))
        pattern = r'var\s*player_mp4_seek\s*=\s*.*((\s|.)*?)flashvars';
        new_webpage = re.findall(pattern, webpage)[0][0]
        video_variables = {}
        for video_variablename, quote, video_variable in re.findall(
                r'(\w+?)=(["\'])(.+?)\2;', new_webpage):
            video_variables[video_variablename] = video_variable.replace('" + "', '')

        video_urls = []
        for encoded_video_url in re.findall(
                r'player_quality_[0-9]{3,4}p\s*=(.+?);', new_webpage):
            pattern = r'\/\*(\s|.)*?\*\/'
            encoded_video_url = re.sub(pattern, '', encoded_video_url).strip()
            for varname, varval in video_variables.items():
                encoded_video_url = encoded_video_url.replace(varname, varval)
                encoded_video_url = encoded_video_url.replace(' + ', '')
            video_urls.append(re.sub(r'[\s+]', '', encoded_video_url))

Problem solving.

@ThomasChr
Copy link
Contributor

@ThomasChr ThomasChr commented Feb 14, 2017

@awei78 That would be a quick fix. But @dstftw made it clear that he didn't want it.
Instead he wants to use the JSInterpreter to Interpret the Javascript Code.
I have no idead if he is working on it at the moment...

@awei78
Copy link
Author

@awei78 awei78 commented Feb 14, 2017

@ThomasChr Yeah!
@dstftw wants to use the JSInterpreter would be a good solution, and we look forward to use it at next version.
thanks!

@ThomasChr
Copy link
Contributor

@ThomasChr ThomasChr commented Feb 14, 2017

Even this simple calculation:

   jsi = JSInterpreter('')
   mycode = '1+1+2+3'
   myvars = []
   ret = jsi.interpret_expression(mycode, myvars, True)

Leads to:

ERROR: Recursion limit reached;

I think the JSInterpreter Class is not quite there yet...

UPDATE: Okay, my fault. You need to set allow_recursion to a value, not to 'True' or 'False'.
BUT: It does have it's problems with comments in the JS Code:

ERROR: Unsupported JS expression u'*';

@awei78
Copy link
Author

@awei78 awei78 commented Feb 14, 2017

@ThomasChr , @dstftw :
JSInterpreter would be the best resolution, we had try it yesterday, but fault too, so I modifyed the regular expression, just like the codes I has commmented.
The failure reason are similar to yours, besides comments, it has backslashs also cause failure.
So I think, the js code must be dealed before use JSInterpreter, remove the comments and some backslashs, then use JSInterpreter to parse it, I think that would be a good idea.
We look forward to!

@ThomasChr
Copy link
Contributor

@ThomasChr ThomasChr commented Feb 14, 2017

It has a problem with comments, and it has a problem with string concatenation.
I'm not quite sure if it's a good idea to write the Code for JSInterpreter by ourselfs and not use some ready made code.

This one works, but it's kind of stupid, the only think the JSInterpreter ist capable of is replacing the variables with ther values:

  video_urls = []
   jsi = JSInterpreter('')
   for encoded_video_url in re.findall(
           r'player_quality_[0-9]{3,4}p\s*=(.+?);', new_webpage):
        # JSInterpreter does not like JS-Comments at the moment
        print('Step 1: ' + str(encoded_video_url))
        encoded_video_url = re.sub(r'\/\*(\s|.)*?\*\/', '', encoded_video_url).strip()
        print('Step 2: ' + str(encoded_video_url))
        encoded_video_url = jsi.interpret_expression(encoded_video_url, video_variables, 100)
        print('Step 3: ' + str(encoded_video_url))
        # JSIntrpreter does not do string concatenation
        encoded_video_url = encoded_video_url.replace('+', '')
        encoded_video_url = encoded_video_url.replace(' ', '')
        encoded_video_url = encoded_video_url.replace('"', '')
        print('Step 4: ' + str(encoded_video_url))
        video_urls.append(encoded_video_url)

I'm doing:
Step 1: Remove comments
Step 2: Interpret JS Code
Step 3: Do 'manual' string concatenation

The output is:

Step 1: /* + f6a66ad70ec2463911b4cdf3c2d434d8 + /de63399d92ab8a4867f59d86d03a4b + / + c33c8e321bdc8bef05e36a6 + /d44fad6c352320e9cd + / + f2b233ec499430c9e8f88607a6 + /e36479b3782487c82d8 + / + c37f + /c37f + / + c34390e1004 + /adff86868 + / + c95744290770e3e + /f2b233ec499430c9e8f88607a6 + / + e9f136c86a69bd1ba02b0df7069546c + /e34319750f948ee1c77ec + / + c34390e1004 + /bcb30f28276eae5ffad031b5abede19 + / + eb87243e520655ca5 + /cd5bd941aeed07cc9d4ef9fb0f38f4 + / + bcb30f28276eae5ffad031b5abede19 + /f6a66ad70ec2463911b4cdf3c2d434d8 + / + b3598b4bf7dee + /a469c07e0 + / + c33c8e321bdc8bef05e36a6 + /b3598b4bf7dee + / + de63399d92ab8a4867f59d86d03a4b + /c34390e1004 + / + a469c07e0 + /c0b05da574633516d332404923be161f + / + c8a939cb0ef7f2 + /b6bad3f6489bd + / + bcb30f28276eae5ffad031b5abede19 + /e9f136c86a69bd1ba02b0df7069546c + / + c34390e1004 + /c95744290770e3e + / + c37f + /bb2391f00c32e + / + cd5bd941aeed07cc9d4ef9fb0f38f4 + /eb87243e520655ca5 + / + adff86868 + /c8a939cb0ef7f2 + / + de63399d92ab8a4867f59d86d03a4b + */c33c8e321bdc8bef05e36a6

Step 2: de63399d92ab8a4867f59d86d03a4b + d44fad6c352320e9cd + e36479b3782487c82d8 + c37f + adff86868 + f2b233ec499430c9e8f88607a6 + e34319750f948ee1c77ec + bcb30f28276eae5ffad031b5abede19 + cd5bd941aeed07cc9d4ef9fb0f38f4 + f6a66ad70ec2463911b4cdf3c2d434d8 + a469c07e0 + b3598b4bf7dee + c34390e1004 + c0b05da574633516d332404923be161f + b6bad3f6489bd + e9f136c86a69bd1ba02b0df7069546c + c95744290770e3e + bb2391f00c32e + eb87243e520655ca5 + c8a939cb0ef7f2 + c33c8e321bdc8bef05e36a6

Step 3: http://cdn2b.v" + "ideo.porn" + "hub.phncdn.com/videos/201701/30/104168" + "582/4" + "80P" + "_600K_104168582.mp4?ipa=18" + "8.68.35" + ".13" + "8&rs=1" + "09&" + "ri=1400&s=" + "14870" + "615" + "93&e=1487068793&h=7" + "62d6987e787971c28b84c2a8660e9ac

Step 4: http://cdn2b.video.pornhub.phncdn.com/videos/201701/30/104168582/480P_600K_104168582.mp4?ipa=188.68.35.138&rs=109&ri=1400&s=1487061593&e=1487068793&h=762d6987e787971c28b84c2a8660e9ac

@dhwz
Copy link

@dhwz dhwz commented Feb 14, 2017

Try js2py which is working perfectly, I'm already using it with our own PH parsing code.

@ThomasChr
Copy link
Contributor

@ThomasChr ThomasChr commented Feb 14, 2017

js2py seems perfect. I gave it a short shot and had no problems. I'm not quite sure if there is a better way to pass the variables, but it's working:

 pattern = r'var\s*player_mp4_seek\s*=\s*.*((\s|.)*?)flashvars';
  new_webpage = re.findall(pattern, webpage)[0][0]
  video_variables=""
  for video_variablename, quote, video_variable in re.findall(
          r'(\w+?)=(["\'])(.+?)\2;', new_webpage):
      video_variables = video_variables + 'var ' + video_variablename + '="' +  video_variable + '"; '

  print('x: ' + video_variables)
  video_urls = []
  for encoded_video_url in re.findall(
          r'player_quality_[0-9]{3,4}p\s*=(.+?);', new_webpage):
       # JSInterpreter does not like JS-Comments at the moment
       print('Step 1: ' + str(encoded_video_url))
       encoded_video_url = js2py.eval_js(video_variables + encoded_video_url)
       print('Step 2: ' + str(encoded_video_url))
       video_urls.append(encoded_video_url)
@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Feb 14, 2017

You're free to use js2py in your forks, but I won't accept patches with js2py as it uses exec() [1], which is a big security hole.

[1] https://github.com/PiotrDabkowski/Js2Py/blob/627a6b9/js2py/evaljs.py#L175

@ThomasChr
Copy link
Contributor

@ThomasChr ThomasChr commented Feb 14, 2017

@yan12125 Don't panic. I'm not trying to get any of my code accepted. I just want to give you some suggestions.

Yeah, exec() is bad - we all know that.

@dhwz
Copy link

@dhwz dhwz commented Feb 14, 2017

Me either. ;)

@yan12125
Copy link
Collaborator

@yan12125 yan12125 commented Feb 14, 2017

Just a reminder :-)

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Feb 14, 2017

I did not state solution must use JSInterpreter. I said that I personally don't see much point in ad hoc solutions since they will most likely change obfuscation shortly.

dstftw added a commit that referenced this issue Feb 14, 2017
@sudovijay
Copy link
Contributor

@sudovijay sudovijay commented Feb 14, 2017

Hey sorry if im late in the convo. why use JSInterpreter for this? you can easily do it in python too ! i did implemented a solution in php. you just need to extract the one with "*/" in the front ! rest all are null values. lemme try doing that in python ..

@sudovijay
Copy link
Contributor

@sudovijay sudovijay commented Feb 14, 2017

nah just tried its kind of complicated for me ! not worth my time bit busy with some other project too. I even don't do much python anyway ! here's my code

// just getting the data between this
$video_data = get_between($page_data, 'var player_mp4_seek', '</script>'); 

if(!preg_match_all('/var\s*?player_quality_([0-9]{3,4}p)=([^;]+?);/', $video_data, $match))
             return;

        $j = 0;    

        foreach ($match[1] as $k) {

            $code = $match[2][$j];

           // this slash got real values rest all are null
            $expCode = explode('*/', $code);

            $path = '';

            foreach ($expCode as $o) {
                    
               if(startsWith($o, '/*'))
                    continue;

               $break = explode(' ', $o);

               $var = $break[0];

               if(!preg_match('/var\s'.$var.'=("[^;]+?");/', $video_data, $mt))
                    continue;
              
              // javascript string concat
               $plus = explode('" + "', $mt[1]);
               $path .= trim(implode('', $plus), '"');

            }

       $links[] = $path; // you can actually store the links now

 $j++;
}
@awei78
Copy link
Author

@awei78 awei78 commented Feb 15, 2017

Today, 2017-02-15, pornhub.com changed again... it seems to be changed back!
We get the correct video_url, but download error:
ERROR: unable to download video data: <urlopen error unknown url type: "http>

Reason: double quotes in video_urls, caused protocol parse error.
Resolved: replace it with empty string.
encoded_video_url = encoded_video_url.replace('"', '')

@sudovijay
Copy link
Contributor

@sudovijay sudovijay commented Feb 15, 2017

yep confirmed! they just reverted back to same old technique !
use this

player_quality_[0-9]{3,4}p\s*=\s*(["\'])(.+?)\1;

@dhwz
Copy link

@dhwz dhwz commented Feb 15, 2017

This won't be fun, I think they are tracking your GIT repo :)

@awei78
Copy link
Author

@awei78 awei78 commented Feb 15, 2017

@dhwz : I think too! :p

@dstftw dstftw removed the broken-IE label Feb 16, 2017
@dstftw dstftw closed this Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.