Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clarification-needed] Brightcove code for Kijk.nl. #6243

Closed
Reino17 opened this issue Jul 15, 2015 · 6 comments
Closed

[Clarification-needed] Brightcove code for Kijk.nl. #6243

Reino17 opened this issue Jul 15, 2015 · 6 comments

Comments

@Reino17
Copy link

@Reino17 Reino17 commented Jul 15, 2015

Although I'm creating an "Issue" here, it's actually some specific Python code explanation I'm looking for.

I'm creating a good old batchscript to download videos / extract video-urls from npo.nl, rtlxl.nl and kijk.nl. For the first 2 it's already working, but I'm having a hard time understanding the code for kijk.nl:
https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/generic.py#L216
https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/generic.py#L1152
https://github.com/rg3/youtube-dl/blob/master/youtube_dl/extractor/brightcove.py#L194

Let's consider http://www.kijk.nl/sbs6/wegmisbruikers/videos/ouhfqhGaSYBE/aflevering-304 as input.
First of all, what does 'url': smuggle_url(bc_url, {'Referer': url}) (#L1158 of generic.py), and smuggle_url in particular do?

With Xidel I can use XQuery in my batchscript and with its help I managed to extract the url from <meta property="og:video" content="{.}" /> (#L197 of brightcove.py: url_m = re.search(...,webpage)):

http://c.brightcove.com/services/viewer/federated_f9?isVid=1&isUI=1&publisherID=20318290001&
playerID=2234112204001&autoStart=false&domain=embed&videoId=4350788533001&branding=sbs&playe
rtitle=true&linkBaseURL=http://www.kijk.nl/sbs6/wegmisbruikers/videos/ouhfqhGaSYBE/afleverin
g-304?sbs_device=pc

But since I have very little experience with Python, I just don't understand what happens then.
It obviously checks whether the url contains "playerKey", or "videoId", but before that, what does url = unescapeHTML(url_m.group(1)) actually do?

And then matches = re.findall(...,webpage)... I guess it's searching for the xml string <object class="BrightcoveExperience">{params}</object>, but this string doesn't appear at all in the html-code of http://www.kijk.nl/sbs6/wegmisbruikers/videos/ouhfqhGaSYBE/aflevering-304.

If anyone can explain the url-extraction to me, I'd be very grateful, because as you can see, I need some clarification. :)

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Jul 15, 2015

what does url = unescapeHTML(url_m.group(1)) actually do?

since the urls is in an html file it may contain escape sequences like &amp, you need to handle them.

And then matches = re.findall(...,webpage)... I guess it's searching for the xml string {params}, but this string doesn't appear at all in the html-code of http://www.kijk.nl/sbs6/wegmisbruikers/videos/ouhfqhGaSYBE/aflevering-304.

It's only searched if the url in og:video can't be used, I don't know if that's the case for that webpage.

@jaimeMF
Copy link
Collaborator

@jaimeMF jaimeMF commented Jul 15, 2015

Let's consider http://www.kijk.nl/sbs6/wegmisbruikers/videos/ouhfqhGaSYBE/aflevering-304 as input.
First of all, what does 'url': smuggle_url(bc_url, {'Referer': url}) (#L1158 of generic.py), and smuggle_url in particular do?

See its definition: https://github.com/rg3/youtube-dl/blob/master/youtube_dl/utils.py#L1129-L1143, it just embeds extra info in a url so the extractor can use it

@Reino17
Copy link
Author

@Reino17 Reino17 commented Jul 15, 2015

I guess I can skip to def _real_extract(self, url): then.
But then,... man, this is on a completely different level than npo.nl and rtlxl.nl. I quit. I just don't understand this.
Thanks for your help.

@Reino17 Reino17 closed this Jul 15, 2015
@Reino17
Copy link
Author

@Reino17 Reino17 commented Jul 17, 2015

A good friend saw my posts here and pointed me to another project's source-code. This code was more straight forward and in the end I could recover the 'manifest-playlist'-url with just 1 Xidel command-line:

xidel.exe -q "[kijk.nl-url]" ^
          -f ^"//meta[@name='video_src']/@content ^
             ! replace(.,'federated_f9','htmlFederated') ^
             ! replace(.,'videoId','@videoPlayer')^" ^
          -e ^"json(concat('{', extract(.,'experienceJSON = {(.*?)};',1), '}')) ^
             //mediaDTO/(renditions)()[size='0']/defaultURL^"

I guess the Brightcove-code has become so complex because of all the added extractors and development in general. But if I can recover the m3u8 with just 1 command-line, it makes me wonder if the Brightcove-code isn't too complex?

@dstftw
Copy link
Collaborator

@dstftw dstftw commented Jul 17, 2015

With this code you've handled just a subset of possible brightcove embeds. If this happens to be used on kijk that doesn't mean different embeds can't be used somewhere else.

@Reino17
Copy link
Author

@Reino17 Reino17 commented Jul 18, 2015

Alright. Understood. Thank you, dstftw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.