Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[hketv] Add new extractor #18696

Closed
wants to merge 3 commits into from
Closed

Conversation

anthonyfok
Copy link
Contributor

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

The Educational Bureau of the Government of Hong Kong has been producing HKETV (香港教育電視), i.e. Educational Television (ETV, eTV), for students from kindergarten to secondary schools in Hong Kong for over 40 years. Their website is at https://etv.edb.gov.hk/ .

This new extractor pulls information from the website through https://apps.hkedcity.net/media/play/handler.php which has recently (mid-2018?) become geo-restricted, i.e. viewable in Hong Kong only. The final URL to the MPEG-4 video looks something like https://video1.hkedcity.net/streaming/79ea4e9feec5e8291a7ffc72632ce19e/5c28ab09/channels/etv/201810/20181024230027_355992_720.mp4 which is not geo-restricted, but due to URL (Token) signing, the URL expires after an hour or so.

Note that the official ETV Android app and iOS app are not currently geo-restricted, though they have not been studied in depth for this extractor.

Please do not hesitate to let me know if you have any questions.

Cheers,
Anthony

youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
youtube_dl/extractor/hketv.py Show resolved Hide resolved
youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
webpage = self._download_webpage(url, video_id)

file_id = self._html_search_regex(r'post_var\["file_id"\] = ([0-9]+);', webpage, 'file ID')
curr_url = self._html_search_regex(r'post_var\["curr_url"\] = "(.+?)";', webpage, 'curr URL')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relax regexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the regex to r'post_var\["file_id"\]\s*=\s*(.+?);' for example.
Hope that is relax enough.

youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
elif label == 'SD':
h = 360
w = h * width // height
urlh = self._downloader.urlopen(urljoin(_APPS_BASE_URL, fmt.get('file')))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urljoin replaced with +.

Or did you mean the use of self._downloader.urlopen()? That was intentionally done so that the download link of the video could be downloaded without geo-restriction, i.e. without proxy, and without the PHPSESSID cookie. (The geo-restriction is done within their PHP script only.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, may I keep using self._downloader.urlopen() for the MPEG-4 video file? :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that no cookie is needed with the link given by -g. The link given by urljoin(_APPS_BASE_URL, fmt.get('file')) looks something like this:

https://apps.hkedcity.net/media/mediaplayer/filename.php?t=85ee74dae3477d95b05cc3091f26ff5a&m=mp4&p=fc61b1acf92b52238ad5c616409759360b4d914f183691d79c03e27a335a1c7e99a0f5d12b7183530daad9edc79fb17d97c3b6c0d3c445d68b96c92fa88b3a2f&file=551cc229795bb70003e9962bda09899612a7ae10b183be5e84e136419d91c684c63c863a8e3237a753ed31921500c2d28aa4db43a8684e37caac000571948110&e=.mp4

which requires the correct cookies to work. It gives a "HTTP/1.1 302 Found" redirecting to a final URL that looks like https://video1.hkedcity.net/streaming/b3848d0a3296b85d67d88b749642c29a/5c3738d1/channels/etv/201507/20150716151039_989507_720.mp4 which requires no cookie. This final URL is easier for me to copy-and-paste to other programs or even to stream to other devices.

That said, I have decided to take your advice and not "fully resolve" to the final URL because the intermediate URL, with the appropriate cookies saved with --cookies, remains valid for many hours, whereas the fully resolved URL would be "410 Gone" within one or two hours. Also, I realized self._downloader.urlopen was being run twice in the for fmt in fmts: loop, leading to quite a few seconds of delay. (The HKETV server seems rather slow in responding to this request.)

youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
@anthonyfok
Copy link
Contributor Author

Sorry, this recent push is incorrect and incomplete; I will push again soon.

youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
}),
headers={'Content-Type': 'application/x-www-form-urlencoded'},
fatal=False)
if emotion.get('result'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emotion is not guaranteed to be a dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Line removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Line removed.

As in

    like_count = int_or_none(try_get(emotion, lambda x: x['data']['emotion_data'][0]['count']))

alone will work just fine without the extraneous if emotion.get('result'): check.

youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
h = 720
elif label == 'SD':
h = 360
w = h * width // height
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaks on uninitialized h. Breaks on None width and height.

Copy link
Contributor Author

@anthonyfok anthonyfok Jan 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Indeed, some older videos were uploaded without width and height information, as in they are set to "0". Changed to the following:

            for fmt in fmts:
                label = fmt.get('label')
                w = None
                h = None
                if label == 'HD':
                    h = 720
                elif label == 'SD':
                    h = 360
                if h:
                    if width and height:
                        w = h * width // height
                    else:
                        w = h * 4 // 3

elif label == 'SD':
h = 360
w = h * width // height
urlh = self._downloader.urlopen(_APPS_BASE_URL + fmt.get('file'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaks on missing file key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for my oversight. I have restructured that code a little bit to check that fmt.get('file') is not None before continuing with appending to formats.

elif label == 'SD':
h = 360
w = h * width // height
urlh = self._downloader.urlopen(urljoin(_APPS_BASE_URL, fmt.get('file')))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, what for?

youtube_dl/extractor/hketv.py Outdated Show resolved Hide resolved
@anthonyfok
Copy link
Contributor Author

Thank you for your second round of thorough review.

I apologize for taking so long to complete the revision as I had been busy, and also it took me a while to think through the various issues you pointed out in your review.

I think I have finally fixed all the issues that you pointed out, and additionally added an additional test case of an 2007 HKETV video which has no subtitle, has width and height set as "0", and used to fail with my earlier code due to insufficient error checking. Thank you indeed for correctly pointing out these issues in your review!

@dstftw dstftw closed this in 73c19aa Jan 20, 2019
@dstftw
Copy link
Collaborator

dstftw commented Jan 20, 2019

Cherry-picked, thanks.

@anthonyfok
Copy link
Contributor Author

Thank you for your review, improvement and cherry-pick!
Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants