Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't download US Senate committee hearings #13399

Open
hub2git opened this issue Jun 16, 2017 · 14 comments
Open

Can't download US Senate committee hearings #13399

hub2git opened this issue Jun 16, 2017 · 14 comments

Comments

@hub2git
Copy link
Contributor

@hub2git hub2git commented Jun 16, 2017

  • I've verified and I assure that I'm running youtube-dl 2017.06.12
  • At least skimmed through README and most notably FAQ and BUGS sections
  • Searched the bugtracker for similar issues including closed ones

What is the purpose of your issue?

  • Site support request (request for adding support for a new site)


 youtube-dl  https://www.budget.senate.gov/hearings/watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B -v
[debug] System config: [u'-o', u'~/Videos/youtube-dl/%(title)s_%(id)s.%(ext)s', u'--netrc', u'--restrict-filenames', u'--write-description', u'--write-sub', u'--yes-playlist', u'--ignore-errors']
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'https://www.budget.senate.gov/hearings/watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B', u'-v']
[debug] Encodings: locale UTF-8, fs UTF-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.06.12
[debug] Python version 2.7.12 - Linux-4.4.0-71-generic-x86_64-with-LinuxMint-18.1-serena
[debug] exe versions: ffmpeg 2.8.11-0ubuntu0.16.04.1, ffprobe 2.8.11-0ubuntu0.16.04.1
[debug] Proxy map: {}
[generic] watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B: Requesting header
WARNING: Falling back on generic information extractor.
[generic] watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B: Downloading webpage
[generic] watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B: Extracting information
[SenateISVP] budget060717: Downloading webpage
WARNING: There's no description to write.
[debug] Invoking downloader on u'http://ussenate-f.akamaihd.net/budget060717.mp4?v=3.1.0&fp=&r=&g='
ERROR: unable to download video data: HTTP Error 404: Not Found
Traceback (most recent call last):
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 1826, in process_info
    success = dl(filename, info_dict)
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 1768, in dl
    return fd.download(name, info)
  File "/usr/local/bin/youtube-dl/youtube_dl/downloader/common.py", line 360, in download
    return self.real_download(filename, info_dict)
  File "/usr/local/bin/youtube-dl/youtube_dl/downloader/http.py", line 61, in real_download
    data = self.ydl.urlopen(request)
  File "/usr/local/bin/youtube-dl/youtube_dl/YoutubeDL.py", line 2129, in urlopen
    return self._opener.open(req, timeout=self._socket_timeout)
  File "/usr/lib/python2.7/urllib2.py", line 435, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 548, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 473, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 556, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found



If the purpose of this issue is a site support request please provide all kinds of example URLs support for which should be included (replace following example URLs by yours):


Description of your issue, suggested solution and other information

The senate.gov has many "subcategories", or better known as committees. The list of committees is at https://www.senate.gov/committees/committees_home.htm.
For example,
veterans.senate.gov
indian.senate.gov

Each committee's hearings are available on /hearings, for example, veterans.senate.gov/hearings.

@johnhawkinson
Copy link
Contributor

@johnhawkinson johnhawkinson commented Jun 18, 2017

A bit tangentially, when I first ran this I got a redirect that didn't seem to make much sense to me:

pb3:Downloads jhawk$ youtube-dl -v https://www.budget.senate.gov/hearings/watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B 
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'https://www.budget.senate.gov/hearings/watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.06.12
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[generic] watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B: Requesting header
[redirect] Following redirect to https://www.senate.gov/pagelayout/general/one_item_and_teasers/waf.htm
[generic] waf: Requesting header
WARNING: Falling back on generic information extractor.
[generic] waf: Downloading webpage
[generic] waf: Extracting information
...
UnsupportedError: Unsupported URL: https://www.senate.gov/pagelayout/general/one_item_and_teasers/waf.htm

but then when I reran with --print-traffic I saw the same 404 error downloading an mp4.

Oh wait, here we go:

[generic] watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B: Requesting header
send: u'HEAD /hearings/watch?hearingid=E9CA49E5-5056-A066-6056-62BB2BE6BA6B HTTP/1.1\r\nAccept-Language: en-us,en;q=0.5\r\nAccept-Encoding: gzip, deflate\r\nConnection: close\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\nHost: www.budget.senate.gov\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
header: Server: Apache
header: Location: https://www.senate.gov/pagelayout/general/one_item_and_teasers/waf.htm
header: Content-Type: text/html; charset=iso-8859-1
header: Content-Length: 0
header: Date: Sun, 18 Jun 2017 01:42:53 GMT
header: Connection: close

bizarre!

Anyhow, as a workaround, it seems that the SenateISVP extractor prefers the arch type is specified, in this case it's this:

https://www.senate.gov/isvp/?comm=budget&type=arch&stt=&filename=budget060717&auto_play=false&wmode=transparent&poster=https%3A%2F%2Fwww%2Ebudget%2Esenate%2Egov%2Fthemes%2Fbudget%2Fimages%2Fvideo%2Dposter%2Dflash%2Dfit%2Epng'

But changing type=arch to something else (e.g. type=xarch) seems to work here:

pb3:extractor jhawk$ youtube-dl -v 'https://www.senate.gov/isvp/?comm=budget&type=xarch&stt=&filename=budget060717&auto_play=false&wmode=transparent&poster=https%3A%2F%2Fwww%2Ebudget%2Esenate%2Egov%2Fthemes%2Fbudget%2Fimages%2Fvideo%2Dposter%2Dflash%2Dfit%2Epng' 
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: [u'-v', u'https://www.senate.gov/isvp/?comm=budget&type=xarch&stt=&filename=budget060717&auto_play=false&wmode=transparent&poster=https%3A%2F%2Fwww%2Ebudget%2Esenate%2Egov%2Fthemes%2Fbudget%2Fimages%2Fvideo%2Dposter%2Dflash%2Dfit%2Epng']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2017.06.12
[debug] Python version 2.7.10 - Darwin-14.5.0-x86_64-i386-64bit
[debug] exe versions: ffmpeg git-2017-02-28-7f62368, ffprobe git-2017-02-28-7f62368, rtmpdump 2.4
[debug] Proxy map: {}
[SenateISVP] budget060717: Downloading webpage
[SenateISVP] budget060717: Downloading f4m manifest
[SenateISVP] budget060717: Downloading m3u8 information
[debug] Invoking downloader on u'http://budget-f.akamaihd.net/i/budget060717_1@76447/index_348_av-p.m3u8?sd=10&rebase=on'
[download] Destination: Integrated Senate Video Player-budget060717.mp4
[debug] ffmpeg command line: ffmpeg -y -loglevel verbose -headers 'Accept-Language: en-us,en;q=0.5

I don't know the semantic meaning behind this type and if the extractor should just stop special-casing the arch type, or maybe try to validate it and fall back to the hdcore/M3U8 type automatically.

@hub2git
Copy link
Contributor Author

@hub2git hub2git commented Jun 18, 2017

@remitamine, I noticed you added the geo-restricted label. What led you to conclude that senate.gov is geo-restricted?

@hub2git
Copy link
Contributor Author

@hub2git hub2git commented Jun 18, 2017

@johnhawkinson thank you for looking into this.

@hub2git hub2git changed the title youtube-dl can't download US Senate committee hearings download US Senate committee hearings Jun 18, 2017
@hub2git hub2git changed the title download US Senate committee hearings Can't download US Senate committee hearings Jun 18, 2017
@remitamine
Copy link
Collaborator

@remitamine remitamine commented Jun 18, 2017

i'm getting this:

Access Denied

You don't have permission to access "http://serve-403-cfpremium.www.senate.gov/" on this server.

adding geo-restricted label is an indication that developers wanting to work on this might need to workaround geo-restriction.

@hub2git
Copy link
Contributor Author

@hub2git hub2git commented Jun 18, 2017

@remitamine
Copy link
Collaborator

@remitamine remitamine commented Jun 19, 2017

How is it when you try to access the video normally (through a browser)?

the massage that i posted is what i get in the browser trying to access any of US Senate urls posted in this issue.

@schmod
Copy link

@schmod schmod commented Jul 17, 2017

arch stands for "archived" (as opposed to a live webcast), and is largely a holdover from the days when Senate committees had a single permalink for all live webcasts, and lacked DVR functionality.

Committee hearings hosted on senate.gov generally fall into two categories:

  • Live webcast, or the "DVR recording" of a live webcast
  • MP4 upload (anything that's been edited, and many older videos)

Videos in the second category are hosted on a different CDN from the first. The type=arch or type=live query parameter provides a hint about which one to try first (if the video fails to load, it automatically fails back to the alternative).

Because the parameter is only a hint, youtube-dl should probably not treat as a special case.

"Live" content is served via Adobe HDS (and does not work on non-Flash devices), while "Archived" content appears to be available as a normal HTTP download.


FWIW, all Senate committees are effectively independent, and manage their own websites. The only thing that is shared is the senate.gov/isvp player. You may need a different extractor for each committee.

I do not believe that the Senate employs geo-blocking (but, given the current leadership, who the heck knows).

@GeitHub
Copy link

@GeitHub GeitHub commented Jul 29, 2018

I'd like to add that it's not only the committee hearings, but the actual Senate feed itself that cannot be downloaded:

https://www.senate.gov/floor/

Example URL:
https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=2956

Footage from the House (http://houselive.gov/) also can't be downloaded via youtube-dl, but it's less of an issue because they provide a download link right in the video player. The Senate does not.

@galgeek
Copy link
Contributor

@galgeek galgeek commented Jul 22, 2019

still an issue...

$ youtube-dl -v 'https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2019.07.16
[debug] Git HEAD: afe22563a
[debug] Python version 3.7.2 (CPython) - Darwin-16.7.0-x86_64-i386-64bit
[debug] exe versions: none
[debug] Proxy map: {}
[generic] MediaPlayer: Requesting header
WARNING: Falling back on generic information extractor.
[generic] MediaPlayer: Downloading webpage
[generic] MediaPlayer: Extracting information
ERROR: Unsupported URL: https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125
Traceback (most recent call last):
  File "/Users/bara/Dev/ytd/youtube-dl/youtube_dl/YoutubeDL.py", line 796, in extract_info
    ie_result = ie.extract(url)
  File "/Users/bara/Dev/ytd/youtube-dl/youtube_dl/extractor/common.py", line 530, in extract
    ie_result = self._real_extract(url)
  File "/Users/bara/Dev/ytd/youtube-dl/youtube_dl/extractor/generic.py", line 3333, in _real_extract
    raise UnsupportedError(url)
youtube_dl.utils.UnsupportedError: Unsupported URL: https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125
@raleeper
Copy link
Contributor

@raleeper raleeper commented Jul 22, 2019

@galgeek
Copy link
Contributor

@galgeek galgeek commented Jul 30, 2019

Thanks, @raleeper! That url works for me, as well.

@galgeek
Copy link
Contributor

@galgeek galgeek commented Aug 15, 2019

The rendered page for URLs like
https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125
contains iframe src
//floor.senate.gov/videos/3385/player?autoplay=1
which then contains
<source type="application/x-mpegurl" src="//archive-stream.granicus.com/OnDemand/_definst_/mp4:senate/senate_222b240e-4dd2-44c9-8218-ab9a67c50778.mp4/playlist.m3u8">
from which we can create the URL for the mp4
http://archive-media.granicus.com:443/OnDemand/senate/senate_ff605d76-86c3-4e8d-9991-9f32efd782de.mp4

@galgeek galgeek mentioned this issue Aug 22, 2019
5 of 9 tasks complete
@galgeek
Copy link
Contributor

@galgeek galgeek commented Sep 23, 2019

I've updated PR #22181, making the URL regexes more general. This should make it easier to copy and paste the code to create additional extractors for other organizations using granicus.com.

Maybe someone who's more familiar with youtube-dl code could make a still more general granicus mp4 extractor?

@galgeek
Copy link
Contributor

@galgeek galgeek commented Oct 22, 2019

The current version of youtube-dl remains unable to download the video at urls like https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125

$ youtube-dl -v 'https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125'
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-v', 'https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125']
[debug] Encodings: locale UTF-8, fs utf-8, out UTF-8, pref UTF-8
[debug] youtube-dl version 2019.10.22
[debug] Git HEAD: 820215f0e
[debug] Python version 3.7.2 (CPython) - Darwin-16.7.0-x86_64-i386-64bit
[debug] exe versions: none
[debug] Proxy map: {}
[generic] MediaPlayer: Requesting header
WARNING: Falling back on generic information extractor.
[generic] MediaPlayer: Downloading webpage
[generic] MediaPlayer: Extracting information
ERROR: Unsupported URL: https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125
Traceback (most recent call last):
  File "/Users/bara/Dev/ytd/youtube-dl/youtube_dl/YoutubeDL.py", line 796, in extract_info
    ie_result = ie.extract(url)
  File "/Users/bara/Dev/ytd/youtube-dl/youtube_dl/extractor/common.py", line 530, in extract
    ie_result = self._real_extract(url)
  File "/Users/bara/Dev/ytd/youtube-dl/youtube_dl/extractor/generic.py", line 3353, in _real_extract
    raise UnsupportedError(url)
youtube_dl.utils.UnsupportedError: Unsupported URL: https://floor.senate.gov/MediaPlayer.php?view_id=2&clip_id=3125
@LameLemon LameLemon mentioned this issue Oct 28, 2019
6 of 9 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.