Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBCNEWS falls back to generic extractor #6922

Closed
RingoTheDog opened this issue Sep 22, 2015 · 1 comment
Closed

NBCNEWS falls back to generic extractor #6922

RingoTheDog opened this issue Sep 22, 2015 · 1 comment

Comments

@RingoTheDog
Copy link

@RingoTheDog RingoTheDog commented Sep 22, 2015

URL: http://www.nbcnews.com/business/autos/volkswagen-11-million-vehicles-could-have-suspect-software-emissions-scandal-n431456

C:>youtube-dl -v "http://www.nbcnews.com/business/autos/volkswagen-11-million-v
ehicles-could-have-suspect-software-emissions-scandal-n431456"
[debug] System config: []
[debug] User config: []
[debug] Command-line args: [u'-v', u'http://www.nbcnews.com/business/autos/volks
wagen-11-million-vehicles-could-have-suspect-software-emissions-scandal-n431456'
]
[debug] Encodings: locale cp1252, fs mbcs, out cp850, pref cp1252
[debug] youtube-dl version 2015.09.09
[debug] Python version 2.7.8 - Windows-7-6.1.7601-SP1
[debug] exe versions: ffmpeg N-71346-gdf4fca2
[debug] Proxy map: {}
[generic] volkswagen-11-million-vehicles-could-have-suspect-software-emissions-s
candal-n431456: Requesting header
WARNING: Falling back on generic information extractor.
[generic] volkswagen-11-million-vehicles-could-have-suspect-software-emissions-s
candal-n431456: Downloading webpage
[generic] volkswagen-11-million-vehicles-could-have-suspect-software-emissions-s
candal-n431456: Extracting information
ERROR: Unsupported URL: http://www.nbcnews.com/business/autos/volkswagen-11-mill
ion-vehicles-could-have-suspect-software-emissions-scandal-n431456
Traceback (most recent call last):
File "youtube_dl\extractor\generic.pyo", line 1222, in _real_extract
File "youtube_dl\utils.pyo", line 1656, in parse_xml
File "xml\etree\ElementTree.pyo", line 1300, in XML
File "xml\etree\ElementTree.pyo", line 1642, in feed
File "xml\etree\ElementTree.pyo", line 1506, in _raiseerror
ParseError: syntax error: line 1, column 0
Traceback (most recent call last):
File "youtube_dl\YoutubeDL.pyo", line 660, in extract_info
File "youtube_dl\extractor\common.pyo", line 287, in extract
File "youtube_dl\extractor\generic.pyo", line 1820, in _real_extract
UnsupportedError: Unsupported URL: http://www.nbcnews.com/business/autos/volkswa
gen-11-million-vehicles-could-have-suspect-software-emissions-scandal-n431456

Thanks
Ringo

@ChanderG
Copy link

@ChanderG ChanderG commented Sep 23, 2015

Going through the NBCNewsIE class, the regex for valid url is:

(?x)https?://(?:www\.)?nbcnews\.com/(?:video/.+?/(?P<id>\d+)|(?:watch|feature|nightly-news)/[^/]+/(?P<title>.+))

If I am right, it only seems to accept pages from sections video, watch, feature and nightly-news.

On testing,
nightly-news -> is working
watch -> not sure if NBC News has a watch section now
feature -> I get a different error, "unable to extract bootstrap json"

video -> Looks like format of the url has changed. The id is now NOT separated by /.

For this issue, adding business to the matcher causes the program to use the correct extractor, but the problem of "unable to extract bootstrap json" comes up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.