Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[91porn] fix title & comment extraction #5932

Merged
merged 21 commits into from
Feb 12, 2023
Merged

Conversation

pmitchell86
Copy link
Contributor

@pmitchell86 pmitchell86 commented Jan 3, 2023

IMPORTANT: PRs without the template will be CLOSED

Description of your pull request and other information.

Fix extraction of Title and Comments fields for the Porn91 info extractor.

I noticed that ytdl-org/youtube-dl#29876 attempts to do the same, but it's not working and appears abandoned.

Fixes #3256

Template

Before submitting a pull request make sure you have:

In order to be accepted and merged into yt-dlp each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
@pukkandan pukkandan added site-bug Issue with a specific website pending-fixes PR has had changes requested labels Jan 3, 2023
@pukkandan pukkandan removed the pending-fixes PR has had changes requested label Jan 3, 2023
Copy link
Member

@pukkandan pukkandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an example for m3u8?

yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
@pukkandan pukkandan added the pending-fixes PR has had changes requested label Jan 6, 2023
@pukkandan pukkandan removed the pending-fixes PR has had changes requested label Jan 16, 2023
@pmitchell86
Copy link
Contributor Author

@pukkandan as for that m3u8 example: I don't have one. The latest version of the PR uses determine_ext() which seems to return the correct info in my testing. The files always come down as mp4

'info_dict': {
'id': '726186267387ffe1e5e6',
'title': '见过卖老婆的,那你见过卖亲闺女的吗?',
'description': '疫情当下,如何约炮?\n--19kn.cc--\n拥有全国线下学生、少妇、反差婊、兼职良家。\n并且免费!!!\n只需要一个电话,一个定位,就能送炮上门。可提前查看照片\n(妹子自带48小时核酸报告)\n约炮,我们是认真的!\n并且拥有三大优势!\n\n1、各种求包养母狗,学生妹资源。为你解决各种需要。--19kn.cc--\n\n2、所有女性会员经过实名视频验证,平台严选,杜绝各种骗红包,口嗨者。--19kn.cc--\n\n3、5年大平台,91许多约炮案例,包括知名博主女伴,均是我们撮合成功的,保障会员隐私,并且约炮3次可自行联系平台进行信息发布。--19kn.cc--\n\n平台5周年庆活动,特回馈91狼友\n\n1、所有女性会员,如果参假,举报客服,核实成功奖励10000人民币。\n\n2、约炮成功并且反馈客服,赠送91vip自拍达人号\n\n3、情侣入驻,可享受专属奖励(奖金5000元)\n\n年关将近,平台大放血,只为各位狼友能找到固定性伴侣,度过美好新年!\n\n约炮渠道请登录--19kn.cc--\n\nPS:招网络客服,对接客户,安排妹子(要求耐心,熟悉客服流程优先,有电脑优先)工作时间:12小时制,\n\n招男模,女模(要求形象气质佳,需提供体检报告)\n有意可以联系官方招聘邮箱[email\xa0protected]',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use md5

'info_dict': {
'id': '7e42283b4f5ab36da134',
'title': '18岁大一漂亮学妹,水嫩性感,再爽一次!',
'description': '想看我拍新的系列都请帮我加精跟5星好评哦!希望大家鼎力支持,谢过了。我再重申,这次是朋友介绍安排的漂亮学生,费用不低,不过胜在年轻听话,水嫩性感,很超值的女生(6分05有91验证)。PS:本人强壮耐久,事业型男,愿意结交江浙沪的漂亮学妹,加Q:2889560495,语音验证性别,欢迎女生约我,或者靠谱男来一起泡美眉。',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -29,32 +51,42 @@ def _real_extract(self, url):
webpage = self._download_webpage(
'http://91porn.com/view_video.php?viewkey=%s' % video_id, video_id)

if '作为游客,你每天只可观看10个视频' in webpage:
raise ExtractorError('91 Porn says: Daily limit 10 videos exceeded', expected=True)
if '作为游客,你每天只可观看15个视频' in webpage:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use regex to extract out the number?

r'<textarea[^>]+id=["\']fm-video_link[^>]+>([^<]+)</textarea>',
webpage, 'video link')
videopage = self._download_webpage(video_link_url, video_id)
r'document\.write\(\s*strencode2\s*\(\s*((?:"[^"]+")|(?:\'[^\']+\'))\s*\)\s*\)', webpage, 'video link')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r'document\.write\(\s*strencode2\s*\(\s*((?:"[^"]+")|(?:\'[^\']+\'))\s*\)\s*\)', webpage, 'video link')
r'document\.write\(\s*strencode2\s*\(\s*((?:"[^"]+")|(?:\'[^\']+\'))', webpage, 'video link')

'title': title,
'upload_date': upload_date,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'upload_date': upload_date,
'upload_date': unified_strdate(self._search_regex(
r'<span\s+class=["\']title-yakov["\']>(\d{4}-\d{2}-\d{2})</span>',
webpage, 'upload_date', fatal=False)),

etc


duration = parse_duration(self._search_regex(
r'时长:\s*</span>\s*(\d+:\d+)', webpage, 'duration', fatal=False))
r'时长:\s*<span[^>]*>\s*(\d+:\d+:\d+)\s*</span>', webpage, 'duration', fatal=False))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r'时长:\s*<span[^>]*>\s*(\d+:\d+:\d+)\s*</span>', webpage, 'duration', fatal=False))
r'时长:\s*<span[^>]*>\s*(\d+(?::\d+){1,2})', webpage, 'duration', fatal=False))
  • {1,2} to support old format too
  • Is </span> needed?

upload_date = unified_strdate(upload_date)

description = self._html_search_regex(
r'<span\s+class=["\']more title["\']>\s*(.*(?!</span>))\s*</span>', webpage, 'description', fatal=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex is wrong. You have to group the . with (?!) and * the whole thing to do what you want. But it's better to just do:

Suggested change
r'<span\s+class=["\']more title["\']>\s*(.*(?!</span>))\s*</span>', webpage, 'description', fatal=False)
r'<span\s+class=["\']more title["\']>\s*([^<]+', webpage, 'description', fatal=False)

'id': video_id,
'url': video_link_url,
'ext': determine_ext(video_link_url),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'ext': determine_ext(video_link_url),

Unnecessary

'id': '726186267387ffe1e5e6',
'title': '见过卖老婆的,那你见过卖亲闺女的吗?',
'description': '疫情当下,如何约炮?\n--19kn.cc--\n拥有全国线下学生、少妇、反差婊、兼职良家。\n并且免费!!!\n只需要一个电话,一个定位,就能送炮上门。可提前查看照片\n(妹子自带48小时核酸报告)\n约炮,我们是认真的!\n并且拥有三大优势!\n\n1、各种求包养母狗,学生妹资源。为你解决各种需要。--19kn.cc--\n\n2、所有女性会员经过实名视频验证,平台严选,杜绝各种骗红包,口嗨者。--19kn.cc--\n\n3、5年大平台,91许多约炮案例,包括知名博主女伴,均是我们撮合成功的,保障会员隐私,并且约炮3次可自行联系平台进行信息发布。--19kn.cc--\n\n平台5周年庆活动,特回馈91狼友\n\n1、所有女性会员,如果参假,举报客服,核实成功奖励10000人民币。\n\n2、约炮成功并且反馈客服,赠送91vip自拍达人号\n\n3、情侣入驻,可享受专属奖励(奖金5000元)\n\n年关将近,平台大放血,只为各位狼友能找到固定性伴侣,度过美好新年!\n\n约炮渠道请登录--19kn.cc--\n\nPS:招网络客服,对接客户,安排妹子(要求耐心,熟悉客服流程优先,有电脑优先)工作时间:12小时制,\n\n招男模,女模(要求形象气质佳,需提供体检报告)\n有意可以联系官方招聘邮箱[email\xa0protected]',
'ext': 'm3u8',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong. Either test def is wrong, or you need to use _extract_m3u8_formats_and_subtitles in code

ExtractorError,
)


class Porn91IE(InfoExtractor):
IE_NAME = '91porn'
_VALID_URL = r'(?:https?://)(?:www\.|)91porn\.com/.+?\?viewkey=(?P<id>[\w\d]+)'
_VALID_URL = r'(?:https?://)(?:www\.|)91porn\.com/.*([\?&])viewkey=(?P<id>[\w\d]+)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_VALID_URL = r'(?:https?://)(?:www\.|)91porn\.com/.*([\?&])viewkey=(?P<id>[\w\d]+)'
_VALID_URL = r'(?:https?://)(?:www\.|)91porn\.com/view_video.php\?([^#]+&)?viewkey=(?P<id>\w+)'

@pukkandan pukkandan added the pending-fixes PR has had changes requested label Jan 28, 2023
@pukkandan
Copy link
Member

Explain 6a7a551. You shouldn't just pass master m3u8 without processing

@pukkandan pukkandan added the pending-fixes PR has had changes requested label Feb 3, 2023
@pmitchell86
Copy link
Contributor Author

Explain 6a7a551. You shouldn't just pass master m3u8 without processing

In 6a7a551 all the urls i used as examples would download as *.m3u8 and not be playable. Renaming the file to mp4 enabled VLC to play the files.

@pukkandan
Copy link
Member

If it's an m3u8, use _extract_m3u8_formats_and_subtitles. Like I said before,

You shouldn't just pass master m3u8 without processing

@pmitchell86
Copy link
Contributor Author

pmitchell86 commented Feb 4, 2023

https://91porn.com/view_video.php?viewkey=7ef0cf3d362c699ab91c is an example (taken from the tests) where:

  • determine_ext() returns m3u8
  • _extract_m3u8_formats_and_subtitles() returns: [{'format_id': '', 'format_index': None, 'url': 'https://cdn77.91p49.com/m3u8/731633/731633.m3u8', 'ext': None, 'protocol': 'm3u8_native', 'preference': None, 'quality': None, 'has_drm': None}]

if the hard-coded 'ext': 'mp4', is removed from the return dict then the downloaded file will have an .m3u8 extension.

yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
yt_dlp/extractor/porn91.py Outdated Show resolved Hide resolved
pmitchell86 and others added 3 commits February 4, 2023 09:36
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
Co-authored-by: pukkandan <pukkandan.ytdlp@gmail.com>
@pukkandan pukkandan removed the pending-fixes PR has had changes requested label Feb 4, 2023
@pukkandan pukkandan merged commit c085cc2 into yt-dlp:master Feb 12, 2023
aalsuwaidi pushed a commit to aalsuwaidi/yt-dlp that referenced this pull request Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NSFW site-bug Issue with a specific website
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix 91porn.com
2 participants