-
-
Notifications
You must be signed in to change notification settings - Fork 422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken Scrapers #123
Comments
Look like the JavLibrary scraper can be broken sometimes. Idea:
But i don't think it would be useful for other scraper. |
@brumouta thanks for the feedback welivetogether,babes now are moved to a separate one |
RealityKings has some more broken domains: |
Thanks for the feedback @budislov |
Looks like RealityKingsOL is broken. Tried to scrap from both babes.com and bellesafilms.com and only the tags came through. It appears that the div classes used in the scrapper have changed. Will investigate further. |
|
iafd.com performer scraper not working |
IAFD fixed , thanks for the report @Ziatexataor and for the fix @Belleyy |
TransSensual.yml seems to be broken. Tested with new and older scenes and can't pull the data |
@malibustacynewhat thanks for the report |
JAVLibrary is broken https://github.com/stashapp/CommunityScrapers/blob/master/scrapers/javlibrary.yml Looks to be a Cloudflare error but using the CDP driver didn't resolve it for me when testing: <!DOCTYPE html><html lang="en-US"><head>
<meta charset="UTF-8"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1"/>
<meta name="robots" content="noindex, nofollow"/>
<meta name="viewport" content="width=device-width,initial-scale=1"/>
<title>Just a moment...</title>
<style type="text/css">
html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
body {background-color: #ffffff; color: #000000; font-family:-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Helvetica Neue",Arial, sans-serif; font-size: 16px; line-height: 1.7em;-webkit-font-smoothing: antialiased;}
h1 { text-align: center; font-weight:700; margin: 16px 0; font-size: 32px; color:#000000; line-height: 1.25;}
p {font-size: 20px; font-weight: 400; margin: 8px 0;}
p, .attribution, {text-align: center;}
#spinner {margin: 0 auto 30px auto; display: block;}
.attribution {margin-top: 32px;}
@keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
@-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
#cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
#cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
#cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
.bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
a { color: #2c7cb0; text-decoration: none; -moz-transition: color 0.15s ease; -o-transition: color 0.15s ease; -webkit-transition: color 0.15s ease; transition: color 0.15s ease; }
a:hover{color: #f4a15d}
.attribution{font-size: 16px; line-height: 1.5;}
.ray_id{display: block; margin-top: 8px;}
#cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
#cf-hcaptcha-container { text-align:center;}
#cf-hcaptcha-container iframe { display: inline-block;}
</style>
<meta http-equiv="refresh" content="12"/>
<script type="text/javascript">
//<![CDATA[
(function(){
window._cf_chl_opt={
cvId: "1",
cType: "non-interactive",
cNounce: "90957",
cRay: "5e0bb321ef7bca98",
cHash: "da202b537a470c2",
cFPWv: "g",
cRq: {
ru: "aHR0cDovL3d3dy5qYXZsaWJyYXJ5LmNvbS9lbi8/dj1qYXZtZXpiZTNh",
ra: "TW96aWxsYS81LjAgKE1hY2ludG9zaDsgSW50ZWwgTWFjIE9TIFggMTBfMTVfNSkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzgzLjAuNDEwMy4xMDYgU2FmYXJpLzUzNy4zNg==",
rm: "R0VU",
d: "q4jiR7WSBtf4fLzLz9igfZOdIxwSKG18lkM8oKJ2oB8n30GM2iyW8aiQ9atzUZsOBOiOCY1F45Ok0xoQE9LhBiZfXlfVJaHdOBUlqNu1cCbboEIdvJX1FuypXHYYwXjfaKTC2p4xeTL5nAkfqvaQqkt1H/1p0rqFLGuv5JXJ3gBxB6Y/uALdxdsFi+lSlCG6Qe3X2Lj+WYyKl3todU7QjK8vUNythAJOrMTlR1fGrfbfXESvY4tSMJo7OEhwZymfB+AKhpzlHeTcuo+T40qfUHcXUDFRZCqSIvBynJ532Jn2bbqiZ1XffuBhRCVhBxK+kkJ9NurfuchvBr0bA3lk+Dnyykdr0hUr5lE34hioN0t6bDwXnGSBMCsX40Hx6TDDQa+utstnZqYk3G1jtYupvATJXzjvxhaNDHgOwHJomiUip/glK6aw52FuNwxXEj7ZJmdJPg4omti3B/1l7wy5+Z1rERc/nHgZE2JBxsOMDFpFXx6oNX/ZCk1//+mIVxGVFfNCBIGI1eyIKCP6LkCcsw1+aeO2YHmOzBkz9Ebx3drg5ouDQU0bmnNNsuh6vtMZ2eydA3b8y1H2mfO+UoUwB7Ej5u0cR1gJGbuSHpK+imsOFpqmwJdDPhqXYl5xcy6nVCnU2xeyqXJP/HMHGjU4h3Op/vlZKIuhtqFPC6Guk0FIUbFTI4JGMG7u3UwcuuYUrnmYXFX1vupeVrqsjsRFJXnqRhnWc+EJ62b3QYIqf/pFpb/eKU8DpE4wKEmd05vkzLCS1DZQ29AxACho6Zf0brScVV2/qvY5qVsNlk9QCSJdmmR7eyfAPju4BoRmFWdRVEymwQHM7raS1XGdZvcFDw==",
t: "MTYwMjQ1MjAwOS4yNzIwMDA=",
m: "jAJ8FygcOeJXMeDg2+r+pIbPCZvv7uD3AA/cCQ2MIkQ=",
i1: "2tfaQpq68/qtCUW9AL9YZA==",
i2: "DT4KsCiUsfsu8FZXKRmHjg==",
uh: "TprDV0CpLyfpdzs+8x+WX/Btsv1e+OQLx8NzEGjSfMY=",
hh: "3htzUBXaqug0moZaVaRPWNYG1rRQQxdDndKhxQafs0M=",
}
}
window._cf_chl_enter = function(){window._cf_chl_opt.p=1};
var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
b(function(){
var cookiesEnabled=(navigator.cookieEnabled)? true : false;
var cookieSupportInfix=cookiesEnabled?'/nocookie':'/cookie';
var a = document.getElementById('cf-content');a.style.display = 'block';
var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
var trkjs = isIE ? new Image() : document.createElement('img');
trkjs.setAttribute("src", "/cdn-cgi/images/trace/jschal/js"+cookieSupportInfix+"/transparent.gif?ray=5e0bb321ef7bca98");
trkjs.id = "trk_jschal_js";
trkjs.setAttribute("alt", "");
document.body.appendChild(trkjs);
var cpo = document.createElement('script');
cpo.type = 'text/javascript';
cpo.src = "/cdn-cgi/challenge-platform/h/g/orchestrate/jsch/v1";
var done = false;
cpo.onload = cpo.onreadystatechange = function() {
if (!done && (!this.readyState || this.readyState === "loaded" || this.readyState === "complete")) {
done = true;
cpo.onload = cpo.onreadystatechange = null;
window._cf_chl_enter()
}
};
document.getElementsByTagName('head')[0].appendChild(cpo);
}, false);
})();
//]]>
</script>
</head>
<body>
<div style="display: none;"><a href="http://bt50.org/nonalignedfrequent.php?pl=0">table</a></div><table width="100%" height="100%" cellpadding="20">
<tbody><tr>
<td align="center" valign="middle">
<div class="cf-browser-verification cf-im-under-attack">
<noscript>
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
<div id="cf-content" style="display:none">
<div id="cf-bubbles">
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
<h1><span data-translate="checking_browser">Checking your browser before accessing</span> javlibrary.com.</h1>
<div id="no-cookie-warning" data-translate="turn_on_cookies" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
<p data-translate="allow_5_secs">Please allow up to 5 seconds…</p>
</div>
<form class="challenge-form" id="challenge-form" action="/en/?v=javmezbe3a&__cf_chl_jschl_tk__=c44b146f044ddd9d0b23bf4928759e99e7ddef0e-1602452009-0-Ab8hTl3noYmOwwAWI1D0d_6zhaYO-4vHBJD8JW4VCFmZKjqal-xVCdpCdbztfKStCEp8QJa2ganoOGB_Jnq-Qwtu6BnG7zySJxaY_Oc54OgSHPG3Mt1wJ-nYfmFjU8ShDtM6t2VT15V5I0rsRAGRc5RZPs1OE8Vi3aozMxTjxatgWYLmnk0ozVyDVudpWURh7xhqtqs9M9vv_jAfqIUgHIwFe1MVURVaxrV4jOsccyGYHvJ8ZLFmpzrqf8LPPa2N3M1SG-T4vUDhsLgjgeIkfOC6_U3zZBVNKUY8HU47JaiTLjHHnOMHfzeA4iz76Sb2MQ" method="POST" enctype="application/x-www-form-urlencoded">
<input type="hidden" name="r" value="c77fde06d76dccbdf1aa275a6824657ec7878994-1602452009-0-AefHkw7YBHV4yapfdyGgNFofr2bk+ZNLsmu1vxzyTAyFPQickf2DVbsdFnOKYI9Zs5D6PO21kZcj5siVtnYOhmEJ7HOBLBCp4lS+GBW8iyR62pXG9ezmP6Fu4qRomUkK8uCSsqveohhquzDEYroSgMpZT0eIJXFIprAfC6uIux7NSx6mo8wGMKFoW3TJJFmAN4FKgZdHpkLShowC8AaRocTx86yZzOOrEywJ5CGsOzw5vNg4GvS4gK6MB+pR3iKfGRnXamisWHrWYZWDyfiGHOfcD8LmcCWzeIEMfD+nADV4477P2jWOHIDvEqtS7Yi0G3qKvH16LmR28qALhOLv8PAhv2GBzp8EOUcdXkJfFN1Jloqm5JU2eoCn/5uBxE0xl80s8Xfaa9vhkhqRicv3XnmHpJRhXgNvauGiYLcmaJ0189RtB6eEhZ6j1N9o9pfstDcSa00ur7vPLgDCd2AqiVrVz8SG8zb+8L+wlfrTaBCIlAiecjoTFLHTPEZW2V4eaVYzY9ECAb69YOhnGBhUXDiDk8wjSLZv8uZYMIxwW+jEsdzAtJ9TkMq5VXrE/sORd24lamS6K3Lr8g9BasZTjJdR3Omni9UmlQVaVDXUIPQBAb6x1nhf57/47lvWjDgrjuEw47NDosN3IHSDoyKYUMg="/>
<input type="hidden" value="3715604b2b146b25182bb17d479ebda2" id="jschl-vc" name="jschl_vc"/>
<!-- <input type="hidden" value="" id="jschl-vc" name="jschl_vc"/> -->
<input type="hidden" name="pass" value="1602452013.272-qzCPIXiuVG"/>
<input type="hidden" id="jschl-answer" name="jschl_answer"/>
</form>
<div id="trk_jschal_nojs" style="background-image:url('/cdn-cgi/images/trace/jschal/nojs/transparent.gif?ray=5e0bb321ef7bca98')"> </div>
</div>
<div class="attribution">
DDoS protection by <a href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>
<br/>
<span class="ray_id">Ray ID: <code>5e0bb321ef7bca98</code></span>
</div>
</td>
</tr>
</tbody></table>
</body></html> |
teamskeet.com |
Teamskeet only works for a single query and then cloudflare blocks the ip i think. for javlibrary with the last update you can change the url to one of the mirrors and it should work |
Vixen Network sites now require you to login when opening a scene page, thus the scraper no longer works. |
Already solved in discord, but for other people: |
teenfidelity.com doesn't work (part of /scrapers/KellyMadisonMedia.yml) |
@Threak are you sure you setup cdp correctly? Just tried and it seems to work. The first request has something to do with their site protection not necessary a cookie. debug:
printHTML: true and have a look at the log so that you can see what the site returns to stash. |
@bnkai I think there is a difference between headless chromium and using normal chrome. I use a chromium executable and this scraper don't work for me like Teamskeet scraper, so i think there is a difference between headless and classic. |
@Belleyy you might be right |
@Belleyy upon futher investigation it seems that the docker container method maintains some cookies which i assume the executable one doesn't. % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 46 100 46 0 0 35 0 0:00:01 0:00:01 --:--:-- 35
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 43.7M 100 157 100 43.7M 6 1890k 0:00:26 0:00:23 0:00:03 206k
stash-osx uploaded to url: "https://gofile.io/d/lq6J3w"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 46 100 46 0 0 35 0 0:00:01 0:00:01 --:--:-- 35
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 40.5M 100 161 100 40.5M 39 9.9M 0:00:04 0:00:04 --:--:-- 9.9M
stash-win.exe uploaded to url: "https://gofile.io/d/DozNwz"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 46 100 46 0 0 29 0 0:00:01 0:00:01 --:--:-- 29
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 41.2M 100 159 100 41.2M 5 1376k 0:00:31 0:00:30 0:00:01 0
stash-linux uploaded to url: "https://gofile.io/d/5Tl0hi" First make sure to set the log level to debug. Then do a scrape. After the scrape get the nats values that are printed in the log and replace in the yml file the |
@bnkai Just tested with few scene and it work 👍 (With & Without CDP)
|
@Belleyy this seems to verify what i thought , we'll probably have to update the scraper to mention that a CDP remote instance is required (plain executable is not enough) till the cookies PR is merged. |
I have following 4 errors regrading feild cookies.
|
@JDRanpariya Are you using the dev build ? This scraper need a version of stash >= v0.4.0-14. |
I'm using following build The 24 Nov one |
@JDRanpariya you need to switch to a recent dev version as stated in the scrapers list v0.4.0-14 at least for cookie support. The one you have doesnt support that as its v0.4.0 ( 14 commits older that what you need) |
@Maista6969 Thank you! I'm not sure how to reopen an issue, so I left a comment on the one you referenced. |
DesperateAmateurs: I have the scraper installed (Stash has been restarted a few times since it was installed) but on a scene, clicking "Scrape with..." DA will not be in the list, pasting in a DA URL and clicking the white download/scrape button returns nothing at all (no dialog box, no message, no data) and clicking "Scrape with URL" from the "Scrape with..." list displays the message "No scenes found" - so either I'm doing something completely wrong or the scraper is 100% borked... could be either. Or both. The scraper is listed as the current/latest version. |
I just installed the scraper through the community scrapers installer in v0.25.0 and it is able get data when entering a DA URL and clicking the scrape button. Make sure you are clicking the reload scraper button, as well as refreshing the browser window where you are attempting to scrape the URL on a scene. I had to refresh my scene page or else the scrape button was greyed out.
|
I've: |
What URL are you scraping? This could be a networking issue, but we can rule out the scraper itself being broken: I have tested it with several scenes now (like this one) and it works as expected. The only problem I can see is that the URL pattern in this scraper is too liberal: it will accept any URL that contains Can you use other scrapers without any issues or is this the first/only one you've tried? |
Having an issue with dc-onlyfans scraper
|
This scraper is very particular about file structures: your folder is named |
I still can't seem to get it working.
|
Got it working. |
I Want Clips: |
I am unable to reproduce this so the scraper isn't broken. Status code 504 is gateway timeout so it's definitely a networking issue, but not necessarily something you can do something about. It could just be a transient problem that will pass on its own 🙂 |
That's fair. Thanks for looking at it. |
scraper Brazzers: error running scraper script |
Need more info to be able to help with this, but first: please look at the README for some manual steps required to use Python scrapers at this time |
Thanks for reply
I installed python but brazzers scrap not work .in scene i click on edit
and clicking on scrape with and brazzers and it showed error running
scraper script i tried another site but same things happend
Maista ***@***.***> schrieb am Di., 16. Apr. 2024, 04:38:
… scraper Brazzers: error running scraper script plz help! i cant Scrap
brazzers scenes
Need more info to be able to help with this, but first: please look at the
README
<https://github.com/stashapp/CommunityScrapers?tab=readme-ov-file#python-scrapers>
for some manual steps required to use Python scrapers at this time
—
Reply to this email directly, view it on GitHub
<#123 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BHTQHE4ORTSNOUPLVBFVEODY5R2ZZAVCNFSM4PZHJ2KKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBVHAYDKOBZGQ3Q>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Can you check the logs at Debug level to see what's going wrong? I can't see your screen from where I'm sitting |
The X-Art scraper was recently updated to support galleries by @Ksrx01 in #1698 (thanks!). It returns blank details on some galleries, due to inconsistent HTML structures by the studio. In Bohemian Rhapsody and First Loves, the description is on the paragraph inside the one with ID <p id="desc"><p>It's a "Bohemian [...] Colette</p></p> <p id="desc"><p>Chelsea is [...] so cute!</p></p> The XPath expression is on line 39: gallery:
Title: //div[@class="small-12 medium-12 large-6 columns info"]/h1[@class="show-for-large-up"]
Details: //div[@class="small-12 medium-12 large-6 columns info"]/p[@id="desc"]
Date:
selector: //div[@class="small-12 medium-12 large-6 columns info"]/h2[1]/text() |
Noticed that issue too, shortly after updating it. |
The descriptions are actually adjacent to the paragraph with the ID In the meantime I've pushed a fix that ensures that we can scrape the full description for galleries on X-Art 🙂 |
Thanks! Unfortunately I can't remember which galleries I had issued with. |
Confirmed fixed! Thanks! |
I'm having an issue with Redgifs ERRO[2024-04-26 19:54:50] [Scrape / Redgifs] HTTP Error: 404 |
IAFD updated their URL for performers. It used to be "https://www.iafd.com/person.rme/perfid=" and it is now "https://www.iafd.com/person.rme/id=" I didn't check all parts of the scraper and all of the data returned from the scraper looks correct. |
Thank you for bringing this up, I've pushed a new version of the scraper YAML that will let it trigger on the new patterns as well as the old since those will redirect to the new and still scrape fine 🙂 |
Closing this in favor of creating individual issues for broken scrapers: if you've come here to report a broken scraper, please open a new issue |
Any issues with scrapers not working should be mentioned here
The name of the scraper, the xpath or part not working would be appretiated.
Known Issues
updated 2022-09-25
The text was updated successfully, but these errors were encountered: