Scraping is blocked #195

eliavm · 2018-11-07T11:39:44Z

When trying to scrape a page I'm getting empty page with one div <div id="distilIdentificationBlock"> </div>

Found something on stackoverflow that might be relevant: https://stackoverflow.com/questions/45060011/crawling-web-using-selenium-chrome-driver-but-still-blocked

Any idea how to bypass this?

Using Scrapy 1.5.1 + scrapy-splash 0.7.2

The text was updated successfully, but these errors were encountered:

al-serebrov · 2018-11-09T16:49:09Z

Hi @eliavm !
Unfortunately, you are hitting the website using Distil Antibot measures, and there's no "silver bullet" to propose you.
There are some solutions and thoughts available on this topic, e.g. here and here, but generally you need to use proxies and develop some browser fingerprinting into your spider (pretend that you are an actual human being) - all of these things are pretty complicated and not really related to the Splash itself, as you can face the same responses using other headless browsers (e.g. Selenium).

JavierRuano · 2018-11-11T12:03:54Z

I have "read" about here. Another resource, https://www.blackhatworld.com/seo/python-scraping-distil-protected-sites.988967/ i don't know if it is useful, sorry. i don't speak chinese. https://www.jianshu.com/p/be856bc15afb <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Libre de virus. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> El vie., 9 nov. 2018 a las 17:49, Alexander Serebrov (< notifications@github.com>) escribió:

…

Hi @eliavm <https://github.com/eliavm> ! Unfortunately, you are hitting the website using Distil Antibot measures, and there's not "silver bullet" to propose you. There are some solutions and thoughts available on this topic, e.g. here <https://www.reddit.com/r/webdev/comments/5q1ypx/what_is_your_approach_on_scraping_distil_networks/> and here <https://www.blackhatworld.com/seo/python-scraping-distil-protected-sites.988967/>, but generally you need to use proxies and develop some browser fingerprinting into your spider (pretend that you are an actual human being) - all of these things are pretty complicated and not really related to the Splash itself, as you can face the same responses using other headless browsers (e.g. Selenium). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#195 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Agwyu7PKjyEjBLK6NTO5uKMFxy-Qh4cJks5utbILgaJpZM4YSRUY> .

Gallaecio · 2019-05-09T12:13:48Z

@eliavm Could you close this issue, as it is not really about scrapy-splash?

eliavm closed this as completed May 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping is blocked #195

Scraping is blocked #195

eliavm commented Nov 7, 2018

al-serebrov commented Nov 9, 2018 •

edited

JavierRuano commented Nov 11, 2018 via email

Gallaecio commented May 9, 2019

Scraping is blocked #195

Scraping is blocked #195

Comments

eliavm commented Nov 7, 2018

al-serebrov commented Nov 9, 2018 • edited

JavierRuano commented Nov 11, 2018 via email

Gallaecio commented May 9, 2019

al-serebrov commented Nov 9, 2018 •

edited