Scraping Google in the client side #1608

agnelvishal · 2019-05-27T19:17:29Z

Since Google introduces captcha when there are too many requests from an IP ( #729), it might be better to scrape Google in the client side. This will also decrease server load.

unixfox · 2019-05-28T14:10:17Z

This seems to be quite difficult to implement and probably out of the scope of Searx.

It would require to when the page load, the client do a request with javascript to fetch the Google results then generate a new HTML code and finally inject it into the page.

I'm pretty sure it would add quite a lot of delay and complexity.

The main issue is not the captcha of Google but it's bots that abuse the public Searx instances. If we get rid of these bots we wouldn't have any issue with Google like I already do on my instance : https://searx.be

eggercomputerchaos · 2019-05-29T09:54:47Z

@unixfox
nice,
how did you did get rid of these bots?

unixfox · 2019-05-29T12:14:55Z

@eggercomputerchaos I talked about it here: #1584 (comment) and here: #1034 (comment)

eggercomputerchaos · 2019-05-29T12:42:49Z

@unixfox
<Meanwhile, you can use my Searx instance: https://searx.be. It works by default with Google without any issue.>
thx

marmistrz · 2019-09-15T09:14:41Z

Would it be possible just solve the captcha on the client side?

agnelvishal · 2019-09-16T14:23:08Z

This code will scrape results from Google in client

function main() {
    var urls = document.querySelectorAll(".r a[target='_blank']:not(.fl)");;
    if (urls.length != 0) {
        var urlList = {};
        for (let i = 0; i < urls.length; i++) {
            url = urls[i].href;
            urlList[i] = url
        }
console.log(urlList)
}

eggercomputerchaos · 2019-09-16T17:27:28Z

@agnelvishal
thx
where to insert the code?

agnelvishal · 2019-09-16T18:17:42Z

@agnelvishal
thx
where to insert the code?

The integration is yet to be implemented

eggercomputerchaos · 2019-09-16T20:19:45Z

ok, but in which file can I insert the code, or is it more complicated than I think?

return42 · 2019-11-30T15:28:17Z

This code will scrape results from Google in client

Where is the sense in? .. parsing in the WEB-client?

return42 · 2019-11-30T15:33:34Z

The main issue is not the captcha of Google but it's bots that abuse the public Searx instances. If we get rid of these bots we wouldn't have any issue with Google like I already do on my instance : https://searx.be

My experience is that just a group of more than one user can result in CAPTCHA questions or even one havey user. Today it hits me at my instance https://darmarit.cloud/searx/ and I checked the apache log: I was the only user, no bots at this time (today).

return42 · 2019-12-06T13:52:59Z

Update: now old google UI is completely dead #1748 (comment)

unixfox · 2019-12-06T14:59:10Z

This issue isn't really about the old UI but about parsing the (new) UI from the client side.

return42 · 2019-12-06T16:43:16Z

Aaargh, my fail .. so many open issues / I got confused.

Anyway parsing in the client is nonsense / the solution for the new UI is #1628 / or did I missed something .. again :)

agnelvishal · 2019-12-09T15:08:18Z

This seems to be quite difficult to implement and probably out of the scope of Searx.

It would require to when the page load, the client do a request with javascript to fetch the Google results then generate a new HTML code and finally inject it into the page.

I'm pretty sure it would add quite a lot of delay and complexity.

The main issue is not the captcha of Google but it's bots that abuse the public Searx instances. If we get rid of these bots we wouldn't have any issue with Google like I already do on my instance : https://searx.be

If scraping is done in the client side, that will prevent bot abuse. Because the bot will have to send request to Google search etc.

jingyu9575 · 2019-12-09T15:26:10Z

@agnelvishal Is it possible to do it in the client side? I'm not sure whether there are same-origin policies limiting it.

unixfox · 2019-12-09T15:44:32Z

@agnelvishal we already know the benefits of parsing using the client side, I completely agree with you that it would be really great for Searx instances.
But the issue is that for Google we can't actually force the browser to do requests to google.com due to the Same-origin policy: https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy

@asciimoo @return42 I guess you can close this issue because it's technically not possible to do what have been requested.

return42 closed this as completed Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping Google in the client side #1608

Scraping Google in the client side #1608

agnelvishal commented May 27, 2019 •

edited

Loading

unixfox commented May 28, 2019 •

edited

Loading

eggercomputerchaos commented May 29, 2019

unixfox commented May 29, 2019

eggercomputerchaos commented May 29, 2019

marmistrz commented Sep 15, 2019

agnelvishal commented Sep 16, 2019

eggercomputerchaos commented Sep 16, 2019

agnelvishal commented Sep 16, 2019

eggercomputerchaos commented Sep 16, 2019

return42 commented Nov 30, 2019

return42 commented Nov 30, 2019

return42 commented Dec 6, 2019

unixfox commented Dec 6, 2019

return42 commented Dec 6, 2019

agnelvishal commented Dec 9, 2019

jingyu9575 commented Dec 9, 2019

unixfox commented Dec 9, 2019 •

edited

Loading

Scraping Google in the client side #1608

Scraping Google in the client side #1608

Comments

agnelvishal commented May 27, 2019 • edited Loading

unixfox commented May 28, 2019 • edited Loading

eggercomputerchaos commented May 29, 2019

unixfox commented May 29, 2019

eggercomputerchaos commented May 29, 2019

marmistrz commented Sep 15, 2019

agnelvishal commented Sep 16, 2019

eggercomputerchaos commented Sep 16, 2019

agnelvishal commented Sep 16, 2019

eggercomputerchaos commented Sep 16, 2019

return42 commented Nov 30, 2019

return42 commented Nov 30, 2019

return42 commented Dec 6, 2019

unixfox commented Dec 6, 2019

return42 commented Dec 6, 2019

agnelvishal commented Dec 9, 2019

jingyu9575 commented Dec 9, 2019

unixfox commented Dec 9, 2019 • edited Loading

agnelvishal commented May 27, 2019 •

edited

Loading

unixfox commented May 28, 2019 •

edited

Loading

unixfox commented Dec 9, 2019 •

edited

Loading