Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

Scraping Google in the client side #1608

Closed
agnelvishal opened this issue May 27, 2019 · 17 comments
Closed

Scraping Google in the client side #1608

agnelvishal opened this issue May 27, 2019 · 17 comments

Comments

@agnelvishal
Copy link

agnelvishal commented May 27, 2019

Since Google introduces captcha when there are too many requests from an IP ( #729), it might be better to scrape Google in the client side. This will also decrease server load.

@unixfox
Copy link
Member

unixfox commented May 28, 2019

This seems to be quite difficult to implement and probably out of the scope of Searx.

It would require to when the page load, the client do a request with javascript to fetch the Google results then generate a new HTML code and finally inject it into the page.

I'm pretty sure it would add quite a lot of delay and complexity.

The main issue is not the captcha of Google but it's bots that abuse the public Searx instances. If we get rid of these bots we wouldn't have any issue with Google like I already do on my instance : https://searx.be

@eggercomputerchaos
Copy link

@unixfox
nice,
how did you did get rid of these bots?

@unixfox
Copy link
Member

unixfox commented May 29, 2019

@eggercomputerchaos I talked about it here: #1584 (comment) and here: #1034 (comment)

@eggercomputerchaos
Copy link

@unixfox
<Meanwhile, you can use my Searx instance: https://searx.be. It works by default with Google without any issue.>
thx

@marmistrz
Copy link

Would it be possible just solve the captcha on the client side?

@agnelvishal
Copy link
Author

This code will scrape results from Google in client

function main() {
    var urls = document.querySelectorAll(".r a[target='_blank']:not(.fl)");;
    if (urls.length != 0) {
        var urlList = {};
        for (let i = 0; i < urls.length; i++) {
            url = urls[i].href;
            urlList[i] = url
        }
console.log(urlList)
}

@eggercomputerchaos
Copy link

@agnelvishal
thx
where to insert the code?

@agnelvishal
Copy link
Author

@agnelvishal
thx
where to insert the code?

The integration is yet to be implemented

@eggercomputerchaos
Copy link

ok, but in which file can I insert the code, or is it more complicated than I think?

@return42
Copy link
Contributor

This code will scrape results from Google in client

Where is the sense in? .. parsing in the WEB-client?

@return42
Copy link
Contributor

The main issue is not the captcha of Google but it's bots that abuse the public Searx instances. If we get rid of these bots we wouldn't have any issue with Google like I already do on my instance : https://searx.be

My experience is that just a group of more than one user can result in CAPTCHA questions or even one havey user. Today it hits me at my instance https://darmarit.cloud/searx/ and I checked the apache log: I was the only user, no bots at this time (today).

@return42
Copy link
Contributor

return42 commented Dec 6, 2019

Update: now old google UI is completely dead #1748 (comment)

@unixfox
Copy link
Member

unixfox commented Dec 6, 2019

This issue isn't really about the old UI but about parsing the (new) UI from the client side.

@return42
Copy link
Contributor

return42 commented Dec 6, 2019

Aaargh, my fail .. so many open issues / I got confused.

Anyway parsing in the client is nonsense / the solution for the new UI is #1628 / or did I missed something .. again :)

@agnelvishal
Copy link
Author

This seems to be quite difficult to implement and probably out of the scope of Searx.

It would require to when the page load, the client do a request with javascript to fetch the Google results then generate a new HTML code and finally inject it into the page.

I'm pretty sure it would add quite a lot of delay and complexity.

The main issue is not the captcha of Google but it's bots that abuse the public Searx instances. If we get rid of these bots we wouldn't have any issue with Google like I already do on my instance : https://searx.be

If scraping is done in the client side, that will prevent bot abuse. Because the bot will have to send request to Google search etc.

@jingyu9575
Copy link

@agnelvishal Is it possible to do it in the client side? I'm not sure whether there are same-origin policies limiting it.

@unixfox
Copy link
Member

unixfox commented Dec 9, 2019

@agnelvishal we already know the benefits of parsing using the client side, I completely agree with you that it would be really great for Searx instances.
But the issue is that for Google we can't actually force the browser to do requests to google.com due to the Same-origin policy: https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy

@asciimoo @return42 I guess you can close this issue because it's technically not possible to do what have been requested.

@return42 return42 closed this as completed Dec 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants