Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] google images engine: Fix 'scrap_img_by_id' function #910

Merged
merged 2 commits into from
Feb 20, 2022

Conversation

tiekoetter
Copy link
Member

What does this PR do?

The 'scrap_img_by_id' function didn't return anything useful. This fix allows the google images engine to present the full source image instead of only the thumbnail.

How to test this PR locally?

  • make run
  • Search !goi land
  • Check if full image is loaded in result card view

Author's checklist

Note: If an image is redirected image proxy fails.

DEBUG   searx.network.image_proxy     : HTTP Request: GET https://www.pik-potsdam.de/en/institute/departments/activities/resolveuid/1921254915ba45a9a661fd8228588a90 "HTTP/1.1 301 Moved Permanently"
DEBUG   searx.webapp                  : image-proxy: wrong response code: 301
INFO    werkzeug                      : 192.168.178.129 - - [19/Feb/2022 01:37:55] "GET /image_proxy?url=https%3A%2F%2Fwww.pik-potsdam.de%2Fen%2Finstitute%2Fdepartments%2Factivities%2Fresolveuid%2F1921254915ba45a9a661fd8228588a90&h=89c65601c562be381d299b79447c42899f3d68b7939ba5691a1513822d763bf0 HTTP/1.0" 400 -

Related issues

Closes #909

@return42 return42 self-requested a review February 19, 2022 08:09
@return42
Copy link
Member

While I am still in review with this PR I leave some comments ...

The 'scrap_img_by_id' function didn't return anything useful.

This implementation is from mine :-) ... ugly hack parsing JS code by regexp to pick out thumbs .. as far I remember, my intention was to get thumbnails instead of full size images to save band with.

Note: If an image is redirected image proxy fails.

In #878 we found a solution to avoid redirects by calculating the direct URL .. if we here in google can't avoid redirects, we can allow redirects in the image_proxy implementation. Here is a possible solution @dalf posted on Matrix a few days ago ..

diff --git a/searx/webapp.py b/searx/webapp.py
index 5e05f978..eb08d63d 100755
--- a/searx/webapp.py
+++ b/searx/webapp.py
@@ -1132,7 +1132,7 @@ def image_proxy():
             'DNT': '1',
         }
         set_context_network_name('image_proxy')
-        resp, stream = http_stream(method='GET', url=url, headers=request_headers)
+        resp, stream = http_stream(method='GET', url=url, headers=request_headers, allow_redirects=True)
         content_length = resp.headers.get('Content-Length')
         if content_length and content_length.isdigit() and int(content_length) > maximum_size:
             return 'Max size', 400

@tiekoetter
Copy link
Member Author

This implementation is from mine :-) ... ugly hack parsing JS code by regexp to pick out thumbs .. as far I remember, my intention was to get thumbnails instead of full size images to save band with.

scrap_out_thumbs does the thumbnail stuff.

scrap_img_by_id should get the source image, which it does now.

if we here in google can't avoid redirects, we can allow redirects in the image_proxy implementation.

I don't think we can avoid this here because not every image comes from one source. I don't think google caches the full image of every source.

@dalf
Copy link
Member

dalf commented Feb 19, 2022

While trying to understand scrap_img_by_id, I fall down the rabbit hole: https://gist.github.com/dalf/ec228b4aec97033de96ec92a504cf988

  • use only the data from AF_initDataCallback
  • custom ECMA script to JSON translation. It seems good enough with few tests, but it may have issues here and there.
  • it doesn't use the base64 thumbnails
  • return the full size images from the original website. Note: some image size are bigger than 10MB.
  • I don't know how often this expression data['data'][31][0][12][2] needs to be updated.

return42 added a commit to return42/searxng that referenced this pull request Feb 19, 2022
Without redirects the load of various images will fail when image_proxy is
enabled [1].

[1] searxng#910 (comment)
Suggested-by: @dalf [1]
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The 'scrap_img_by_id' function didn't return any longer anything useful.  This
fix allows the google images engine to present the full source image instead of
only the thumbnail.

The function scrap_img_by_id() is rpelaced by a fully rewrite to parse image
URLs by a regular expression. The new function parse_urls_img_from_js(dom)
returns a mapping of data-id to image URL.

Closes: searxng#909
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Without redirects the load of various images will fail when image_proxy is
enabled [1].

[1] searxng#910 (comment)
Suggested-by: @dalf [1]
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Copy link
Member

@return42 return42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiekoetter, @dalf: the response from google has changed, see my comment above.

I implemented an alternative solution in my branch fix-909-mhe .. would you please have a look / see commit

[fix] image_proxy: allow HTTP redirects

In this fix-909-mhe branch I also implemented the redirect when image_proxy is enabled, see commit:

[fix] image_proxy: allow HTTP redirects

@return42
Copy link
Member

While trying to understand scrap_img_by_id, I fall down the rabbit hole: https://gist.github.com/dalf/ec228b4aec97033de96ec92a504cf988

  • use only the data from AF_initDataCallback
  • custom ECMA script to JSON translation. It seems good enough with few tests, but it may have issues here and there.

I also had been in this rabbit hole when I implemented it the first time :-) .. see my comment: return42@4a28b59#diff-7f888885ad7eb66947d95abd3ffa0dabe8a1faa6b1a4b5d4ea0b40908bf0afa6R100-R105

@unixfox
Copy link
Member

unixfox commented Feb 19, 2022

Awesome thanks for the PR, this annoyed me to have small images too.

Copy link
Member

@return42 return42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@return42 return42 merged commit 36aee70 into searxng:master Feb 20, 2022
@tiekoetter tiekoetter deleted the fix-909 branch February 21, 2022 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG ?] Images on search engine are mostly small.
4 participants