[fix] Update xpaths for new Google results page #1628

MarcAbonce · 2019-06-26T23:00:25Z

Should fix #1609, or at least most of it.

This is very similar to #1603 in scope, with a few differences:

Once you remove the hardcoded user-agent, the old xpaths are never used anymore, so they can be completely replaced. This way, we don't have to maintain two different versions of result parsing.
This uses the old code for parsing results, which already handles some edge cases like extracting text from nested divs.
Suggestions and spelling corrections are also fixed.
The unit test was updated.

Instant answers and number of results are still not working, but I think they can be left for later.

rachmadaniHaryono · 2019-06-26T23:35:42Z

can you test it against html here?

#1596 (comment)

e:

- [{'content': 'Your source for entertainment news, celebrities, celeb news, and '
-              'celebrity gossip\u200b. Check out the hottest fashion, photos, '
-              'movies and TV shows!',
+ [{'content': 'Your source for entertainment\n'
+              'news, celebrities, celeb news, and celebrity gossip\u200b. Check '
+              'out the\n'
+              'hottest fashion, photos, movies and TV shows!',
    'title': 'E! News',
    'url': 'https://www.eonline.com/'},
   {'content': 'E! Watch later. Share. 7:42. 0:00 / 7:42.',
    'title': 'E! Entertainment - YouTube',
    'url': 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ'},
   {'content': 'Welcome to the E! News YouTube Channel.\u200b ... From latest '
               'celebrity and pop culture news to freshest E! Original series on '
               'Youtube, the E! News Youtube Channel is the #1 place to get the '
               'scoop.\u200b ... Join Will Marfuggi, Melanie Bromley and E! News '
               'correspondents as they discuss the hottest pop ...',
    'title': 'E! News - YouTube',
    'url': 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA'},
+  {'content': '',
-  {'content': '05 Nov 2017 · The family travels to Cleveland to support Tristan '
-              'during one of his big games.\u200b Plus, Kourtney ...Duration: '
-              '2:27 Posted: 05 Nov 2017',
    'title': '"Keeping Up With the Kardashians" Katch-Up S14, EP.6 | E! - '
             'YouTube',
    'url': 'https://www.youtube.com/watch?v=u4GoS0-R9Sg'},
   {'content': 'E! (an initialism for Entertainment Television) is an American '
               'pay television channel that is owned by the NBCUniversal Cable '
               'Entertainment Group division of NBCUniversal, all owned by '
               'Comcast.',
    'title': 'E! - Wikipedia',
    'url': 'https://en.wikipedia.org/wiki/E!'},
   {'content': 'Wil je online spelen? Ontdek de e-games van de Nationale Loterij '
               'Win for Life, Presto, Subito, Cash, 21, Spaarpot Smash, '
               'SuperSafe, Rabbits Run.',
    'title': 'E-games Nationale Loterij - Online Spelen - e-lotto.be',
    'url': 'https://www.e-lotto.be/NL/eGames'},
   {'content': 'Th!nk E presents its activities in the field of smart energy, '
               'smart grids, smart buildings and EU projects.',
    'title': 'Th!nk E',
    'url': 'https://www.think-e.be/'}]

~~that is the diff result from #1603 with new html from @unixfox from #1596~~

~~i think it is better because it get that missing content on https://www.youtube.com/watch?v=u4GoS0-R9Sg~~

see updated below

MarcAbonce · 2019-06-27T00:16:21Z

Yeah, if I parse newUIGoogle.html with my branch, I get the following 7 results:
['https://www.eonline.com/', 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ', 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA', 'https://www.youtube.com/watch?v=u4GoS0-R9Sg', 'https://en.wikipedia.org/wiki/E!', 'https://www.e-lotto.be/NL/eGames', 'https://www.think-e.be/']

I get the exact same results with oldUIGoogle.html using the version in master.

rachmadaniHaryono · 2019-06-27T00:46:10Z

Once you remove the hardcoded user-agent, the old xpaths are never used anymore, so they can be completely replaced. This way, we don't have to maintain two different versions of result parsing.

i actually doubt this method, because this assume user/searx instance will never got old google html format, which may or may not be true.

i would love if there is some data about this

i actually feel conflicted about this because i make #1603

i want to integrate this to my pr, but it kinda hard because different method (keeping the old parser vs change it)

The unit test was updated.

this show that engine test need to be changed (and why i create #1606)

searx should have html paired with parsed content data, so engine developer can compare between each version/commit similar to seedpeer engine test

rachmadaniHaryono · 2019-06-27T10:20:46Z

diff result from #1603 with new html from @unixfox from #1596

- [{'content': 'Your source for entertainment news, celebrities, celeb news, and '
-              'celebrity gossip\u200b. Check out the hottest fashion, photos, '
-              'movies and TV shows!',
+ [{'content': 'Your source for entertainment\n'
+              'news, celebrities, celeb news, and celebrity gossip\u200b. Check '
+              'out the\n'
+              'hottest fashion, photos, movies and TV shows!',
    'title': 'E! News',
    'url': 'https://www.eonline.com/'},
   {'content': 'E! Watch later. Share. 7:42. 0:00 / 7:42.',
    'title': 'E! Entertainment - YouTube',
    'url': 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ'},
   {'content': 'Welcome to the E! News YouTube Channel.\u200b ... From latest '
               'celebrity and pop culture news to freshest E! Original series on '
               'Youtube, the E! News Youtube Channel is the #1 place to get the '
               'scoop.\u200b ... Join Will Marfuggi, Melanie Bromley and E! News '
               'correspondents as they discuss the hottest pop ...',
    'title': 'E! News - YouTube',
    'url': 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA'},
   {'content': '05 Nov 2017 · The family travels to Cleveland to support Tristan '
               'during one of his big games.\u200b Plus, Kourtney ...Duration: '
               '2:27 Posted: 05 Nov 2017',
    'title': '"Keeping Up With the Kardashians" Katch-Up S14, EP.6 | E! - '
             'YouTube',
    'url': 'https://www.youtube.com/watch?v=u4GoS0-R9Sg'},
   {'content': 'E! (an initialism for Entertainment Television) is an American '
               'pay television channel that is owned by the NBCUniversal Cable '
               'Entertainment Group division of NBCUniversal, all owned by '
               'Comcast.',
    'title': 'E! - Wikipedia',
    'url': 'https://en.wikipedia.org/wiki/E!'},
   {'content': 'Wil je online spelen? Ontdek de e-games van de Nationale Loterij '
               'Win for Life, Presto, Subito, Cash, 21, Spaarpot Smash, '
               'SuperSafe, Rabbits Run.',
    'title': 'E-games Nationale Loterij - Online Spelen - e-lotto.be',
    'url': 'https://www.e-lotto.be/NL/eGames'},
   {'content': 'Th!nk E presents its activities in the field of smart energy, '
               'smart grids, smart buildings and EU projects.',
    'title': 'Th!nk E',
    'url': 'https://www.think-e.be/'}]

the different is only newline here

unixfox · 2019-06-27T12:01:35Z

i actually doubt this method, because this assume user/searx instance will never got old google html format, which may or may not be true.

What if we send the user agent of Google Chrome to be sure to have the new Google ui? Would that work?

rachmadaniHaryono · 2019-06-27T12:46:26Z

imo there should be a way to count if the parser meet old or new format

i will look into searx logging and maybe create a script that will count that

MarcAbonce · 2019-06-28T03:44:50Z

I don't have any stats, but it looks like Google has been returning this new HTMLs every time since #1596 was posted two months ago. That's why I think it's reasonable to assume that Google always returns this HTML, at least as long as the user agent is from Firefox browser.

However, I agree that it would be nice to have some stats to be sure.

rachmadaniHaryono · 2019-07-15T23:05:51Z

https://github.com/rachmadaniHaryono/searx/tree/feature/logging-and-script

it consist of change on init file and a script file to print out the number of log message and it's count. it require tabulate to print the counter table. there is optionals appdirs package requirement to put log file to user data dir instead cwd

to enable the message first set env var SEARX_LOG_TO_FILE to 1 or true

e: also only tested on python3

vuori · 2019-11-22T14:05:12Z

I'm seeing class kCrYT instead of jfp3ef now, but otherwise the XPaths here work. Starting from a elements that have a well-known href prefix (such as /url) and guessing class-of-the-day from parent elements might work as a more lasting heuristic.

MarcAbonce · 2019-11-24T02:20:25Z

I just updated this PR with the new xpath. I've still never seen Google return the older HTML with any of the Firefox user agents.

Making the xpaths more generic would be great, but last time I tried that it turned out to be tricky because Google used very similar looking elements for normal results, suggestions and other stuff. So for the time being, I'm just going to keep it like this.

unixfox · 2019-11-24T10:06:49Z

You should check out the HTML given by Google when using IE as a user agent, it seems it's a better looking code than just Google without JS enabled.
HTML code with just JS disabled: https://paste.ee/p/Z2uRj
HTML code with IE user agent (and JS disabled): https://paste.ee/p/Nkj0D

Maybe it would be simpler to parse than the HTML code with just JS disabled.

MarcAbonce · 2019-11-25T05:21:20Z

Yeah, but I don't know for how long are this old browsers going to work with the old UI, so eventually we'll probably have to switch to the new UI anyway.

But both approaches work for now anyway, so either of our PRs could be merged.

unixfox · 2019-11-25T06:36:46Z

It's actually the new UI that IE has now but a simpler version of it because it doesn't support some features. That's why I provided the HTML source code in case it would be easier to make a parser for this version of the new UI.

return42 · 2019-12-01T16:31:47Z

@MarcAbonce can please solve the conflict with a appropriate patch / thanks a lot!

return42 · 2019-12-06T17:38:47Z

Merged into my dm-cloud branch / tested locally and it works / thanks a lot for this straight forward solution!

ATM the CAPTCHA problem resist on my server, hopefully this wil be fixed one day this link should show you the working patch: !go foo

unixfox · 2019-12-06T19:04:50Z

I just tested this PR on my public Searx instance and it seems to work without any issue.
When the conflicts have been resolved, this PR can be merged to restore the Google engine on Searx.

immanuelfodor · 2019-12-06T19:27:22Z

Also tested this PR. After merging the master into it to have the latest changes, there is a conflict as it has a hardcoded Safari browser agent what this branch not. Resolved it by removing the master changes, then gave it a go, and the Google engine works fine. Thanks for the effort!

immanuelfodor · 2019-12-06T19:29:31Z

I had the issue open on mobile view where Unixfox's comment just appeared after posting the comment, sorry for the duplication, at least two of us says it's fine 😃

return42 · 2019-12-07T12:16:25Z

at least two of us says it's fine

three! .. you forgot me :)

there is a conflict as it has a hardcoded Safari browser agent what this branch not. Resolved it by removing the master changes,

correct, when merging this PR, you have to remove the user-agent filed, which comes from the interim solution (PR #1749) for the old google UI.

return42 · 2019-12-07T16:40:01Z

We need to merge this PR ASAP

see my #729 (comment)

MarcAbonce · 2019-12-07T23:57:46Z

Sorry, I hadn't noticed this posts here.
I just rebased this PR.

asciimoo

Great, thanks for the PR and for the notes.

rachmadaniHaryono mentioned this pull request Jun 27, 2019

Feature/parse obfuscated google #1603

Closed

dalf added the engine label Aug 2, 2019

MarcAbonce force-pushed the google_fix branch 2 times, most recently from 73a46b7 to 4389dca Compare November 24, 2019 01:56

return42 mentioned this pull request Dec 6, 2019

Scraping Google in the client side #1608

Closed

unixfox mentioned this pull request Dec 6, 2019

Google never returns results: Sorry! we didn't find any results. Please use another query or search in more categories. #1748

Closed

return42 mentioned this pull request Dec 7, 2019

Google Captcha #729

Open

[fix] update xpaths for new google results page

ccaf6ca

MarcAbonce force-pushed the google_fix branch from 4389dca to ccaf6ca Compare December 7, 2019 23:37

asciimoo approved these changes Dec 9, 2019

View reviewed changes

asciimoo merged commit 4cddb82 into searx:master Dec 9, 2019

MarcAbonce deleted the google_fix branch December 10, 2019 04:41

unixfox mentioned this pull request Dec 10, 2019

less Google results depending on the SearX language option #1236

Closed

unixfox mentioned this pull request Dec 21, 2019

No search results on searx.me and other instances #1780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Update xpaths for new Google results page #1628

[fix] Update xpaths for new Google results page #1628

MarcAbonce commented Jun 26, 2019

rachmadaniHaryono commented Jun 26, 2019 •

edited

Loading

MarcAbonce commented Jun 27, 2019

rachmadaniHaryono commented Jun 27, 2019 •

edited

Loading

rachmadaniHaryono commented Jun 27, 2019 •

edited

Loading

unixfox commented Jun 27, 2019

rachmadaniHaryono commented Jun 27, 2019

MarcAbonce commented Jun 28, 2019

rachmadaniHaryono commented Jul 15, 2019 •

edited

Loading

vuori commented Nov 22, 2019

MarcAbonce commented Nov 24, 2019

unixfox commented Nov 24, 2019

MarcAbonce commented Nov 25, 2019

unixfox commented Nov 25, 2019

return42 commented Dec 1, 2019

return42 commented Dec 6, 2019

unixfox commented Dec 6, 2019

immanuelfodor commented Dec 6, 2019

immanuelfodor commented Dec 6, 2019

return42 commented Dec 7, 2019

return42 commented Dec 7, 2019

MarcAbonce commented Dec 7, 2019

asciimoo left a comment

[fix] Update xpaths for new Google results page #1628

[fix] Update xpaths for new Google results page #1628

Conversation

MarcAbonce commented Jun 26, 2019

rachmadaniHaryono commented Jun 26, 2019 • edited Loading

MarcAbonce commented Jun 27, 2019

rachmadaniHaryono commented Jun 27, 2019 • edited Loading

rachmadaniHaryono commented Jun 27, 2019 • edited Loading

unixfox commented Jun 27, 2019

rachmadaniHaryono commented Jun 27, 2019

MarcAbonce commented Jun 28, 2019

rachmadaniHaryono commented Jul 15, 2019 • edited Loading

vuori commented Nov 22, 2019

MarcAbonce commented Nov 24, 2019

unixfox commented Nov 24, 2019

MarcAbonce commented Nov 25, 2019

unixfox commented Nov 25, 2019

return42 commented Dec 1, 2019

return42 commented Dec 6, 2019

unixfox commented Dec 6, 2019

immanuelfodor commented Dec 6, 2019

immanuelfodor commented Dec 6, 2019

return42 commented Dec 7, 2019

return42 commented Dec 7, 2019

We need to merge this PR ASAP

MarcAbonce commented Dec 7, 2019

asciimoo left a comment

Choose a reason for hiding this comment

rachmadaniHaryono commented Jun 26, 2019 •

edited

Loading

rachmadaniHaryono commented Jun 27, 2019 •

edited

Loading

rachmadaniHaryono commented Jun 27, 2019 •

edited

Loading

rachmadaniHaryono commented Jul 15, 2019 •

edited

Loading