Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

[fix] Update xpaths for new Google results page #1628

Merged
merged 1 commit into from
Dec 9, 2019

Conversation

MarcAbonce
Copy link
Contributor

Should fix #1609, or at least most of it.

This is very similar to #1603 in scope, with a few differences:

  • Once you remove the hardcoded user-agent, the old xpaths are never used anymore, so they can be completely replaced. This way, we don't have to maintain two different versions of result parsing.
  • This uses the old code for parsing results, which already handles some edge cases like extracting text from nested divs.
  • Suggestions and spelling corrections are also fixed.
  • The unit test was updated.

Instant answers and number of results are still not working, but I think they can be left for later.

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jun 26, 2019

can you test it against html here?

#1596 (comment)

e:

- [{'content': 'Your source for entertainment news, celebrities, celeb news, and '
-              'celebrity gossip\u200b. Check out the hottest fashion, photos, '
-              'movies and TV shows!',
+ [{'content': 'Your source for entertainment\n'
+              'news, celebrities, celeb news, and celebrity gossip\u200b. Check '
+              'out the\n'
+              'hottest fashion, photos, movies and TV shows!',
    'title': 'E! News',
    'url': 'https://www.eonline.com/'},
   {'content': 'E! Watch later. Share. 7:42. 0:00 / 7:42.',
    'title': 'E! Entertainment - YouTube',
    'url': 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ'},
   {'content': 'Welcome to the E! News YouTube Channel.\u200b ... From latest '
               'celebrity and pop culture news to freshest E! Original series on '
               'Youtube, the E! News Youtube Channel is the #1 place to get the '
               'scoop.\u200b ... Join Will Marfuggi, Melanie Bromley and E! News '
               'correspondents as they discuss the hottest pop ...',
    'title': 'E! News - YouTube',
    'url': 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA'},
+  {'content': '',
-  {'content': '05 Nov 2017 · The family travels to Cleveland to support Tristan '
-              'during one of his big games.\u200b Plus, Kourtney ...Duration: '
-              '2:27 Posted: 05 Nov 2017',
    'title': '"Keeping Up With the Kardashians" Katch-Up S14, EP.6 | E! - '
             'YouTube',
    'url': 'https://www.youtube.com/watch?v=u4GoS0-R9Sg'},
   {'content': 'E! (an initialism for Entertainment Television) is an American '
               'pay television channel that is owned by the NBCUniversal Cable '
               'Entertainment Group division of NBCUniversal, all owned by '
               'Comcast.',
    'title': 'E! - Wikipedia',
    'url': 'https://en.wikipedia.org/wiki/E!'},
   {'content': 'Wil je online spelen? Ontdek de e-games van de Nationale Loterij '
               'Win for Life, Presto, Subito, Cash, 21, Spaarpot Smash, '
               'SuperSafe, Rabbits Run.',
    'title': 'E-games Nationale Loterij - Online Spelen - e-lotto.be',
    'url': 'https://www.e-lotto.be/NL/eGames'},
   {'content': 'Th!nk E presents its activities in the field of smart energy, '
               'smart grids, smart buildings and EU projects.',
    'title': 'Th!nk E',
    'url': 'https://www.think-e.be/'}]

that is the diff result from #1603 with new html from @unixfox from #1596

i think it is better because it get that missing content on https://www.youtube.com/watch?v=u4GoS0-R9Sg

see updated below

@MarcAbonce
Copy link
Contributor Author

Yeah, if I parse newUIGoogle.html with my branch, I get the following 7 results:
['https://www.eonline.com/', 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ', 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA', 'https://www.youtube.com/watch?v=u4GoS0-R9Sg', 'https://en.wikipedia.org/wiki/E!', 'https://www.e-lotto.be/NL/eGames', 'https://www.think-e.be/']

I get the exact same results with oldUIGoogle.html using the version in master.

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jun 27, 2019

Once you remove the hardcoded user-agent, the old xpaths are never used anymore, so they can be completely replaced. This way, we don't have to maintain two different versions of result parsing.

i actually doubt this method, because this assume user/searx instance will never got old google html format, which may or may not be true.

i would love if there is some data about this


i actually feel conflicted about this because i make #1603

i want to integrate this to my pr, but it kinda hard because different method (keeping the old parser vs change it)


The unit test was updated.

this show that engine test need to be changed (and why i create #1606)

searx should have html paired with parsed content data, so engine developer can compare between each version/commit similar to seedpeer engine test

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jun 27, 2019

diff result from #1603 with new html from @unixfox from #1596

- [{'content': 'Your source for entertainment news, celebrities, celeb news, and '
-              'celebrity gossip\u200b. Check out the hottest fashion, photos, '
-              'movies and TV shows!',
+ [{'content': 'Your source for entertainment\n'
+              'news, celebrities, celeb news, and celebrity gossip\u200b. Check '
+              'out the\n'
+              'hottest fashion, photos, movies and TV shows!',
    'title': 'E! News',
    'url': 'https://www.eonline.com/'},
   {'content': 'E! Watch later. Share. 7:42. 0:00 / 7:42.',
    'title': 'E! Entertainment - YouTube',
    'url': 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ'},
   {'content': 'Welcome to the E! News YouTube Channel.\u200b ... From latest '
               'celebrity and pop culture news to freshest E! Original series on '
               'Youtube, the E! News Youtube Channel is the #1 place to get the '
               'scoop.\u200b ... Join Will Marfuggi, Melanie Bromley and E! News '
               'correspondents as they discuss the hottest pop ...',
    'title': 'E! News - YouTube',
    'url': 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA'},
   {'content': '05 Nov 2017 · The family travels to Cleveland to support Tristan '
               'during one of his big games.\u200b Plus, Kourtney ...Duration: '
               '2:27 Posted: 05 Nov 2017',
    'title': '"Keeping Up With the Kardashians" Katch-Up S14, EP.6 | E! - '
             'YouTube',
    'url': 'https://www.youtube.com/watch?v=u4GoS0-R9Sg'},
   {'content': 'E! (an initialism for Entertainment Television) is an American '
               'pay television channel that is owned by the NBCUniversal Cable '
               'Entertainment Group division of NBCUniversal, all owned by '
               'Comcast.',
    'title': 'E! - Wikipedia',
    'url': 'https://en.wikipedia.org/wiki/E!'},
   {'content': 'Wil je online spelen? Ontdek de e-games van de Nationale Loterij '
               'Win for Life, Presto, Subito, Cash, 21, Spaarpot Smash, '
               'SuperSafe, Rabbits Run.',
    'title': 'E-games Nationale Loterij - Online Spelen - e-lotto.be',
    'url': 'https://www.e-lotto.be/NL/eGames'},
   {'content': 'Th!nk E presents its activities in the field of smart energy, '
               'smart grids, smart buildings and EU projects.',
    'title': 'Th!nk E',
    'url': 'https://www.think-e.be/'}]

the different is only newline here

@unixfox
Copy link
Member

unixfox commented Jun 27, 2019

i actually doubt this method, because this assume user/searx instance will never got old google html format, which may or may not be true.

What if we send the user agent of Google Chrome to be sure to have the new Google ui? Would that work?

@rachmadaniHaryono
Copy link
Contributor

imo there should be a way to count if the parser meet old or new format

i will look into searx logging and maybe create a script that will count that

@MarcAbonce
Copy link
Contributor Author

I don't have any stats, but it looks like Google has been returning this new HTMLs every time since #1596 was posted two months ago. That's why I think it's reasonable to assume that Google always returns this HTML, at least as long as the user agent is from Firefox browser.

However, I agree that it would be nice to have some stats to be sure.

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jul 15, 2019

https://github.com/rachmadaniHaryono/searx/tree/feature/logging-and-script

it consist of change on init file and a script file to print out the number of log message and it's count. it require tabulate to print the counter table. there is optionals appdirs package requirement to put log file to user data dir instead cwd

to enable the message first set env var SEARX_LOG_TO_FILE to 1 or true

e: also only tested on python3

@dalf dalf added the engine label Aug 2, 2019
@vuori
Copy link

vuori commented Nov 22, 2019

I'm seeing class kCrYT instead of jfp3ef now, but otherwise the XPaths here work. Starting from a elements that have a well-known href prefix (such as /url) and guessing class-of-the-day from parent elements might work as a more lasting heuristic.

@MarcAbonce MarcAbonce force-pushed the google_fix branch 2 times, most recently from 73a46b7 to 4389dca Compare November 24, 2019 01:56
@MarcAbonce
Copy link
Contributor Author

I just updated this PR with the new xpath. I've still never seen Google return the older HTML with any of the Firefox user agents.

Making the xpaths more generic would be great, but last time I tried that it turned out to be tricky because Google used very similar looking elements for normal results, suggestions and other stuff. So for the time being, I'm just going to keep it like this.

@unixfox
Copy link
Member

unixfox commented Nov 24, 2019

You should check out the HTML given by Google when using IE as a user agent, it seems it's a better looking code than just Google without JS enabled.
HTML code with just JS disabled: https://paste.ee/p/Z2uRj
HTML code with IE user agent (and JS disabled): https://paste.ee/p/Nkj0D

Maybe it would be simpler to parse than the HTML code with just JS disabled.

@MarcAbonce
Copy link
Contributor Author

Yeah, but I don't know for how long are this old browsers going to work with the old UI, so eventually we'll probably have to switch to the new UI anyway.

But both approaches work for now anyway, so either of our PRs could be merged.

@unixfox
Copy link
Member

unixfox commented Nov 25, 2019

It's actually the new UI that IE has now but a simpler version of it because it doesn't support some features. That's why I provided the HTML source code in case it would be easier to make a parser for this version of the new UI.

@return42
Copy link
Contributor

return42 commented Dec 1, 2019

@MarcAbonce can please solve the conflict with a appropriate patch / thanks a lot!

@return42
Copy link
Contributor

return42 commented Dec 6, 2019

Merged into my dm-cloud branch / tested locally and it works / thanks a lot for this straight forward solution!

ATM the CAPTCHA problem resist on my server, hopefully this wil be fixed one day this link should show you the working patch: !go foo

@unixfox
Copy link
Member

unixfox commented Dec 6, 2019

I just tested this PR on my public Searx instance and it seems to work without any issue.
When the conflicts have been resolved, this PR can be merged to restore the Google engine on Searx.

@immanuelfodor
Copy link

Also tested this PR. After merging the master into it to have the latest changes, there is a conflict as it has a hardcoded Safari browser agent what this branch not. Resolved it by removing the master changes, then gave it a go, and the Google engine works fine. Thanks for the effort!

@immanuelfodor
Copy link

I had the issue open on mobile view where Unixfox's comment just appeared after posting the comment, sorry for the duplication, at least two of us says it's fine 😃

@return42
Copy link
Contributor

return42 commented Dec 7, 2019

at least two of us says it's fine

three! .. you forgot me :)

there is a conflict as it has a hardcoded Safari browser agent what this branch not. Resolved it by removing the master changes,

correct, when merging this PR, you have to remove the user-agent filed, which comes from the interim solution (PR #1749) for the old google UI.

@return42 return42 mentioned this pull request Dec 7, 2019
@return42
Copy link
Contributor

return42 commented Dec 7, 2019

We need to merge this PR ASAP

see my #729 (comment)

@MarcAbonce
Copy link
Contributor Author

Sorry, I hadn't noticed this posts here.
I just rebased this PR.

Copy link
Member

@asciimoo asciimoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the PR and for the notes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle parsing of the new Google UI
8 participants