-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[fix] Update xpaths for new Google results page #1628
Conversation
can you test it against html here? e: - [{'content': 'Your source for entertainment news, celebrities, celeb news, and '
- 'celebrity gossip\u200b. Check out the hottest fashion, photos, '
- 'movies and TV shows!',
+ [{'content': 'Your source for entertainment\n'
+ 'news, celebrities, celeb news, and celebrity gossip\u200b. Check '
+ 'out the\n'
+ 'hottest fashion, photos, movies and TV shows!',
'title': 'E! News',
'url': 'https://www.eonline.com/'},
{'content': 'E! Watch later. Share. 7:42. 0:00 / 7:42.',
'title': 'E! Entertainment - YouTube',
'url': 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ'},
{'content': 'Welcome to the E! News YouTube Channel.\u200b ... From latest '
'celebrity and pop culture news to freshest E! Original series on '
'Youtube, the E! News Youtube Channel is the #1 place to get the '
'scoop.\u200b ... Join Will Marfuggi, Melanie Bromley and E! News '
'correspondents as they discuss the hottest pop ...',
'title': 'E! News - YouTube',
'url': 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA'},
+ {'content': '',
- {'content': '05 Nov 2017 · The family travels to Cleveland to support Tristan '
- 'during one of his big games.\u200b Plus, Kourtney ...Duration: '
- '2:27 Posted: 05 Nov 2017',
'title': '"Keeping Up With the Kardashians" Katch-Up S14, EP.6 | E! - '
'YouTube',
'url': 'https://www.youtube.com/watch?v=u4GoS0-R9Sg'},
{'content': 'E! (an initialism for Entertainment Television) is an American '
'pay television channel that is owned by the NBCUniversal Cable '
'Entertainment Group division of NBCUniversal, all owned by '
'Comcast.',
'title': 'E! - Wikipedia',
'url': 'https://en.wikipedia.org/wiki/E!'},
{'content': 'Wil je online spelen? Ontdek de e-games van de Nationale Loterij '
'Win for Life, Presto, Subito, Cash, 21, Spaarpot Smash, '
'SuperSafe, Rabbits Run.',
'title': 'E-games Nationale Loterij - Online Spelen - e-lotto.be',
'url': 'https://www.e-lotto.be/NL/eGames'},
{'content': 'Th!nk E presents its activities in the field of smart energy, '
'smart grids, smart buildings and EU projects.',
'title': 'Th!nk E',
'url': 'https://www.think-e.be/'}]
see updated below |
Yeah, if I parse newUIGoogle.html with my branch, I get the following 7 results: I get the exact same results with oldUIGoogle.html using the version in master. |
i actually doubt this method, because this assume user/searx instance will never got old google html format, which may or may not be true. i would love if there is some data about this i actually feel conflicted about this because i make #1603 i want to integrate this to my pr, but it kinda hard because different method (keeping the old parser vs change it)
this show that engine test need to be changed (and why i create #1606) searx should have html paired with parsed content data, so engine developer can compare between each version/commit similar to seedpeer engine test |
diff result from #1603 with new html from @unixfox from #1596 - [{'content': 'Your source for entertainment news, celebrities, celeb news, and '
- 'celebrity gossip\u200b. Check out the hottest fashion, photos, '
- 'movies and TV shows!',
+ [{'content': 'Your source for entertainment\n'
+ 'news, celebrities, celeb news, and celebrity gossip\u200b. Check '
+ 'out the\n'
+ 'hottest fashion, photos, movies and TV shows!',
'title': 'E! News',
'url': 'https://www.eonline.com/'},
{'content': 'E! Watch later. Share. 7:42. 0:00 / 7:42.',
'title': 'E! Entertainment - YouTube',
'url': 'https://www.youtube.com/channel/UCj7V_ikJOXO9RC8at6kYfHQ'},
{'content': 'Welcome to the E! News YouTube Channel.\u200b ... From latest '
'celebrity and pop culture news to freshest E! Original series on '
'Youtube, the E! News Youtube Channel is the #1 place to get the '
'scoop.\u200b ... Join Will Marfuggi, Melanie Bromley and E! News '
'correspondents as they discuss the hottest pop ...',
'title': 'E! News - YouTube',
'url': 'https://www.youtube.com/channel/UCjDsbbzHgTrGc4Ff26TJtsA'},
{'content': '05 Nov 2017 · The family travels to Cleveland to support Tristan '
'during one of his big games.\u200b Plus, Kourtney ...Duration: '
'2:27 Posted: 05 Nov 2017',
'title': '"Keeping Up With the Kardashians" Katch-Up S14, EP.6 | E! - '
'YouTube',
'url': 'https://www.youtube.com/watch?v=u4GoS0-R9Sg'},
{'content': 'E! (an initialism for Entertainment Television) is an American '
'pay television channel that is owned by the NBCUniversal Cable '
'Entertainment Group division of NBCUniversal, all owned by '
'Comcast.',
'title': 'E! - Wikipedia',
'url': 'https://en.wikipedia.org/wiki/E!'},
{'content': 'Wil je online spelen? Ontdek de e-games van de Nationale Loterij '
'Win for Life, Presto, Subito, Cash, 21, Spaarpot Smash, '
'SuperSafe, Rabbits Run.',
'title': 'E-games Nationale Loterij - Online Spelen - e-lotto.be',
'url': 'https://www.e-lotto.be/NL/eGames'},
{'content': 'Th!nk E presents its activities in the field of smart energy, '
'smart grids, smart buildings and EU projects.',
'title': 'Th!nk E',
'url': 'https://www.think-e.be/'}] the different is only newline here |
What if we send the user agent of Google Chrome to be sure to have the new Google ui? Would that work? |
imo there should be a way to count if the parser meet old or new format i will look into searx logging and maybe create a script that will count that |
I don't have any stats, but it looks like Google has been returning this new HTMLs every time since #1596 was posted two months ago. That's why I think it's reasonable to assume that Google always returns this HTML, at least as long as the user agent is from Firefox browser. However, I agree that it would be nice to have some stats to be sure. |
https://github.com/rachmadaniHaryono/searx/tree/feature/logging-and-script it consist of change on init file and a script file to print out the number of log message and it's count. it require tabulate to print the counter table. there is optionals appdirs package requirement to put log file to user data dir instead cwd to enable the message first set env var e: also only tested on python3 |
I'm seeing class |
73a46b7
to
4389dca
Compare
I just updated this PR with the new xpath. I've still never seen Google return the older HTML with any of the Firefox user agents. Making the xpaths more generic would be great, but last time I tried that it turned out to be tricky because Google used very similar looking elements for normal results, suggestions and other stuff. So for the time being, I'm just going to keep it like this. |
You should check out the HTML given by Google when using IE as a user agent, it seems it's a better looking code than just Google without JS enabled. Maybe it would be simpler to parse than the HTML code with just JS disabled. |
Yeah, but I don't know for how long are this old browsers going to work with the old UI, so eventually we'll probably have to switch to the new UI anyway. But both approaches work for now anyway, so either of our PRs could be merged. |
It's actually the new UI that IE has now but a simpler version of it because it doesn't support some features. That's why I provided the HTML source code in case it would be easier to make a parser for this version of the new UI. |
@MarcAbonce can please solve the conflict with a appropriate patch / thanks a lot! |
I just tested this PR on my public Searx instance and it seems to work without any issue. |
Also tested this PR. After merging the master into it to have the latest changes, there is a conflict as it has a hardcoded Safari browser agent what this branch not. Resolved it by removing the master changes, then gave it a go, and the Google engine works fine. Thanks for the effort! |
I had the issue open on mobile view where Unixfox's comment just appeared after posting the comment, sorry for the duplication, at least two of us says it's fine 😃 |
three! .. you forgot me :)
correct, when merging this PR, you have to remove the user-agent filed, which comes from the interim solution (PR #1749) for the old google UI. |
We need to merge this PR ASAPsee my #729 (comment) |
4389dca
to
ccaf6ca
Compare
Sorry, I hadn't noticed this posts here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for the PR and for the notes.
Should fix #1609, or at least most of it.
This is very similar to #1603 in scope, with a few differences:
Instant answers and number of results are still not working, but I think they can be left for later.