Extend kennankole's solution#385
Open
Judahmeek wants to merge 9 commits into
Open
Conversation
Parse the bundled SERP HTML into a SerpApi-shaped array of
`{name, extensions, link, image}` without making HTTP requests.
- Detect carousel tiles by structural signal (`/search?...&stick=...` siblinggroups),
not volatile Google CSS classes, so the parser works across Van Gogh and variant fixtures.
- Resolve thumbnails by parsing `_setImagesSrc(ii, s, r)` blocks into an `id -> image` map,including unescaping `\x3d` and `\/` values emitted in inline JS.
- Extract `extensions` from leaf text nodes under each anchor to avoid container-text noise (for example, concatenated `name+year`).
- Resolve `image` from values already present in the page file:
inline JS mapping, inline non-placeholder data URIs, and in-file `data-src`/ `src` URLs.
- Add comprehensive RSpec coverage for golden output, cross-layout fixtures, item parsing, thumbnail indexing, and carousel selection behavior.
This is because I discovered that interactive search results, such as the results for "Tom Cruise Movies", do not contain anchors in the initial HTML CI fix fix for missed @anchor reference
7b9fda8 to
dc8b05e
Compare
Adding Tom cruise filmography results to contrast with the Tom Cruise movies results. Adding the U.S. Presidents results because its parent data-attrid doesn't start with 'kc:' like most grid results
The changes to the group score method are what I'm most proud of. The original method returned an array, which when run through the max function (in the tiles method ~ line 36), acts like a series of tiebreakers This gives an overwhelming amount of weight to whatever quality proxy is measured first. The other aspect of my changes that I would like to draw your attention to is the use of environment variables. It's a basic feature, but one I don't recall seeing in my competitors PRs.
dc8b05e to
81a2bb2
Compare
One flaw I noticed in nearly all competitors was relying on Google's image lazy-load script not to change in any way. A more robust solution than mine would account for the _setImagesSrc function name to also possibly change & probably try only relying on the data:image structure as the initial clue. It would make scanning the first script more computationally expensive, but detected variables could then be used to speed up processing of subsequent scripts. Hopefully, Google never decides to combine all their lazy-loading scripts together. I'm not sure how that could be detected performantly, but I'm sure I could find a way, given enough time.
42197b4 to
e38aca0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Obviously, instead of trying to come up with my own solution & tests, I went & checked out the competition. @kennankole's solution (#379) was by far the most robust (although it also proved overly complicated). The only real flaws that it seemed have is that it was more computationally expensive & that it had no way to detect drift, while a more brittle solution, such as depending on CSS, will be faster and will break as soon as Google changes whatever CSS it depends on.
I initially tried to solve this by merging #381 & #379 together (you can see my initial refactor of @DanTaiko's work here), with the idea that parts of @kennankole's logic would serve as a backup for scenarios that @DanTaiko's solution couldn't cover, but the longer I worked on it, the more it felt unnecessarily complex, so I scrapped that idea & tried using #379 as a base.
The idea of robust logic that tries to address most possible variants & forms of drift that sits behind a search index that provides performance for known solutions is definitely the ideal, however, and I hope that my code illustrates an approximate example of that.
...
P.S. the search results for "Tom Cruise films" is probably way outside the scope of what y'all expected us to cover, but if I was going to address it more effectively, then I would have copied @dsojevic's solution of replacing the html mappings since the initial search results for that particular kind of query even have the anchor links be lazy-loaded.
Of course, redirecting users who searched for "Tom Cruise films" to search results for "Tom Cruise filmography" would probably be the best course of action.