fix: WestSuffolkCouncil — User-Agent + renamed h4 heading#1978
fix: WestSuffolkCouncil — User-Agent + renamed h4 heading#1978InertiaUK wants to merge 1 commit intorobbrad:masterfrom
Conversation
Two small upstream changes have been breaking the West Suffolk scraper: 1. westsuffolk.gov.uk's IIS now 404s requests with no User-Agent. Send a realistic browser UA so the bin-day page returns 200 with the collection panel. 2. The council renamed the collection panel heading from "Bin collection days" to "Bin collection days current", which broke the exact-string `find_all(..., string="Bin collection days")` guard and caused an empty result. Match as a substring via a lambda so either label works. Verified against live site for UPRN 10009739960 (IP28 6DR) — returns Black / Blue / Brown collection dates. Supersedes the earlier WestSuffolk fix in robbrad#1955 (that branch also picked up unrelated commits from stacked council fixes and only addressed part of the problem).
📝 WalkthroughWalkthroughUpdated the WestSuffolkCouncil scraper to include a realistic User-Agent header in HTTP requests and relaxed HTML matching logic to handle header label variations, enabling the parser to correctly retrieve bin collection data from the API endpoint. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py (1)
45-53:⚠️ Potential issue | 🟠 MajorRaise when the collection panel is missing.
The relaxed heading match is good, but if the panel still is not found, this returns
{"bins": []}and reintroduces the silent-empty failure mode this PR is fixing.Proposed fix
- collection_tag = soup.body.find_all(panel_search) + if soup.body is None: + raise ValueError("West Suffolk response did not contain a <body> element") + + collection_tag = soup.body.find_all(panel_search) + if not collection_tag: + raise ValueError("West Suffolk bin collection panel not found")Based on learnings, parsing council bin collection data should prefer explicit failures over silent defaults or swallowed errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` around lines 45 - 53, The panel search currently falls back to returning no data when the collection panel isn't found; after computing collection_tag = soup.body.find_all(panel_search) you must detect an empty result and raise an explicit exception (e.g., RuntimeError or a custom ParseError) instead of letting the code continue and return {"bins": []}; update the code around collection_tag and the panel_search helper so that if not collection_tag you raise a clear error indicating the WestSuffolkCouncil collection panel is missing, referencing panel_search and collection_tag so reviewers can find the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Line 26: The HTTP fetch in WestSuffolkCouncil (where requests.get(api_url,
headers=headers) is called) must include a timeout and validate the response
before parsing: add a timeout argument to requests.get (e.g. timeout=10) and
call response.raise_for_status() immediately after the request and before
creating the BeautifulSoup object; ensure this mirrors the pattern used in
BelfastCityCouncil (use requests.get, response.raise_for_status(), then
BeautifulSoup) so 4xx/5xx responses raise and are not parsed silently.
---
Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Around line 45-53: The panel search currently falls back to returning no data
when the collection panel isn't found; after computing collection_tag =
soup.body.find_all(panel_search) you must detect an empty result and raise an
explicit exception (e.g., RuntimeError or a custom ParseError) instead of
letting the code continue and return {"bins": []}; update the code around
collection_tag and the panel_search helper so that if not collection_tag you
raise a clear error indicating the WestSuffolkCouncil collection panel is
missing, referencing panel_search and collection_tag so reviewers can find the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 62e18e1d-002a-49ca-a399-c6b75f914480
📒 Files selected for processing (1)
uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py
| "(KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36" | ||
| ) | ||
| } | ||
| response = requests.get(api_url, headers=headers) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP --type=python -C3 'requests\.get\(api_url, headers=headers|raise_for_status\(|timeout='Repository: robbrad/UKBinCollectionData
Length of output: 101
🏁 Script executed:
cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.pyRepository: robbrad/UKBinCollectionData
Length of output: 4403
🏁 Script executed:
rg -nA5 -B5 'requests\.get.*api_url' uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.pyRepository: robbrad/UKBinCollectionData
Length of output: 496
🏁 Script executed:
# Check what other council parsers do with HTTP error handling
rg -l 'raise_for_status' uk_bin_collection/uk_bin_collection/councils/ | head -5Repository: robbrad/UKBinCollectionData
Length of output: 420
🏁 Script executed:
# Check timeout usage patterns in council parsers
rg 'timeout=' uk_bin_collection/uk_bin_collection/councils/ | head -10Repository: robbrad/UKBinCollectionData
Length of output: 1241
🏁 Script executed:
rg -B5 -A5 'raise_for_status' uk_bin_collection/uk_bin_collection/councils/BelfastCityCouncil.py | head -20Repository: robbrad/UKBinCollectionData
Length of output: 894
🏁 Script executed:
# Check if there's a base class or common pattern for HTTP requests
rg -B3 -A3 'requests\.get.*timeout' uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py | head -20Repository: robbrad/UKBinCollectionData
Length of output: 53
Add timeout and HTTP status validation before parsing the response.
The User-Agent header fixes the immediate 404, but this still parses any 4xx/5xx/error response silently and returns empty results. Add both a timeout and raise_for_status() before BeautifulSoup, consistent with patterns used in other council parsers like BelfastCityCouncil.
Proposed fix
- response = requests.get(api_url, headers=headers)
+ response = requests.get(api_url, headers=headers, timeout=30)
+ response.raise_for_status()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = requests.get(api_url, headers=headers) | |
| response = requests.get(api_url, headers=headers, timeout=30) | |
| response.raise_for_status() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` at line
26, The HTTP fetch in WestSuffolkCouncil (where requests.get(api_url,
headers=headers) is called) must include a timeout and validate the response
before parsing: add a timeout argument to requests.get (e.g. timeout=10) and
call response.raise_for_status() immediately after the request and before
creating the BeautifulSoup object; ensure this mirrors the pattern used in
BelfastCityCouncil (use requests.get, response.raise_for_status(), then
BeautifulSoup) so 4xx/5xx responses raise and are not parsed silently.
|
Works for me. |
|
Included in May 2026 Release PR #1992. Closing. |
…esolved conflict with robbrad#1955)
Two upstream changes on the West Suffolk bin-day page had left the scraper silently failing (empty result, no exception):
maps.westsuffolk.gov.uk's IIS returns 404 to requests with noUser-Agent. Sending a realistic browser UA fixes the 404.<h4>that sits above the bin panel was renamed fromBin collection daystoBin collection days current. The existing scraper usedfind_all("h4", string="Bin collection days")which is an exact match, so the panel filter never matched and the result came back empty. Matching as a substring via a lambda covers either label.Verification
Tested live against UPRN
10009739960(1 The Drift, Culford, IP28 6DR):{"bins": [ {"type": "Black Bins", "collectionDate": "29/04/2026"}, {"type": "Blue Bins", "collectionDate": "22/04/2026"}, {"type": "Brown Bins", "collectionDate": "22/04/2026"} ]}Also confirmed via the Kepthouse
/resolve-v2API (lat/lng → UPRN → council → scraper) — returnedstatus: okwith the bin list in ~22 s on cold Selenium grid.Supersedes #1955
#1955 only addressed the User-Agent side of this and was based on a branch that accidentally picked up four unrelated stacked council fixes. This PR is a clean standalone against current
masterwith both the UA fix and the h4 rename fix in one commit. Happy to close #1955 once this merges.Summary by CodeRabbit