Skip to content

fix: WestSuffolkCouncil — User-Agent + renamed h4 heading#1978

Closed
InertiaUK wants to merge 1 commit intorobbrad:masterfrom
InertiaUK:fix/west-suffolk-ua-and-header
Closed

fix: WestSuffolkCouncil — User-Agent + renamed h4 heading#1978
InertiaUK wants to merge 1 commit intorobbrad:masterfrom
InertiaUK:fix/west-suffolk-ua-and-header

Conversation

@InertiaUK
Copy link
Copy Markdown
Contributor

@InertiaUK InertiaUK commented Apr 22, 2026

Two upstream changes on the West Suffolk bin-day page had left the scraper silently failing (empty result, no exception):

  1. User-Agent now required. maps.westsuffolk.gov.uk's IIS returns 404 to requests with no User-Agent. Sending a realistic browser UA fixes the 404.
  2. Collection-panel heading renamed. The <h4> that sits above the bin panel was renamed from Bin collection days to Bin collection days current. The existing scraper used find_all("h4", string="Bin collection days") which is an exact match, so the panel filter never matched and the result came back empty. Matching as a substring via a lambda covers either label.

Verification

Tested live against UPRN 10009739960 (1 The Drift, Culford, IP28 6DR):

{"bins": [
  {"type": "Black Bins", "collectionDate": "29/04/2026"},
  {"type": "Blue Bins",  "collectionDate": "22/04/2026"},
  {"type": "Brown Bins", "collectionDate": "22/04/2026"}
]}

Also confirmed via the Kepthouse /resolve-v2 API (lat/lng → UPRN → council → scraper) — returned status: ok with the bin list in ~22 s on cold Selenium grid.

Supersedes #1955

#1955 only addressed the User-Agent side of this and was based on a branch that accidentally picked up four unrelated stacked council fixes. This PR is a clean standalone against current master with both the UA fix and the h4 rename fix in one commit. Happy to close #1955 once this merges.

Summary by CodeRabbit

  • Bug Fixes
    • Improved bin collection schedule parsing to handle variations in data formatting across different council responses.
    • Enhanced HTTP request handling with improved browser compatibility headers to ensure more reliable data retrieval of your collection information.

Two small upstream changes have been breaking the West Suffolk scraper:

1. westsuffolk.gov.uk's IIS now 404s requests with no User-Agent. Send
   a realistic browser UA so the bin-day page returns 200 with the
   collection panel.

2. The council renamed the collection panel heading from
   "Bin collection days" to "Bin collection days current", which broke
   the exact-string `find_all(..., string="Bin collection days")` guard
   and caused an empty result. Match as a substring via a lambda so
   either label works.

Verified against live site for UPRN 10009739960 (IP28 6DR) — returns
Black / Blue / Brown collection dates.

Supersedes the earlier WestSuffolk fix in robbrad#1955 (that branch also
picked up unrelated commits from stacked council fixes and only
addressed part of the problem).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

📝 Walkthrough

Walkthrough

Updated the WestSuffolkCouncil scraper to include a realistic User-Agent header in HTTP requests and relaxed HTML matching logic to handle header label variations, enabling the parser to correctly retrieve bin collection data from the API endpoint.

Changes

Cohort / File(s) Summary
User-Agent Header & HTML Matching
uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py
Added User-Agent header to requests.get() call to restore full response from IIS endpoint; relaxed panel_search HTML matching from exact string equality to substring check to accommodate header label changes.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

  • PR #1733: Applies identical User-Agent header addition pattern to a different council scraper module.

Suggested reviewers

  • dp247

Poem

🐰 A browser's voice was all we lacked,
IIS turned our requests back.
With User-Agent's gentle call,
The bins now dance and never fall! 📦✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the two main changes: adding a User-Agent header and handling the renamed h4 heading in WestSuffolkCouncil.
Linked Issues check ✅ Passed The pull request implements both requirements from #1955: adding a User-Agent header to requests and handling the renamed h4 heading for robust parsing.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the WestSuffolkCouncil scraper; no unrelated modifications were introduced outside the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py (1)

45-53: ⚠️ Potential issue | 🟠 Major

Raise when the collection panel is missing.

The relaxed heading match is good, but if the panel still is not found, this returns {"bins": []} and reintroduces the silent-empty failure mode this PR is fixing.

Proposed fix
-        collection_tag = soup.body.find_all(panel_search)
+        if soup.body is None:
+            raise ValueError("West Suffolk response did not contain a <body> element")
+
+        collection_tag = soup.body.find_all(panel_search)
+        if not collection_tag:
+            raise ValueError("West Suffolk bin collection panel not found")

Based on learnings, parsing council bin collection data should prefer explicit failures over silent defaults or swallowed errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` around
lines 45 - 53, The panel search currently falls back to returning no data when
the collection panel isn't found; after computing collection_tag =
soup.body.find_all(panel_search) you must detect an empty result and raise an
explicit exception (e.g., RuntimeError or a custom ParseError) instead of
letting the code continue and return {"bins": []}; update the code around
collection_tag and the panel_search helper so that if not collection_tag you
raise a clear error indicating the WestSuffolkCouncil collection panel is
missing, referencing panel_search and collection_tag so reviewers can find the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Line 26: The HTTP fetch in WestSuffolkCouncil (where requests.get(api_url,
headers=headers) is called) must include a timeout and validate the response
before parsing: add a timeout argument to requests.get (e.g. timeout=10) and
call response.raise_for_status() immediately after the request and before
creating the BeautifulSoup object; ensure this mirrors the pattern used in
BelfastCityCouncil (use requests.get, response.raise_for_status(), then
BeautifulSoup) so 4xx/5xx responses raise and are not parsed silently.

---

Outside diff comments:
In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py`:
- Around line 45-53: The panel search currently falls back to returning no data
when the collection panel isn't found; after computing collection_tag =
soup.body.find_all(panel_search) you must detect an empty result and raise an
explicit exception (e.g., RuntimeError or a custom ParseError) instead of
letting the code continue and return {"bins": []}; update the code around
collection_tag and the panel_search helper so that if not collection_tag you
raise a clear error indicating the WestSuffolkCouncil collection panel is
missing, referencing panel_search and collection_tag so reviewers can find the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 62e18e1d-002a-49ca-a399-c6b75f914480

📥 Commits

Reviewing files that changed from the base of the PR and between 60bd3cc and c25dba4.

📒 Files selected for processing (1)
  • uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py

"(KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36"
)
}
response = requests.get(api_url, headers=headers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP --type=python -C3 'requests\.get\(api_url, headers=headers|raise_for_status\(|timeout='

Repository: robbrad/UKBinCollectionData

Length of output: 101


🏁 Script executed:

cat -n uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 4403


🏁 Script executed:

rg -nA5 -B5 'requests\.get.*api_url' uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py

Repository: robbrad/UKBinCollectionData

Length of output: 496


🏁 Script executed:

# Check what other council parsers do with HTTP error handling
rg -l 'raise_for_status' uk_bin_collection/uk_bin_collection/councils/ | head -5

Repository: robbrad/UKBinCollectionData

Length of output: 420


🏁 Script executed:

# Check timeout usage patterns in council parsers
rg 'timeout=' uk_bin_collection/uk_bin_collection/councils/ | head -10

Repository: robbrad/UKBinCollectionData

Length of output: 1241


🏁 Script executed:

rg -B5 -A5 'raise_for_status' uk_bin_collection/uk_bin_collection/councils/BelfastCityCouncil.py | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 894


🏁 Script executed:

# Check if there's a base class or common pattern for HTTP requests
rg -B3 -A3 'requests\.get.*timeout' uk_bin_collection/uk_bin_collection/councils/BlackpoolCouncil.py | head -20

Repository: robbrad/UKBinCollectionData

Length of output: 53


Add timeout and HTTP status validation before parsing the response.

The User-Agent header fixes the immediate 404, but this still parses any 4xx/5xx/error response silently and returns empty results. Add both a timeout and raise_for_status() before BeautifulSoup, consistent with patterns used in other council parsers like BelfastCityCouncil.

Proposed fix
-        response = requests.get(api_url, headers=headers)
+        response = requests.get(api_url, headers=headers, timeout=30)
+        response.raise_for_status()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = requests.get(api_url, headers=headers)
response = requests.get(api_url, headers=headers, timeout=30)
response.raise_for_status()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@uk_bin_collection/uk_bin_collection/councils/WestSuffolkCouncil.py` at line
26, The HTTP fetch in WestSuffolkCouncil (where requests.get(api_url,
headers=headers) is called) must include a timeout and validate the response
before parsing: add a timeout argument to requests.get (e.g. timeout=10) and
call response.raise_for_status() immediately after the request and before
creating the BeautifulSoup object; ensure this mirrors the pattern used in
BelfastCityCouncil (use requests.get, response.raise_for_status(), then
BeautifulSoup) so 4xx/5xx responses raise and are not parsed silently.

@blackandwhitetux
Copy link
Copy Markdown

Works for me.

@robbrad robbrad mentioned this pull request May 1, 2026
@robbrad
Copy link
Copy Markdown
Owner

robbrad commented May 1, 2026

Included in May 2026 Release PR #1992. Closing.

@robbrad robbrad closed this May 1, 2026
pull Bot pushed a commit to mrw298/UKBinCollectionData that referenced this pull request May 1, 2026
@robbrad robbrad mentioned this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants