Skip to content

Conversation

@johnwalz97
Copy link
Contributor

Pull Request Description

What and why?

Integration tests were failing due to 403 errors when downloading the sklearn california housing dataset. Bundling the csv in the repo to prevent that.

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

Pull requests must include at least one of the required labels: internal (no release notes required), highlight, enhancement, bug, deprecation, documentation. Except for internal, pull requests must also include a description in the release notes section.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

PR Summary

This pull request introduces significant enhancements to the California housing dataset module. The changes primarily focus on improving the data loading mechanism by:

  1. Introducing a new API in the load_data function that supports a 'bundled' data source as an alternative to the default sklearn fetch. The function will first attempt to load the dataset from a bundled CSV file. If the file is absent or the columns do not match the expected ones, it falls back to fetching the data using sklearn.

  2. Adding robust error handling in the helper function _load_from_sklearn to capture common issues such as HTTP 403 errors or network-related problems. Detailed error messages are provided to guide the user on potential resolutions, including instructions for manually downloading the dataset if necessary.

  3. Including a helper script generate_california_housing_csv.py that downloads the dataset (using the same fallback mechanisms) and saves it as a CSV file in the repository. This script assists in generating the bundled version of the dataset, ensuring that the repository can serve the dataset without always relying on an external download.

These changes aim to improve data reliability, user experience, and local caching of the dataset while providing clear diagnostic feedback when operations fail.

Test Suggestions

  • Test loading data with the 'bundled' source when the CSV file exists and contains the correct columns.
  • Test loading data with the 'bundled' source when the CSV file is missing to ensure it falls back to the sklearn fetch.
  • Simulate a scenario where the bundled CSV is present but has missing or incorrect columns to validate the fallback mechanism.
  • Test for error conditions by providing an invalid source parameter to ensure the appropriate ValueError is raised.
  • Test the helper script by running it in an environment without a cached dataset to ensure it can download and generate the CSV file.

@johnwalz97 johnwalz97 added the internal Not to be externalized in the release notes label Jan 7, 2026
@johnwalz97 johnwalz97 merged commit 494754b into main Jan 7, 2026
17 of 18 checks passed
@johnwalz97 johnwalz97 deleted the fix-integration-tests branch January 7, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Not to be externalized in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants