Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test whether CV is effective #649

Closed
wants to merge 10 commits into from
Prev Previous commit
add pseudocode for CV
  • Loading branch information
WinstonLiyt committed Mar 4, 2025
commit bfaa601e5f0a208c2d00ce1df8c114fddaced58e
21 changes: 21 additions & 0 deletions rdagent/components/coder/data_science/raw_data_loader/prompts.yaml
Original file line number Diff line number Diff line change
@@ -273,6 +273,27 @@ spec:
- The dataset returned by `load_data` is not pre-split. After calling `feat_eng`, split the data into training and test sets.
- [Notice] Apply cross-validation (e.g. KFold) on the training set (`X_transformed`, `y_transformed`) to ensure a reliable assessment of model performance.
- Keep the test set (`X_test_transformed`) unchanged, as it is only used for generating the final predictions.
- Pseudocode logic for reference:
```
Set number of splits and initialize KFold cross-validator.

Create dictionaries for validation and test predictions.

For each model file:
Import the model dynamically.
Initialize arrays for out-of-fold (OOF) and test predictions.

For each fold in KFold:
Split data into training and validation sets.
Run model workflow to get validation and test predictions.
Validate shapes.
Store validation and test predictions.

Compute average test predictions across folds.
Save OOF and averaged test predictions.

Ensemble predictions from all models and print the final shape.
```

4. Submission File:
- Save the final predictions as `submission.csv`, ensuring the format matches the competition requirements (refer to `sample_submission` in the Folder Description for the correct structure).