-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/groupby2 #58
Feature/groupby2 #58
Conversation
Feature/groupby
|
Remove file to upload changed version
btw, I would like you to add some unit tests for groupby features |
Stale pull request message |
examples/demo14.py
Outdated
|
||
train, test = train_test_split(data, test_size=2000, random_state=42) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove redundant empty lines
examples/demo14.py
Outdated
general_params={"use_algos": [["lgb"]]}, | ||
gbm_pipeline_params={"use_groupby": True, "groupby_triplets": groupby_triplets}, | ||
) | ||
_ = automl.fit_predict(train, roles=roles) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
automl.fit_predict(train, roles=roles)
Co-authored-by: Rinchin <57899558+dev-rinchin@users.noreply.github.com>
examples/demo14.py
Outdated
gbm_pipeline_params={"use_groupby": True, "groupby_triplets": groupby_triplets}, | ||
) | ||
automl.fit_predict(train, roles=roles) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment for feature_scores
examples/demo14.py
Outdated
|
||
# Custom pipeline with groupby features defined by importance | ||
print("\nTry custom pipeline with groupby features defined by importance:\n") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment for custom pipeline
examples/demo14.py
Outdated
|
||
|
||
pipe = LGBAdvancedPipeline( | ||
use_groupby=True, pre_selector=selector, groupby_types=["delta_median", "std"], groupby_top_based_on="importance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feats_imp
@@ -429,7 +429,7 @@ def get_gbms( | |||
pre_selector: Optional[SelectionPipeline] = None, | |||
): | |||
|
|||
gbm_feats = LGBAdvancedPipeline(**self.gbm_pipeline_params) | |||
gbm_feats = LGBAdvancedPipeline(**self.gbm_pipeline_params, feats_imp=pre_selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add pre_selector to linear_l2_feats init in get_linear
@@ -348,7 +348,7 @@ def get_gbms( | |||
): | |||
|
|||
text_gbm_feats = self.get_nlp_pipe(self.gbm_pipeline_params["text_features"]) | |||
gbm_feats = LGBAdvancedPipeline(output_categories=False, **self.gbm_pipeline_params) | |||
gbm_feats = LGBAdvancedPipeline(feats_imp=pre_selector, output_categories=False, **self.gbm_pipeline_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need eats_imp=pre_selector in get_linear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm file
group_col: str, | ||
numeric_cols: Optional[List[str]] = None, | ||
categorical_cols: Optional[List[str]] = None, | ||
used_transforms: Optional[List[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename
self._features = [f"{self._fname_prefix}__{self.group_col}__{t}__{f}" for f, t in self.transformations_list] | ||
self._features_mapping = {self._features[i]: k for i, k in enumerate(self.transformations_list)} | ||
|
||
self._group_ids_dict = self._calculate_group_ids(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm _group_ids_dict
|
||
def _set_feature_indices(self): | ||
feat_idx = dict() | ||
feat_idx[self.group_col] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not hardcode values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that we don't have the original feature names in there, because they are affected by different encoders. So that, hardcoding idx is the only way to link them with original names. To fix this we need some general improvements: save original feature names through one-to-one transformations.
Groupby features introduced: delta mean, delta median, min, max, std, cat mode. Config files updated, the features are not used by default.