Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Overly Aggressive Feature Removal in CollinearityThreshold Class #36

Closed
Pacman1984 opened this issue Nov 26, 2023 · 0 comments

Comments

@Pacman1984
Copy link
Contributor

Description

Problem

The CollinearityThreshold class in our codebase is intended to remove collinear features from datasets. However, it appears to be dropping features that do not meet the specified collinearity threshold, leading to the potential loss of important data. An example of this issue is the unwarranted removal of the 'age' column in the titanic dataset provided in the examples, where the association values are below the set threshold.

Expected Behavior

The class should only remove features that are collinear above the specified threshold. Features with association values below this threshold should be retained in the dataset.

Current Behavior

The class is removing features that do not exceed the collinearity threshold. This behavior is observed in the recursive feature elimination process, where features are being dropped inappropriately.

Steps to Reproduce

  1. Initialize the CollinearityThreshold with a specific threshold.
  2. Fit the selector to a dataset.
  3. Observe that features with association values below the threshold are also being removed.

image

Suggested Fix

Modify the _recursive_collinear_elimination method to ensure it accurately removes only those features that exceed the specified collinearity threshold. The proposed change includes adding a condition to break the while loop when no more features exceed the threshold, preventing the unnecessary removal of features.

Old Version

def _recursive_collinear_elimination(association_matrix, threshold):
    dum = association_matrix.copy()
    most_collinear_features = []
    most_collinear_feature, to_drop = _most_collinear(association_matrix, threshold)
    most_collinear_features.append(most_collinear_feature)
    dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)

    while len(to_drop) > 1:
        most_collinear_feature, to_drop = _most_collinear(dum, threshold)
        most_collinear_features.append(most_collinear_feature)
        dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
    return most_collinear_features

New Version

def _recursive_collinear_elimination(association_matrix, threshold):
    dum = association_matrix.copy()
    most_collinear_features = []

    while True:
        most_collinear_feature, to_drop = _most_collinear(dum, threshold)
        
        # Break if no more features to drop
        if not to_drop:
            break

        if most_collinear_feature not in most_collinear_features:
            most_collinear_features.append(most_collinear_feature)
            dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)

    return most_collinear_features

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants