Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bugs in CollinearityThreshold #40

Closed
harper357 opened this issue Feb 6, 2024 · 9 comments
Closed

Possible bugs in CollinearityThreshold #40

harper357 opened this issue Feb 6, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@harper357
Copy link
Contributor

harper357 commented Feb 6, 2024

Hi Thanks for writing such a great module!

After running CollinearityThreshold.fit_transform() on some of my data, I was trying to look into which unselected features are collinear with my selected features. I was trying to look at the assoc_matrix_ which told me that 502/643 had no values above my threshold. This is in contrast to the number of selected features which was only 231/643. Spot checking some of the not selected features showed that they also never had a value above the threshold. This then led me to the code for dropping features and I am a little confused by it.

def _most_collinear(association_matrix, threshold):
cols_to_drop = [
column
for column in association_matrix.columns
if any(association_matrix.loc[:, column].abs() > threshold)
]
rows_to_drop = [
row
for row in association_matrix.index
if any(association_matrix.loc[row, :].abs() > threshold)
]
to_drop = list(set(cols_to_drop).union(set(rows_to_drop)))
most_collinear_series = (
association_matrix[to_drop].abs().sum(axis=1).sort_values(ascending=False)
)
most_collinear_series += (
association_matrix[to_drop].abs().sum(axis=0).sort_values(ascending=False)
)
most_collinear_series /= 2
return most_collinear_series.index[0], to_drop
def _recursive_collinear_elimination(association_matrix, threshold):
dum = association_matrix.copy()
most_collinear_features = []
while True:
most_collinear_feature, to_drop = _most_collinear(dum, threshold)
# Break if no more features to drop
if not to_drop:
break
if most_collinear_feature not in most_collinear_features:
most_collinear_features.append(most_collinear_feature)
dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature)
return most_collinear_features

In lines 438-444, it looks like you are trying to sum the row and column for a given feature, and return the feature with the highest average, correct? However, association_matrix[to_drop] == association_matrix.loc[:,to_drop] so in L439 your index would be all features instead of just features into_drop.

Second, in L439 and L442 you sort the series, but I believe you should just be doing a final sort in L444 or L445.

>>> series = pd.Series([1, 0], index=['A', 'B'])
>>> series += pd.Series([1.4, 0.1], index=['B', 'A'])
>>> print(series)
A    1.1
B    1.4
dtype: float64

Combined, these two things seem to result in the incorrect feature being dropped.

Related, but not a bug, I found changing L427-L436 to the following resulted in a huge speedup (998 ms ± 26.7 ms per loop vs 27.3 s ± 558 ms per loop) in calling _recursive_collinear_elimination for me (n_features=643).

cols_to_drop = association_matrix.loc[
        :, (association_matrix.abs() > threshold).any(axis=0)
    ].columns.values
rows_to_drop = association_matrix.loc[
        (association_matrix.abs() > threshold).any(axis=1), :
    ].index.values
@ThomasBury ThomasBury self-assigned this Feb 7, 2024
@ThomasBury ThomasBury added the bug Something isn't working label Feb 7, 2024
@ThomasBury
Copy link
Owner

HI @harper357, thank you for the kind words and the careful review (a proper CI/CD test suite is still missing). I'll go through your comments and PR as soon as my schedule allows it.

@ThomasBury
Copy link
Owner

ThomasBury commented Feb 9, 2024

For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent.

Agreed, using pandas is indeed more Pythonic than list comprehension in this case.

In the 2.2.3 version tutorial, it seems everything's fine. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful (please update to version 2.2.3). Thanks

@ThomasBury ThomasBury added question Further information is requested and removed bug Something isn't working labels Feb 9, 2024
@harper357
Copy link
Contributor Author

harper357 commented Feb 9, 2024 via email

@harper357
Copy link
Contributor Author

I see you merged my changes so the bug should be fixed. But just to be complete, I uploaded a notebook to my fork that illustrates everything. Notebook

It should be pretty easy to understand, I used the cancer data set to show that there was a problem and then some minimal examples of what was happening.

@ThomasBury ThomasBury added bug Something isn't working and removed question Further information is requested labels Feb 10, 2024
@ThomasBury
Copy link
Owner

ThomasBury commented Feb 10, 2024

Thanks for the detailed example! I appreciate it.

I missed that Pandas preserves the order of the first series during addition. Your PR solved it due to the subsequent sorting.
Based on your example, I've updated the tutorial.

Regarding overall performance, using n_jobs > 1 launches multiple processes (essentially, multiple Python workers). This initialization incurs a time overhead. However, subsequent runs (e.g., the second time you run the function) may be faster than n_jobs=1. Nevertheless, the current parallelization approach is not very sophisticated, and I recommend keeping n_jobs=1 for now. The upcoming Pandas 3 will natively improve parallelization.

Let me know if the fix works and you can close this issue, thank you for contributing.

@harper357
Copy link
Contributor Author

Sorry, I have been pretty busy with work so it took me a while to get back to this.

I haven't had the time to formally test it, but it seems like the scipy.stats.spearmanr calculation is is much faster than arfs' implementation (it calculated rho for 14k features in < 1 min on a laptop ) . I am not sure if the other measures are found in other common packages though.

I think it is fine to close this.

@ThomasBury
Copy link
Owner

scipy.stats.spearmanr

Here is a rephrased version of the text:

Computation Time Comparison:

Running the code with a single job (n_jobs=1) results in similar overall computation time for both implementations:

  • ARFS: 231 ns ± 188 ns per loop
  • SciPy Spearmanr: 137 ns ± 89 ns per loop

While SciPy is slightly faster (around 1.7 times), keep in mind that:

ARFS Implementation:
* Uses NumPy and Pandas for potentially better scalability with larger datasets.
* Supports weighting, which is crucial for specific applications like Poisson regression.
* While parallelizing with n_jobs > 1 can be tempting for speed gains, keep in mind it spawns multiple Python processes. This incurs overhead due to Python's Global Interpreter Lock (GIL), a known limitation when utilizing multiple cores for certain tasks. Use it only for bigger dataset (or even not at all, I wrote it when pandas 2.0 wasn't released yet, let's see what Pandas 3.0 brings on the table).

I agree if one doesn't need weighting (weight vector is None), then defaulting to the SciPy/NumPy implementation for speed would make perfect sense. Perhaps in the next release

I hope this is clearer. Thanks for contributing.

@harper357
Copy link
Contributor Author

My larger dataset was showing a bigger speed up, but I totally get your points about weights.

Adding Numba support could also offer quite a speed up, but it might involve more code changes than what it is worth if most people aren't using large datasets.

@ThomasBury
Copy link
Owner

While this implementation isn't lightning-fast, some functions are inherently difficult to vectorize. While Numba could offer potential speedups, it has compatibility issues with recent Python and NumPy versions, often requiring significant code rewrites with uncertain gains.

Therefore, for now, I'm sticking with this simpler approach and awaiting the expected performance improvements in pandas 3.0.

Another option could be migrating to Polars, but it's unclear if it outperforms the upcoming pandas version. Stay tuned for further updates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants