Possible bugs in `CollinearityThreshold` #40

harper357 · 2024-02-06T01:00:55Z

Hi Thanks for writing such a great module!

After running CollinearityThreshold.fit_transform() on some of my data, I was trying to look into which unselected features are collinear with my selected features. I was trying to look at the assoc_matrix_ which told me that 502/643 had no values above my threshold. This is in contrast to the number of selected features which was only 231/643. Spot checking some of the not selected features showed that they also never had a value above the threshold. This then led me to the code for dropping features and I am a little confused by it.

arfs/src/arfs/feature_selection/unsupervised.py

Lines 426 to 463 in bbcc785

    
           def _most_collinear(association_matrix, threshold): 
        
               cols_to_drop = [ 
        
                   column 
        
                   for column in association_matrix.columns 
        
                   if any(association_matrix.loc[:, column].abs() > threshold) 
        
               ] 
        
               rows_to_drop = [ 
        
                   row 
        
                   for row in association_matrix.index 
        
                   if any(association_matrix.loc[row, :].abs() > threshold) 
        
               ] 
        
               to_drop = list(set(cols_to_drop).union(set(rows_to_drop))) 
        
               most_collinear_series = ( 
        
                   association_matrix[to_drop].abs().sum(axis=1).sort_values(ascending=False) 
        
               ) 
        
               most_collinear_series += ( 
        
                   association_matrix[to_drop].abs().sum(axis=0).sort_values(ascending=False) 
        
               ) 
        
               most_collinear_series /= 2 
        
               return most_collinear_series.index[0], to_drop 
        
           def _recursive_collinear_elimination(association_matrix, threshold): 
        
               dum = association_matrix.copy() 
        
               most_collinear_features = [] 
        
               while True: 
        
                   most_collinear_feature, to_drop = _most_collinear(dum, threshold) 
        
                   # Break if no more features to drop 
        
                   if not to_drop: 
        
                       break 
        
                   if most_collinear_feature not in most_collinear_features: 
        
                       most_collinear_features.append(most_collinear_feature) 
        
                       dum = dum.drop(columns=most_collinear_feature, index=most_collinear_feature) 
        
               return most_collinear_features

In lines 438-444, it looks like you are trying to sum the row and column for a given feature, and return the feature with the highest average, correct? However, association_matrix[to_drop] == association_matrix.loc[:,to_drop] so in L439 your index would be all features instead of just features into_drop.

Second, in L439 and L442 you sort the series, but I believe you should just be doing a final sort in L444 or L445.

>>> series = pd.Series([1, 0], index=['A', 'B'])
>>> series += pd.Series([1.4, 0.1], index=['B', 'A'])
>>> print(series)
A    1.1
B    1.4
dtype: float64

Combined, these two things seem to result in the incorrect feature being dropped.

Related, but not a bug, I found changing L427-L436 to the following resulted in a huge speedup (998 ms ± 26.7 ms per loop vs 27.3 s ± 558 ms per loop) in calling _recursive_collinear_elimination for me (n_features=643).

cols_to_drop = association_matrix.loc[
        :, (association_matrix.abs() > threshold).any(axis=0)
    ].columns.values
rows_to_drop = association_matrix.loc[
        (association_matrix.abs() > threshold).any(axis=1), :
    ].index.values

The text was updated successfully, but these errors were encountered:

ThomasBury · 2024-02-07T19:53:37Z

HI @harper357, thank you for the kind words and the careful review (a proper CI/CD test suite is still missing). I'll go through your comments and PR as soon as my schedule allows it.

ThomasBury · 2024-02-09T09:44:01Z

For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent.

Agreed, using pandas is indeed more Pythonic than list comprehension in this case.

In the 2.2.3 version tutorial, it seems everything's fine. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful (please update to version 2.2.3). Thanks

harper357 · 2024-02-09T17:06:48Z

Looking back at my example, that doesn't show what I was trying to show. I'll try posting a better example and find an example data set for the threshold part later today.

…

On Fri, Feb 9, 2024, 1:44 AM Thomas Bury ***@***.***> wrote: For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent. Agreed, using pandas is indeed more Pythonic than list comprehension in this case. I'll review the user messages to see if they might be the source of the inconsistencies. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful. — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZ4ZBZHZ3ZAU7WOI6QUYFLYSXVW3AVCNFSM6AAAAABC3DCTHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGYYTENBVGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

harper357 · 2024-02-10T05:15:38Z

I see you merged my changes so the bug should be fixed. But just to be complete, I uploaded a notebook to my fork that illustrates everything. Notebook

It should be pretty easy to understand, I used the cancer data set to show that there was a problem and then some minimal examples of what was happening.

ThomasBury · 2024-02-10T15:55:40Z

Thanks for the detailed example! I appreciate it.

I missed that Pandas preserves the order of the first series during addition. Your PR solved it due to the subsequent sorting.
Based on your example, I've updated the tutorial.

Regarding overall performance, using n_jobs > 1 launches multiple processes (essentially, multiple Python workers). This initialization incurs a time overhead. However, subsequent runs (e.g., the second time you run the function) may be faster than n_jobs=1. Nevertheless, the current parallelization approach is not very sophisticated, and I recommend keeping n_jobs=1 for now. The upcoming Pandas 3 will natively improve parallelization.

Let me know if the fix works and you can close this issue, thank you for contributing.

harper357 · 2024-02-19T18:35:44Z

Sorry, I have been pretty busy with work so it took me a while to get back to this.

I haven't had the time to formally test it, but it seems like the scipy.stats.spearmanr calculation is is much faster than arfs' implementation (it calculated rho for 14k features in < 1 min on a laptop ) . I am not sure if the other measures are found in other common packages though.

I think it is fine to close this.

ThomasBury · 2024-02-19T22:02:37Z

scipy.stats.spearmanr

Here is a rephrased version of the text:

Computation Time Comparison:

Running the code with a single job (n_jobs=1) results in similar overall computation time for both implementations:

ARFS: 231 ns ± 188 ns per loop
SciPy Spearmanr: 137 ns ± 89 ns per loop

While SciPy is slightly faster (around 1.7 times), keep in mind that:

ARFS Implementation:
* Uses NumPy and Pandas for potentially better scalability with larger datasets.
* Supports weighting, which is crucial for specific applications like Poisson regression.
* While parallelizing with n_jobs > 1 can be tempting for speed gains, keep in mind it spawns multiple Python processes. This incurs overhead due to Python's Global Interpreter Lock (GIL), a known limitation when utilizing multiple cores for certain tasks. Use it only for bigger dataset (or even not at all, I wrote it when pandas 2.0 wasn't released yet, let's see what Pandas 3.0 brings on the table).

I agree if one doesn't need weighting (weight vector is None), then defaulting to the SciPy/NumPy implementation for speed would make perfect sense. Perhaps in the next release

I hope this is clearer. Thanks for contributing.

harper357 · 2024-02-20T00:50:38Z

My larger dataset was showing a bigger speed up, but I totally get your points about weights.

Adding Numba support could also offer quite a speed up, but it might involve more code changes than what it is worth if most people aren't using large datasets.

ThomasBury · 2024-02-20T13:06:45Z

While this implementation isn't lightning-fast, some functions are inherently difficult to vectorize. While Numba could offer potential speedups, it has compatibility issues with recent Python and NumPy versions, often requiring significant code rewrites with uncertain gains.

Therefore, for now, I'm sticking with this simpler approach and awaiting the expected performance improvements in pandas 3.0.

Another option could be migrating to Polars, but it's unclear if it outperforms the upcoming pandas version. Stay tuned for further updates!

harper357 mentioned this issue Feb 6, 2024

CollinearityThreshold bug fix and speedup #41

Merged

ThomasBury self-assigned this Feb 7, 2024

ThomasBury added the bug Something isn't working label Feb 7, 2024

ThomasBury added question Further information is requested and removed bug Something isn't working labels Feb 9, 2024

ThomasBury added bug Something isn't working and removed question Further information is requested labels Feb 10, 2024

harper357 closed this as completed Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bugs in `CollinearityThreshold` #40

Possible bugs in `CollinearityThreshold` #40

harper357 commented Feb 6, 2024 •

edited

ThomasBury commented Feb 7, 2024

ThomasBury commented Feb 9, 2024 •

edited

harper357 commented Feb 9, 2024 via email

harper357 commented Feb 10, 2024

ThomasBury commented Feb 10, 2024 •

edited

harper357 commented Feb 19, 2024

ThomasBury commented Feb 19, 2024

harper357 commented Feb 20, 2024

ThomasBury commented Feb 20, 2024

Possible bugs in CollinearityThreshold #40

Possible bugs in CollinearityThreshold #40

Comments

harper357 commented Feb 6, 2024 • edited

ThomasBury commented Feb 7, 2024

ThomasBury commented Feb 9, 2024 • edited

harper357 commented Feb 9, 2024 via email

harper357 commented Feb 10, 2024

ThomasBury commented Feb 10, 2024 • edited

harper357 commented Feb 19, 2024

ThomasBury commented Feb 19, 2024

harper357 commented Feb 20, 2024

ThomasBury commented Feb 20, 2024

Possible bugs in `CollinearityThreshold` #40

Possible bugs in `CollinearityThreshold` #40

harper357 commented Feb 6, 2024 •

edited

ThomasBury commented Feb 9, 2024 •

edited

ThomasBury commented Feb 10, 2024 •

edited