-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bugs in CollinearityThreshold
#40
Comments
HI @harper357, thank you for the kind words and the careful review (a proper CI/CD test suite is still missing). I'll go through your comments and PR as soon as my schedule allows it. |
For improved clarity, I'll move the series sorting to the end. As you demonstrated, pandas handles value placement correctly using the index, ensuring the order remains consistent. Agreed, using pandas is indeed more Pythonic than list comprehension in this case. In the 2.2.3 version tutorial, it seems everything's fine. To help diagnose the issue with features not exceeding the threshold, providing a reproducible example would be very helpful (please update to version |
Looking back at my example, that doesn't show what I was trying to show.
I'll try posting a better example and find an example data set for the
threshold part later today.
…On Fri, Feb 9, 2024, 1:44 AM Thomas Bury ***@***.***> wrote:
For improved clarity, I'll move the series sorting to the end. As you
demonstrated, pandas handles value placement correctly using the index,
ensuring the order remains consistent.
Agreed, using pandas is indeed more Pythonic than list comprehension in
this case.
I'll review the user messages to see if they might be the source of the
inconsistencies. To help diagnose the issue with features not exceeding the
threshold, providing a reproducible example would be very helpful.
—
Reply to this email directly, view it on GitHub
<#40 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZ4ZBZHZ3ZAU7WOI6QUYFLYSXVW3AVCNFSM6AAAAABC3DCTHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGYYTENBVGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I see you merged my changes so the bug should be fixed. But just to be complete, I uploaded a notebook to my fork that illustrates everything. Notebook It should be pretty easy to understand, I used the cancer data set to show that there was a problem and then some minimal examples of what was happening. |
Thanks for the detailed example! I appreciate it. I missed that Pandas preserves the order of the first series during addition. Your PR solved it due to the subsequent sorting. Regarding overall performance, using Let me know if the fix works and you can close this issue, thank you for contributing. |
Sorry, I have been pretty busy with work so it took me a while to get back to this. I haven't had the time to formally test it, but it seems like the I think it is fine to close this. |
Here is a rephrased version of the text: Computation Time Comparison: Running the code with a single job (
While SciPy is slightly faster (around 1.7 times), keep in mind that: ARFS Implementation: I agree if one doesn't need weighting (weight vector is None), then defaulting to the SciPy/NumPy implementation for speed would make perfect sense. Perhaps in the next release I hope this is clearer. Thanks for contributing. |
My larger dataset was showing a bigger speed up, but I totally get your points about weights. Adding Numba support could also offer quite a speed up, but it might involve more code changes than what it is worth if most people aren't using large datasets. |
While this implementation isn't lightning-fast, some functions are inherently difficult to vectorize. While Numba could offer potential speedups, it has compatibility issues with recent Python and NumPy versions, often requiring significant code rewrites with uncertain gains. Therefore, for now, I'm sticking with this simpler approach and awaiting the expected performance improvements in pandas 3.0. Another option could be migrating to Polars, but it's unclear if it outperforms the upcoming pandas version. Stay tuned for further updates! |
Hi Thanks for writing such a great module!
After running
CollinearityThreshold.fit_transform()
on some of my data, I was trying to look into which unselected features are collinear with my selected features. I was trying to look at theassoc_matrix_
which told me that 502/643 had no values above my threshold. This is in contrast to the number of selected features which was only 231/643. Spot checking some of the not selected features showed that they also never had a value above the threshold. This then led me to the code for dropping features and I am a little confused by it.arfs/src/arfs/feature_selection/unsupervised.py
Lines 426 to 463 in bbcc785
In lines 438-444, it looks like you are trying to sum the row and column for a given feature, and return the feature with the highest average, correct? However,
association_matrix[to_drop] == association_matrix.loc[:,to_drop]
so in L439 your index would be all features instead of just features into_drop
.Second, in L439 and L442 you sort the series, but I believe you should just be doing a final sort in L444 or L445.
Combined, these two things seem to result in the incorrect feature being dropped.
Related, but not a bug, I found changing L427-L436 to the following resulted in a huge speedup (998 ms ± 26.7 ms per loop vs 27.3 s ± 558 ms per loop) in calling
_recursive_collinear_elimination
for me (n_features=643).The text was updated successfully, but these errors were encountered: