Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Table Matcher module for matching identical noisy observations across two tables. #4238

Closed
wants to merge 6 commits into from

Conversation

Raahul-Singh
Copy link
Contributor

This adds a module to match the rows across two dataframes, where the rows correspond to the same observation but vary in the exact values of the measured values.

This module requires adding Scikit-learn as a dependency. If this is not acceptable (which I assume it isn't), I will need to implement the pairwise cosine similarity, euclidean distance, etc. algorithms which will be used here.

Currently, this supports the following metrics:

  • Cosine Similarity
  • Euclidean Distance
  • Pairwise Euclidean Distance
  • Fuzzy matching for string values

Possible impronvements:

  • Passing feature specific match score threshold values
  • Using different algorithms for different features.
  • Making the table matcher compatible with Astropy tables.

This was earlier part of #4233 but on further discussion with @dpshelio, it was decided that this should be implemented as a separate module.

@pep8speaks
Copy link

pep8speaks commented Jun 3, 2020

Hello @Raahul-Singh! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 11:1: W191 indentation contains tabs
Line 12:1: W191 indentation contains tabs
Line 13:1: W191 indentation contains tabs
Line 14:1: W191 indentation contains tabs
Line 15:1: W191 indentation contains tabs
Line 16:1: W191 indentation contains tabs
Line 17:1: W191 indentation contains tabs
Line 18:1: W191 indentation contains tabs
Line 19:1: W191 indentation contains tabs
Line 24:1: E101 indentation contains mixed spaces and tabs
Line 51:1: E101 indentation contains mixed spaces and tabs
Line 51:1: W191 indentation contains tabs
Line 56:1: W191 indentation contains tabs
Line 61:1: W191 indentation contains tabs
Line 65:1: W191 indentation contains tabs
Line 66:1: W191 indentation contains tabs
Line 70:1: W191 indentation contains tabs
Line 71:1: W191 indentation contains tabs
Line 75:1: W191 indentation contains tabs
Line 76:1: W191 indentation contains tabs
Line 78:1: W191 indentation contains tabs
Line 82:1: W191 indentation contains tabs
Line 84:1: W191 indentation contains tabs
Line 85:1: W191 indentation contains tabs
Line 89:1: W191 indentation contains tabs
Line 90:1: W191 indentation contains tabs
Line 92:1: W191 indentation contains tabs
Line 96:1: W191 indentation contains tabs
Line 98:1: W191 indentation contains tabs
Line 99:1: W191 indentation contains tabs

Comment last updated at 2020-06-03 16:07:26 UTC

@nabobalis nabobalis added this to the 2.1 milestone Jun 3, 2020
@nabobalis nabobalis added [WIP] util Issues relating to sunpy.util labels Jun 3, 2020
setup.cfg Outdated
@@ -40,6 +40,7 @@ install_requires =
pandas>=0.23.0
astropy>=3.2
parfive[ftp]>=1.1.0
scikit-learn>=0.23.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into its own extra dependency (as below) and the imports moved into the functions with an import check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the changes I have made suffice?

@wtbarnes
Copy link
Member

wtbarnes commented Jun 8, 2020

Re: the discussion at the community meeting last week, I think there needs to be a larger discussion about whether this feature is really in the scope of sunpy core.

I'm of course not debating whether this functionality is useful. It most certainly is! But this seems to have broad utility outside of solar physics and may be more widely used/discoverable if it was placed in a package with a broader scope than sunpy.

It looks like this just works on pandas dataframes currently (please correct me if I'm wrong!). If this operated on astropy tables, perhaps it could be upstreamed to astropy? Or maybe even to pandas in its current state? I recognize there's probably a significantly larger effort involved in merging it into either of those two, particularly the latter.

Thoughts?

@Raahul-Singh
Copy link
Contributor Author

Raahul-Singh commented Jun 8, 2020

Re: the discussion at the community meeting last week, I think there needs to be a larger discussion about whether this feature is really in the scope of sunpy core.

I'm of course not debating whether this functionality is useful. It most certainly is! But this seems to have broad utility outside of solar physics and may be more widely used/discoverable if it was placed in a package with a broader scope than sunpy.

It looks like this just works on pandas dataframes currently (please correct me if I'm wrong!). If this operated on astropy tables, perhaps it could be upstreamed to astropy? Or maybe even to pandas in its current state? I recognize there's probably a significantly larger effort involved in merging it into either of those two, particularly the latter.

Thoughts?

I'm working on making it compliant with Astropy tables. It should be done soon enough. Hopefully by the next community meeting. 😅 I also request everyone to suggest any other metrics that would be useful, besides the cosines similarity and euclidean distance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
util Issues relating to sunpy.util
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants