Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

Open
wu-s-john opened this issue Aug 5, 2015 · 2 comments

Comments

@wu-s-john
Copy link

Hi,

I was reading this documentation (http://sampleclean.org/guide/) and I see that you can use any similarity metric to find the similarity between two strings on one column attribute. Can you use multiple similarity metrics to find the similarity between two strings rather than one? If so, how can you include multiple similarity metrics?

Also, what is the matrix that is fed into SVM and RandomForest? What are the columns for this matrix. Are the values different string metrics?

@sjyk
Copy link
Owner

sjyk commented Aug 5, 2015

Hi,
The exposed API in the guide is a subset of the possible things you can do. See the scala docs (esp. http://sampleclean.org/api/#sampleclean.clean.featurize.AnnotatedSimilarityFeaturizer, http://sampleclean.org/api/#sampleclean.clean.featurize.Featurizer).

You can define metrics between a set of strings and use the included libraries for similarity--however, there is no guarantee that our internal optimizations such a prefix filtering will hold.

The learning for deduplication learns a discriminative model given a feature vector representing similarities between strings. For N data, there are N^2 similarities, so a subset L \subset N^2 are labeled. The features are an ensemble of similarity metrics comparing the strings. However, this is flexible as well and you are free to write your own featurizer. The Active Learning should be agnostic to the choice of featurization.

@wu-s-john
Copy link
Author

Thank you for the response. I see in the API that it returns a list of R^d elements. By default, if I feed the system a list of strings coming from one column or attribute, would it use an ensemble of string metrics? If so, what are the metrics? Also, can you show a brief example of using AnnotatedSimilarityFeaturizer and Featurizer and how I can import my own similarity metrics into these functions. Specifically, if I have the metrics, Jaro Distance, edit distance and LCS, how would I use these abstract classes to make my own class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants