Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

wu-s-john · 2015-08-05T02:09:22Z

Hi,

I was reading this documentation (http://sampleclean.org/guide/) and I see that you can use any similarity metric to find the similarity between two strings on one column attribute. Can you use multiple similarity metrics to find the similarity between two strings rather than one? If so, how can you include multiple similarity metrics?

Also, what is the matrix that is fed into SVM and RandomForest? What are the columns for this matrix. Are the values different string metrics?

sjyk · 2015-08-05T23:42:46Z

Hi,
The exposed API in the guide is a subset of the possible things you can do. See the scala docs (esp. http://sampleclean.org/api/#sampleclean.clean.featurize.AnnotatedSimilarityFeaturizer, http://sampleclean.org/api/#sampleclean.clean.featurize.Featurizer).

You can define metrics between a set of strings and use the included libraries for similarity--however, there is no guarantee that our internal optimizations such a prefix filtering will hold.

The learning for deduplication learns a discriminative model given a feature vector representing similarities between strings. For N data, there are N^2 similarities, so a subset L \subset N^2 are labeled. The features are an ensemble of similarity metrics comparing the strings. However, this is flexible as well and you are free to write your own featurizer. The Active Learning should be agnostic to the choice of featurization.

wu-s-john · 2015-08-06T06:28:25Z

Thank you for the response. I see in the API that it returns a list of R^d elements. By default, if I feed the system a list of strings coming from one column or attribute, would it use an ensemble of string metrics? If so, what are the metrics? Also, can you show a brief example of using AnnotatedSimilarityFeaturizer and Featurizer and how I can import my own similarity metrics into these functions. Specifically, if I have the metrics, Jaro Distance, edit distance and LCS, how would I use these abstract classes to make my own class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

wu-s-john commented Aug 5, 2015

sjyk commented Aug 5, 2015

wu-s-john commented Aug 6, 2015

Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

Using Multiple Similarity Metrics and Features for SVM and RandomForest Model #64

Comments

wu-s-john commented Aug 5, 2015

sjyk commented Aug 5, 2015

wu-s-john commented Aug 6, 2015