You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was reading this documentation (http://sampleclean.org/guide/) and I see that you can use any similarity metric to find the similarity between two strings on one column attribute. Can you use multiple similarity metrics to find the similarity between two strings rather than one? If so, how can you include multiple similarity metrics?
Also, what is the matrix that is fed into SVM and RandomForest? What are the columns for this matrix. Are the values different string metrics?
The text was updated successfully, but these errors were encountered:
You can define metrics between a set of strings and use the included libraries for similarity--however, there is no guarantee that our internal optimizations such a prefix filtering will hold.
The learning for deduplication learns a discriminative model given a feature vector representing similarities between strings. For N data, there are N^2 similarities, so a subset L \subset N^2 are labeled. The features are an ensemble of similarity metrics comparing the strings. However, this is flexible as well and you are free to write your own featurizer. The Active Learning should be agnostic to the choice of featurization.
Thank you for the response. I see in the API that it returns a list of R^d elements. By default, if I feed the system a list of strings coming from one column or attribute, would it use an ensemble of string metrics? If so, what are the metrics? Also, can you show a brief example of using AnnotatedSimilarityFeaturizer and Featurizer and how I can import my own similarity metrics into these functions. Specifically, if I have the metrics, Jaro Distance, edit distance and LCS, how would I use these abstract classes to make my own class.
Hi,
I was reading this documentation (http://sampleclean.org/guide/) and I see that you can use any similarity metric to find the similarity between two strings on one column attribute. Can you use multiple similarity metrics to find the similarity between two strings rather than one? If so, how can you include multiple similarity metrics?
Also, what is the matrix that is fed into SVM and RandomForest? What are the columns for this matrix. Are the values different string metrics?
The text was updated successfully, but these errors were encountered: