## Detecting near duplicate keywords using dedupe

The dedupe Python library is a machine learning library that uses a combination of blocking, hierarchical clustering and logistic regression to generate clusters out of similar records. The use case it focuses on are structured records - a good overview of its use can be found in the [Basics of Entity Resolution with Python and Dedupe](https://medium.com/district-data-labs/basics-of-entity-resolution-with-python-and-dedupe-bc87440b64d4) article.

Since our keywords are likely to be of different sizes, we will use 3-char shingles and feature hashing to reduce each keyword to an integer array of 25 features.

In [1]:
import os
import nltk
import numpy as np

from sklearn.feature_extraction import FeatureHasher
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
DATA_DIR = "../data"

CURATED_KEYWORDS = os.path.join(DATA_DIR, "raw_keywords.txt")
# dedupe wants (or suggests) a CSV file
CURATED_KEYWORD_HASHES = os.path.join(DATA_DIR, "curated_keywords_hash.csv")
# output of dedupe
KEYWORD_DEDUPE_MAPPINGS = os.path.join(DATA_DIR, "keyword_dedupe_mappings.tsv")

### Encode keywords

For each keyword, we create 3-char shingles, then use the sklearn FeatureHasher to hash the array of shingles to a fixed length integer array. We show below that these arrays can be sent through a similarity measure such as Jaccard and return intuitively good values.

In [3]:
hasher = FeatureHasher(input_type="string", n_features=25, dtype=np.int32)
keywords = ["absolute value", "absolute values"]
hashes = []
for keyword in keywords:
    shingles = ["".join(trigram) for trigram in nltk.trigrams([c for c in keyword])]
    keyword_hash = hasher.transform([shingles]).toarray()
    hashes.append(keyword_hash[0])
    print(keyword, keyword_hash[0])

print("jaccard:", jaccard_similarity_score(hashes[0], hashes[1]))

('absolute value', array([ 0,  0,  1,  0,  0,  0,  0, -1, -1,  0,  0, -1,  0,  0,  0,  0,  0,
        0,  0, -2, -1,  0,  1,  0,  0], dtype=int32))
('absolute values', array([ 0,  0,  1,  0,  0,  0,  0, -2, -1,  0,  0, -1,  0,  0,  0,  0,  0,
        0,  0, -2, -1,  0,  1,  0,  0], dtype=int32))
('jaccard:', 0.96)


In [4]:
fcurated = open(CURATED_KEYWORDS, "r")
fhashed = open(CURATED_KEYWORD_HASHES, "w")

# header
cols = ["id"]
cols.extend(["col_{:d}".format(i+1) for i in range(25)])
cols.append("keyword")
fhashed.write("{:s}\n".format(",".join(cols)))

# shingle each word into 3-char trigrams, then hash to 25 features
hasher = FeatureHasher(input_type="string", n_features=25, dtype=np.int32)
for rowid, keyword in enumerate(fcurated):
    keyword = keyword.strip()
    shingles = ["".join(trigram) for trigram in nltk.trigrams([c for c in keyword])]
    keyword_hash = hasher.transform([shingles]).todense().tolist()[0]
    cols = [str(rowid)]
    cols.append(",".join([str(h) for h in keyword_hash]))
    cols.append(keyword)
    fhashed.write("{:s}\n".format(",".join(cols)))

fhashed.close()
fcurated.close()
print("num keywords: {:d}".format(rowid))

num keywords: 2281


### Cluster encoded keywords using dedupe

The pipeline is closely modeled after the [CSV example in the dedup-examples repository](https://github.com/dedupeio/dedupe-examples). Code is in a script `../scripts/dedupe_keyword_train.py`. Input is the `../data/curated_keywords_hash.csv` file we generated in this notebook. Output is a settings and a labels file which is generated as a result of the active learning step the first time the model trains so you don't have to repeat the labeling exercise every time. Final output is a set of pairs similar to the one we generated using `simhash` in the previous notebook.

Output looks like this:

In [5]:
i = 0
labels, preds = [], []
f = open(KEYWORD_DEDUPE_MAPPINGS, "r")
for line in f:
    keyword_left, keyword_right, score = line.strip().split("\t")
    score = float(score)
    preds.append(1 if score > 0.75 else 0)
    edit_dist = nltk.edit_distance(keyword_left, keyword_right)
    labels.append(1 if edit_dist <= 2 else 0)
    if i <= 10:
        print("{:25s}\t{:25s}\t{:.3f}\t{:.3f}".format(keyword_left, keyword_right, 
                                                      score, edit_dist))
    i += 1
f.close()

acc = accuracy_score(labels, preds)
cm = confusion_matrix(labels, preds)
cr = classification_report(labels, preds)

print("---")
print("accuracy: {:.3f}".format(acc))
print("---")
print("confusion matrix")
print(cm)
print("---")
print("classification report")
print(cr)

learning approaches      	learning approach        	0.776	2.000
absolute values          	absolute value           	0.796	1.000
dual variables           	dual variable            	0.878	1.000
synaptic weights         	synaptic weight          	0.816	1.000
performance measures     	performance measure      	0.818	1.000
synthetic dataset        	synthetic data           	0.684	3.000
dynamical systems        	dynamical system         	0.836	1.000
action pairs             	action pair              	0.877	1.000
action potentials        	action potential         	0.853	1.000
learning models          	learning model           	0.816	1.000
action spaces            	action space             	0.816	1.000
---
accuracy: 0.889
---
confusion matrix
[[ 52   5]
 [ 31 235]]
---
classification report
             precision    recall  f1-score   support

          0       0.63      0.91      0.74        57
          1       0.98      0.88      0.93       266

avg / total       0.92      0.89      0.90   

### What about clustering?

I also tried encoding the keywords as described and recursively splitting it up with different clustering algorithms (KMeans and Spectral) until the size of the output cluster is less than some preset threshold. Unfortunately, the clustering algorithm operated by splitting off one row at a time, finally ending with N single row clusters where N is the size of the original dataset. So naive clustering is probably not the way to go with this.