## Keyword Deduping

This is part of cleaning up the keywords, we want to dedupe keywords that are really close to each other, such as plural and singular versions of the same thing, so they don't show up as facets together. For that, we will use the [simhash algorithm](https://moz.com/devblog/near-duplicate-detection/) via the [simhash-py](https://github.com/seomoz/simhash-py) module to find near-duplicate keywords and normalize them for purposes of loading up the index.

In [1]:
import os
import simhash

from nltk.metrics.distance import edit_distance

In [2]:
DATA_DIR = "../data"

MERGED_KEYWORDS = os.path.join(DATA_DIR, "raw_keywords.txt")
NEAR_DUPS = os.path.join(DATA_DIR, "keyword_neardup_mappings.tsv")

In [3]:
def create_hash(s):
    shingles = ["".join(s) for s in simhash.shingle(s, window=3)]
    hashes = []
    for shingle in shingles:
        hashes.append(simhash.unsigned_hash(shingle))
    return simhash.compute(hashes)


hashes = []
hash2keyword = {}
fmrgk = open(MERGED_KEYWORDS, "r")
for line in fmrgk:
    keyword = line.strip()
    hash_val = create_hash(keyword)
    hashes.append(hash_val)
    hash2keyword[hash_val] = keyword
fmrgk.close()

In [4]:
# Number of bits that may differ in matching pairs
distance = 10
# Number of blocks to use (more in the next section)
blocks = distance + 1
matches = simhash.find_all(hashes, blocks, distance)

In [5]:
fndm = open(NEAR_DUPS, "w")
num_mappings = 0
for lhs, rhs in matches:
    lhs_keyword = hash2keyword[lhs]
    rhs_keyword = hash2keyword[rhs]
    if edit_distance(lhs_keyword, rhs_keyword) > 2:
        continue
    len_lhs = len(lhs_keyword)
    len_rhs = len(rhs_keyword)
    if len_lhs < len_rhs:
        fndm.write("{:s}\t{:s}\n".format(rhs_keyword, lhs_keyword))
    else:
        fndm.write("{:s}\t{:s}\n".format(lhs_keyword, rhs_keyword))
    num_mappings += 1
    
fndm.close()
print("number of mappings: {:d}".format(num_mappings))

number of mappings: 259
