# String Similarity III
Now, we will put the last few lectures together to study a concept called "Entity Resolution". Entity Resolution is the task of disambiguating data representations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. For example, all of these strings represent the same country:

```
USA
U.S.A
United States
United States of America
```

Given a list of strings where there might be duplications, we must resolve each duplicate representation to a single canonical representation.

## Similarity Metrics
Suppose, we use the matchers from the previous lecture to compare the group these strings what goes wrong? There are a couple things that we need to decide. First, when we get a match of two strings $a$ and $b$, how should we pick which one to merge to? But, there is also a more subtle issue. In general, similarity measures are not *transitive*. A metric is transitive if $a \approx b$ and $b \approx c$ then $a \approx c$. 

Let's try a first pass at this. Let's suppose we prefer the "shorter" string. How could we get around this transitivity issue? One solution, is to repeatedly merge pairs of strings until it reaches a "fixed point".

In [26]:
import distance 
import copy

#normalize the distance between 0 and 1
def editSimilarity(s,t):
    return 1 - distance.levenshtein(s,t)/max(len(s), len(t))


def mergeone(strlist):
    changes = 0
    
    for i,s in enumerate(strlist):
        for j,t in enumerate(strlist):
            if s != t and editSimilarity(s,t) >= thresh:
                if len(s) < len(j):
                    strlist[i] = t
                    changes += 1
    return changes, strlist


def recurseMerge(strlist):
    changes, strlist = mergeone(strlist)
    
    while changes != 0:
        changes, strlist = mergeone(strlist)
    
    return strlist
   
strlist = ['USA', 'US', 'U.S.A', 'China','Chia', 'Belize']
print(recurseMerge(strlist))

NameError: name 'enuerate' is not defined

In [18]:
import distance 

#normalize the distance between 0 and 1
def editSimilarity(s,t):
    return 1 - distance.levenshtein(s,t)/max(len(s), len(t))

#build a graph of strings
def build_graph(strlist, thresh):
    graph = {}
    for s in set(strlist):
        graph[s] = set()
        for t in set(strlist):
            if s != t and editSimilarity(s,t) >= thresh:
                graph[s].add(t)
    return graph    

Running this code pay close attentions to U.S.A and US:

In [19]:
strlist = ['USA', 'US', 'U.S.A', 'China','Chia', 'Belize']
print(build_graph(strlist, 0.5))

{'Belize': set(), 'US': {'USA'}, 'USA': {'US', 'U.S.A'}, 'China': {'Chia'}, 'Chia': {'China'}, 'U.S.A': {'USA'}}


Now, we can run the transitive closure algorithm:

In [24]:
def transitive_closure(graph):
    
    #maintain nodes already found in a cc
    already_seen = set()
    
    result = []
    
    #iterate through each node in the graph
    for node in graph:
        
        #if you haven't seen it before
        if node not in already_seen:
            
            #find all nodes connected
            connected_group, already_seen = get_connected_group(node, already_seen, graph)
            result.append(connected_group)
            
    return result


#the main thing you have to do
def get_connected_group(node, already_seen, graph):
        result = []
        
        #start with yourself, a list of things to expand
        nodes = set([node])
        
        #while you still have stuff to expand
        while nodes:
            
            #expand from the top of the expansion set
            node = nodes.pop()
            already_seen.add(node)
            
            #connect your self to everyone neighboring you haven't already seend
            nodes = nodes or (graph[node] - already_seen)
            
            #add to result
            result.append(node)
        return result, already_seen
    

print(transitive_closure(build_graph(strlist, 0.5)))

[['Belize'], ['US', 'USA', 'U.S.A'], ['China', 'Chia']]
