# String Search

We ended the previous lecture by noting the challenges of working with unstructured strings. The treatment of string attributes in a database requires further discussion. For string attributes, databases are good at looking up or joining on exact matches using hashing. For more complicated conditions (e.g., find all near matches in two tables), this story is a little more complicated. We start our study of string-searching algorithms, sometimes called string-matching algorithms. These are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger set of strings.

A string similarity measure is a function that evalautes how similar (or different) two strings are. Let $s$ and $t$ be two strings from the same alphabet. A string similarity function takes the two strings as arguments and returns a continuous value 0 to 1 $m(s,t) \mapsto [0,1]$. 1 implies the strings are equal and 0 implies they are maximally different.

For the purposes of this lecture, we will only consider strings from the lower-case latin alphabet and similarity measures over them. 

## Tokenization and Preprocessing
The first step in designing a good string similarity metric is the process of tokenization. This considers the minimum granularity of matching that we will consider. Think of these like the important features of a string that are worth considering. For example, we might tokenize a string on word boundaries:

In [1]:
st = 'the quick brown fox jumped over a lazy dog'
print(st.split(' '))

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']


Or, we might tokenize a string on character boundaries:

In [2]:
st = 'the quick brown fox jumped over a lazy dog'
print(list(st))

['t', 'h', 'e', ' ', 'q', 'u', 'i', 'c', 'k', ' ', 'b', 'r', 'o', 'w', 'n', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 'e', 'd', ' ', 'o', 'v', 'e', 'r', ' ', 'a', ' ', 'l', 'a', 'z', 'y', ' ', 'd', 'o', 'g']


A general tokenization class takes a string as input and returns an iterator over tokens.

In [3]:
class WordTokenization():
    '''
    This class tokenizes a string
    into a sequence of words.
    '''
    
    def __init__(self, input):
        self.input = input
    
    def __iter__(self):
        self.it = iter(self.input.split())
        return self
    
    def __next__(self):
        return next(self.it)


for i in WordTokenization('the cow jumped over the moon'):
    print(i)

the
cow
jumped
over
the
moon


Often times, we will further process the output of the returned tokens. For example, we may not be interested in articles or "stop words".

In [4]:
class StopWordFilter():
    '''
    This removes stop words from a tokenized
    stream
    '''
    
    def __init__(self, input):
        self.input = input
    
    def __iter__(self):
        self.it = iter(self.input)
        return self
    
    def __next__(self):
        rtn = next(self.it)
        if rtn == 'a' or rtn == 'an' or rtn == 'the':
            return self.__next__()
        else:
            return rtn

for i in StopWordFilter(WordTokenization('the cow jumped over the moon')):
    print(i)

cow
jumped
over
moon


We could similarly filter the tokenized output to only return unique words (remove duplicates):

In [6]:
class DedupFilter():
    '''
    This removes stop words from a tokenized
    stream
    '''
    
    def __init__(self, input):
        self.input = input
    
    def __iter__(self):
        self.it = iter(self.input)
        self.prev = set()
        return self
    
    def __next__(self):
        rtn = next(self.it)
        
        if rtn in self.prev:
            return self.__next__()
        else:
            self.prev.add(rtn)
            return rtn

for i in DedupFilter(WordTokenization('the cow jumped over the moon')):
    print(i)

the
cow
jumped
over
moon


## Jaccard Similarity 
Over a sequence of tokens one of the simplest (but ubiquitous!) similarity measures is Jaccard Similarity. Jaccard similarity, also known as Intersection over Union, is defined as the size of the intersection (how many unique tokens appear in both) divided by the size of the union (how many unique tokens appear in either).

In [22]:
import itertools

def intersection(s_tokens, t_tokens):
    count = 0
    for s in DedupFilter(s_tokens):
        for t in DedupFilter(t_tokens):    
            if s == t:
                count += 1

    return count

def union(s_tokens, t_tokens):
    count = 0
    for s in DedupFilter(itertools.chain(s_tokens,t_tokens)):
        count += 1
    return count

def jaccard(s_tokens, t_tokens):
    return intersection(s_tokens, t_tokens)/union(s_tokens, t_tokens)

Consider the following examples:

In [23]:
s = WordTokenization('the quick brown')
t = WordTokenization('quick silver')

print(jaccard(s,t))

0.25


In [19]:
s = WordTokenization('the quick brown')
t = WordTokenization('the quick brown')

print(jaccard(s,t))

1.0


In [20]:
s = WordTokenization('the quick brown')
t = WordTokenization('a big dog')

print(jaccard(s,t))

0.0


In [24]:
s = WordTokenization('the quick the')
t = WordTokenization('the big dog')

print(jaccard(s,t))

0.25


## A Jaccard Matcher
The Jaccard similarity can be used in more complex operators. Given two lists of strings `strlist1` and `strlist2`, find all pairs of strings with a Jaccard similarity of at least `thresh`. Why might we want to do this?

In [25]:
strlist1 = ['the big bear', 'dog in the woods', 'alphabet soup', 'kermit surprise']
strlist2 = ['big bear', 'dogs', 'alphabet soup 1', 'kermit surprise']

This is often a core primitive in data integration problems. Data integration is the combination of technical and business processes used to combine data from disparate sources into meaningful and valuable information. Sometimes the two sources might have slighly differently descriptions for the same real world entities.

In [26]:
class JaccardMatchOperator:


    def __init__(self, input, thresh):
        '''
        Takes in a tuple of input iterators (i1,i2)
        '''
        self.in1, self.in2 = input
        self.thresh = thresh
        # a list of iterators
        
    def __iter__(self):
        '''
        Initializes the iterators and fetches the first element
        '''

        self.it1 = iter(self.in1) # initialize the first input
        self.it2 = iter(self.in2) # initialize the second input
        
        self.i = next(self.it1)
        self.j = next(self.it2)
        
        return self


    """
    Below are two helper methods. Conceptually,
    we are running the following patter:
    for i in it1:
        for j in it2:
            if jaccard(i,j) > thresh:
                return (i,j)
    To implement this with iterators, we need two
    helper methods _reset_or_inc2 (go back to the
    beginning of the inner for loop), or _inc1_or_end
    (increment the first for loop or stop)
    """

    def _reset_or_inc2(self):
        try:
            self.j = next(self.it2)

        except StopIteration:
            self.it2 = iter(self.in2)
            self.j = next(self.it2)
            self._inc1_or_end()

    def _inc1_or_end(self):
        try:
            self.i = next(self.it1)
        except StopIteration:
            self.i = None
            self.j = None


    def __next__(self):
        '''
        The next method fetches the next element
        '''

        rtn = (self.i, self.j)

        self._reset_or_inc2()

        # skip non-pairs
        if rtn[0] == None:
            raise StopIteration()

        if jaccard(WordTokenization(rtn[0]),WordTokenization(rtn[1])) <= self.thresh:
            return self.__next__()
        else:
            return rtn

In [29]:
for i,j in JaccardMatchOperator((strlist1, strlist2),0.5):
    print(i,"<->",j)

the big bear <-> big bear
alphabet soup <-> alphabet soup 1
kermit surprise <-> kermit surprise
