<h1><center>cs1001.py , Tel Aviv University, Spring 2018</center></h1>
<img src="http://www.pngall.com/wp-content/uploads/2016/05/Python-Logo-PNG-Image-180x180.png" width=50/>

## Recitation 10

We discussed iterators + generators and the Karp-Rabin algorithm for string matching.

###### Takeaways:
- Generators function is a function that contains the yield command and returns a genertor object. 
- Make sure you read our KR <a href="http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2017b/KR-summary_new.pdf">summary.</a>
- A naive solution for the string-matching problem has $O(m(n-m))$ time complexity.
- By allowing "false-positives" (with small probability), we can obtain a linear time solution for the string-matching problem.
- Make sure you understand the way the algorithm works, and in particular the "rolling hash mechanism", that is, how to compute the fingerprint of the next substring in $O(1)$ time, given the fingerprint of the current substring.
- Make sure you understand the "aritmetization" step used by the algorithm.
- Make sure you understand the question we solved.

#### Code for printing several outputs in one cell (not part of the recitation):

In [34]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

##  Iterators and Generators

In [35]:
l = [1,2,3]
li = iter(l)
type(li)
li2 = iter(l)



list_iterator

In [36]:
next(li)
next(li)
z = next(li)
print("z is", z)
next(li2)
#next(li)
print("before loop")
for elem in li2:
    print(elem)

1

2

z is 3


1

before loop
2
3


In [37]:
next(li)

StopIteration: 

In [38]:
next(li2)

StopIteration: 

### Try Except example

In [48]:
lst = [1,2,3]
li = iter(lst)
while True:
    item = next(li)
    print ("item is ", item)


item is  1
item is  2
item is  3


StopIteration: 

In [47]:
lst = [1,2,3]
li = iter(lst)
try:
    while True:
        item = next(li)
        print ("item is ", item)
except StopIteration:
    print ("Catching the error!!!")

item is  1
item is  2
item is  3
Catching the error!!!


A function which produces a list of all positive even numbers up to $n$

In [None]:
def Evens_list(n):
    ''' a list of evens up to n '''
    return [num for num in range(n) if num%2==0]

A generator function (includes a "yield" statement) that returns a generator that generates all positive even numbers up to $n$

In [40]:
def Evens_gen(n):
    ''' returns a generator of evens up to n '''
    print("before loop")
    for i in range(n):
        print("current i:", i)
        if i%2 == 0:
            print("before yield")
            yield i
            print("after yield")

In [41]:
type(Evens_gen)
g = Evens_gen(10)
type(g)

function

generator

In [42]:
next(g)

before loop
current i: 0
before yield


0

In [43]:
a = next(g)

after yield
current i: 1
current i: 2
before yield


In [45]:
print(a)
next(g)
next(g)
next(g)
next(g)

2
after yield
current i: 3
current i: 4
before yield


4

after yield
current i: 5
current i: 6
before yield


6

after yield
current i: 7
current i: 8
before yield


8

after yield
current i: 9


StopIteration: 

A generator function which produces the **infinite** sequence of all positive even numbers

In [None]:
def All_evens_gen():
    i=0
    while True:
        yield i
        i+=2

#### Solving this <a href="http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2017a/a10_geneators_exam_q.pdf"> exam question about generators</a>

Section (b)

In [None]:
def SomePairs():
    i=0
    while True:
        for j in range(i):
            yield(i,j)
        i=i+1

Section (c)

In [None]:
def RevGen(PairsGen):
    pairs = PairsGen()
    while True:
        pair = next(pairs)
        yield(pair[1],pair[0])

Section (d1)

In [None]:
def UnionGenerators(gen1, gen2):
    while True:
        yield next(gen1)
        yield next(gen2)

Section (d2)

In [None]:
def EqPairs():
    i=0
    while True:
        yield (i,i)
        i=i+1

In [None]:
def AllPairs():
    return UnionGenerators(SomePairs(),
                           UnionGenerators(EqPairs(),
                                           RevGen(SomePairs)))

## The string-matching problem

Given a string $text$ of length $n$, and a short(er) string $pattern$ of length $m$ ($m\leq n$), report all occurrances of $pattern$ in $text$.

Example:

$text = $"abracadabra",  $pattern = $"abr"

The requested output should be $[0,7]$, since $pattern$ appears in $text$ in indices $0,7$.

## Karp-Rabin Algorithm

In [2]:
def fingerprint(text, basis=2**16, r=2**32-3):
    """ used to compute karp-rabin fingerprint of the pattern
        employs Horner method (modulo r) """
    partial_sum = 0
    for ch in text:
        partial_sum =(partial_sum*basis + ord(ch)) % r
    return partial_sum

def text_fingerprint(text, m, basis=2**16, r=2**32-3):
    """ computes karp-rabin fingerprint of the text """
    f=[]
    b_power = pow(basis, m-1, r)
    list.append(f, fingerprint(text[0:m], basis, r))
    # f[0] equals first text fingerprint 
    for s in range(1, len(text)-m+1):
        new_fingerprint = ( (f[s-1] - ord(text[s-1])*b_power)*basis
                         +ord(text[s+m-1]) ) % r
            # compute f[s], based on f[s-1]
        list.append(f,new_fingerprint)# append f[s] to existing f       
    return f

def find_matches_KR(pattern, text, basis=2**16, r=2**32-3):
    """ find all occurances of pattern in text
        using coin flipping Karp-Rabin algorithm """
    if len(pattern) > len(text):
        return []
    p = fingerprint(pattern, basis, r)
    f = text_fingerprint(text, len(pattern), basis, r)
    matches = [s for (s,f_s) in enumerate(f) if f_s == p]
    # list comprehension 
    return matches

In [49]:
text = "abracadabra"
pattern = "abr"

In [4]:
fingerprint("abr")

6422933

In [50]:
base = 2**16
arit = ord("a")*(base**2) + ord("b")*(base**1) + ord("r")*(base**0)
arit
r = 2**32 - 3
fp = arit%r
fp

416618250354

6422933

In [6]:
text_fingerprint(text, 3)

[6422933,
 7471495,
 6357433,
 6488452,
 6357389,
 6553988,
 6357390,
 6422933,
 7471495]

In [7]:
find_matches_KR(pattern, text)

[0, 7]

### Safe version
Makes sure no false positives occur. In the worst case, when all $n-m$ possible substrings are indeed matches, behaves as the naive solution in terms of time complexity.

In [8]:
def find_matches_KR_safe(pattern, text, basis=2**16, r=2**32-3):
    """ a safe version of KR
        checks every suspect for a match """

    if len(pattern) > len(text):
        return []
    p = fingerprint(pattern, basis, r)
    f = text_fingerprint(text, len(pattern), basis, r)
    matches = [s for (s,f_s) in enumerate(f) if f_s == p \
               and text[s:s+len(pattern)]==pattern]
    #note that python performs "cleaver evaluation" of the 'and' statement
    return matches

#### Competition between versions on single char string.
This is the worst-case scenario for the safe version.
Changing $m$ has a greater effect on the safe version than on the standard KR.

In [31]:
import time

text = "a"*1000000
print("text = 'a'*",len(text))
for pattern in ["a"*100, "a"*1000, "a"*10000, "a"*100000]:
    print("pattern = 'a'*",len(pattern))
    for f in [find_matches_KR, find_matches_KR_safe]:
        t0=time.clock()
        res=f(pattern, text)
        t1=time.clock()
        print (f.__name__, t1-t0)
    print("") #newline

text = 'a'* 1000000
pattern = 'a'* 100
find_matches_KR 2.0220746595732635
find_matches_KR_safe 2.267512647478725

pattern = 'a'* 1000
find_matches_KR 1.8958205700764665
find_matches_KR_safe 2.3651200483727735

pattern = 'a'* 10000
find_matches_KR 1.7998994991503423
find_matches_KR_safe 3.4687894740054617

pattern = 'a'* 100000
find_matches_KR 1.7552307019359432
find_matches_KR_safe 15.491687561574508



#### Competition between versions on random strings. 

Note that the standard and safe versions of KR has similar running times. Moreover, as $m$ increases, the running time slightly decreases since there are less candidates to consider.

In [33]:
import random
def gen_str(size):
    ''' Generate a random lowercase English string of length size'''
    s=""
    for i in range(size):
        s+=random.choice("abcdefghijklmnopqrstuvwxyz")
    return s


n=1000000
m=1000
text = gen_str(n)
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.clock()
    f(pattern, text)
    t1=time.clock()
    print (f.__name__, t1-t0)
    

m=10000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.clock()
    f(pattern, text)
    t1=time.clock()
    print (f.__name__, t1-t0)
    
m=100000
pattern = gen_str(m)
print("random str of len n=", n, " , random pattern of length m=",m)
for f in [find_matches_KR, find_matches_KR_safe]:
    t0=time.clock()
    f(pattern, text)
    t1=time.clock()
    print (f.__name__, t1-t0)

random str of len n= 1000000  , random pattern of length m= 1000


[]

find_matches_KR 1.8567712938674958


[]

find_matches_KR_safe 1.876404880254995
random str of len n= 1000000  , random pattern of length m= 10000


[]

find_matches_KR 1.8769306741451146


[]

find_matches_KR_safe 1.8435759195854189
random str of len n= 1000000  , random pattern of length m= 100000


[]

find_matches_KR 1.6260555352055235


[]

find_matches_KR_safe 1.6044790377200115


### Choice of $r$

By setting $r$ to be a power of the base, say $r=base$, we will obtain more false-positives. This may serve as an intuition for choosing a prime $r$.

In [51]:
find_matches_KR("da", "abracadabra", basis=2**16, r=2**16)
find_matches_KR_safe("da", "abracadabra", basis=2**16, r=2**16)

[2, 4, 6, 9]

[6]

In [None]:
fingerprint("da", 2**16, r=2**16)
ord("d")*(2**16)**1 + ord("a")
ord("a")

fingerprint("ca", 2**16, r=2**16)
ord("c")*(2**16)**1 + ord("a")
(ord("c")*(2**16)**1 + ord("a") )%2**16

In [13]:
base = 2**16
r = 2**16
fingerprint("bda", base, r)
ord("b")*(base**2) + ord("d")*(base**1) + ord("a")
(ord("b")*base + ord("d"))*base + ord("a")
((ord("b")*base + ord("d"))*base + ord("a"))%r == ord("a")%r


fingerprint("cda", base, r)
(ord("c")*base + ord("d"))*base + ord("a")
((ord("c")*base + ord("d"))*base + ord("a"))%r == ord("a")%r

97

420913348705

420913348705

True

97

425208316001

True

In [14]:
find_matches_KR("Humpty", "Humpty Dumpty", r=2**(16*5))

[0, 7]

In [15]:
fingerprint("Humpty", r=2**(16*5))
fingerprint("Dumpty", r=2**(16*5))

2158299737877522940025

2158299737877522940025

In [16]:
text_fingerprint("Humpty Dumpty", 6, r=2**(16*5))

[2158299737877522940025,
 2010726629729956855840,
 2066067987872461357124,
 2139856371159933386869,
 2232065040410175930477,
 590314951159640293488,
 1254411530052683432052,
 2158299737877522940025]

In [17]:
find_matches_KR("Humpty", "Humpty Dumpty", r=2**(16*6))

[0]