<h1><center>cs1001.py , Tel Aviv University, Fall 2018/19</center></h1>
<img src="http://www.pngall.com/wp-content/uploads/2016/05/Python-Logo-PNG-Image-180x180.png" width=50/>

# Recitation 9

We discussed Hash tables, iterators and generators. 

#### Takeaways:
- Hash tables can be useful for many algorithms, including memoization. 
- Make sure you understand the complexity analysis for hash tables (see the links below).
- A generator function is a function that contains the yield command and returns a genertor object. 

#### Code for printing several outputs in one cell (not part of the recitation):

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

##  Hash

We wish to have a data structure that implements the operations: insert, search and delete in **expected** $O(1)$ time. 

Summarizing the insert and search complexity of the data structures that we have seen already:

| implementation                | insert                   | search   | delete              |
|-------------------------------|--------------------------|----------|---------------------|
| Python list                   | O(1) always at the end   | O(n)     | O(n)                |
| Python ordered list           | O(n)                     | O(log n) | O(n)                |
| Linked list                   | O(1) always at the start | O(n)     | O(1) given the node before the one to delete |
| Sorted linked list            | O(n)                     | O(n)     | O(1) given the node before the one to delete |
| Unbalanced Binary Search Tree | O(n)                     | O(n)     | O(n)                |
| Balanced Binary Search Tree   | O(log n)                 | O(log n) | O(log n)            |

Please read <a href="https://github.com/taucsrec/recitations/blob/master/2018b/Michal/rec9/DataStructures_summary.pdf"> the following summary</a> on the various data structures mentioned in class.

A detailed summary on the complexity of insert/search operations using hash tables can be found <a href="http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2017a/hashtable_find_and_insert_complexity.pdf">here</a>. Make sure you read it.

### Exercise: 
Given a string $st$ of length $n$ and a small integer $\ell$, write a function that checks whether there is a substring in $st$ of length $\ell$ that appears more than once.

#### Solution #1: Naive

The complexity is $O(\ell(n-\ell)^2)$. 
There $O((n-\ell)^2)$ iterations (make sure you undersand why) and in each iteration we perform operations in $O(\ell)$ time.

In [2]:
def repeat_naive(st, l): 
    for i in range(len(st)-l+1):
        for j in range(i+1,len(st)-l+1):
            if st[i:i+l]==st[j:j+l]:
                return True
    return False

repeat_naive("hello", 1)
repeat_naive("hello"*10, 45)
repeat_naive("hello"*10, 46)

True

True

False

Let's test our algorithm with by generating a random string of a given size.

In [17]:
import random
def gen_str(size, alphabet = "abcdefghijklmnopqrstuvwxyz"):
    ''' Generate a random string of length size over alphabet '''
    s=""
    for i in range(size):
        s += random.choice(alphabet)
    return s
rndstr = gen_str(1000)
print(rndstr)
repeat_naive(rndstr, 3)
repeat_naive(rndstr, 10)



kkbfhilnpmeraulchrukbbhjkvfoafngncqgwtxoxqmuptkglzpuxspwuopiulvrolartazpsxnjfrwwvmwcziwtkunmtnlfhmjpjckehhzvypkvmrsagsflspgthqukwdcbwpliobauyuurlyursaguhjtzanznevjkuhzyiowbterxtnrstmlhultyqfijiqxvipwoxuugblvsccnloimhpqtgkubevvqzaeuklriishbuteumwywvoktgofqnnflfkyqqtgjhqfjoilalmdxomwlpmtdrgpitiqiburfbevsvifvpwfsluvkbjynduzmplywwkfvqscnbpkfvvobssktwqkogdotnbojqkvxshamjxvxvvhtgesmgcdqthtbaooisvpfapuuqjjplhowvuvgpejbapmkwwpdsbaxkejmzfqpwyyummsfzscfsjappgvkjgvrehwsdcurmcupebqaifurobvrebgkrsuxumavaoejxcsdypapwgdnspflricoskwogwbdlaoufgwmdabkwvzkxzqugwrxhrwktkqgclvdszfjemsekinhixeqgrguhtmwwcyfqqpuyqkgcmcqsgdlfjzvgaqehhbuzfjbbvwuoylpflpkgjlbybpuhapqfylymmsoqbvorytoafchmhczeeamwtmtfooescpnjorrxmthaqstfybjnzqidwcmifitbdkvzgkezccbxafxffvjlowevjunmysmgjbodztsiuzvfzpvckeccuicxrnujllhipjfuetuljwjemcjevrdwmaycmigvhhgmrpjinoedevmgopqmmhhjtunrguofokeptkrulvtirawoqejbayeprlerpnmrhnqsdhptgzwtipzafddtxhxqutrtyskrkbgzesivfbbiungwobzxqjlmowcqfhmdvdeoujlrbmpnixotvzaftjvlydjjatoyghwgbkujptjosiqeaddyofacmyqnejfu

True

False

For bigger $n$ and $\ell$, this could be quite slow:

In [18]:
rndstr = gen_str(10000)
repeat_naive(rndstr, 3)
repeat_naive(rndstr, 10)

True

False

### The class Hashtable from the lectures

In [44]:
class Hashtable:
    def __init__(self, m, hash_func=hash):
        """ initial hash table, m empty entries """
        ##bogus initialization #1:
        #self.table = [[]*m]
        ##bogus initialization #2:
        #empty=[]
        #self.table = [empty for i in range(m)]
        
        self.table = [ [] for i in range(m)]
        self.hash_mod = lambda x: hash_func(x) % m # using python hash function

    def __repr__(self):
        L = [self.table[i] for i in range(len(self.table))]
        return "".join([str(i) + " " + str(L[i]) + "\n" for i in range(len(self.table))])
    
    def find(self, item):
        """ returns True if item in hashtable, False otherwise  """
        i = self.hash_mod(item)
        return item in self.table[i]
        #if item in self.table[i]:
        #    return True
        #else:
        #    return False

    def insert(self, item):
        """ insert an item into table """
        i = self.hash_mod(item)
        if item not in self.table[i]:
            self.table[i].append(item)


#### Solution #2: using the class Hashtable

In [5]:
def repeat_hash1(st, l):
    m=len(st)-l+1
    htable = Hashtable(m)
    for i in range(len(st)-l+1):
        if htable.find(st[i:i+l])==False:
            htable.insert(st[i:i+l])
        else:
            return True
    return False

The expected (average) complexity is: $O(\ell(n-\ell))$

Creating the table takes $O(n-\ell)$ time, and there are $O(n-\ell)$ iterations, each taking expected $O(\ell)$ time.



The worst case complexity is: $O(\ell(n-\ell)^2)$

Creating the table takes $O(n-\ell)$ time, and the time for executing the loop is
$\ell\cdot\sum_{i=0}^{n-\ell}{i}= O(\ell(n-\ell)^2)$

Which of Python's naitive DS fits the solution?

#### Solution #3: using Python's set implementation

In [6]:
def repeat_hash2(st, l):
    htable = set() #Python sets use hash functions for fast lookup
    for i in range(len(st)-l+1):
        if st[i:i+l] not in htable:
            htable.add(st[i:i+l])
        else: return True
    return False

#### Competition between the 3 solutions

In [56]:
import time
str_len=1000
st=gen_str(str_len)
l=10
for f in [repeat_naive,repeat_hash1,repeat_hash2]:
    t0=time.perf_counter()
    res=f(st, l)
    t1=time.perf_counter()
    print(f.__name__, t1-t0, "found?",res)

repeat_naive 0.10627349599963054 found? False
repeat_hash1 0.001666572003159672 found? False
repeat_hash2 0.0003140879998682067 found? False


In [55]:
str_len=2000
st=gen_str(str_len)
l=10
for f in [repeat_naive,repeat_hash1,repeat_hash2]:
    t0=time.perf_counter()
    res=f(st, l)
    t1=time.perf_counter()
    print(f.__name__, t1-t0, "found?",res)

repeat_naive 0.39586318099463824 found? False
repeat_hash1 0.0024104119947878644 found? False
repeat_hash2 0.0006739280070178211 found? False


For a random string of size $n=1000$ and for $l=10$ the running time of repeat_hash2 is the smallest, while the one for repeat_naive is the largest.

When increasing $n$ to 2000, the running time of repeat_naive increases by ~4, while the running time of repeat_hash1, repeat_hash2 increases by ~2.

#### Time spent on creating the table
When $st$ is "a"$*1000$, repeat_hash1 is the slowest, since it spends time on creating an empty table of size 991.

In [67]:
st="a"*1000
l=10
for f in [repeat_naive,repeat_hash1,repeat_hash2]:
    t0=time.perf_counter()
    res=f(st, l)
    t1=time.perf_counter()
    print(f.__name__, t1-t0, "found?",res)

repeat_naive 7.19599484000355e-06 found? True
repeat_hash1 0.00020459499501157552 found? True
repeat_hash2 7.709997589699924e-06 found? True


### The effect of table size

The second solution, with control over the table size

In [68]:
def repeat_hash1_var_size(st, l, m=0):
    if m==0: #default hash table size is ~number of substrings to be inserted
        m=len(st)-l+1
    htable = Hashtable(m)
    for i in range(len(st)-l+1):
        if htable.find(st[i:i+l])==False:
            htable.insert(st[i:i+l])
        else:
            return True
    return False

Comparing different table sizes

In [72]:
import time
str_len=1000
st=gen_str(str_len)
l=10
print("str_len=",str_len, "repeating substring len=",l)
for m in [1, 10, 100, 1000, 1500, 10000, 100000]:
    t0=time.perf_counter()
    res=repeat_hash1_var_size(st, l, m)
    t1=time.perf_counter()
    print(t1-t0, "found?",res, "table size=",m)

str_len= 1000 repeating substring len= 10
0.020394168997881934 found? False table size= 1
0.0034462350013200194 found? False table size= 10
0.0013869249960407615 found? False table size= 100
0.001538057011202909 found? False table size= 1000
0.0014074869977775961 found? False table size= 1500
0.013536139013012871 found? False table size= 10000
0.030332424998050556 found? False table size= 100000


### Summary

Make sure you read the following <a href="http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2016b/m_10_repeating_substring_additional_material.pdf">summary</a> that includes a detailed explanation on the experiments.

##  Iterators and Generators

In [7]:
l = [1,2,3]
li = iter(l)
type(li)
li2 = iter(l)

list_iterator

In [8]:
next(li)
next(li)
z = next(li)
print("z is", z)
next(li2)
next(li)


1

2

z is 3


1

StopIteration: 

In [9]:
next(li)

StopIteration: 

In [19]:
print("before loop")
for elem in li2:
    print(elem)

before loop
2
3


#### Solving this <a href="http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2017a/a10_geneators_exam_q.pdf"> exam question about generators</a>

Section (b)

In [3]:
def SomePairs():
    i=0
    while True:
        for j in range(i):
            yield(i,j)
        i=i+1

gen = SomePairs()
[next(gen) for x in range(10)]

[(1, 0),
 (2, 0),
 (2, 1),
 (3, 0),
 (3, 1),
 (3, 2),
 (4, 0),
 (4, 1),
 (4, 2),
 (4, 3)]

Section (c)

In [4]:
def RevGen(PairsGen):
    pairs = PairsGen()
    while True:
        pair = next(pairs)
        yield(pair[1],pair[0])

gen = RevGen(SomePairs)
[next(gen) for i in range(10)]

[(0, 1),
 (0, 2),
 (1, 2),
 (0, 3),
 (1, 3),
 (2, 3),
 (0, 4),
 (1, 4),
 (2, 4),
 (3, 4)]

Section (d1)

In [5]:
def UnionGenerators(gen1, gen2):
    while True:
        yield next(gen1)
        yield next(gen2)

Section (d2)

In [6]:
def EqPairs():
    i=0
    while True:
        yield (i,i)
        i=i+1
        
def AllPairs():
    return UnionGenerators(SomePairs(),
                           UnionGenerators(EqPairs(),
                                           RevGen(SomePairs)))

gen = AllPairs()
[next(gen) for i in range(10)]

[(1, 0),
 (0, 0),
 (2, 0),
 (0, 1),
 (2, 1),
 (1, 1),
 (3, 0),
 (0, 2),
 (3, 1),
 (2, 2)]