<h1><center>cs1001.py , Tel Aviv University, Fall 2018/19</center></h1>
<img src="http://www.pngall.com/wp-content/uploads/2016/05/Python-Logo-PNG-Image-180x180.png" width=50/>

## Exam recitation

We went over various questions from previous exams

###### Takeaways:
- The exam is easy, all you have to do is write down the correct answers
- When in doubt, bet on 42

#### Code for printing several outputs in one cell (not part of the recitation):

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 2016bb: q3 (recursive generators)

Reminder: in the change problem we are given an amount of money to be paid $amount$ and a list of coins $coins$ and we need to return the number of possible combinations for paying back the amount using the coins.

Our recursive solution had the following logic:
- If we need to return $amount = 0$ then we can do so in only a single way
- If we need to return a negative amount or if we don't have any coins we cannot do so at all
- Otherwise, we can either:
    - Not use the last coin in our possesion at all
    - Or, use it *at least* once
    
This implies a simple recursive algorithm.

In [3]:
def change(amount, coins):
    if amount == 0:
        return 1
    if amount < 0 or coins == []:
        return 0
    return change(amount, coins[:-1]) + change(amount - coins[-1], coins)

In [4]:
change(5, [1, 2, 3])

5

### Solving with recursion and generators
Given $amount$ and $coins$, write a generator function that returns all possible combinations.

In [5]:
def change_gen(amount, coins):
    if (amount == 0):
        yield []
    elif not (amount < 0 or coins == []):
        g = change_gen(amount, coins[:-1])        
        for change in g:
            yield change
            
        g = change_gen(amount - coins[-1], coins)            
        for change in g:
            yield change+[coins[-1]]

### Testing

In [7]:
for change in change_gen(5, [1, 2, 3]):
    print(change)

[1, 1, 1, 1, 1]
[1, 1, 1, 2]
[1, 2, 2]
[1, 1, 3]
[2, 3]


## Limit on list length

Now we are given $amount$, $coins$ and $max\_len$. We need to return a single generated list of length at most $max_len$, and if none exists return $None$

In [8]:
def change_gen_maxlim(amount, coins, max_coins):
    for change in change_gen(amount, coins):
        if len(change) <= max_coins:
            return change
    return None

### Testing

In [10]:
change_gen_maxlim(5, [1,2,3], 2)
change_gen_maxlim(5, [1,2,3], 1) == None

[2, 3]

True

## 2017aa: q4 (repetition code)

Consider the following code for encoding a message using the $m$-repetition code:

In [12]:
def encode(msg, m):
    return "".join([bit*m for bit in msg])

In [13]:
encode('110', 3)

'111111000'

For a message of length $k$, what is the length and distance of the code?
- Length: $mk$
- Distance: $m$

Give an example of a corrupt message whose distance to two or more legal codewords is the same. Specify values of $m,k$ used.
- Parameters: $k=1, m=2$
- Word: $01$, has distance $1$ from both $00$ and $11$

Given the following majority function, write a decoding function for the code above. The decoding function will work as follows:
- If there is a single nearest codeword, return it
- Otherwise, return None

In [17]:
def majority(bin_s):
    n = len(bin_s)
    cnt = bin_s.count("1")
    if cnt > n/2:
        return "1"
    elif cnt < n/2:
        return "0"
    else:
        return None

### Decoding function

In [20]:
def decode_rep(trans, m):
    msg_len = len(trans)//m
    msg = ""
    for i in range(msg_len):
        bit = majority(trans[i*m:(i*m)+m])
        if bit == None:
            return None
        else:
            msg += bit
    return msg

### Testing

In [24]:
decode_rep('111000', 3)
decode_rep('111001', 3)
decode_rep('10', 2) == None


'10'

'10'

True

## Maximal number of errors

What is the maximal number of errors such that there is still a unique decoding?

There are $k$ blocks, each block of length $m$. In each block we need a majority, thus we can allow $\left\lfloor\frac{m-1}{2}\right\rfloor$ errors.

In total: $k \cdot \left\lfloor\frac{m-1}{2}\right\rfloor $ errors

## 2017bb: q3 (memory allocation)

Rate the following snippets by the amount of memory required for each snippet and decide whether its execution will crash due to a lack of memory.

In [12]:
# Snip 1
c = 2**1000+19

# Snip 2
d = 2**(2**1000)+11

# Snip 3
def what():
    n=0
    while n < 2**10000+1:
        yield n
        n+=1

maybe = what()
for i in range(2**20+1):
    a = next(maybe)

# Snip 4
lst = [i for i in range(2**(2**1000)+11)] 

We go over them one by one:
- The first snippet requires around $1000$ bits (doable, easy)
- The second snippet requires around $2^{1000}$ bits. A regular PC has, say $16gb$ of ram which is $\approx 2^{34}$ bits. Clearly not doable.
- The third snippet requires creating a generator which loops $2^{10000}$ times. While we require a counter for the generator, we only need around $10000$ bits for this counter, so we don't have any problem here and this is indeed doable.
- The fourth snippet requires at least $1$ bit per entry for a list of length $2^{2^{1000}}$. Clearly not doable.

Space used, in ascending order: 1, 3, 2, 4

## 2017bb: q5 (LZ minimum rep. length)

In class we've discussed the benefit of not compressing repetitions of length $2$ in the LZ algorithm.

Our implementation required a repetition of length at least $3$. A natural question is whether this is the optimal choice of parameter?

I.e. - can we find examples where a minimal repetition of $min\_rep \geq 4$ is better than one where $min\_rep \geq 3$? And how about the other way around?

Here is the code for the algorithm where we have added the $min\_rep$ optional parameter for the compression function:


In [30]:
def maxmatch(T, p, w=2**12-1, max_length=2**5-1):
    """ finds a maximum match of length k<=2**5-1 in a
    w long window, T[p:p+k] with T[p-m:p-m+k].
    Returns m (offset) and k (match length) """
    assert isinstance(T,str)
    n = len(T)
    maxmatch = 0
    offset = 0
    # Why do we need the min here?
    for m in range(1, min(p+1, w)):
        k = 0
        # Why do we need the min here?
        while k < min(n-p, max_length) and T[p-m+k] == T[p+k]:
            k += 1
        # at this point, T[p-m:p-m+k]==T[p:p+k]
        if maxmatch < k:  
            maxmatch = k
            offset = m
    return offset, maxmatch
# returned offset is smallest one (closest to p) among
# all max matches (m starts at 1)



def LZW_compress(text, w=2**12-1, max_length=2**5-1, min_rep=3):
    """LZW compression of an ascii text. Produces
       a list comprising of either ascii characters
       or pairs [m,k] where m is an offset and
       k is a match (both are non negative integers) """
    result = []
    n = len(text)
    p = 0
    while p<n:
        m,k = maxmatch(text, p, w, max_length)
        if k<min_rep:
            result.append(text[p]) #  a single char
            p += 1
        else:
            result.append([m,k])   # two or more chars in match
            p += k
    return result  # produces a list composed of chars and pairs

We claim that we can find both examples where a minimal repetition of $3$ is better and ones where a minimal repetition of $4$ is better. Why is that?
The case for $3$ is easy - sometimes we don't have long repetitions. E.g. - $"abc"*n$

The case for $4$ is a little trickier - the idea is that sometimes a greedy choice that comes in early will hurt us in the long run. Consider the string $"aab ~|~ abcd ~|~ aabcd"$ (the dividers are just for visual clarification). 

The first two blocks are uncompressed in either case. Next, if we take $min\_rep = 3$ then we will "catch" the appearance of $"aab"$ in the last block will then be left with $c,d$. However, if we take $min\_rep = 4$ then we will "skip" the first $"a"$ in the last block but we will catch $"abcd"$


In [31]:
LZW_compress('abcabc', min_rep=3)
LZW_compress('abcabc', min_rep=4)

LZW_compress('aababcdaabcd', min_rep=3)
LZW_compress('aababcdaabcd', min_rep=4)

['a', 'b', 'c', [3, 3]]

['a', 'b', 'c', 'a', 'b', 'c']

['a', 'a', 'b', 'a', 'b', 'c', 'd', [7, 3], 'c', 'd']

['a', 'a', 'b', 'a', 'b', 'c', 'd', 'a', [5, 4]]

## 2018ba: q2 (LZ and Huffman compression ratios)

For each of the following strings give their Huffman Tree, LZ intermediate representation and compression ratio (for both compression algorithms):

In [32]:
#st1 = "abc" * k (1 < k < 10)
#st2 = "abcabcdabcdeabcde"
#st3 = ''.join(['a'*i+'b'*i+'c'*i for i in range(5)]) 

The Huffman code is attached below

In [34]:
def ascii2bit_stream(text):
    """ Translates ASCII text to binary reprersentation using
        7 bits per character. Assume only ASCII chars """
    return "".join([bin(ord(c))[2:].zfill(7) for c in text])



########################################################
#### HUFFMAN CODE
########################################################

def char_count(text):
    """ Counts the number of each character in text.
        Returns a dictionary, with keys being the observed characters,
        values being the counts """
    d = {}
    for ch in text:
        if ch in d:
            d[ch] += 1
        else:
            d[ch] = 1
    return d


def build_huffman_tree(char_count_dict):
    """ Recieves dictionary with char:count entries
        Generates a LIST structure representing
        the binary Huffman encoding tree """
    queue = [(c,cnt) for (c,cnt) in char_count_dict.items()]

    while len(queue) > 1:
        #print(queue)
        # combine two smallest elements
        A, cntA = extract_min(queue)    # smallest in queue
        B, cntB = extract_min(queue)    # next smallest
        chars = [A,B]
        weight = cntA + cntB            # combined weight
        queue.append((chars, weight))   # insert combined node

    # only root node left
    #print("final queue:", queue)
    root, weight_trash = extract_min(queue) # weight_trash unused
    return root                             #a LIST representing the tree structure


def extract_min(queue): 
    """ queue is a list of 2-tuples (x,y).
        remove and return the tuple with minimal y """ 
    min_pair = min(queue, key = lambda pair: pair[1])
    queue.remove(min_pair)
    return min_pair



def generate_code(huff_tree, prefix=""):
    """ Receives a Huffman tree with embedded encoding,
        and a prefix of encodings.
        returns a dictionary where characters are
        keys and associated binary strings are values."""
    if isinstance(huff_tree, str): # a leaf
        return {huff_tree: prefix}
    else:
        lchild, rchild = huff_tree[0], huff_tree[1]
        codebook = {}

        codebook.update(generate_code(lchild, prefix+'0'))
        codebook.update(generate_code(rchild, prefix+'1'))
        #   oh, the beauty of recursion...
        return codebook

    
def compress(text, encoding_dict):
    """ compress text using encoding dictionary """
    assert isinstance(text, str)
    return "".join(encoding_dict[ch] for ch in text)


def reverse_dict(d):
    """ build the "reverse" of encoding dictionary """
    return {y:x for (x,y) in d.items()}


def decompress(bits, decoding_dict):
    prefix = ""
    result = []
    for bit in bits:
        prefix += bit
        if prefix in decoding_dict:
            result.append(decoding_dict[prefix])
            prefix = ""  #restart
    assert prefix == "" # must finish last codeword
    return "".join(result)  # converts list of chars to a string

## First string
For the first string, we get the following tree (here $k=1$ which of course doesn't matter because any multiple will still give a uniform distribution of characters)

In [37]:
t1 = build_huffman_tree(char_count("abc"*1))
print(t1)

['c', ['a', 'b']]
[['a', 'b'], ['c', ['e', 'd']]]
['c', ['a', 'b']]


So we have $c = 0, a = 01, b = 00$, so we require $5\cdot k$ bits to represent the compressed string as opposed to $7\cdot k$ bits uncompressed.

The LZ intermediate representation is (for $k=9$):

In [41]:
print(LZW_compress('abc'*9))

['a', 'b', 'c', [3, 24]]
['a', 'b', 'c', [3, 3], 'd', [4, 4], 'e', [5, 5]]
['a', 'b', 'c', 'a', 'a', 'b', 'b', 'c', [6, 3], [7, 3], [8, 3], [9, 4], [10, 4], [11, 4], 'c']


So the LZ representation requires $3 \cdot 8 + 1 \cdot 18$ (for any $k$!) bits versus the uncompressed $7\cdot k$ bits

## Second string
For the second string, we get the following char count ($a=b=c=4,d=3,e=2$) and thus the following tree:

In [43]:
t2 = build_huffman_tree(char_count("abcabcdabcdeabcde"))
print(t2)

[['a', 'b'], ['c', ['e', 'd']]]


So we have $a = 00, b = 01, c = 10, d = 110, e = 111$, so we require $(4+4+4)\cdot 2$ bits to represent all occurrences of $a,b,c$ and $(3+2) \cdot 3$ for $d,e$. In total, $39$ bits as opposed to $17 \cdot 7$ uncompressed.

The LZ intermediate representation is:

In [44]:
print(LZW_compress("abcabcdabcdeabcde"))

print(LZW_compress(''.join(['a'*i+'b'*i+'c'*i for i in range(5)])))



['a', 'b', 'c', [3, 3], 'd', [4, 4], 'e', [5, 5]]
['a', 'b', 'c', 'a', 'a', 'b', 'b', 'c', [6, 3], [7, 3], [8, 3], [9, 4], [10, 4], [11, 4], 'c']


So the LZ representation requires $5 \cdot 8 + 3 \cdot 18$ bits versus the uncompressed $7\cdot 17$ bits

## Third string
For the third string, we get again a uniform distribution so the tree is similar to the first one:

In [46]:
t3 = build_huffman_tree(char_count(''.join(['a'*i+'b'*i+'c'*i for i in range(5)]) ))
print(t3)

['c', ['a', 'b']]


So we have again $c = 0, a = 01, b = 00$. Now, each character appears $1+2+3+4 = 10$ times so we require $5\cdot 30$ bits to represent the compressed string as opposed to $7\cdot 30$ bits uncompressed.

The LZ intermediate representation is:

In [47]:
print(LZW_compress(''.join(['a'*i+'b'*i+'c'*i for i in range(5)])))



['a', 'b', 'c', 'a', 'a', 'b', 'b', 'c', [6, 3], [7, 3], [8, 3], [9, 4], [10, 4], [11, 4], 'c']


So the LZ representation requires $9 \cdot 8 + 6 \cdot 18$ bits versus the uncompressed $7\cdot 30$ bits

## 2016ab: q3 (recursive comparison of strings)

Implement a recursive function $comp(st1,st2)$ which returns true iff $st1 == st2$.

Guidelines:
- Use recursion
- You may only compare single characters (and not substrings)

In [1]:
def comp(s1, s2):
    if len(s1) == 0 and len(s2) == 0:
        return True
    if len(s1) == 0 or len(s2) == 0:
        return False
    if s1[0] != s2[0]:
        return False
    return comp(s1[1:], s2[1:])

## Testing


In [4]:
comp('ab','ab')
comp('ab','abx')
comp('','')

True

False

True

## Adding a single joker

In this implementation $s1$ can contain the character $+$ which is equal to any single character in $s2$.


Guidelines: Add two lines in the commented area to accomodate the change

In [8]:
def comp_plus(s1, s2):
    if len(s1) == 0 and len(s2) == 0:
        return True
    if len(s1) == 0 or len(s2) == 0:
        return False
    
    ### ADD HERE ###
    if s1[0] == '+':
        return comp_plus(s1[1:], s2[1:])
    ### ADD HERE ###
    
    if s1[0] != s2[0]:
        return False
    return comp_plus(s1[1:], s2[1:])

## Testing


In [10]:
comp_plus('ab','ab')
comp_plus('ab+','abx')
comp_plus('ab+','abxx')
comp_plus('a++x','abxx')


True

True

False

True

## Adding a multiple length joker

In this implementation $s1$ can contain the character $*$ which is equal to any substring of length $1$ or more in $s2$


Guidelines: 
- Add two lines in the commented area to accomodate the change
- Here $s1$ will not contain the plus sign from the previous section

In [36]:
def comp_wild(s1, s2):
    if len(s1) == 0 and len(s2) == 0:
        return True
    if len(s1) == 0 or len(s2) == 0:
        return False
    
    ### ADD HERE ###
    if s1[0] == '*':
        return comp_wild(s1, s2[1:]) or comp_wild(s1[1:], s2[1:])
    ### ADD HERE ###
    
    if s1[0] != s2[0]:
        return False
    return comp_wild(s1[1:], s2[1:])

## Testing


In [37]:
comp_wild('ab','ab')
comp_wild('ab*','abx')
comp_wild('ab*z','abxyz')
comp_wild('a*','a')


True

True

True

False

## 2015bb: q5 (fun with Catalan)

The Catalan sequence is the following recursive sequence:
- Base case: $C(0) = 1$
- Recursive rule: $$C(n) = \sum_{i=0}^{n-1}C(i)\cdot C(n - i - 1)$$

## The iterative case

Implement an iterative function that, given $n$, returns $C(n)$.

In [25]:
def catalan_iter(n):
    cat = [0]*(n+1)
    cat[0] = 1
    for i in range(1, n+1):
        for j in range(i):
            cat[i] += cat[j] * cat[i - j - 1]
    return cat[n]

## Testing


In [28]:
print([catalan_iter(i) for i in range(10)])

[1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862]


## Running time

Assuming all arithmetic operations run in $O(1)$. What is the running time complexity of the code above given $n$?

To initialize the array we need time $O(n)$. Apart from that, all operations are constant so we simply need to compute how many loops we do, and this is our beloved arithmetic sequence $1+ 2 + 3 + \cdots + n = O(n^2)$

## The recursive case

Implement a memoized recursive function that, given $n$, returns $C(n)$.

In [33]:
def catalan_rec(n, d = {}):
    if n == 0:
        return 1
    if n in d:
        return d[n]
    temp = 0
    for i in range(n):
        temp += catalan_rec(i) * catalan_rec(n - i - 1)
    d[n] = temp
    return temp

## Testing


In [31]:
print([catalan_rec(i) for i in range(10)])

[1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862]


## Recursion depth

What is the maximum recursion depth of the code?

As the second recursive call $catalan\_rec(n - i - 1)$ goes down by 1 the first time it is called, the recursion depth is clearly $O(n)$.

## Running time

Assuming all arithmetic operations run in $O(1)$. What is the running time complexity of the code above given $n$?

We first note that once we compute $cat(k)$ for some $k$ it is in the dictionary and we never need to compute it again.

Next, note that when we compute $cat(k)$ we need to access all $cat(j)$ where $j < k$, and there are at most $n$ such values.

It follows that to compute each value requires $O(n)$ once all lower indices are computed, and each lower index is computed once in time at most $O(n)$, thus the running time is still $O(n^2)$

## Closed form evaluation

Apart from the recursive formula, the Catalan numbers also follow the following formula:
$$C(n) = \prod_{i=2}^{n} \frac{n+i}{i}$$

Given that, explain the following phenomenon:

In [35]:
def catalan_closed(n):
    res = 1
    for i in range(2, n+1):
        res *= ((n+i)/i)
    return round(res)

catalan_iter(35)
catalan_closed(35)

3116285494907301262

1622730281582592000

The discrepancy arises from the fact that the closed form formula uses division. As we get to very large numbers we get rounding errors due to the limited accuracy of Python's floating point arithmetic, and thus we get differing answers (the iterative answer is of course the correct one)

## 2015ab - Q5

We define a new (recursive) function lz77_compress_new(text, start, w, max_length)

In [6]:
import math



def maxmatch(T, p, w=2**12-1, max_length=2**5-1):
    """ finds a maximum match of length k<=2**5-1 in a
    w long window, T[p:p+k] with T[p-m:p-m+k].
    Returns m (offset) and k (match length) """
    assert isinstance(T,str)
    n = len(T)
    maxmatch = 0
    offset = 0
    # Why do we need the min here?
    for m in range(1, min(p+1, w)):
        k = 0
        # Why do we need the min here?
        while k < min(n-p, max_length) and T[p-m+k] == T[p+k]:
            k += 1
        # at this point, T[p-m:p-m+k]==T[p:p+k]
        if maxmatch < k:  
            maxmatch = k
            offset = m
    return offset, maxmatch
# returned offset is smallest one (closest to p) among
# all max matches (m starts at 1)


 
                          
def LZW_decompress(compressed, w=2**12-1, max_length=2**5-1):
    """LZW decompression from intermediate format to ascii text"""
    result = []
    n = len(compressed)
    p = 0
    while p<n:
        if type(compressed[p]) == str:  # char, as opposed to a pair
            result.append(compressed[p])
            p+=1
        else:
            m,k = compressed[p]
            p += 1
            for i in range(0,k):
                # append k times to result;  
                result.append(result[-m])
                # fixed offset m "to the left", as result itself grows
    return "".join(result)



def LZW_compress2(text, w=2**12-1, max_length=2**5-1):
    """LZW compression of an ascii text. Produces
       a list comprising of either ascii characters
       or pairs [m,k] where m is an offset and
       k>3 is a match (both are non negative integers) """
    result = []
    n = len(text)
    p = 0
    while p<n:
        m,k = maxmatch(text, p, w, max_length)
        if k<3: # modified from k<2
            result.append(text[p]) # a single char
            p += 1 #even if k was 2 (why?)
        else:
            result.append([m,k])   # two or more chars in match
            p += k
    return result  # produces a list composed of chars and pairs

          

def inter_to_bin(lst, w=2**12-1, max_length=2**5-1):
    """ converts intermediate format compressed list
       to a string of bits"""
    offset_width = math.ceil(math.log(w,2))
    match_width = math.ceil(math.log(max_length, 2))
    #print(offset_width,match_width)   # for debugging
    result = []
    for elem in lst:
        if type(elem) == str:
            result.append("0")
            result.append('{:07b}'.format(ord(elem)))
        elif type(elem) == list:
            result.append("1")
            m,k = elem
            result.append(str(bin(m)[2:]).zfill(offset_width))
            result.append(str(bin(k)[2:]).zfill(match_width))     
    return "".join(ch for ch in result)
   
def LZW_compress_new(text, start=0, w=2**12-1, max_length=2**5-1):
    n = len(text)
    if start >= n:
        return []
    #find the maximal length matching
    m,k = maxmatch(text, start, w, max_length)
    res1 = [text[start]] + \
    LZW_compress_new(text, start+1, w, max_length)
    res1_len = len(inter_to_bin(res1, w, max_length))
    if k < 3:
        return res1
    res2 = [[m,k]] + LZW_compress_new(text, start+k, w, max_length)
    res2_len = len(inter_to_bin(res2, w, max_length))

    if (res2_len < res1_len):
        return res2
    return res1
   

Let s = "ababcabcd". What will be the output of LZW_compress2(s) and LZW_compress_new(s, 0)?

In [3]:
s = "ababcabcd"

In [4]:
LZW_compress2(s)

['a', 'b', 'a', 'b', 'c', [3, 3], 'd']

In [7]:
LZW_compress_new(s, 0)

['a', 'b', 'a', 'b', 'c', [3, 3], 'd']

Claim: There exists a string s for which


    len(inter_to_bin(LZW_compress_new(s, 0))) < len(inter_to_bin(LZW_compress2(s)))
    
Give an example for such a string s or explain why such a string does not exist

In [14]:
s = "aababcdaabcd"

In [15]:
LZW_compress2(s)

['a', 'a', 'b', 'a', 'b', 'c', 'd', [7, 3], 'c', 'd']

In [18]:
len(inter_to_bin(LZW_compress2(s)))

90

In [16]:
LZW_compress_new(s, 0)

['a', 'a', 'b', 'a', 'b', 'c', 'd', 'a', [5, 4]]

In [19]:
len(inter_to_bin(LZW_compress_new(s, 0)))

82

Note that this is the example from before. At each position the algorithm tries to options: (1) writing a repetition or (2) writing a character. Among the two options, it selects the one that returned a shorter binary string.

Claim: There exists a string s for which


    len(inter_to_bin(LZW_compress_new(s, 0))) > len(inter_to_bin(LZW_compress2(s)))
    
Give an example for such a string s or explain why such a string does not exist

Well, this is clearly impossible (there is no such string s).
By drawing the recursion tree of the call LZW_compress_new(s,0), one of the paths from root to one of the leaves will be identical to the selections made by the greedy LZW_compress2 algorithm. Therefore, if LZW_comppress2(s) returned the shortest compression among all compression tried out by LZW_compress_new(s,0), then this will also be the result of LZW_comprees_new.

In other word, the following is true for every string s:
    
    len(inter_to_bin(LZW_compress_new(s, 0))) <= len(inter_to_bin(LZW_compress2(s)))
    