## Sequences and their Abstractions

### What is a sequence?

In python a sequence is something that follows the sequence protocol. an example of this is a list. This entails defining the `__len__` and `__getitem__` methods. 

In [9]:
alist=[1,2,3,4]
len(alist)#calls alist.__len__

4

In [11]:
alist[2]#calls alist.__getitem__(2)

3

Lists also support slicing. How does this work?

In [12]:
alist[2:4]

[3, 4]

To see this lets create a dummy sequence which shows us what happens. This sequence does not create any storage, it just implements the protocol

In [14]:
class DummySeq:
    
    def __len__(self):
        return 42
    
    def __getitem__(self, index):
        return index

In [15]:
d = DummySeq()
len(d)

42

In [16]:
d[5]

5

In [17]:
d[67:98]

slice(67, 98, None)

Aha, this is interesting. Slicing creates a `slice object` for us of the form `slice(start, stop, step)` and then python calls `seq.__getitem__(slice(start, stop, step))`.

What about two dimensional indexing, if we wanted to create a two dimensional structure?

In [18]:
d[67:98:2,1]

(slice(67, 98, 2), 1)

In [19]:
d[67:98:2,1:10]

(slice(67, 98, 2), slice(1, 10, None))

As sequence writers, our job is to interpret these in `__getitem__`

In [21]:
dir(slice)#introspect the slice class

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'indices',
 'start',
 'step',
 'stop']

Let us build out our Dummy a bit more now

In [1]:
#taken from Fluent Python
import numbers, reprlib

class NotSoDummySeq:    
    def __init__(self, iterator):
        self._storage=list(iterator)
        
    def __repr__(self):
        components = reprlib.repr(self._storage)
        components = components[components.find('['):]
        return 'NotSoDummySeq({})'.format(components)
    
    def __len__(self):
        return len(self._storage)
    
    def __getitem__(self, index):
        cls = type(self)
        if isinstance(index, slice):
            return cls(self._storage[index])
        elif isinstance(index, numbers.Integral): 
            return self._storage[index]
        else:
            msg = '{cls.__name__} indices must be integers' 
            raise TypeError(msg.format(cls=cls))


In [2]:
d2 = NotSoDummySeq(range(10))
len(d2)

10

In [3]:
d2

NotSoDummySeq([0, 1, 2, 3, 4, 5, ...])

In [60]:
d[4]

4

In [61]:
reprlib.repr([1,2,3])

'[1, 2, 3]'

In [62]:
d2[2:4]

NotSoDummySeq([2, 3])

In [63]:
d2[1,4]

TypeError: NotSoDummySeq indices must be integers

It might seem that slices are only useful in writing dunder methods like `__getitem__`, but this is not the case. Imagine you are writing code to parse a file:

In [69]:
mystring="""
Ralph 1 2
Midge 3 4
"""
name = slice(0,5)
weight =  slice(5,7)
height =  slice(7,9)
for line in mystring.split('\n')[1:-1]:
    print(line[name], line[weight].strip(), line[height].strip())

Ralph 1 2
Midge 3 4


### Linked Lists

Lets think of how a list with a pointer to its next element might be constructed in python

#### Nested Pairs

![](http://wla.berkeley.edu/~cs61a/fa11/lectures/img/pair.png)

(from stanford cs61a, http://wla.berkeley.edu/~cs61a/fa11/lectures/objects.html#nested-pairs)

This way to visualize a pair is called the **box and pointer** notation. It can be used to construct many abstractions, including trees.

Which can generalize to something which looks like this:

![](http://wla.berkeley.edu/~cs61a/fa11/lectures/img/sequence.png)

In [100]:
%%file linked_list.c

#include <stdlib.h>
#include <stdio.h>

typedef struct item {
    int value;
    struct item* rest;
} Item;

Item* new_item(int value){
    Item* newitem = (Item *) malloc(sizeof(Item));
    newitem->value = value;
    newitem->rest = NULL;
    return newitem;
}

Item* insert_front(Item* listptr, int value){
    Item* newitem = new_item(value);
    newitem->rest = listptr;
    return newitem;
}


int get(Item* listptr, int index){
    int ctr = 0;
    Item* p;
    for(p = listptr; p!= NULL; p = p->rest){
        if (ctr==index){
            return p->value;
        }
        ctr++;
    }
    return -1;
}

int find_min_index(Item* listptr){
    int ctr = 0;
    int minctr = 0;
    Item* p;
    int min = listptr -> value;
    for(p = listptr; p!= NULL; p = p->rest){
        if (p->value <= min) {
            min = p->value;
            minctr = ctr;
        }
        ctr++;
    }
    return minctr;
}

void free_all(Item* listptr) {
    Item *p;
    Item *next;
    for(p = listptr; p!= NULL; p = next){
        next = p->rest;
        free(p);
    }
}

int main(){
    Item* listptr;
    int i;
    int minidx;
    listptr = new_item(0);
    for (i=1; i < 6; i++){
        listptr=insert_front(listptr, 10*i - 5);
    }
    for (i=0; i < 6; i++){
        printf("i %d Item %d\n", i, get(listptr, i));
    }
    minidx = find_min_index(listptr);
    printf("min index %d min value %d\n", minidx, get(listptr, minidx) );
    free_all(listptr);
}

Overwriting linked_list.c


In [101]:
!gcc -o linked_list -g linked_list.c

In [102]:
!./linked_list

i 0 Item 45
i 1 Item 35
i 2 Item 25
i 3 Item 15
i 4 Item 5
i 5 Item 0
min index 5 min value 0


### Linked List vs Array Complexities

From this implementation we can see some things:

- Insertion at the front of a linked list is O(1)(just wrap with one more `make_ll`. Insertion at the back is O(n). (because you must traverse through to find `emptyll`). Arrays or dynamic arrays cost O(n) (amortized for the latter to append to)
- Searching a linked list(`getindex_ll`) is like linear search on an array: O(n)
- What about deletion? See the lab.
- What about indexing? Its O(1) for arrays (you just seek). But in linked lists its O(n) as you must follow the pointers (`getitem_ll`).
- Setting a value (not implemented here) follows the indexing complexity.

The space complexity of both arrays and lists is O(n).

The Linked list has some pluses:

- memory wont overflow, you just allocate more
- insertion and deletion (especially at end) is rather simple, and dousent come with attendant resizing or unuse worries.
- if the records (here ints) are themselves large (say they are heap allocated C structs or complex python objects), the memory impact could actually be lower

But we have given up efficient random access and cache locality (for the smaller records anyways).

### A Python Implementation

In [77]:
#todo: make a copy instead????
from doctest import run_docstring_examples as dtest
import numbers
import reprlib
class LL:
    """
    >>> A = LL()  
    >>> A[0]
    Traceback (most recent call last):
        ...
    IndexError: trying to index an empty LL
    >>> A.insert_front(1)
    >>> A[0]
    1
    >>> A.insert_back(2)
    >>> A[1]
    2
    >>> A
    LL([1,...])
    >>> myll = LL.from_components([1,2])
    >>> myll[1]
    1
    >>> len(myll)
    2
    >>> myll[2]
    Traceback (most recent call last):
        ...
    IndexError: LL index out of range
    >>> myll[0:1]
    Traceback (most recent call last):
        ...
    TypeError: LL indices must be integers
    """
    @classmethod
    def from_components(cls, components):
        inst = cls(components[0])
        for c in components[1:]:
            inst.insert_front(c)
        return inst
        
    def __init__(self, head=None):
        if head is None:
            self._headNode = None
        else:
            self._headNode = [head, None]
            
    def insert_front(self, element):
        new_node = [element, None]
        new_node[1] = self._headNode
        self._headNode = new_node
        
    def insert_back(self, element):
        new_node = [element, None]
        curr_ptr = self._headNode
        while curr_ptr[1] is not None:
            curr_ptr = curr_ptr[1]
        curr_ptr[1]= new_node
        
    def __repr__(self):
        class_name = type(self).__name__
        if len(self)==0:
            components=""
        else:
            components = reprlib.repr(self[0])
        return '{}([{},...])'.format(class_name,components)


    def __len__(self):
        curr_ptr = self._headNode
        count = 0
        if curr_ptr==None:
            return 0
        while 1:
            count = count + 1
            if curr_ptr[1] is None:
                break
            curr_ptr = curr_ptr[1]
        return count    
    
    def __getitem__(self, index):
        class_name = type(self).__name__
        if isinstance(index, numbers.Integral): 
            curr_ptr = self._headNode
            if curr_ptr==None:
                msg = 'trying to index an empty {class_name}' 
                raise IndexError(msg.format(class_name=class_name))
            next_ptr = self._headNode[1]
            count = 0
            while 1:
                if index == count:
                    return curr_ptr[0]
                if curr_ptr[1] is None:
                    msg = '{class_name} index out of range' 
                    raise IndexError(msg.format(class_name=class_name))       
                count += 1
                curr_ptr = curr_ptr[1]
        else:
            msg = '{class_name} indices must be integers' 
            raise TypeError(msg.format(class_name=class_name))

In [78]:
from doctest import run_docstring_examples as dtest
dtest(LL, globals(), verbose = True)

Finding tests in NoName
Trying:
    A = LL()  
Expecting nothing
ok
Trying:
    A[0]
Expecting:
    Traceback (most recent call last):
        ...
    IndexError: trying to index an empty LL
ok
Trying:
    A.insert_front(1)
Expecting nothing
ok
Trying:
    A[0]
Expecting:
    1
ok
Trying:
    A.insert_back(2)
Expecting nothing
ok
Trying:
    A[1]
Expecting:
    2
ok
Trying:
    A
Expecting:
    LL([1,...])
ok
Trying:
    myll = LL.from_components([1,2])
Expecting nothing
ok
Trying:
    myll[1]
Expecting:
    1
ok
Trying:
    len(myll)
Expecting:
    2
ok
Trying:
    myll[2]
Expecting:
    Traceback (most recent call last):
        ...
    IndexError: LL index out of range
ok
Trying:
    myll[0:1]
Expecting:
    Traceback (most recent call last):
        ...
    TypeError: LL indices must be integers
ok


In [81]:
myll=LL.from_components([1,2,32,-4,5])
myll

LL([5,...])

In [83]:
min(myll), max(myll)

(-4, 32)

In [85]:
min([1,2,32,-4,5]),max([1,2,32,-4,5])

(-4, 32)

### From pointers to Iterators

In [114]:
%%file simplelist.c
#include <stdlib.h>
#include <stdio.h>

int find_min(int* alist, int size){
    int minctr = 0;
    int min = *alist;
    for(int i =0; i < size; i++){
        if (*(alist+i) <= min) {
            min = *(alist+i);
            minctr=i;
        }
    }
    return minctr;
}

int main() {
    int alist[10];
    int i;
    int minctr;
    
    for (i = 0; i < 10; i++) {
        alist[i] =-10*i+5;
        printf("alist %d :  %d\n", i, alist[i]);
    }
    for (i = 0; i < 10; i++) {
        printf("alist %d :  %d\n", i, *(alist + i));
    }
    minctr = find_min(alist, 10);
    printf("min index %d value %d\n", minctr, alist[minctr] );
}

Overwriting simplelist.c


So we can do mins on both linked lists and arrays,

In [115]:
!gcc -o simplelist simplelist.c

In [116]:
!./simplelist

alist 0 :  5
alist 1 :  -5
alist 2 :  -15
alist 3 :  -25
alist 4 :  -35
alist 5 :  -45
alist 6 :  -55
alist 7 :  -65
alist 8 :  -75
alist 9 :  -85
alist 0 :  5
alist 1 :  -5
alist 2 :  -15
alist 3 :  -25
alist 4 :  -35
alist 5 :  -45
alist 6 :  -55
alist 7 :  -65
alist 8 :  -75
alist 9 :  -85
min index 9 value -85


Notice that you can access elements in the C array, one by one, by **POSITION**, simply by pointer arithmetic, without ever indexing. In a similar but not-identical fashion, one can simply follow the `next` pointers to the next **POSITION** in a linked list. This suggests an abstraction of the **position** or pointer to an **iterator**, an abstraction which allows us to treat arrays and linked lists with an identical interface. The salient points of this abstraction are:

- the notion of a `next` abstracting away the actual gymnastics of where to go next in a storage system
- the notion of a `first` to a `last` that `next` takes us on a journey from and to respectively

This is not to say we havent already abstracted a linked list to a sequence by implementing the sequence protocol, but to suggest an additional abstraction that is more fundamental than the notion of a sequence: the **iterator**.

### Iterators in python

Iteration in python is even more simple. Just as a sequence is something implementing `__getitem__` and `__len__`, an **Iterable** is something implementing `__iter__`. `__len__` is not needed and indeed may not make sense. Instead of `__iter__` an iterator might implement `__getitem__`; these are looked up one after the other.

The following example is taken from Fluent Python

In [33]:
import reprlib
class Sentence:
    def __init__(self, text): 
        self.text = text
        self.words = text.split()
        
    def __getitem__(self, index):
        return self.words[index] 
    
    def __len__(self):
        #completes sequence protocol, but not needed for iterable
        return len(self.words) 
    
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)

In [34]:
#sequence'
a= Sentence("Mary had a little lamb whose fleece was white as snow.")
len(a), a[3], a

(11, 'little', Sentence('Mary had a l...hite as snow.'))

In [35]:
list(a)

['Mary',
 'had',
 'a',
 'little',
 'lamb',
 'whose',
 'fleece',
 'was',
 'white',
 'as',
 'snow.']

To iterate over an object x, python automatically calls `iter(x)`. An iterable is something which, when `iter` is called on it, returns an iterator.

(1) if `__iter__` is defined, calls that to implement an iterator.

(2) if not  `__getitem__` starting from index 0

(3) otherwise raise TypeError

Any Python sequence is iterable because they implement `__getitem__`. The standard sequences also implement `__iter__`; for future proofing you should too because  (2) might be deprecated in a future version of python.

This:

In [38]:
for i in a:
    print(i)

Mary
had
a
little
lamb
whose
fleece
was
white
as
snow.


is implemented something like this:

In [39]:
it = iter(a)
while True:
    try:
        nextval = next(it)
        print(nextval)
    except StopIteration:
        del it
        break

Mary
had
a
little
lamb
whose
fleece
was
white
as
snow.


`it` is an iterator. An iterator defines both `__iter__` and a `__next__` (the first one is only required to make sure an *iterator* IS an *iterable*. Notice that calling `next` on an iterator will trigger the calling of `__next__`.

In [41]:
it=iter(a)#an iterator defines `__iter__` and can thus be used as an iterable
for i in it:
    print(i)

Mary
had
a
little
lamb
whose
fleece
was
white
as
snow.


In [42]:
it = iter(a)
next(it), next(it), next(it)

('Mary', 'had', 'a')

So now we can completely abstract away a sequence in favor an iterable (ie we dont need to support indexing anymore). From Fluent:

In [90]:
class SentenceIterator:
    def __init__(self, words): 
        self.words = words 
        self.index = 0
        
    def __next__(self): 
        try:
            word = self.words[self.index] 
        except IndexError:
            raise StopIteration() 
        self.index += 1
        return word 

    def __iter__(self):
        return self
    
class Sentence:#an iterable
    def __init__(self, text): 
        self.text = text
        self.words = text.split()
        
    def __iter__(self):
        return SentenceIterator(self.words)
    
    def __repr__(self):
        return 'Sentence(%s)' % reprlib.repr(self.text)

In [91]:
s2 = Sentence("While we could have implemented `__next__` in Sentence itself, making it an iterator, we will run into the problem of 'exhausting an iterator'.")

In [92]:
len(s2)

TypeError: object of type 'Sentence' has no len()

In [93]:
for i in s2:
    print(i)

While
we
could
have
implemented
`__next__`
in
Sentence
itself,
making
it
an
iterator,
we
will
run
into
the
problem
of
'exhausting
an
iterator'.


In [95]:
s2it=iter(s2)
print(next(s2it))
s2it2=iter(s2)
next(s2it),next(s2it2)

While


('we', 'While')

While we could have implemented `__next__` in Sentence itself, making it an iterator, we will run into the problem of "exhausting an iterator". The iterator above keeps state in `self.index` and we must be able to start anew by creating a new instance if we want to re-iterate. Thus the `__iter__` in the iterable, simply returns the `SentenceIterator`.

In [97]:
min(s2), max(s2)

("'exhausting", 'will')

Note that min and max will work even though we now DO NOT satisfy the sequence protocol, but rather the ITERABLE protocol, as its a pairwise comparison. The take home message is that in programming with these iterators, these generlization of pointers, we dont need either the length or indexing to work to implement many algorithms: we have abstracted these away.

### Back to Containers  vs Flats

Last time we saw how Python lists contained references to integer ("digit")+metdata based structs on the heap. And above we just saw linked lists.

We call sequences that hold such "references" to objects on the heap **Container Sequences**. Examples of such container sequences are `list, tuple, collections.deque`.

There are collections in python which contain contiguous memory (which itself is allocated on the heap). We call these **Flat Sequences**. In such containers, the iteration, and acccess to elelemnts is done at the C level, not at the python level, for reasons that will become clearer below. Such containers in python 3 are:

`str, bytes, bytearray, memoryview, array.array`

You have probably extensively used one of these which is not mentioned yet. This is numpy's ndarray: `np.array`.

From Fluent Python:
> Container sequences hold references to the objects they contain, which may be of any type, while flat sequences physically store the value of each item within its own memory space, and not as distinct objects. Thus, flat sequences are more compact, but they are limited to holding primitive values like characters, bytes, and numbers.

This is also a more general way of thinking about data structures. 

>**Contiguously-allocated** structures are composed of single slabs of memory, and include arrays, matrices, heaps, and hash tables.

>**Linked** data structures are composed of distinct chunks of memory bound together by pointers, and include lists, trees, and graph adjacency lists.

(Steven S Skiena. The Algorithm Design Manual)

A critical advantake of something like a contiguous memory array is that indexing is a constant time operation, as opposed to worst-case O(n), as we saw in linked lists. Other benefits include a tighter size and a locality of memory which benefits cache and general memory transport.

### Mutable vs Immutable

A recurrent theme in this course is that of the mutability of objects. One can also study containers based on their mutability. **Mutable Sequences** in python 3 are:

`list, bytearray, array.array, collections.deque, memoryview`

whereas immutable seuqnces in Python 3 are

`tuple, str, bytes`

Lets learn about some of these collections in python.

### array.array

The list type is nice and very flexible, but if you need to store many many (millions) of floating point variables, array.array is a better option. It stores just the bytes representing the type, so its just like a contiguous C array of things in RAM, and also just like a numpy array. 

`array.array` IS mutable, and you dont need to allocate ahead of time (reallocation will be done).

The constructor is: 

`array(typecode [, initializer]) -- create a new array`


In [71]:
from array import array
from random import random
#generator expression instead of list comprehension
floats_aa=array('d', (random() for i in range(10**8)))

In [79]:
floats_aa.itemsize

8

In [72]:
type(floats_aa)

array.array

In [73]:
floats_aa[5]

0.2788214037543454

In [74]:
floats_list=[random() for i in range(10**8)]

Some behavior that you see might be unexpected

In [75]:
%%time
for f in floats_aa:
    pass

CPU times: user 3.68 s, sys: 435 ms, total: 4.12 s
Wall time: 4.19 s


In [76]:
%%time
for f in floats_list:
    pass

CPU times: user 4.05 s, sys: 3.61 s, total: 7.66 s
Wall time: 10.2 s


Ok, so a regular python list on 100 million floats only costs double. Why would you use `array.array` then? And Why is accessing floats in an `array.array` so slow. The answer to the latter is that in using the standard python access, like in a `for` loop each float is **boxed** by the python runtime. What does this mean? 

Remember the int based structs we had earlier? In an `array.array` or in `numpy` for that matter, when you "iterate" over the array, and use the ints you get, what python does is to take that 32 bits or 64 bits from memory, warap it up into one of these structs, and hand it to you.

What it also means is that ops on `array.array` which can be done with C are fast, but access into python is slow. But none of the `array.array` functionality is exposed with any complex operations under the hood, so its current use remains limited to reading really big pieces of data, and then being ok with slower access to them (its frugal). In numpy, instead, stuff defined at the `C` level is exposed is fast (like dot products in numpy). 

In [77]:
%%time
floats_aa[1000000]

CPU times: user 4 µs, sys: 4 µs, total: 8 µs
Wall time: 36 µs


0.9597984913557022

In [78]:
%%time
floats_list[1000000]

CPU times: user 5 µs, sys: 15 µs, total: 20 µs
Wall time: 21.9 µs


0.6974724338617144

If you want to use numerical stuff, use `numpy` arrays. But `array.array`s are still useful when a buffer needs to be shlepped between Python and C, for quick access to things. If you are using legacy code in C which ops on these lists, this is the way to do it fast. Indeed, otherwise lists can be faster because of this "boxing" penalty. (see https://www.python.org/doc/essays/list2str/)

### Boxing and unboxing (or why is python slow)

(based off jakedevp's blog)


In [99]:
import ctypes
class IntStruct(ctypes.Structure):
    _fields_ = [("ob_refcnt", ctypes.c_long),
                           ("ob_type", ctypes.c_void_p),
                           ("ob_size", ctypes.c_ulong),
                           ("ob_digit", ctypes.c_long)]
    def __repr__(self):
        return ("IntStruct(ob_digit={self.ob_digit}, "
                           "refcount={self.ob_refcnt})").format(self=self)

In [100]:
num = 42
IntStruct.from_address(id(42))

IntStruct(ob_digit=42, refcount=52)

In [12]:
def test_add():
    a=1
    b=2
    c = a+b
    return c

In [98]:
import dis
dis.dis(test_add)

  2           0 LOAD_CONST               1 (1)
              3 STORE_FAST               0 (a)

  3           6 LOAD_CONST               2 (3)
              9 STORE_FAST               1 (b)

  4          12 LOAD_FAST                0 (a)
             15 LOAD_FAST                1 (b)
             18 BINARY_ADD
             19 STORE_FAST               2 (c)

  5          22 LOAD_FAST                2 (c)
             25 RETURN_VALUE


Python Addition

1. Assign 1 to a
    - 1a. Set a->PyObject_HEAD->typecode to integer
    - 1b. Set a->val = 1
2. Assign 2 to b
    - 2a. Set b->PyObject_HEAD->typecode to integer 
    - 2b. Set b->val = 2
3. call binary_add(a,b)
    - 3a. find typecode in a->PyObject_HEAD
    - 3b. a is an integer; value is a->val
    - 3c. find typecode in b->PyObject_HEAD
    - 3d. b is an integer; value is b->val
    - 3e. call `binary_add<int, int>(a->val, b->val)` 
    - 3f. result of this is result, and is an integer.
4. Create a Python object c
    - 4a. set c->PyObject_HEAD->typecode to integer
    - 4b. set c->val to result
    
So its never a simple addition...there is all this machinery around it. And this is to add two ints where `binary_add` is optimized in C. If these were user defined classes, there would be additional overhead from dunder methods for addition that we will talk about later.