# Implementing a Set with a nested List


## Profiling tools

1. As the results of the first call to `memit` reveal, a fresh IPython shell consumes approximately 40 MB of RAM:

In [26]:
%reload_ext memory_profiler
%memit

peak memory: 179.06 MiB, increment: 0.18 MiB


2. Import modules into main program:

In [27]:
import ipython_memory_usage.ipython_memory_usage as imu
import memory_profiler
import time
import timeit

3. Notice how it's difficult to have certain outputs standout in the sea of outputs, notes, etc.?

In [28]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

printmd("**When there's a lot of noise, we can bold our output by including markdown in it**")

def printmd(string, color=None):
    colorstr = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(colorstr))

printmd("**We extended further to add color support. Bold... and blue...**", color="blue")

**When there's a lot of noise, we can bold our output by including markdown in it**

<span style='color:blue'>**We extended further to add color support. Bold... and blue...**</span>

## DynamicArraySet
I initially implemented it as 1-D list of integers, then augmented to a 2-D list of lists containing integers, as described in the "Set" section of the `README.md`.

The list mimics a naive implementation of a **dynamic array** for educational purposes, providing **dynamic memory allocation**, **traversal**, **insertion** and **removal**. In reality a Python list is already a dynamic array, just like
    the Ruby `Array`, as opposed to a static array like in Java.
    
**This implementation limits the valid set of contained elements to integers**, as the integers are used to compute an index into an array of **buckets** (sub-arrays). Bucket lookup is 0(1), then an 0(n) scan finds the correct element in the bucket.

In [29]:
class DynamicArraySet:
    """Alternate array-based Set implemented with a 2-D list.""" 

####  `__init__`
We use a **list comprehension** when assigning the strcture: `[[] for _ in range(size)]`. 
**This is the appropriate way to create a nested list.**

Creating it using `[obj] * size` is a **common pitfall**. The resulting list contains `size` *references* to the
same object, not copies of it, meaning trying to mutate the `obj` at a specific index would update it throughout the entire list instead. To avoid this behavior, **only do this if `obj` is immutable**. For ex., If we were to preallocate the inner list, we could do: 

`[[None for _ in range(size)] for _ in range(size)]` or `[[None] * size] for _ in range(size)]` 

because `None` is immutable. 

**Better yet, just use the comprehension throughout**

In [30]:
class DynamicArraySet(DynamicArraySet):
    def __init__(self, size: int = 4) -> None:
        """Constructs a new DynamicArraySetInstance.

        :param int size: The initial set size; can be thought of as the number
                         of buckets / sub-lists / rows / rank / space in memory.
                         Defaults to 4
        :return: None
        """
        self.n = 0

        if size < 0:
            raise ValueError("size should be a positive integer")

        self.store = [[] for _ in range(size)]

#### MutableSet ABC

The remainder of our magic methods are defined below with docstrings. According to Python's `MutableSet` ABC, in order for our class to provide a `MutableSet` interface it needs to define implementations for the following methods:
    
   1. **`__contains__`**: defines how we perform a membership test when using the `in` operator

   2. **`__iter__`**: defines how we iterate when using `for`

   3. **`__len__`**: defines length when we use `len`

   4. **`__add__`** (`add`) defines how we add an element

   5. **`__discard__`** (`discard`) defines how we remove an element. Should not raise an exception if the element is absent

Then if we subclassed `MutableSet`, it would provide generic implementations of all the other methods that support a container API.

We're not going to be subclassinig `MutableSet` because this class only serves to demonstrate concepts, it isn't intended to be robust or used for production. The methods identified above, though, are a reminder why `insert`, `include?` and `delete` is our minimal Set API as defined in the README.

Additionally, we implement:
   1. **`__repr__`**: the official representation of our class
   2. **`__str__`**: output of `print` ing our class

In [31]:
class DynamicArraySet(DynamicArraySet):
    
    @property
    def size(self):
        """Readonly. The number of buckets."""
        return len(self)

    def contains(self, value: int) -> bool:
        """
        Check if the passed value exists in the set.
        """
        if self.n == 0 or not 0 <= value % self.size <= self.size - 1:
            return False
        return value in self.store[value % self.size]

    __contains__ = contains

    def __iter__(self) -> iter:
        """Makes our class iterable, ie. allows `[sub_list for sub_list in self]`"""
        return iter(self.store)
    
    def __len__(self) -> int:
        """Implements `len` on our class, allows `len(self)`"""
        return len(self.store)

    def __repr__(self) -> str:
        """
        The official representation of this class, acccessed by `repr(self)`.
        The output matches the `__repr__` of Python's built-in `set`
        """
        comp = [str(e) for _ in self for e in _ if e is not None]
        return f'{{{", ".join(comp)}}}'
    
    def __str__(self) -> str:
        """Pretty print class representation, accessed by `print(self)`"""
        return f'{self.__class__.__name__}(size: {self.size}, count: {self.n}, store: {self.store})'

### `add`
Add the value in the first empty space in the underlying array or append it if there are no empty spaces and resize when free slots run low. This is modeled after the probing technique **open addressing**.

In [32]:
class DynamicArraySet(DynamicArraySet):
    def add(self, value: int) -> bool:
        """Add the value to the set if it does not already exist in the set"""
        if value not in self:
            sub_list = self.store[value % self.size]
            for idx, el in enumerate(sub_list):
                if el is None:
                    sub_list[idx] = value
                    self.n += 1
                    return True

            sub_list.append(value)
            self.n += 1

            if self.size < self.n:
                self._resize()

            return True
        return False

### `discard`
In reality, the value would not actually get deleted, but set to None. This dereferences it and the garbage collector will come around later to deallocate it.

In [33]:
class DynamicArraySet(DynamicArraySet):

    def discard(self, value: int) -> bool:
        """Delete the value from the set if it exists"""
        if value in self:
            sub_list = self.store[value % self.size]
            for idx, el in enumerate(sub_list):
                if el == value:
                    del sub_list[idx]
                    self.n -= 1
                    return True
        return False

###  `_resize`

This method is meant to grow the set if there is not enough free space or shrink it if there is too much
space. We have only implemented growth, which is sufficient for this learning example.

**Steps:**

When a list of size `N` is first appended to, Python must:

1. Create a new list that is big enough to hold the original `N` items
   in addition to the extra one that is being appended.

2. Allocate `M` items, where `M` > `N`, in order to provide extra headroom
   for future appends

3. Copy the data from the old list to the new list

4. Destroy the old list

 The list allocation equation used below is the one used currently by Python:
        M = (N >> 3) + (3 if N < 9 else 6)

        N  0  1-4  5-8  9-16  17-25  26-35  36-46  ...  991-1120
        --------------------------------------------------------
        M  0  4    8    16    25     35     46     ...  1120
        
**entropy**: the minimum number of bits required, on average, to store its outcomes.

In [34]:
class DynamicArraySet(DynamicArraySet):

    def _resize(self):
        """Resize the underlying array"""
        flat_list = [e for sl in self for e in sl if e is not None]
        alloc_memory = int((self.n >> 3) + (3 if self.n < 9 else 6))

        self.store = [[] for _ in range(self.size + alloc_memory + 1)]
        self.n = 0
        for el in flat_list:
            self.add(el)

Include some standard mathematical operations computed with sets:
1. **`difference`**
2. **`intersection`**
    - Defining `__and__` allows using `&` as an alias for `intersection`
3. **`union`**
    - Defining `__or__` allows using `|` as an alias for `union`

In [35]:
class DynamicArraySet(DynamicArraySet):
    
    def difference(self, s: iter) -> list:
        """Return a list of elements present on one set, but not on the other"""
        return [e for _ in self for e in _ if e is not None and e not in s]

    def intersection(self, s2: iter) -> list:
        """Returns the set of elements that are common to both sets"""
        return [e for _ in self for e in _ if e is not None and e in s2]
    
    __and__ = intersection

    def union(self, s2: iter) -> list:
        """Returns the set of all distinct elements present in both sets"""
        flat_l = [e for _ in self for e in _ if e is not None]
        other_list = [e for e in s2 if e not in flat_l]
        flat_l.extend(other_list)
        return flat_l
    
    __or__ = union

## Profiling

### Memory

`ipython_memory_usage` reports memory usage deltas for every command you type. This tool helps you to figure out which commands use a lot of RAM and take a long time to run, this is very useful if you're working with large matrices. In addition it reports the peak memory usage whilst a command is running which might be higher (due to temporary objects) than the final RAM usage. Built on `memory_profiler`.

In [36]:
imu.start_watching_memory()

In [36] used -28.1758 MiB RAM in 5953.72s, peaked 28.18 MiB above current, total RAM usage 179.45 MiB


<div class="alert alert-block alert-info">
    <p><b>MiB</b><span> - A mebibyte contains $1024^{2}$ bytes. 1 Mebibytes to Bytes = 1048576</span></p><br>
    <code>memory_profiler</code> uses MiB - an output of <code>0 MiB RAM</code> indicates our data structure used an insignificant (to a profiler) amount of memory that could be better represented in bytes.
    Same goes with getting <code>0s</code> instead of using ms.
 <pre>
 s1 = DynamicArraySet()
 for i in range(100):
     s1.add(i * 4)
 </pre>
</div>

In [37]:
printmd(f'RAM at start: {memory_profiler.memory_usage()[0]:0.1f}MiB', color="blue")
t1 = time.time()

s1 = DynamicArraySet()
for i in range(100):
    s1.add(i * 4)

printmd(f'Loading: {s1.n} elements', color="blue")
t2 = time.time()

printmd(f'RAM after creating list: {memory_profiler.memory_usage()[0]:0.1f}MiB, took {t2 - t1:0.1f}s', color="blue")

<span style='color:blue'>RAM at start: 179.6MiB</span>

<span style='color:blue'>Loading: 100 elements</span>

<span style='color:blue'>RAM after creating list: 179.6MiB, took 0.0s</span>

In [37] used 0.1094 MiB RAM in 0.32s, peaked 0.00 MiB above current, total RAM usage 179.56 MiB


<div class="alert alert-block alert-info">
    With a larger set of elements, MiB and s are more useful.
  <pre>
  s2 = DynamicArraySet()
  for i in range(1000000):
      s2.add(i * 4)
  </pre>
</div>

In [49]:
printmd(f'RAM at start: {memory_profiler.memory_usage()[0]:0.1f}MiB', color="blue")
t1 = time.time()

s2 = DynamicArraySet()
for i in range(1000000):
    s2.add(i * 4)

printmd(f'Loading: {s2.n} elements', color="blue")
t2 = time.time()

printmd(f'RAM after creating list: {memory_profiler.memory_usage()[0]:0.1f}MiB, took {t2 - t1:0.1f}s', color="blue")

<span style='color:blue'>RAM at start: 15.5MiB</span>

<span style='color:blue'>Loading: 1000000 elements</span>

<span style='color:blue'>RAM after creating list: 193.9MiB, took 32.3s</span>

/usr/local/lib/python3.9/site-packages/ipython_memory_usage/ipython_memory_usage.py SOMETHING WEIRD HAPPENED AND THIS RAN FOR TOO LONG, THIS THREAD IS KILLING ITSELF
/usr/local/lib/python3.9/site-packages/ipython_memory_usage/ipython_memory_usage.py SOMETHING WEIRD HAPPENED AND THIS RAN FOR TOO LONG, THIS THREAD IS KILLING ITSELF
/usr/local/lib/python3.9/site-packages/ipython_memory_usage/ipython_memory_usage.py SOMETHING WEIRD HAPPENED AND THIS RAN FOR TOO LONG, THIS THREAD IS KILLING ITSELF


In [39]:
imu.stop_watching_memory()

Python's magic `%%capture` command runs the cell, capturing `stdout`, `stderr`, and `IPython`’s rich `display()` calls. Save it to a variable to have reference to it for quickly displaying the output another time. Otherwise, the captured output will be discarded. I used it to prevent my `print` from outputting because the result is too long and cannot be hidden in static environments, such as on Github. 

Below, we can later access the contents of the captured variable by invoking them with `()`. For example, to show the print statement that was captured as `cap_s`, we would just execute `cap_s()`

In [40]:
%%capture cap_s
print(s1, repr(s1), sep='\n\n')

TypeError: 'str' object is not callable

In [41]:
cap_s()

In [42]:
%%capture cap_s2
print(s2)

## `in` membership test runs in constant time 0(1):

In [43]:
time_cost = sum(timeit.repeat(stmt="15 in s1",
                              setup="from __main__ import s1",
                              number=1,
                              repeat=10000))
printmd(f'Size of s1: {s1.size}, Count of elements: {s1.n}, Summed time to look up 15 is {time_cost:0.4f}s', color="blue") 

<span style='color:blue'>Size of s1: 106, Count of elements: 100, Summed time to look up 15 is 0.0131s</span>

In [44]:
time_cost2 = sum(timeit.repeat(stmt="15 in s2",
                              setup="from __main__ import s2",
                              number=1,
                              repeat=10000))
printmd(f'Size of s2: {s2.size}, Count of elements: {s2.n}, Summed time to look up 15 is {time_cost2:0.4f}s', color="blue")

<span style='color:blue'>Size of s2: 1087175, Count of elements: 1000000, Summed time to look up 15 is 0.0149s</span>

## DynamicHashSet

We don't want to be limited to integers, so we subclass the `DynamicArraySet` to implement a more specific version that uses **hashing**. With this data type **we can now store keys of any immutable (hashable) type**.

In [45]:
from typing import Hashable


class DynamicHashSet(DynamicArraySet):

    def contains(self, value: Hashable) -> bool:
        if self.n == 0 or not 0 <= hash(value) % self.size <= self.size - 1:
            return False
        return value in self.store[hash(value) % self.size]

    __contains__ = contains

    def add(self, value: Hashable) -> bool:
        if value not in self:
            sub_list = self.store[hash(value) % self.size]
            for idx, el in enumerate(sub_list):
                if el is None:
                    sub_list[idx] = value
                    self.n += 1
                    return True

            sub_list.append(value)
            self.n += 1

            if self.size < self.n:
                self._resize()

            return True
        return False

    def discard(self, value: Hashable) -> bool:
        if value in self:
            sub_list = self.store[hash(value) % self.size]
            for idx, el in enumerate(sub_list):
                if el == value:
                    del sub_list[idx]
                    self.n -= 1
                    return True
        return False

<div class="alert alert-block alert-info">
We instantiate a <code>DynamicHashSet</code> instance now so we can see it's the same song and dance.
  <pre>
  str = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
  word_list = str.split()
  hs1 = DynamicHashSet()
  for word in word_list:
      hs1.add(word)
      for char in word:
          hs1.add(char)
  </pre>
</div>

In [46]:
imu.start_watching_memory()
printmd(f'RAM at start: {memory_profiler.memory_usage()[0]:0.1f}MiB', color="blue")
t1 = time.time()

str = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.'
word_list = str.split()
hs1 = DynamicHashSet()
for word in word_list:
    hs1.add(word)
    for char in word:
        hs1.add(char)

t2 = time.time()
printmd(f'Loading: {hs1.n} elements', color="blue")
printmd(f'RAM after creating list: {memory_profiler.memory_usage()[0]:0.1f}MiB, took {t2 - t1:0.1f}s', color="blue")



<span style='color:blue'>RAM at start: 227.2MiB</span>

<span style='color:blue'>Loading: 26 elements</span>

<span style='color:blue'>RAM after creating list: 227.2MiB, took 0.0s</span>

In [46] used 4.0273 MiB RAM in 0.81s, peaked 0.00 MiB above current, total RAM usage 227.21 MiB


In [47]:
imu.stop_watching_memory()

<div class="alert alert-block alert-info">
    We can use the set computation methods we implemented on <code>DynamicArraySet</code> to check that the    results of the for loop were as expected.
</div>

In [48]:
printmd(f'Elements in \'hs1\' not in \'word_list\': {hs1.difference(word_list)}', color="blue")
printmd(f'Elements in \'word_list\' not in \'hs1\': {set(word_list).difference([e for _ in hs1.store for e in _ if e is not None])}', color="blue")
printmd(f'Count of unique words + unique chars in \'str\': {len(set(word_list + [char for word in word_list for char in word if char]))}', color="blue")
printmd(f'Length of \'hs1\': {hs1.n}', color="blue")

<span style='color:blue'>Elements in 'hs1' not in 'word_list': ['o', 'l', 'L', 't', 'i', 'a', 'g', 'n', 'p', 'd', 'u', 'r', ',', 'm', 's', 'c', '.', 'e']</span>

<span style='color:blue'>Elements in 'word_list' not in 'hs1': set()</span>

<span style='color:blue'>Count of unique words + unique chars in 'str': 26</span>

<span style='color:blue'>Length of 'hs1': 26</span>