In [23]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'theme': 'serif',
              'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

{'start_slideshow_at': 'selected', 'theme': 'serif', 'transition': 'zoom'}

## Software design and Binary Search

### Linear Search

In [1]:
def linear_search(da_array, needle):
    for i, item in enumerate(da_array):
        if item==needle:
            return i
    return -1

In [2]:
a=[1, 7, 4, 5, 63, 4, 35, 32, 21]
print(linear_search(a, 4))
print(linear_search(a, 6))

2
-1


O(n) storage if u take the list into account, and O(n) time, as we saw last time.

This is the algorithm that `list.index` uses.

In [55]:
l2 = range(10000000)

In [64]:
%%time
linear_search(l2, 10)

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 8.11 µs


10

In [65]:
%%time
linear_search(l2, 1000)

CPU times: user 96 µs, sys: 0 ns, total: 96 µs
Wall time: 98.9 µs


1000

In [66]:
%%time
linear_search(l2, 10000)

CPU times: user 985 µs, sys: 0 ns, total: 985 µs
Wall time: 989 µs


10000

### Binary Search

How can we increase this speed? If we are given sorted data we can use **binary search** to do this in O(lg(n)) time.

The idea is simple: look at the middle of the list and compare with desired value. If value is higher its on the right side and we look there, and so on. This is an example of what is called a divide and conquer algorithm.

Animation: http://www.cs.armstrong.edu/liang/animation/web/BinarySearch.html

In [1]:
#there are problems in this code. But, later
def binary_search_simple(da_array, needle):
    rangemin = 0
    rangemax = len(da_array) - 1
    while True:
        midpoint = (rangemin+rangemax)//2 #whats the problem with this
        if da_array[midpoint] > needle:#lower part
            rangemax = midpoint - 1
        elif da_array[midpoint] < needle:
            rangemin = midpoint + 1
        else:
            return midpoint

In [69]:
%%time
binary_search_simple(l2, 10000)

CPU times: user 12 µs, sys: 1 µs, total: 13 µs
Wall time: 16 µs


10000

Binary search order of operation can be understood by setting up a simple recurrence:

T(n) = T(n/2) + c

T(n) = lg(n)*c + d 

which means that Binary search is O(lg(n)) in time. What about space? Once the sorted array is allocated there is no more allocation needed...we just go down the array.

The best case performance is O(1), if we were looking for the median or something close to it.

Binary search can be used in many different places. Examples:

- square root finding bisection method, starting with l=1, and r = n
- generalizes to finding roots of any equation, given a l,r where f(l) < 0 and f(r) > 0
- we can use git bisect to find the commit that introduced a bug

There are problems with the previous implementation. What if the value is not in the array you are searching for? How do we terminate

In [18]:
#there are problems in this code, but later
def binary_search(da_array, needle):
    rangemin = 0
    rangemax = len(da_array) - 1
    tries=0
    while True:
        print("at top", rangemin, rangemax)
        if tries > len(da_array):
            print("No Success")
            break
        midpoint = (rangemin+rangemax)//2 #whats the problem with this
        print(rangemin, midpoint, rangemax, "|", da_array[midpoint], needle)
        if da_array[midpoint] > needle:#lower part
            rangemax = midpoint - 1
        elif da_array[midpoint] < needle:
            rangemin = midpoint + 1
        else:
            return midpoint
        tries += 1

In [19]:
input=list(range(10))
binary_search(input,4)

at top 0 9
0 4 9 | 4 4


4

In [20]:
binary_search(input,4.5)        

at top 0 9
0 4 9 | 4 4.5
at top 5 9
5 7 9 | 7 4.5
at top 5 6
5 5 6 | 5 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
5 4 4 | 4 4.5
at top 5 4
No Success


In [21]:
def binary_search(da_array, needle):
    rangemin = 0
    rangemax = len(da_array) - 1
    tries=0
    while True:
        print("at top", rangemin, rangemax)
        if tries > len(da_array):
            print("No Success")
            break
        if rangemin > rangemax:
            return -1
        midpoint = (rangemin+rangemax)//2 #whats the problem with this
        print(rangemin, midpoint, rangemax, "|", da_array[midpoint], needle)
        if da_array[midpoint] > needle:#lower part
            rangemax = midpoint - 1
        elif da_array[midpoint] < needle:
            rangemin = midpoint + 1
        else:
            return midpoint
        tries += 1

In [22]:
binary_search(input,4.5)        

at top 0 9
0 4 9 | 4 4.5
at top 5 9
5 7 9 | 7 4.5
at top 5 6
5 5 6 | 5 4.5
at top 5 4


-1

### Issues that the implementation raises

- What if the value is not there?
- What if rangemax is so high as to create overflow
- What if the array is not sorted
- what are we returning (why the -1)
- are we consistent in our returning?
- what if da_array was not an array?
- what types are we supporting?
- what happens if we have a NaN in our array? Infty?

#### One should play with all of ones concerns

Here is an example

In [43]:
list(enumerate(a))

[(0, 1), (1, 7), (2, 4), (3, 5), (4, 63), (5, 4), (6, 35), (7, 32), (8, 21)]

In [44]:
binary_search(a, 32)#would go on infinitely but we have to terminate

4 0 9
2 0 4
3 3 4


-1

In [45]:
a2=sorted(a)
a2

[1, 4, 4, 5, 7, 21, 32, 35, 63]

In [46]:
binary_search(a2,32)

4 0 9
7 5 9
6 5 7


6

### Software Design

Desirables:

- Documentation
    - names (understandable names)
    - pre+post conditions or requirements
    
- Maintainability
    - Extensibility
    - Modularity and Encapsulation
    
- Portability
- Installability

- Generality
    - Data Abstraction (change types, change data structures)
    - Functional Abstraction (the object model, overloading)
    - Robustness
        - Provability: Invariants, preconditions, postconditions
        - User Proofing, Adversarial Inputs
        
- Efficiency
    - Use of appropriate algorithms and data structures
    - Optimization (but no premature optimization)

### A Document for binary search

- That we are returning -1 when value is not in array (could have used None or False instead
- That we are returning the index, not the values
- precondition that we have a sorted array

```
@pre da_array is sorted
@post for return value idx, da_array[idx]==needle if needle in da_array else -1
```

This means that whatever types you support in da_array must support < and ==.  But we can use it to write:

```
@pre if i < j then da_array[i] <= da_array[j] 
@post for return value idx, da_array[idx]==needle if needle in da_array else -1
```

Pre-conditions and post-conditions are a contract: you dont violate the contract as a user of my search, i dont violate it as a library writer.

**Documentation is a contract between a user (client) and an implementor (library writer).**

#### Even Better Document

We must make ascending order clear. And we also want a guarantee on the runtime (the implementor could have written a linear search):

```
Searches an immutable array for a value using binary search
@param[in] da_array: The immutable array to search
@param[in] needle: The value to search for
@pre da_array: The array is sorted in ascending order: for all i,j if i<j, a[i]<=a[j]
Operates in O(lg(n)) ops
@returns idx such that da_array[idx]==needle or (idx==-1 and there is no i for all i such that da_array[i]==needle)
```

What happens if multiple elements in the array have value `needle` We could constrain this further, but wont. 

### Invariants

An invariant is something that is true at some point in the code.

Invariants and the contract are what we use to guide our implementation.

Pre-conditions and post-conditions are special cases of invariants.

**Pre's are true at function entry. They constrain the user.
Post's are true at function exit. They constrain the implementation.**

You can change implementations, stuff under the hood, etc, but once the software is in the wild **you cant change these**, since the client user is depending upon them.




### Documentation

But this is not written in any coherent way which might help a client of our code use it. For that we ought to be systematically documenting our code. You can choose any convention you like, but the important thing is to be consistent in your codebase with it. Lets see how to write `binary_search` in a well documented format, which makes these invariances, along wi
th the inputs and outputs of the code clear.

Here I'll use the numpy conventions: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt .



In [1]:
def binary_search(da_array: list, needle, left:int=0, right:int=-1) -> int:
    """
    An algorithm that operates in O(lg(n)) to search for an element
    in an array sorted in ascending order.
    
    Parameters
    ----------
    da_array : list
        a list of items sorted in non-descending order
    needle: an item to find in the array; it may or may not
        be in the array
    left: int, optional
        the left index in the array to search from. Default 0
    right: int, optional
        the right index in the array to search to. Default is -1
        in which case we will use the end of the array `len(da_array) - 1`
        
    Returns
    -------
    index: int
        an integer representing the index of `needle` if found, and -1
        otherwise
        
    Notes
    -----
    PRE: `da_array` is sorted in non-decreasing order
    POST: 
        - `da_array` is not changed by this function
        - returns `index`=-1 if `needle` is not in `da_array`
        - returns an int `index ` in [0:len(da_array)] if
          `index` is in `da_array`
    INVARIANTS:
        - If `needle` in `da_array`, needle in `da_array[rangemin:rangemax]`
          is a loop invariant in the while loop below.
    
    WARNINGS:
        - If you provide an unsorted array this function is not guaranteed to terminate
        - for multiple copies of a value in the arrar secrched for, the one returned is not guaranteed
        - to be the smallest one.
    """
    if left==0:
        rangemin = 0
    else:
        rangemin = left
    if right==-1:
        rangemax=len(da_array) - 1
    else:
        rangemax=right
    while True:
        "needle in da_array => needle in da_array[rangemin:rangemax]"   
        if rangemin >= rangemax:
            index = -1
            return index
        midpoint = rangemin + (rangemax - rangemin)//2 
        if da_array[midpoint] > needle:#lower part
            rangemax = midpoint - 1
        elif da_array[midpoint] < needle:
            rangemin = midpoint + 1
        else:
            index = midpoint
            return index

In [2]:
binary_search.__annotations__

{'da_array': list, 'left': int, 'return': int, 'right': int}

In [6]:
from pydoc import doc as pydoc
from doctest import run_docstring_examples as dtest

In [7]:
pydoc(binary_search)

Python Library Documentation: function binary_search in module __main__

binary_search(da_array:list, needle, left:int=0, right:int=-1) -> int
    An algorithm that operates in O(lg(n)) to search for an element
    in an array sorted in ascending order.
    
    Parameters
    ----------
    da_array : list
        a list of items sorted in non-descending order
    needle: an item to find in the array; it may or may not
        be in the array
    left: int, optional
        the left index in the array to search from. Default 0
    right: int, optional
        the right index in the array to search to. Default is -1
        in which case we will use the end of the array `len(da_array) - 1`
        
    Returns
    -------
    index: int
        an integer representing the index of `needle` if found, and -1
        otherwise
        
    Notes
    -----
    PRE: `da_array` is sorted in non-decreasing order
    POST: 
        - `da_array` is not changed by this function
        - returns

### Assertions

In [5]:
test_list=[1,3,4,5,6,8,9]
assert binary_search(test_list,4)==3, "this is not true"

AssertionError: this is not true

### Testing and Doctests

While developing our algorithm, it is not a bad idea to use assert to make sure that our loop-invariant is true. Such asserts could help us catch bugs. But because the test for belonging is linear in the size of the array you will want to remove it afterwords.

Similarly you might want to check that the sortedness precondition is met so that mistakes in testing data dont set your code into an infinite loop. Note that this changes the performance characteristics so this should not be in the final

The post-conditions are what the clients of your code see. These are typically handled as **Unit Tests**, of which doctests are a special kind which document the interface of the function by example.

In [9]:
def binary_search(da_array: list, needle, left:int=0, right:int=-1) -> int:
    """
    An algorithm that operates in O(lg(n)) to search for an element
    in an array sorted in ascending order.
    
    Parameters
    ----------
    da_array : list
        a list of "comparable"items sorted in non-descending order
    needle: an item to find in the array; it may or may not
        be in the array
    left: int, optional
        the left index in the array to search from. Default 0
    right: int, optional
        the right index in the array to search to. Default is -1
        in which case we will use the end of the array `len(da_array) - 1`
        
    Returns
    -------
    index: int
        an integer representing the index of `needle` if found, and -1
        otherwise
        
    Notes
    -----
    PRE: `da_array` is sorted in non-decreasing order (thus items in
        `da_array` must be comparable: implement < and ==)
    POST: 
        - `da_array` is not changed by this function (immutable)
        - returns `index`=-1 if `needle` is not in `da_array`
        - returns an int `index ` in [0:len(da_array)] if
          `index` is in `da_array`
    INVARIANTS:
        - If `needle` in `da_array`, needle in `da_array[rangemin:rangemax]`
          is a loop invariant in the while loop below.
    WARNINGS:
        - If you provide an unsorted array this function is not guaranteed to terminate
        - for multiple copies of a value in the arrar secrched for, the one returned is not guaranteed
        - to be the smallest one.
        
    Examples
    --------
    >>> input = list(range(10))
    >>> binary_search(input, 5)
    5
    >>> binary_search(input, 4.5)
    -1
    >>> binary_search(input, 10)
    -1
    >>> binary_search([5], 5)
    0
    >>> binary_search([5], 4)
    -1
    >>> import numpy as np
    >>> binary_search([1,2,np.inf], 2)
    1
    >>> binary_search([1,2,np.inf], np.inf)
    2
    """
    if left==0:
        rangemin = 0
    else:
        rangemin = left
    if right==-1:
        rangemax=len(da_array) - 1
    else:
        rangemax=right
    while True:
        "needle in da_array => needle in da_array[rangemin:rangemax]"   
        if rangemin > rangemax:
            index = -1
            return index
        #If rangemin and rangemax are both very high we do not want overflow,
        #so get the midpoint like this:
        midpoint = rangemin + (rangemax - rangemin)//2
        if da_array[midpoint] > needle:#lower part
            rangemax = midpoint - 1
        elif da_array[midpoint] < needle:
            rangemin = midpoint + 1
        else:
            index = midpoint
            return index

In [10]:
dtest(binary_search, globals(), verbose=True)

Finding tests in NoName
Trying:
    input = list(range(10))
Expecting nothing
ok
Trying:
    binary_search(input, 5)
Expecting:
    5
ok
Trying:
    binary_search(input, 4.5)
Expecting:
    -1
ok
Trying:
    binary_search(input, 10)
Expecting:
    -1
ok
Trying:
    binary_search([5], 5)
Expecting:
    0
ok
Trying:
    binary_search([5], 4)
Expecting:
    -1
ok
Trying:
    import numpy as np
Expecting nothing
ok
Trying:
    binary_search([1,2,np.inf], 2)
Expecting:
    1
ok
Trying:
    binary_search([1,2,np.inf], np.inf)
Expecting:
    2
ok


### What's happened to our issues from before?

- What if the value is not there in the array? What if it is there multiple times? what are we returning (why the -1). Are we consistent in our returning?

We return -1 if the element is not in the array. If it is there multiple times, we will return one of them: it is not defined which. We are consistent by always returning an int, choosing one which cannot be an index.

- What if rangemax is so high as to create overflow: 

We fixed it by using the difference and have documented it in the algorithm


- what types are we supporting? . 

It seems that as long as we have a notion of equals `==`, and a notion of `<` to support sorting we are set. We should document this.

- what happens if we have a NaN in our array? Infty?

If our preconditions are violated by the user, we can do anything. Doing it nicely might be costly. so we wont.


- what if da_array was not an array?

The user violated the pre-conditions. Anything could happen. We could check for a list
but yhen that would hurt a special class which implemented the python sequence protocol

- What happens if array is not sorted: 

The user violated the pre-conditions. We could return an error, violate post conditions. If we sort it we'd violate the o(lg(n)) notion. (fixing it seems dubious). Can we check if its sorted? This is naively O(n) and breaks our performance specifications. We can create a guard to terminate the array at more than n iterations for the infinite case or let the user have enough rope to hang themselves




In [13]:
%%file binsearch.py
def binary_search(da_array: list, needle, left:int=0, right:int=-1) -> int:
    """
    An algorithm that operates in O(lg(n)) to search for an element
    in an array sorted in ascending order.
    
    Parameters
    ----------
    da_array : list
        a list of "comparable"items sorted in non-descending order
    needle: an item to find in the array; it may or may not
        be in the array
    left: int, optional
        the left index in the array to search from. Default 0
    right: int, optional
        the right index in the array to search to. Default is -1
        in which case we will use the end of the array `len(da_array) - 1`
        
    Returns
    -------
    index: int
        an integer representing the index of `needle` if found, and -1
        otherwise
        
    Notes
    -----
    PRE: `da_array` is sorted in non-decreasing order (thus items in
        `da_array` must be comparable: implement < and ==)
    POST: 
        - `da_array` is not changed by this function (immutable)
        - returns `index`=-1 if `needle` is not in `da_array`
        - returns an int `index ` in [0:len(da_array)] if
          `index` is in `da_array`
    INVARIANTS:
        - If `needle` in `da_array`, needle in `da_array[rangemin:rangemax]`
          is a loop invariant in the while loop below.
    WARNINGS:
        - If you provide an unsorted array this function is not guaranteed to terminate
        - for multiple copies of a value in the arrar secrched for, the one returned is not guaranteed
        - to be the smallest one.
        
    Examples
    --------
    >>> input = list(range(10))
    >>> binary_search(input, 5)
    5
    >>> binary_search(input, 4.5)
    -1
    >>> binary_search(input, 10)
    -1
    >>> binary_search([5], 5)
    0
    >>> binary_search([5], 4)
    -1
    >>> import numpy as np
    >>> binary_search([1,2,np.inf], 2)
    1
    >>> binary_search([1,2,np.inf], np.inf)
    2
    >>> binary_search(input, 5, 1,3)
    -1
    >>> binary_search(input, 2, 1,3)
    2
    >>> binary_search(input, 2, 3, 1)
    -1
    >>> binary_search(input, 2, 2, 2)
    2
    >>> binary_search(input, 5, 2, 2)
    -1
    """
    if left==0:
        rangemin = 0
    else:
        rangemin = left
    if right==-1:
        rangemax=len(da_array) - 1
    else:
        rangemax=right
    while True:
        "needle in da_array => needle in da_array[rangemin:rangemax]"   
        if rangemin > rangemax:
            index = -1
            return index
        #If rangemin and rangemax are both very high we do not want overflow,
        #so get the midpoint like this:
        midpoint = rangemin + (rangemax - rangemin)//2
        if da_array[midpoint] > needle:#lower part
            rangemax = midpoint - 1
        elif da_array[midpoint] < needle:
            rangemin = midpoint + 1
        else:
            index = midpoint
            return index



Writing binsearch.py


In [14]:
!/anaconda/envs/py35/bin/python3 -m doctest -v binsearch.py

Trying:
    input = list(range(10))
Expecting nothing
ok
Trying:
    binary_search(input, 5)
Expecting:
    5
ok
Trying:
    binary_search(input, 4.5)
Expecting:
    -1
ok
Trying:
    binary_search(input, 10)
Expecting:
    -1
ok
Trying:
    binary_search([5], 5)
Expecting:
    0
ok
Trying:
    binary_search([5], 4)
Expecting:
    -1
ok
Trying:
    import numpy as np
Expecting nothing
ok
Trying:
    binary_search([1,2,np.inf], 2)
Expecting:
    1
ok
Trying:
    binary_search([1,2,np.inf], np.inf)
Expecting:
    2
ok
Trying:
    binary_search(input, 5, 1,3)
Expecting:
    -1
ok
Trying:
    binary_search(input, 2, 1,3)
Expecting:
    2
ok
Trying:
    binary_search(input, 2, 3, 1)
Expecting:
    -1
ok
Trying:
    binary_search(input, 2, 2, 2)
Expecting:
    2
ok
Trying:
    binary_search(input, 5, 2, 2)
Expecting:
    -1
ok
1 items had no tests:
    binsearch
1 items passed all tests:
  14 tests in binsearch.binary_search
14 tests in 2 items.
14 passed and 0 failed.
Test passed.


In [15]:
!/anaconda/envs/py35/bin/pydoc3 binsearch

Help on module binsearch:

NNAAMMEE
    binsearch

FFUUNNCCTTIIOONNSS
    bbiinnaarryy__sseeaarrcchh(da_array:list, needle, left:int=0, right:int=-1) -> int
        An algorithm that operates in O(lg(n)) to search for an element
        in an array sorted in ascending order.
        
        Parameters
        ----------
        da_array : list
            a list of "comparable"items sorted in non-descending order
        needle: an item to find in the array; it may or may not
            be in the array
        left: int, optional
            the left index in the array to search from. Default 0
        right: int, optional
            the right index in the array to search to. Default is -1
            in which case we will use the end of the array `len(da_array) - 1`
            
        Returns
        -------
        index: int
            an integer representing the index of `needle` if found, and -1
            otherwise
            
        Notes
     