# Class 4 - 25.3.18

 # Errors and Exceptions

A very debated feature of Python (and other scripting languages) is its fear of failing. Python tries to coerce unknown commands into something familiar that it can work with. For example, addition of `bool`s and other types is fully supported, since `bool` types are treated as 0 (for `False`) and 1 (for `True`).

In [1]:
True - 1.0

0.0

In [2]:
False + 10

10

However, many other statements will result in an error, or _Exception_ in Python's terms:

In [3]:
'3' + 3  # TypeError

TypeError: must be str, not int

In [4]:
a = [0, 1, 2]
a[3]  # IndexError

IndexError: list index out of range

In [5]:
camel  # NameError

NameError: name 'camel' is not defined

There are countless exceptions in Python, and most modules you'll use created their designated exceptions. Modules and packages do this because the exception is _meaningful_ - each exception conveys information about what went wrong during the runtime. Since it's not a simple error, we can use this information by predetermining the course of action when an excpetion occurs. This is called _catching_ an exception.

The keywords involved are: `try`, `except`, `else` and `finally`:

```python
try:
    # Do something that might fail
except PermissionError:
    # If we don't have permission to do the operation (e.g. write to protected disk), do the following
except IsADirectoryError:
    # Trying to do a file operation on a directory - so do the following
except (NameError, TypeError):
    # If we encouter either a non-existent variable or operation on variables, do the following
except Exception:
    # General error, not caught by previous exceptions
else:
    # If the operation under "try" succeeded, do the following
finally:
    # Regardless of the result - success or failure - do this.
```

Let's break it down:

In [6]:
# Simplest form of exception handling:
# a = 2
try:
    b = a + 1
except NameError:  # a or b isn't defined
    a = 1
    b = 2

# We could catch other exceptions
try:
    b = a + 1
except TypeError:  # a isn't a float\int
    b = 2


TypeError: can only concatenate list (not "int") to list

In [7]:
# With the else clause
current_key = 'John'
default_val = 'Cohen'
dict_1 = {'John': 'Doe', 'Jane': 'Doe'}
try:
    johns = dict_1.pop(current_key)
except KeyError:  # Non-existent key
    dict_1[current_key] = default_val
else:
    print(f"{len(dict_1)} remaing key(s) in the dictionary")
    

1 remaing key(s) in the dictionary


In [8]:
# Another else example
tup = (1,)
try:
    a, b = tup[0], tup[1]
except IndexError as e:
    print("IndexError")
    print(f"Exception: {e}; tup: {tup}")
else:
    # process_data(a, b)
    print(a, b)

IndexError
Exception: tuple index out of range; tup: (1,)


In [9]:
# With the finally clause
def divisor(a, b):
    """
    Divides two numbers.
    a, b - numbers (int, float)
    returns a tuple of the result and a possible error.
    """
    try:
        ans = a / b
    except ZeroDivisionError as e:
        ans = None
        err = e
    except TypeError as e:
        ans = None
        err = e
    else:
        err = None
    finally:
        return ans, err


In [10]:
# Should work:
ans, err = divisor(1, 2)
print(ans, " ----", err)

# ZeroDivisionError:
ans, err = divisor(1, 0)
print(ans, "----", err)

# TypeError
ans, err = divisor(1, 'a')
print(ans, "----", err)

0.5  ---- None
None ---- division by zero
None ---- unsupported operand type(s) for /: 'int' and 'str'


Exception handling is used almost everywhere in the Python world. We always expect our operations to fail, and catch the errors as our backup plan. This is considered more Pythonic than other options. Here's a "real-world" example:

In [21]:
# Integre conversion. We check before doing it to make sure it won't raise errors
def int_conversion(s):
    """ Convert a string to int """
    if not isinstance(s, str) or not s.isdigit:
        return None
    elif len(s) > 10:    #too many digits for int conversion
        return None
    else:
        return int(str)

In [22]:
# Same purpose - more Pythonic
def pythonic_int_conversion(s):
    """ Convert a string to int """
    try:
        return int(s)
    except (TypeError, ValueError, OverflowError):
        return None
# This is also sometimes phrased as "easier to ask for forgiveness than permission"

## Exercise - User Input Verification

The user's input is always a very error-prone area in an application. A famous joke describes this situation in the following manner: 

> A Quality Assurance (QA) Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.


A decent application should not only handle all possible incoming inputs, but should also convey back to the user the information of what went wrong. In this exercise you'll write a `verify_input` function that handles file and folder names.

### Short Intro - `pathlib`

For file I/O and other disk operations, some of which are required in this exercise, Pythonistas use `pathlib`, a module in the Python standard library designated to work with files and folders (`pathlib2` in Python 2). Its basic premise is that files and folders are objects themselves, and certain operations are allowed between these objects.

In [11]:
from pathlib import Path

In [23]:
p1 = Path(r'C:\Users\Hagai\Documents\Classes\python-course-for-students')  # notice the "raw" string r'',
# it forces Python to not duplicate backslashes

In [24]:
p1

WindowsPath('C:/Users/Hagai/Documents/Classes/python-course-for-students')

In [25]:
p1.parent

WindowsPath('C:/Users/Hagai/Documents/Classes')

In [26]:
list(p1.parents)

[WindowsPath('C:/Users/Hagai/Documents/Classes'),
 WindowsPath('C:/Users/Hagai/Documents'),
 WindowsPath('C:/Users/Hagai'),
 WindowsPath('C:/Users'),
 WindowsPath('C:/')]

In [27]:
p1.exists()  # is it actually a folder\file?

True

In [28]:
p1.parts

('C:\\',
 'Users',
 'Hagai',
 'Documents',
 'Classes',
 'python-course-for-students')

In [29]:
p1.name

'python-course-for-students'

In [30]:
for file in p1.iterdir():
    print(file)

C:\Users\Hagai\Documents\Classes\python-course-for-students\general
C:\Users\Hagai\Documents\Classes\python-course-for-students\hw1
C:\Users\Hagai\Documents\Classes\python-course-for-students\hw2


In [32]:
# Traversing the file system
p2 = Path('C:/Users/Hagai/Documents')
p2 / 'Classes' / 'python-course-for-students'
# Operator overloading

WindowsPath('C:/Users/Hagai/Documents/Classes/python-course-for-students')

#### The exercise:

In [33]:
class UserInputVerifier:
    """
    Assert that the input from a user is a valid folder name. A valid folder is a folder
    containing the following files: "a.py", "b.py", "c.py", and the data file "data.txt". However, the class
    should be able to deal with any arbitrary filename, or an iterable of which.
    If the given folder doesn't contain it, it's possible the user gave us a parent folder of the 
    folder that contains these Python files. Look into any sub-folders for these files, and return the
    "actual" true folder, i.e. the top-most folder containing all the files.
    Input - Foldername, string
    Output - A pathlib object. If the input isn't valid, i.e. the files weren't found, 
    the class should raise an exception.
    """

### Exercise solution below...

In [34]:
class UserInputVerifier:
    """
    Assert that the given foldername contains files in "filenames".
    """
    def __init__(self, foldername, filenames=['a.py', 'b.py', 'c.py', 'data.txt']):    
        self.raw_folder = Path(str(foldername))  # first possible error
        self.filenames = self._verify_filenames(filenames)
    
    def _verify_filenames(self, filenames):
        """ Verify the input filenames, and return it as an iterable. """
        
        typ = type(filenames)
        if typ not in (str, Path, list, tuple, set):
            raise TypeError("Filenames should be an iterable, a Path object or a string.")
        if typ in (str, Path):
            return [filenames]
        return filenames
        
    def check_folder(self):
        """ Assert that the files are indeed in the folder or in one of its subfolders """
        
        existing_files = []
        missing_files = []
        if not self.raw_folder.exists():
            raise UserWarning(f"Folder {self.raw_folder} doesn't exist.")
            
        # Make sure that each file we're looking for doesn't 
        for file_to_look in self.filenames:
            found_files = [str(file) for file in self.raw_folder.rglob(file_to_look)]
            if len(found_files) == 0:
                raise UserWarning(f"File '{file_to_look}' was missing from folder '{self.raw_folder}'.")
            if len(found_files) > 1:
                raise UserWarning(f"More than one file named '{file_to_look}' was found in '{self.raw_folder}'.")
        return True

In [35]:
foldername = r'./mock'
verifier = UserInputVerifier(foldername)
verifier.check_folder()

True

## File Input/Output

Yet another error-prone area in applications is their I/O (input-output) module. Interfacing with objects outside the scope of your own project should always be handled carefully. You never know what's really out there.

The class we created in the previous exercise should help with the first step of file I/O, but that's not all of it.

Assume we wish to write some data to a file - a list filled with counts of some sort, for example.

To write (and read) from a file, you have to do several operations:
1. Define the file path and name.
2. Open the file with the appropriate mode - read, write, etc.
3. Flush out the data.
4. Close the file.

Here's a mediocre example of how it's done:

In [36]:
data_to_write = 'A B C D E F'
filename = 'data.txt'
file = open(filename, 'w')  # w is write
file.write(data_to_write)
file.close()

The real issue stems from the fact that these steps are very error prone. For example, you can open a file to write something to it, but while the file is opened someone else (or some other Python process) can close and even delete the file.

Another example - some connection error might occur after you've flushed the data into the file, but before you managed to close it, leading to a file that can't be accessed by the operating system.

Gladly, Python is here to help, and its main method of doing so is context managers, called upon with the `with` keyword. Context managers are awesome, and I'll only briefly describe their capabilities. That being said, they shine the most when doing I/O, like in the following example:

In [37]:
data_to_write = 'A B C D E F'
filename = 'data.txt'
with open(filename, 'w') as file:
    file.write(data_to_write)

The unique thing here is that once we've opened the file, the `with` block guarantees that the file will be closed, regardless of what code is executed. 

Even if an error occurs while the file is open - the context manager will ensure proper handling of the file and prevent our data from disappearing into the void of the file system.

### How do they work (advanced)?

Like everything in Python, a context manager is a class. Moreover, each class can become a context manager by simply defining two methods:

In [38]:
class MyFile(object):
    def __init__(self, file_name, method):
        self.file_obj = open(file_name, method)
        
    def __enter__(self):
        return self.file_obj
    def __exit__(self, error_type, value, traceback):
        self.file_obj.close()

# Usage
with MyFile('demo.txt', 'w') as opened_file:
    opened_file.write('Hola!')

The `__enter__` method defines the object after the `as` keyword (in this case - a file object). The `__exit__` method defines what happens when we leave the context manager's context. As you can see, `__exit__` has three extra keywords that are used for error handling.

There's an even more advanced way to create context managers, using decorators, which we'll discuss later in the course.

# Scientific Python

## Introduction

![The scientific Python stack](scientific_python.jpg)

The onion-like scientific stack of Python is composed of the most important packages used in the Python scientific world. Here's a quick overview of the most important names in the stack, before we dive in deep:
1. IPython: A REPL, much like the command window in MATLAB. Lets you write Python on the fly, debug and check the performance of your code, and much (much!) more.
2. NumPy: Standard implementation of multi-dimensional arrays in Python. 
3. Jupyter: Notebooks aimed at data exploration. Write code and see its output immediately, with useful Markdown annotations to accompany it - just like this notebook!
4. SciPy: Implements advanced algorithms widely used in scientific programming. Advanced linear algebra, data structures, curve fitting, signal processing, modeling and more.
5. Matplotlib: Visualizations of your data in Python with MATLAB-like interface.
6. Pandas: Tabular data manipulation.

This scientific stack is one of the main reasons that empowered Python in recent years to the prominent spot it has today.

Throughout the course you will become familiar with all of the mentioned tools, and more.

## NumPy

NumPy is the de-facto implementation of arrays in Python. They're very similar to MATLAB arrays (and to other implementations in other programming languages), which will hopefully make us feel comfortable with this module.

>[This](https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html) page contains a thorough comparison of similar functions in MATLAB and numpy. If you want to use MATLAB's `find` function and can't seem to find a counterpart in numpy (it's `nonzero`, BTW), or if you wish to see how do different arithmetic functions compare between the two, that page can be of great assistance.

In [39]:
import numpy as np  # standard alias for numpy

In [40]:
l = [1, 2, 3]
arr = np.array(l)  # obviously equal to np.array([1, 2, 3])
arr

array([1, 2, 3])

In [41]:
# Check the dimensionality of the array:
print("Shape: ", arr.shape)
print("NDim: ", arr.ndim)
# But...
a = np.array(1)  # wrong instatiation! Don't use it, instead write np.array([1])
print("Shape: ", a.shape)
print("NDim: ", a.ndim)


Shape:  (3,)
NDim:  1
Shape:  ()
NDim:  0


We've instantiated an array by calling the `np.array()` function on an iterable. This will create a one-dimensional numpy array, or _vector_. This is the first _important_ difference from MATLAB. In MATLAB, all arrays are by default n-dimensional. If you write:
```matlab
a = 1
a(:, :, :, :, :, 1) = 2
```
then you just received a 6-dimensional array. Numpy doesn't allow that. The dimensions of an array are generally the dimensions it had when it was created. You can always use `reshape`, like MATLAB, to define the exact number of dimensions you wish to have.

The second difference is the idea of vectors.

In [42]:
print(f"The number of dimensions is {arr.ndim}.")
print(f"The shape is {arr.shape}.")

The number of dimensions is 1.
The shape is (3,).


As you see, `ndim` returns the number of dimensions. The `shape` is returned as a tuple, the length of which is the number of dimensions, while the values represent the number of elements in each dimensions. `shape` is very similar to the `size()` function of MATLAB.

Creating arrays with more than one dimension might look odd at a first glance:

In [43]:
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
arr_2d

array([[1, 2, 3],
       [4, 5, 6]])

In [44]:
print(f"The number of dimensions is {arr_2d.ndim}.")
print(f"The shape is {arr_2d.shape}.")
print(f"The len of the array is {len(arr_2d)} - which corresponds to the value of the 0th dimension.")

The number of dimensions is 2.
The shape is (2, 3).
The len of the array is 2 - which corresponds to the value of the 0th dimension.


As we see, each list was considered a single "row" (dimension number 0) in the final array, very much like MATLAB, although the syntax can be confusing at first. Here's another slightly confusing example:

In [45]:
c = np.array([[[1], [2]], [[3], [4]]])
c

array([[[1],
        [2]],

       [[3],
        [4]]])

In [46]:
c.shape

(2, 2, 1)

### Getting help

Numpy has a ton of features, and to get around it we can use `lookfor` and the `?` sign, besides the official reference on the internet:

In [47]:
np.lookfor("Create array")

Search results for 'create array'
---------------------------------
numpy.array
    Create an array.
numpy.memmap
    Create a memory-map to an array stored in a *binary* file on disk.
numpy.diagflat
    Create a two-dimensional array with the flattened input as a diagonal.
numpy.fromiter
    Create a new 1-dimensional array from an iterable object.
numpy.partition
    Return a partitioned copy of an array.
numpy.ctypeslib.as_array
    Create a numpy array from a ctypes array or a ctypes POINTER.
numpy.ma.diagflat
    Create a two-dimensional array with the flattened input as a diagonal.
numpy.ma.make_mask
    Create a boolean mask from an array.
numpy.ctypeslib.as_ctypes
    Create and return a ctypes object from a numpy array.  Actually
numpy.ma.mrecords.fromarrays
    Creates a mrecarray from a (flat) list of masked arrays.
numpy.ma.mvoid.__new__
    Create a new masked array from scratch.
numpy.lib.format.open_memmap
    Open a .npy file as a memory-mapped array.
numpy.ma.MaskedArr

In [48]:
np.con*?

### Creating arrays

In [49]:
np.arange(10)  # similar to MATLAB's 0:9 (and to Python's range())

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [50]:
np.arange(start=1, stop=8, step=3)  # MATLAB's 1:3:8

array([1, 4, 7])

In [51]:
np.zeros((2, 3))  # notice that the shape is a tuple

array([[0., 0., 0.],
       [0., 0., 0.]])

In [52]:
np.zeros_like(arr_2d)

array([[0, 0, 0],
       [0, 0, 0]])

In [53]:
np.ones((1, 4), dtype='int64')  # 2-d array with the dtype argument

array([[1, 1, 1, 1]], dtype=int64)

In [54]:
np.full((4, 2), 1e6)

array([[1000000., 1000000.],
       [1000000., 1000000.],
       [1000000., 1000000.],
       [1000000., 1000000.]])

In [55]:
np.linspace(0, 10, 3)  # start, stop, number of points (endpoint as a keyword argument)

array([ 0.,  5., 10.])

In [56]:
np.linspace(0, 10, 3, endpoint=False, dtype=np.float32)

array([0.       , 3.3333333, 6.6666665], dtype=float32)

In [57]:
np.eye(3, dtype='uint8')

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=uint8)

In [58]:
np.diag([19, 20])

array([[19,  0],
       [ 0, 20]])

In [59]:
# Some more interesting features
np.array([True, False, False])

array([ True, False, False])

In [60]:
np.array([1+2j]).dtype

dtype('complex128')

In [61]:
np.array(['a', 'bc', 'd', 'e'])  # <U2 are strings containing maximum 2 unicode letters
# No such thing as a cell array, arrays can contain strings.

array(['a', 'bc', 'd', 'e'], dtype='<U2')

In [98]:
# Arrays can be heterogeneous, although usually this is discouraged. 
# To make an hetetogeneous array, you have to specify its data type as "object":
np.array([1, 'b', True, {'b': 1}], dtype=object)

array([1, 'b', True, {'b': 1}], dtype=object)

The last few examples showed us that numpy arrays are a superset of MATLAB's matrices, as they also contain the cell array functionality within them.

When multiplying numpy arrays, multiplication is done cell-by-cell (elementwise), rather than by the rules of linear algebra. 

You can still multiply vectors and matrices using one of two options:
* The `@` operator (preferred): `arr1 @ arr2`
* `arr1.dot(arr2)`

Just remember that the default behavior is different than MATLAB's.

Also, numpy does contain a matrix-like array - `np.matrix('1 2; 3 4')` - which behaves like a linear algebra 2D matrix, but its use is discouraged.

### Difference between numpy arrays and lists

You might have asked yourselves by this point what's the difference between lists and numpy arrays. In other words, if Python already has an array-like data structure, why does it need another one?

Lists are truly amazing, but their flexibility comes at a cost. You won't notice it when you're writing some short script, but when you start working with real datasets you notice that their performance is _lacking_.

Naive iteration over lists is good enough when the lists contain up to a few thousands of elements, but somewhere after this invisible threshold you will start to notice the runtimes of your app.

#### Why are lists slower?

Lists are slower due to their implementation. Because a list has to be heterogeneous, a list is actually an object containing references to other objects, which are the elements of the data contained in the list. This nesting can go on even deeper, since elements of lists can be other lists, or other complex data types.

Iterating over such a complicated objects contains a lot of "overhead", i.e. time the interpreter tries to figure out what is it actually facing - is this current value an integer? A float? A dictionary of tuples?

![Numpy arrays versus lists](array_vs_list.png)

This is where numpy arrays come into the picture. They require the contained data to be homogeneous (disregarding the "object" datatype), leading to a simpler structure of the underlying implementation: A numpy array is an object with a pointer to a contiguous block of data in the memory.

Numpy forces the elements in its array to be homogeneous, by means of "upcasting":

In [99]:
arr = np.array([1, 2, 3.14, 4])
arr  # upcasted to float, not "object", since it's much slower

array([1.  , 2.  , 3.14, 4.  ])

This homogeneity and contiguity allows numpy to use "vectorized" functions, which are simply functions that can iterate in C code over the array, as opposed to regular functions which have to use the Python iteration rules.

This means that while in essence the following two pieces of code are identical, the performance gain is due to the loop being done in C rather than in Python (this is the case in MATLAB as well):

In [63]:
%%timeit
python_array = list(range(1000000))
for item in python_array:
    item += 1

103 ms ± 4.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [64]:
%%timeit 
numpy_array = np.arange(1000000)
numpy_array += 1  # inside, a C loop is adding 1 to each item.

# Two orders of magnitude improvement for a pretty small (1M elements) array. 
# This is approximately the size of a 1024x1024 pixel image.

3.59 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


If you recall the first class, you'll remember that we mentioned how Python code has to be transpiled into C code, which is then compiled to machine code. Numpy arrays take a "shortcut" here, are are quickly compiled to efficient C code without the Python overhead. When using vectorized numpy operations, the loop Python does in the backstage is very similar to a loop that a C programmer would have written by hand.

Another small but significant benefit of numpy arrays is smaller memory footprint:

In [100]:
from sys import getsizeof

python_array = list(range(1000000))
list_size = getsizeof(python_array) / 1e6  # in MB
print(f"Python list size (1M elements, MB): {list_size}")

numpy_array = np.arange(1000000)
numpy_size = numpy_array.nbytes / 1e6
print(f"Numpy array size (1M elements, MB): {numpy_size}")

# Why a million elements take 4 MB? Each element has to weigh 4 bytes, or 32 bits. This means that the
# np.arange() function generates by default int32 values.

Python list size (1M elements, MB): 9.000112
Numpy array size (1M elements, MB): 4.0


### Indexing and slicing

In [102]:
a = np.arange(10)
print(f"a = {a}")
a[0], a[3], a[-2]  # Python indexing is always done with square brackets

a = [0 1 2 3 4 5 6 7 8 9]


(0, 3, 8)

In [67]:
# The beautiful reverse slicing is here as well
a[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [68]:
arr_2d

array([[1, 2, 3],
       [4, 5, 6]])

In [69]:
arr_2d[0, 2]  # first row, third column

3

In [70]:
arr_2d[0, :]  # first row, all columns

array([1, 2, 3])

In [71]:
arr_2d[0]  # first item in the first dimension, and all of its corresponding elemets
# Similar to arr_2d[0, :]

array([1, 2, 3])

In [72]:
a[1::2]

array([1, 3, 5, 7, 9])

In [73]:
a[:2]  # last index isn't included

array([0, 1])

In [74]:
# In Python, slicing creates a view of an object, not a copy:
b = a[::2]
b

array([0, 2, 4, 6, 8])

In [75]:
b[0] = 100
b

array([100,   2,   4,   6,   8])

In [76]:
a  # a is also changed!

array([100,   1,   2,   3,   4,   5,   6,   7,   8,   9])

In [103]:
# Copying is done with .copy()
a_copy = a[::-1].copy()
a_copy

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [104]:
a_copy[1] = 200
a_copy

array([  9, 200,   7,   6,   5,   4,   3,   2,   1,   0])

In [105]:
a_copy  # unchanged

array([  9, 200,   7,   6,   5,   4,   3,   2,   1,   0])

This behavior allows Numpy to save time and memory. It will usually go unnoticed during day-to-day use. Occasionally, however, it can result in ugly bugs which are hard to locate.

In [80]:
a[:3]

array([100,   1,   2])

In [81]:
a[3:]

array([3, 4, 5, 6, 7, 8, 9])

![Numpy indexing](numpy_indexing.png)

### Aggregation

In [82]:
arr = np.random.random(1000)
sum(arr)  # The native Python sum works on numpy array, but its use is discouraged in this context

492.6043817490305

`sum()` is a built in Python function, capable of working on lists and other native Python data structures. Numpy has its own sum functions: You can either write `np.sum(arr)`, or use `arr.sum()`:

In [83]:
%timeit sum(arr)
%timeit np.sum(arr)
%timeit arr.sum()

215 µs ± 24.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
5.15 µs ± 583 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.44 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Keep in mind that the built-in `sum()` and numpy's `np.sum()` aren't identical. `np.sum()` by default calculates the sum of _all axes_, returning a single number for multi-dimensional arrays.

In MATLAB, this behavior is replicated with `sum(arr(:))`.

Likewise, `min()` and `max()` also have two "competing" versions:

In [84]:
%timeit min(arr)  # Native Python
%timeit np.min(arr)
%timeit arr.min()

82.8 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
4.36 µs ± 662 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.33 µs ± 442 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Calculating the `min` of an array will, again, result in a single number:

In [85]:
arr2 = np.random.random((100, 100, 100))
arr2.min()

5.823232201995765e-07

If you wish to calculate it across an axis, use the `axis` keyword:

In [86]:
# 2D output - axis number 0 was "dropped"
print(arr2.min(axis=0))

print('------')
# 1D output - the two first axes were summed over
print(arr2.max(axis=(0, 1)))

[[1.30105370e-04 1.85247986e-02 5.76981265e-04 ... 2.17054628e-03
  1.74017353e-03 1.71648417e-02]
 [9.74708434e-04 1.42110375e-02 2.27249914e-04 ... 1.84564336e-04
  9.91460849e-03 5.03250994e-04]
 [2.10501609e-03 3.11640007e-03 9.31462323e-03 ... 4.80578692e-03
  2.53371128e-03 1.15754554e-02]
 ...
 [2.64927481e-03 2.05463086e-03 2.03506809e-02 ... 1.71246031e-03
  1.22983732e-02 1.23869152e-02]
 [6.90173657e-04 3.43165101e-03 8.55869758e-03 ... 7.83710115e-04
  1.33042948e-02 7.60761611e-03]
 [3.24140890e-03 7.34097009e-03 9.79868560e-03 ... 3.08316230e-02
  6.87046457e-03 2.45869442e-06]]
------
[0.99996561 0.99998143 0.99979485 0.99995817 0.99992053 0.99996168
 0.9997315  0.99988793 0.99991571 0.9999547  0.99999641 0.99999241
 0.99995579 0.99997798 0.99991344 0.99989906 0.9997946  0.9999751
 0.99997769 0.99986947 0.99986118 0.99993987 0.99994818 0.99999912
 0.99978118 0.999931   0.99992138 0.9997643  0.99999697 0.99999074
 0.99997707 0.9998866  0.99990193 0.99990738 0.99998946 0.9

Many other aggregation functions exist, including:
```
- np.var
- np.std
- np.argmin\argmax
- np.median
- ...
```
Most of them have an object-oriented version, i.e. `arr.var()`, and a procedural version, i.e. `np.var(arr)`.

### Fancy indexing

In Numpy, indexing an array with a different array is called "fancy" indexing. Perhaps confusingly, it creates a copy, not a view.

In [106]:
basic_array = np.arange(10)
basic_array

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [107]:
mask = (basic_array % 2 == 0)
mask

array([ True, False,  True, False,  True, False,  True, False,  True,
       False])

In [108]:
basic_array[mask]  # also basic_array[basic_array % 2 == 0]

array([0, 2, 4, 6, 8])

MATLAB veterans shouldn't be surprised from this feature, but if you haven't seen it before make sure to have a firm grasp on it, as it's a very powerful technique.

In [90]:
basic_array[mask] = 'a'
basic_array

array(['a', 1, 'a', 3, 'a', 5, 'a', 7, 'a', 9], dtype=object)

In [91]:
float_array = np.arange(start=0, stop=20, step=1.5)
float_array

array([ 0. ,  1.5,  3. ,  4.5,  6. ,  7.5,  9. , 10.5, 12. , 13.5, 15. ,
       16.5, 18. , 19.5])

In [92]:
float_array[[1, 2, 5, 5, 10]]  # copy, not a view. Meaning that the resulting array is a new
# instance of the original array, independent of it, in a different location in memory.

array([ 1.5,  3. ,  7.5,  7.5, 15. ])

Counting the amount of values that fit our condition can be done as follows:

In [109]:
basic_array = np.arange(10)
count_nonzero_result = np.count_nonzero(basic_array % 2 == 0)
sum_result = np.sum(basic_array % 2 == 0)

print(f"np.count_nonzero result: {count_nonzero_result}")
print(f"np.sum result: {sum_result}")
# In the latter case, True is 1 and False is 0

np.count_nonzero result: 5
np.sum result: 5


If we wish to add more conditions in to the mix, we can use the `&, |, ~, ^` operators (and, or, not, xor):

In [94]:
np.sum((basic_array % 2 == 1) & (basic_array != 3))  # uneven values that are different from 3

4

Note that numpy uses the `&, |, ~, ^` operators for element-by-element comparison, while the reserved `and` and `or` keywords evaluate the entire object:

In [110]:
np.sum((basic_array % 2 == 1) and (basic_array != 3))   # doesn't work, "and" isn't used this way here

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

### Sorting

Again we face a condition in which Python has its own `sort` functions, but numpy's `np.sort()` are much better suited for arrays. We can either sort in-place, or have a new object back:

In [111]:
# Return a new object:
arr = np.array([ 3, 4, 1, 8, 10, 2, 4])
print(np.sort(arr))

# Sort in-place:
arr.sort()
print(arr)

[ 1  2  3  4  4  8 10]
[ 1  2  3  4  4  8 10]


The default implementation is a _quicksort_, but other sorting algorithms can be found as well.

In [112]:
# np.argsort will return the indices of the sorted array:
arr = np.array([ 3, 4, 1, 8, 10, 2, 4])
print("Sorted indices: {}".format(np.argsort(arr)))

# Usage is as follows:
arr[np.argsort(arr)]

Sorted indices: [2 5 0 1 6 3 4]


array([ 1,  2,  3,  4,  4,  8, 10])

### Concatenation

Concatenation (and splitting) in numpy works as follows:

In [113]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])  # concatenates along the first axis

array([1, 2, 3, 3, 2, 1])

In [114]:
# Concatenate more than two arrays
z = [99, 99, 99]
np.concatenate([x, y, z])  # notice how the function argument is an iterable!

array([ 1,  2,  3,  3,  2,  1, 99, 99, 99])

In [115]:
# 2D - along the last axis
twod_1 = np.array([[0, 1, 2],
                   [3, 4, 5]])
twod_2 = np.array([[6, 7, 8],
                   [9, 10, 11]])
np.concatenate((twod_1, twod_2), axis=-1)

array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11]])

In [116]:
# np.vstack and np.hstack are also an option
sim1 = np.array([0, 1, 2])
sim2 = np.array([[30, 40, 50],
                 [60, 70, 80]])
np.vstack((sim1, sim2))

array([[ 0,  1,  2],
       [30, 40, 50],
       [60, 70, 80]])

Splitting up arrays is also quite easy:

In [144]:
arr = np.arange(16).reshape((4, 4))  # see the tuple? Shapes of arrays always come in tuples
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [151]:
upper, lower = np.split(arr, [3])  # splits at the third line (of the first axis, by default)
print(upper)
print('---')
print(lower)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
---
[[12 13 14 15]]


### Exercise

#### a. Familiarization:

Create two random 3D arrays, at least 3x3x3 in size. The first should have random integers between 0 and 100, while the second should contain floating points numbers drawn from the normal distribution, with mean -1 and standard deviation of 2 

1. What are the default datatypes of integer and floating-point arrays in numpy?

2. Center the distribution of the first integer array around 0, with values between -1 and 1.

3. Caclulate their mean, standard deviation along the last axis, and sum along the second axis. Remember that arrays are, like everything in Python, objects.


#### b. _Iteration over an array:_ 

Find and return the indices of the two random arrays only where _both elements_ comply with -0.5 <= X <= 0.5. Do so in at least two distinct ways. These can include elementwise iteration, masked-arrays, and more.

Time the execution of the script using the iPython `%%timeit` magic if you have this tool near at hand.

#### c. _Broadcasting:_
1. Create a vector with length 100, filled with ones, and a 2D zero-filled array with a its last dimension with size of 100. 
2. Using `np.tile`, add the vector and the array.
3. Did you have to use `np.tile` in this case? What feature of numpy allows this?
4. What happens to the non-tiled addition when the first dimension of the matrix is 100? Why?
5. Bonus: How can one add the matrix (in its new shape) and vector without np.tile()?

### Exercise solutions below...

In [232]:
# a
# 1
low = 0
high = 100
arr1 = np.random.randint(low=low, high=high, size=(10, 10, 10))  # dtype is np.int32
arr2 = np.random.randn(10, 10, 10) * 2 - 1  # dtype is np.float64

# 2
middle = (high - low) / 2
arr1 = (arr1 - middle) / middle

# 3
print("arr1 (integers) stats:")
print(f"Mean: {arr1.mean()},\nSTD: {arr1.std(axis=-1)},\nSum: {arr1.sum(axis=1)}")
print("arr2 (normal distribution) stats:")
print(f"Mean: {arr2.mean()},\nSTD: {arr2.std(axis=-1)},\nSum: {arr2.sum(axis=1)}")

arr1 (integers) stats:
Mean: -0.020619999999999996,
STD: [[0.5359291  0.61333514 0.50833454 0.33021205 0.66885275 0.61260428
  0.44156087 0.50123448 0.4065907  0.3777883 ]
 [0.46545032 0.58760871 0.52640669 0.61347861 0.46119844 0.58208247
  0.58282416 0.40878356 0.57019646 0.55656446]
 [0.37253725 0.43644473 0.42942287 0.61514226 0.48916255 0.63286333
  0.46882406 0.61070779 0.53604477 0.51823161]
 [0.52177006 0.47644097 0.53019242 0.39339039 0.72874961 0.44749972
  0.4968863  0.38956899 0.51682105 0.57946527]
 [0.36334557 0.63012697 0.63956548 0.59572141 0.55644946 0.45217696
  0.42390565 0.58275552 0.27791366 0.67945861]
 [0.47340891 0.51949591 0.46296436 0.65697489 0.42663333 0.5181853
  0.67374773 0.54335624 0.51433841 0.5383679 ]
 [0.45888561 0.67546725 0.62923446 0.51753261 0.50825191 0.49243883
  0.38665747 0.56633559 0.42068991 0.37653154]
 [0.63020314 0.51550364 0.46206493 0.53044887 0.53194361 0.31815091
  0.42720019 0.47252936 0.56773233 0.4936436 ]
 [0.72770598 0.55211955 

In [233]:
# %%timeit
# b
# 1 - elementwise
result = []
for idx, (item1, item2) in enumerate(zip(arr1.flat, arr2.flat)):
    if (-0.5 <= item1 <= 0.5) and (-0.5 <= item2 <= 0.5):
        result.append(idx)
result

554 µs ± 36.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [234]:
# %%timeit
# b
# 2 - masked arrays createed by np.ravel(), np.where()
mask1 = ((-0.5 <= arr1.ravel()) & (arr1.ravel() <= 0.5))
mask2 = ((-0.5 <= arr2.ravel()) & (arr2.ravel() <= 0.5))
both_positive = np.logical_and(mask1, mask2)
result = np.where(both_positive)
result

15.5 µs ± 649 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


As you can tell, the second solution is far more efficient than the first, and after a few read code blocks, the syntax presented there can be at least as clear as the one in the first solution.

In [250]:
# c
# 1
vec = np.ones(100)
mat = np.zeros((10, 100))

# 2
tiled_vec = np.tile(vec, (10, 1))  # transpose is necessary
ans_tiled = mat + tiled_vec

# 3
ans_simple = mat + vec  # It's called array broadcasting

# 4
mat_transposed = mat.T
print(mat_transposed.shape)
# ans_trans = mat_transposed + vec  # fails because broadcasting is done over the last dimension

# 5 - np.newaxis is the answer - very perfomant
ans_newaxis = mat_transposed + vec[:, np.newaxis]

(100, 10)
