# <img width=400 src="http://www.numpy.org/_static/numpy_logo.png" alt="Numpy"/>


## Why do we need numpy?

* You may have heard "Python is slow", this is true when it concerns looping over many small python objects
* Python is dynamically typed and everything is an object, even an `int`. There are no primitive types.
* Numpy's main feature is the `ndarray` class, a fixed length, homogeniously typed array class.
* Numpy implements a lot of functionality in fast c, cython and fortran code to work on these arrays
* python with vectorized operations using numpy can be blazingly fast

See: [Python is not C](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Python_Is_Not_C?lang=en)

But the most important reason:

* More beautiful code

In [None]:
import numpy as np

## More beautiful code through vectorisation

pure python with list comprehension

In [None]:
voltages = [10.1, 15.1, 9.5]
currents = [1.2, 2.4, 5.2]

resistances = [U / I for U, I in zip(voltages, currents)]
resistances

Using numpy

In [None]:
U = np.array([10.1, 15.1, 9.5])
I = np.array([1.2, 2.4, 5.2])

R = U / I
R

### Finding the point with the smallest distance

In [None]:
import math

def euclidean_distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

point = (1, 2)
points = [(3, 2), (4, 2), (3, 0)]

min_distance = float('inf')
for other in points:
    distance = euclidean_distance(point, other)
    if distance < min_distance:
        closest = other
        min_distance = distance 

print(min_distance, closest)

In [None]:
point = np.array([1, 2])
points = np.array([(3, 2), (4, 2), (3, 0)])

distance = np.linalg.norm(point - points, axis=1)
idx = np.argmin(distance)

print(distance[idx], points[idx])

## Small example timings

In [None]:
import math


def var(data):
    '''
    knuth's algorithm for one-pass calculation of the variance
    Avoids rounding errors of large numbers when doing the naive
    approach of `sum(v**2 for v in data) - sum(v for v in data)**2`
    '''
    
    n = 0
    mean = 0.0
    m2 = 0.0
    
    if len(data) < 2:
        return float('nan')

    for value in data:
        n += 1
        delta = value - mean
        mean += delta / n
        delta2 = value - mean
        m2 += delta * delta2

    return m2 / n 

In [None]:
list(range(10))

In [None]:
l = list(range(1000))
a = np.array(l)

In [None]:
%timeit var(l)

In [None]:
%timeit np.var(a)

## Basic math: vectorized

Operations on numpy arrays work vectorized, element-by-element

**Lose your loops**

In [None]:
# create a numpy array from a python a python list
a = np.array([1.0, 3.5, 7.1, 4, 6])

In [None]:
2 * a

In [None]:
a**2

In [None]:
a**a

In [None]:
np.cos(a)

**Attention: You need the `cos` from numpy!**

In [None]:
math.cos(a)

Most normal python functions with basic operators like `*`, `+`, `**` simply work because
of operator overloading:

In [None]:
def poly(x):
    return x + 2 * x**2 - x**3

poly(a)

In [None]:
poly(np.e), poly(np.pi)

## Useful properties

In [None]:
a

In [None]:
len(a)

In [None]:
a.size

In [None]:
a.shape

In [None]:
a.dtype

In [None]:
a.ndim

## Arbitrary dimension arrays

In [None]:
# two-dimensional array
y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

y + y

In [None]:
## since python 3.5 @ is matrix product
y @ y

In [None]:
# Broadcasting, changing array dimensions to fit the larger one

y + np.array([1, 2, 3])

In [None]:
# trick for column broadcasting

y = np.ones((3, 5))

(y.T + np.arange(3)).T

In [None]:
np.ones((3,5))

In [None]:
np.ones((3,5)).T

In [None]:
np.arange(3)

Create new axis to enable broadcasting for unmatched shapes

In [None]:
a = np.arange(3)
b = np.arange(4, 7)

a[:, np.newaxis] * b

In [None]:
a[:]

In [None]:
a[:, np.newaxis]

## Reduction operations

Numpy has many operations, which reduce dimensionality of arrays

In [None]:
x = np.random.normal(0, 1, 1000)

In [None]:
np.sum(x)

In [None]:
np.prod(x)

In [None]:
np.mean(x)

(arithmetic) Standard Deviation

In [None]:
np.std(x)

Sample Standard Deviation

In [None]:
np.std(x, ddof=1)

Most of the numpy functions are also methods of the array

In [None]:
x.mean(), x.std(), x.max(), x.min()

Difference between neighbor elements

In [None]:
z = np.arange(10)**2
diff_z = np.diff(z)

print(z)
print(diff_z)

### Reductions on multi-dimensional arrays


In [None]:
array2d = np.arange(20).reshape(4, 5)

array2d

In [None]:
np.sum(array2d, axis=0)

In [None]:
np.var(array2d, axis=1)

## Exercise 1

Write a function that calculates the analytical linear regression for a set of
x and y values.

Reminder:

$$ f(x) = a \cdot x + b$$

with 

$$
\hat{a} = \frac{\mathrm{Cov}(x, y)}{\mathrm{Var}(x)} \\
\hat{b} = \bar{y} - \hat{a} \cdot \bar{x}
$$

In [None]:
def linear_regression(x, y):
    a = np.nan
    b = np.nan
    return a, b

In [None]:
# %load solutions/numpy_linear.py
def linear_regression(x, y):

    cov_matrix = np.cov(x, y)
    a = cov_matrix[0, 1] / cov_matrix[0, 0]
    b = np.mean(y) - a * np.mean(x)

    return a, b


In [None]:
x = np.linspace(0, 1, 50)
y = 5 * np.random.normal(x, 0.1) + 2  # see section on random numbers later

a, b = linear_regression(x, y)
a, b

## Helpers for creating arrays

In [None]:
np.zeros(10)

In [None]:
np.ones((5, 2))

In [None]:
np.full(5, np.nan)

In [None]:
np.empty(10)  # attention, uninitialised memory, be carefull

In [None]:
np.linspace(-2, 1, 1)

In [None]:
# like range() for arrays:
np.arange(5)

In [None]:
np.arange(2, 10, 2)

In [None]:
np.logspace(-4, 5, 10)

In [None]:
np.logspace(1, 4, 4, base=2)

## Numpy Indexing

* Element access
* Slicing

![index1d](images/Indexing1D.svg)

In [None]:
x = np.arange(0, 10)

# like lists:
x

In [None]:
# like lists:
x[0]

In [None]:
# all elements with indices ≥1 and <4:
x[1:4]

In [None]:
# negative indices count from the end
x[-1], x[-2]

In [None]:
# combination:
x[3:-2]

In [None]:
# step size
x[::2]

In [None]:
# trick for reversal: negative step
x[::-1]

![index2d](images/Indexing2D.svg)

In [None]:
y = np.array([x, x + 10, x + 20, x + 30])
y

In [None]:
# only one index ⇒ one-dimensional array
y[2]

In [None]:
# other axis: (: alone means the whole axis)
y[:, 3]

In [None]:
# inspecting the number of elements per axis:
y[:, 1:3].shape

# Changing array content

In [None]:
y

In [None]:
y[:, 3] = 0
y

Using slices on both sides

In [None]:
y[:,0] = x[3:7]
y

Transposing inverts the order of the dimensions

In [None]:
y

In [None]:
y.shape

In [None]:
y.T

In [None]:
y.T.shape

# Masks

* A boolean array can be used to select only the element where it contains `True`.
* Very powerfull tool to select certain elements that fullfill a certain condition

In [None]:
a = np.linspace(0, 2, 11)
b = np.random.normal(0, 1, 11)

print(b >= 0)
print(a[b >= 0])

In [None]:
a[[0, 2]] = np.nan
a

In [None]:
a[np.isnan(a)] = -1
a

### Random numbers

* numpy has a larger number of distributions builtin

10 random numbers between -1 and 1 uniformly distrubited

In [None]:
np.random.uniform(-1, 1, 10)

Draw a $2\times 10$ matrix from a normal distribution centered around 0 with a variance of 5

In [None]:
np.random.normal(0, 5, (2, 10))

In [None]:
np.random.poisson(5, 2)

## Calculating pi through monte-carlo simulation

* We draw random numbers in a square with length of the sides of 2
* We count the points which are inside the circle of radius 1

The area of the square is

$$
A_\mathrm{square} = a^2 = 4
$$

The area of the circle is
$$
A_\mathrm{circle} = \pi r^2 = \pi
$$

With 
$$
\frac{n_\mathrm{circle}}{n_\mathrm{square}} = \frac{A_\mathrm{circle}}{A_\mathrm{square}}
$$
We can calculate pi:

$$
\pi = 4 \frac{n_\mathrm{circle}}{n_\mathrm{square}}
$$

In [None]:
n_square = 1000000

x = np.random.uniform(-1, 1, n_square)
y = np.random.uniform(-1, 1, n_square)

radius = np.sqrt(x**2 + y**2)

n_circle = np.sum(radius <= 1.0)

print(4 * n_circle / n_square)

## Exercise

1. Draw 10000 gaussian random numbers with mean of $\mu = 2$ and standard deviation of $\sigma = 3$
2. Calculate the mean and the standard deviation of the sample
3. What percentage of the numbers are outside of $[\mu - \sigma, \mu + \sigma]$?
4. How many of the numbers are $> 0$?
5. Calculate the mean and the standard deviation of all numbers ${} > 0$

In [None]:
# %load solutions/numpy_gaussian.py

## Exercise

Monte-Carlo uncertainty propagation

* The hubble constant as measured by PLANCK is
$$
H_0 = (67.74 \pm 0.47)\,\frac{\mathrm{km}}{\mathrm{s}\cdot\mathrm{Mpc}}
$$

* Estimate mean and the uncertainty of the velocity of a galaxy which is measured to be $(500 \pm 100)\,\mathrm{Mpc}$ away
using monte carlo methods

In [None]:
# %load solutions/numpy_hubble.py

## Simple io functions

In [None]:
idx = np.arange(100)
x = np.random.normal(0, 1e5, 100)
y = np.random.normal(0, 1, 100)
n = np.random.poisson(20, 100)

In [None]:
idx.shape, x.shape, y.shape, n.shape

In [None]:
np.savetxt(
    'data.txt',
    np.column_stack([idx, x, y, n]),
)

In [None]:
!head data.txt

In [None]:
# Load back the data, unpack=True is needed to read the data columnwise and not row-wise
idx, x, y, n = np.genfromtxt('data.txt', unpack=True)

idx.dtype, x.dtype

But better give it a header, otherwise you won't remember what the hell you saved there

In [None]:
np.savetxt('data.txt', np.column_stack([idx, x, y, n]), header="idx x y n")

In [None]:
with open('data.txt', 'r') as f:
    print(f.read())
f.close()

### Problems

* Everything is a float
* Way larger file than necessary because of too much digits for floats
* No column names

## Numpy recarrays

* Numpy recarrays can store columns of different types
* Rows are addressed by integer index
* Columns are addressed by strings

Solution for our io problem → Column names, different types

In [None]:
# for more options on formatting see
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

data = np.savetxt(
    'data.csv',
    np.column_stack([idx, x, y, n]),
    delimiter=',', # true csv file
    header=','.join(['idx', 'x', 'y', 'n']),
    fmt=['%d', '%.4g', '%.4g', '%d'],  # One formatter for each column
    comments='',
)

In [None]:
!head data.csv

In [None]:
data = np.genfromtxt(
    'data.csv',
    names=True, # load column names from first row
    dtype=None, # Automagically determines best data type for each column
    delimiter=',',
)

In [None]:
data[:10]

In [None]:
data[0]

In [None]:
data['n']

In [None]:
data.dtype

## Linear algebra

Numpy offers a lot of linear algebra functionality, mostly wrapping LAPACK

In [None]:
# symmetric matrix, use eigh
# If not symmetric, use eig
mat = np.array([
    [4, 2, 0],
    [2, 1, -3],
    [0, -3, 4]
])

eig_vals, eig_vecs = np.linalg.eigh(mat)

eig_vals, eig_vecs

In [None]:
np.linalg.inv(mat)

Singular value decomposition:

In [None]:
np.linalg.svd(mat)

... and many many more, read the docs!

### Mathematical constants

In [None]:
np.pi

In [None]:
np.e

In [None]:
np.euler_gamma

In [None]:
np.inf

Complex numbers

In [None]:
1j #Is the python representation of 1*i

In [None]:
1 + 1j

In [None]:
abs(1+1j)

In [None]:
x = [i+1j for i in range(10)]

In [None]:
abs(x)

Use either <code> np.abs() </code> or create a <code>numpy array</code> instead of a <code>list</code>

In [None]:
np.abs(x)

In [None]:
abs(np.array([i + 1j for i in range(10)]))

In [None]:
np.exp(1+1j)

In [None]:
np.sin(1+1j)

In [None]:
np.sqrt(1j)

In [None]:
np.sqrt(-1) #assumes real floats by default

In [None]:
np.sqrt(-1 +0j)

In [None]:
import cmath

cmath.sqrt(-1) #But doesn't work for lists