In [None]:
%matplotlib inline

# Introduction to NumPy



## Why use NumPy?

NumPy provides fast numerical processing and fast arrays to python. 

Python itself is very slow. 

In [1]:
import math as m

In [2]:
m.log10(100)

2.0

## How do I import NumPy?

In [None]:
#!pip install numpy

In [3]:
# standard alias
import numpy as np

## ndarray?

**n-dimensional array**
### n
* n is an arbitrary number. Suggests that numpy can handle datasets of any shape, with any number of dimensions.

### dimensional
* 1-dim == column/vector
* e.g. 2-dimensional array == table/matrix
* e.g. 3-dimensional array == many tables == tensor

### array
* fast data structure

## How do you create a NumPy array?

You could start with a list, and then covert it:

In [4]:
x_age = [18, 22, 33, 41]

# x is now much faster than it was!
x = np.array(x_age)

In [5]:
print(x_age)

[18, 22, 33, 41]


In [6]:
print(x)

[18 22 33 41]


In [7]:
x

array([18, 22, 33, 41])

In [9]:
x.mean()

28.5

You can also create numpy arrays using specific utilties... 

In [11]:
print(np.arange(0, 10, 0.2)) # a range of numbers from 0 to 10 in steps of 2

[0.  0.2 0.4 0.6 0.8 1.  1.2 1.4 1.6 1.8 2.  2.2 2.4 2.6 2.8 3.  3.2 3.4
 3.6 3.8 4.  4.2 4.4 4.6 4.8 5.  5.2 5.4 5.6 5.8 6.  6.2 6.4 6.6 6.8 7.
 7.2 7.4 7.6 7.8 8.  8.2 8.4 8.6 8.8 9.  9.2 9.4 9.6 9.8]


In [12]:
np.repeat([0, 1], 5) # repeat [0, 1] five times

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [14]:
np.random.choice([1, 2, 3, 4, 5, 6], 10) # 10 rolls of a dice

array([4, 2, 2, 5, 1, 1, 6, 6, 3, 6])

In [15]:
np.random.choice(["h", "t"], p=[0.8, 0.2], size=10)

array(['h', 'h', 'h', 'h', 'h', 'h', 't', 'h', 'h', 'h'], dtype='<U1')

...here we are using the random library within numpy to simulate experimental data... 

(ie., drawing a number out of a set (1 to 6) 10 times). 

In [None]:
type(x)

## How do you compute with arrays?

Suppose we generate an array which represents the ages of six people (mean=$30 \pm 5$)

In [16]:
x_age = np.random.normal(30, 5, 10)

x_age

array([27.28037405, 33.54172964, 30.45697115, 29.54637611, 27.00654053,
       28.7486022 , 37.53497642, 37.23138295, 29.47369223, 31.62070602])

Suppose we need to compute $3x_{age} + 1$, then we write:

$y = 3 \times X_{age} + 1$

In [17]:
y = 3 * x_age + 1

y

array([ 82.84112216, 101.62518892,  92.37091344,  89.63912832,
        82.0196216 ,  87.24580661, 113.60492925, 112.69414886,
        89.4210767 ,  95.86211806])

Notice that `3*` is run on every element, as is `+1`. 

This is called **vectorization**. 

Aside: note, that this doesnt work with lists:

In [19]:
x = [1, 2, 3]
#x = np.array([1, 2, 3])

x + 1

TypeError: can only concatenate list (not "int") to list

## What is a Sequence?

In [20]:
print(x_age)

[27.28037405 33.54172964 30.45697115 29.54637611 27.00654053 28.7486022
 37.53497642 37.23138295 29.47369223 31.62070602]


The shape of this array defines how it is structured for calculations, $(10,)$ -- a sequence of 10 elements...

In [21]:
x_age.shape

(10,)

In [23]:
x_age.size

10

In [24]:
len(x_age)

10

In [25]:
x_age[7]

37.231382952490705

### How do I index a sequence?

Just the same as python lists...

In [26]:
x_age[0]

27.280374054673114

In [27]:
x_age[0:2]

array([27.28037405, 33.54172964])

In [28]:
x_age[-1]

31.620706019507857

### What is a Vector?
A vector is a matrix of one colum.

In [29]:
x_profit = np.array(
    [[10],[11],[12]]
)

print(x_profit)

[[10]
 [11]
 [12]]


In [30]:
x_profit.shape

(3, 1)

In [31]:
x_profit[0, 0]

10

In [32]:
x_profit[1, 0]

11

In [33]:
x_profit[:, 0]

array([10, 11, 12])

In [34]:
x_profit.size

3

In [35]:
print(x_profit.reshape(1, 3))

[[10 11 12]]


In [36]:
x_profit.T

array([[10, 11, 12]])

We don;t like the dimensions. Consider reshaping!

In [39]:
x_profit.reshape(1, -1)

array([[10, 11, 12]])

## What is a Matrix?

A table of numbers...

In [52]:
M = np.array(
    [(1000, 12, +1), #eg., Loan, Duration, Settle
    (2000, 9,   -1), #eg., Loan, Duration, Settle  
    (3000, 6,   -1), #eg., Loan, Duration, Settle  
    ]
)

In [41]:
print(M)

[[1000   12    1]
 [2000    9   -1]
 [3000    6   -1]]


### How do I index a matrix?

`M[row-index, col-index]`

Note, both indexes work like list indexes -- except now there are two. 

In [42]:
M[0, 2]

1

In [43]:
M[0, 0] # first row, first column

1000

In [44]:
M[1, 0] # second row, first column

2000

In [45]:
M[0:2, -1] # first two rows, last column

array([ 1, -1])

In [53]:
M.shape

(3, 3)

In [54]:
M.size

9

### What is a Tensor


In [46]:
M = np.array([
    (1000, 12, +1), #eg., Loan, Duration, Settle
    (2000, 9, -1), #eg., Loan, Duration, Settle  
    (3000, 6, -1), #eg., Loan, Duration, Settle  
    (4000, 3, +1),
])

In [47]:
M.shape

(4, 3)

In [48]:
M.size

12

In [49]:
M.reshape(3,4)

array([[1000,   12,    1, 2000],
       [   9,   -1, 3000,    6],
       [  -1, 4000,    3,    1]])

In [None]:
M.T

In [None]:
M.T.shape

Make a 3x3 tensor

In [50]:
#[sheet, row, column]

tensor = M.reshape(3, 2, 2)

print(tensor)

[[[1000   12]
  [   1 2000]]

 [[   9   -1]
  [3000    6]]

 [[  -1 4000]
  [   3    1]]]


In [None]:
tensor[1,0,0]

In machine learning (libraries) we must always have our features ($X$) formatted as a matrix.

Each row of the feature matrix $X$ *must* be one complete observation. This is assumed in how these libraries process data. 

## How do I select multiple elements?

In [55]:
M

array([[1000,   12,    1],
       [2000,    9,   -1],
       [3000,    6,   -1]])

`:2` means from `0` to `2`

In [56]:
M[:2, 0] # 0:2   :2

array([1000, 2000])

`:` - from the beginning to the end

In [57]:
M[:, 0]

array([1000, 2000, 3000])

NB. you can just read `:` as "all".

So, `M[:, 0]` means `M[all rows, first column]`

In [58]:
M[ [0, 2], :] # chose rows indexed [0, 2] and all columns

array([[1000,   12,    1],
       [3000,    6,   -1]])

Remember:  `label[index]`  <- always means FIND `index` in `label`

Remember: `[data,]` <- always means `list`


## How do I select elements by a condition?

Comparison are also *vectorized*, meaning, they run across every element:

In [59]:
x_age

array([27.28037405, 33.54172964, 30.45697115, 29.54637611, 27.00654053,
       28.7486022 , 37.53497642, 37.23138295, 29.47369223, 31.62070602])

In [60]:
x_age < 30

array([ True, False, False,  True,  True,  True, False, False,  True,
       False])

`np.where` tells you the index of the `True` values... 

In [61]:
np.where(x_age < 30)

(array([0, 3, 4, 5, 8], dtype=int64),)

In [62]:
x_age[ np.where(x_age < 30)  ]  # here I select the elements which match this condition

array([27.28037405, 29.54637611, 27.00654053, 28.7486022 , 29.47369223])

In [63]:
x_age[x_age < 30]  # SELECT age FROM x_age WHERE age < 30

array([27.28037405, 29.54637611, 27.00654053, 28.7486022 , 29.47369223])

Aside: to do this in raw python, we would use a loop and a condition:

NOTE: far far slower... 

In [64]:
keep = []
for age in x_age:
    if age < 30:
        keep.append(age)
keep

[27.280374054673114,
 29.546376105772154,
 27.006540533832943,
 28.748602202825428,
 29.473692233650475]

## How do I combine conditions?

Recall, in python:

In [None]:
age = 18
email = "michael.burgess@qa.com"

(age <= 20) and ("@" in email)

The problem with using `and`, (`or`, `not` etc.) with numpy, is that they only work for *single* comparisons. 

In [66]:
temp = np.array([19, 21, 23]) # eg., temp of a room 
hours = np.array([0, 0.5, 1]) # eg., duration of heating

To combine comparisons across and array we must use *vectorized* operators (ie., ones which work with arrays).

* `&` and 
* `|` or
* `~` not

In [67]:
(temp > 20) & (hours < 0.75)

array([False,  True, False])

In [69]:
(temp > 20) | (hours >= 0.75)

array([False,  True,  True])

In [70]:
~((temp > 20) & (hours < 0.75))

array([ True, False,  True])

## How do you simulate real-valued data?

10 random values, whose mean will be aproximately $30$, and which will vary from 30, on average, by $5$...

In [None]:
np.random.normal(30, 5, 10)

## How do you simulate categorical data?

Categorical data is represented as *labels* (eg., die faces, cards, answers to questions, locations)....

In [71]:
x_like_film = np.random.choice(["YES", "NO"], 10)
x_like_film

array(['NO', 'NO', 'YES', 'NO', 'NO', 'NO', 'YES', 'NO', 'YES', 'YES'],
      dtype='<U3')

In numpy, `random.choice` is the easiest way to simulate a categorical variable (eg., `x_like_film`). 

A categorical variable *IS NOT* numerical in the ordinary sense, so if we wish to compute statistics on it, we typically convert it to a frequency distribution (ie., we count the entires). 

In [None]:
categories, counts = np.unique(x_like_film, return_counts=True)

print(categories)
print(counts)

The rate of "NO" (ie., $P(x=\text{NO})$), 

In [None]:
counts[0] / sum(counts)

## Exercise (10 min)

Consider the matrix below...

Import numpy

In [None]:
import numpy as np

In [None]:
np.column_stack(([1, 2, 3], [2, 4, 5]))

In [None]:
import pandas as pd

df = pd.DataFrame(X, columns=["Temp", "Power", "Window"])

df.loc[  df["Window"] == 1 ,  ["Temp", "Power"]]

$f(X; W, b) = W_0X_0 + W_1X_1 + b \dots $

In [72]:
X = np.array([
    [21, 1_000, False],  # temp, power, window_open
    [19, 1_000, False],
    [24, 3_000, False],
    [26, 3_000, True],
])

X[   X[:, 0] > 20 ,  -1]   # select the elements in X 
                           # where first col is >20, the last column

array([0, 0, 1])

In [73]:
X

array([[  21, 1000,    0],
       [  19, 1000,    0],
       [  24, 3000,    0],
       [  26, 3000,    1]])

In [None]:
X.shape

In [None]:
a = np.array([1, 2, 3, 4, 5])
a.shape

$P(Window=Open | Temp > 21) = P(X_2=1 | X_0 > 21) = $ 

In [None]:
X[X[:, 0] > 21, 2].mean()

$P(Window = Open, Temp > 21) = P(X_2 = 1, X_0 > 21) = $

In [None]:
((X[:, 0] > 21) & (X[:, 2])).mean()

## Exercise 1: Select Values
* the temperature column
    * HINT: all rows of column 0
* the power column
    * HINT: all rows of column 1
* the last column
    * HINT: all rows of column -1
* the first observation row
    * HINT: row 0 of all columns
* the last observation row
    * HINT: row -1 of all columns
* the temp and power of the first two observations
    * HINT: the first two rows of the first two columns
    * HINT: `0` until `2`
* the temp and power when the window is open
    * HINT: we want the first two columns with a *row* condition (ie., mask, test, ..)
    * HINT: the condition is that the third column `X[:, 2]` is `True`
* the power when it is closed
    * HINT: as above, condition is that third column is `False`
    

Aside: note that a numpy array can only use one type for the entire data structure. Here it's chosen an integer.  

In [None]:
X.dtype

## Exercise 2 (EXTRA)

You are hired by a cinema to make film recommendations to customers as they speak to your front desk staff.

Your staff may observe: their age, budget, like_action, like_comedy. 

Note, $x : (age, budget, action, comedy) = (18, 10, +1, -1)$

Let's simulate some data:

$x_{age} \sim N(\mu=35, \sigma=5) \in \mathbb{R}^{25}$

### Q1. Import and Compute
* import the numpy library
    * recall, use `np`
    
* you are given a regression and classification formula
* use numpy to compute the $y$ predictions for each person
* a formula to compute likely spend on consessions (food counter)
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = 0.1x_{age} + 0.1x_{budget} + x_{action} - x_{comedy}$

* what is the expected (ie., average) spend for these customers?
    * HINT: `.mean()`
    
#### EXTRA
* a formula to compute whether they will like the blockbuster currently showing
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = (x_{age} < 18) \text{ or } (x_{budget} > 10) \text{ and } x_{action}$
* HINT:
    * `(age < 18) | (budget > 10) & (action == 1)`
    
* what is $P(y=LikeFilm)$ ?
    * HINT: `.mean`

### Q2. Select Elements & Describe

* Produce a report of the simulated data
    * show `.mean()` of all x
    * show `.std()` of all 
    * `.min()`, `.max()`
    
* Show sample observations
    * first, last
    * first two, last two
    * extra: the median
    
* EXTRA: Show the budget of people who are adults
    * HINT: `x_budget[ x_age ... ]`
    
    * and other conditions of interest...

In [None]:
a = np.array([1,2,3]).reshape(-1,1)

In [None]:
X @ a

In [None]:
X.dot(a)