# <img style="float: left; padding-right: 100px; width: 300px" src="images/logo.png">AI4SG Bootcamp:


## Python Numerical Stack
**Authors:** Faustine


---

## 2 Numpy Basics

NumPy is the fundamental package for scientific computing with Python. Provide high-performance
vector, matrix and higher-dimensional data
structures and offers Matlab-ish capabilities within Python

It contains among other things:

* a powerful N-dimensional array/vector/matrix object
* sophisticated (broadcasting) functions
* function implementation in C/Fortran assuring good performance if vectorized
* tools for integrating C/C++ and Fortran code
* useful linear algebra, Fourier transform, and random number capabilities

Also known as *array oriented computing*. The recommended convention to import numpy is:

In [None]:
import numpy as np

### 2.1 Creating numpy arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples or
* using functions that are dedicated to generating numpy arrays, such as arange, linspace, empty,zeros etc.

In [None]:
client_age = [80, 60,18, 32]
a = np.array(client_age)
a

In [None]:
m = np.array([[80,1.2],[60,2.5], [18,1], [32,0.5]])
m

For larger arrays it is inpractical to initialize the data manually, using explicit python lists.
Instead we can use one of the many functions in numpy that generates arrays of different forms.

Some of the more common are:

* np.arange;
* np.linspace;
* np.logspace;
* np.diag;
* np.zeros;
* np.ones;
* np.empty;


In [None]:
np.arange(0,10)

In [None]:
np.linspace(-1,1,20)

In [None]:
np.zeros((2,2))

In [None]:
np.ones((3,2))

In [None]:
np.empty((2,3))

In [None]:
m

In [None]:
m.shape

In [None]:
a.shape

In [None]:
a=a.reshape(4,1)
a.shape
a

<div class="alert alert-success">
    <b>Activity 1</b>: Create a vector with values ranging from 10 to 49 with steps of 1
</div>

#### Random numbers and seeds

In [None]:
# uniform random numbers in [0,1]
np.random.rand(5,5)

In [None]:
# standard normal distributed random numbers
np.random.randn(5,5)

#### Random seed

The seed is for when we want repeatable (reproducible) results

In [None]:
np.random.seed(77)
x=np.random.rand(8,2)
print(x)

### Shape, size, dimension and dtype

In [None]:
print(x)

In [None]:
x.shape

In [None]:
x.size

In [None]:
x.ndim

In [None]:
x.dtype

####  Shape Manipulation
The shape of an array can be changed with various commands:

In [None]:
x = np.random.rand(20)
print(x)

In [None]:
x.shape

In [None]:
x_new=x.reshape(-1,1)

In [None]:
x_new.shape

In [None]:
x = np.random.rand(10, 2)
print(x)

In [None]:
x.flatten()

#### vstack and hstack

In [None]:
x = np.ones((5, 2))
print(x)

In [None]:
y = np.zeros((5, 2))
print(y)

In [None]:
z = np.hstack((x,y))
print(z)

In [None]:
z = np.vstack((x,y))
print(z)

In [None]:
client_price = np.array([1,0.1,4,2])

In [None]:
client_price.reshape(-1,1).shape

In [None]:
np.concatenate((a.reshape(-1,1), client_price.reshape(-1,1)), 1)

### Indexing and slicing

In [None]:
data = np.random.randint(25,37, size=10)
print(data)

In [None]:
data[7:]

In [None]:
#print the first sensor data
print(data[0])

In [None]:
#print  data between index 3 and 7
print(data[3:7])

In [None]:
#print the last three data
print(data[7:])

In [None]:
# We can also use negative index
print(data[-1])

Multidimensional array behaves like a dataframe or matrix (i.e. columns and rows).Consider the following 2D  array.

In [None]:
data = np.random.randint(25,37, size=(10,3))
print(data)

In [None]:
data[:,2]

In [None]:
data[2:5,[0,2]]

In [None]:
mask = data>30
data[mask]

In [None]:
np.sqrt(data)

In [None]:
# View the first column of the array
data[:,0]

In [None]:
# View the first row of the array
data[0,]

In [None]:
# View the first two row
data[:2,]

In [None]:
#View the first  data
data[0,0]

#### Fancy indexing

In [None]:
## view all data that is less than 30
mask = data<30
data[mask]

In [None]:
if (data > 30).any():
    print("at least one element in data is larger than 30")
else:
    print("no element in data is larger than 30")

<div class="exercise"><b>Exercise</b></div>
* Create a two-dimensional array of size $3\times 5$ and do the following:
  * Print out the array
  * Print out the shape of the array
  * Create two slices of the array:
    1. The first slice should be the last row and the third through last column
    2. The second slice should be rows $1-3$ and columns $3-5$
  * Square each element in the array and print the result

### Save and load numpy data to/ from file

In [None]:
np.save("data/sensor_data.npy",data)

In [None]:
sensor_data = np.load("data/sensor_data.npy")
print(sensor_data)

In [None]:
##Load from text file
sms = np.loadtxt("data/sms.txt")

### calculations

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays. 

In [None]:
#mean
sms.mean()

In [None]:
#std
sms.std()

In [None]:
#min
sms.min()

In [None]:
#max
sms.max()

### Numpy calculation is element wise

In [None]:
x = np.arange(1,10)
print(x)

In [None]:
print(x+2)

In [None]:
#print(x**2)
np.square(x)

In [None]:
np.log(x)

### `Numpy `Arrays vs. `Python` Lists?

1. Why the need for `numpy` arrays?  Can't we just use `Python` lists?
2. Iterating over `numpy` arrays is slow. Slicing is faster.

`Python` lists may contain items of different types. This flexibility comes at a price: `Python` lists store *pointers* to memory locations.  On the other hand, `numpy` arrays are typed, where the default type is floating point.  Because of this, the system knows how much memory to allocate, and if you ask for an array of size $100$, it will allocate one hundred contiguous spots in memory, where the size of each spot is based on the type.  This makes access extremely fast.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Drawing" style="width: 500px;"/>

(image from Jake Vanderplas's Data Science Handbook)

Unfortunately, looping over an array slows things down. In general you should not access `numpy` array elements by iteration.  This is because of type conversion.  `Numpy` stores integers and floating points in `C`-language format.  When you operate on array elements through iteration, `Python` needs to convert that element to a `Python` `int` or `float`, which is a more complex beast (a `struct` in `C` jargon).  This has a cost.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Drawing" style="width: 500px;"/>

(image from Jake Vanderplas's Data Science Handbook)

If you want to know more, we will suggest that you read 
- [Jake Vanderplas's Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/). 
- [Wes McKinney's Python for Data Analysis](https://hollis.harvard.edu/primo-explore/fulldisplay?docid=01HVD_ALMA512247401160003941&context=L&vid=HVD2&lang=en_US&search_scope=everything&adaptor=Local%20Search%20Engine&tab=everything&query=any,contains,Wes%20McKinney%27s%20Python%20for%20Data%20Analysis&sortby=rank&offset=0) (HOLLIS)<br>
You will find them both incredible resources for this class.

Why is slicing faster? The reason is technical: slicing provides a *view* onto the memory occupied by a `numpy` array, instead of creating a new array. That is the reason the code above this cell works nicely as well. However, if you iterate over a slice, then you have gone back to the slow access.

By contrast, functions such as `np.dot` are implemented at `C`-level, do not do this type conversion, and access contiguous memory. If you want this kind of access in `Python`, use the `struct` module or `Cython`. Indeed many fast algorithms in `numpy`, `pandas`, and `C` are either implemented at the `C`-level, or employ `Cython`.

# Concept Check:
Answer these questions and see the bottom of the lab to check your answers.

1. What is the major benefit of working in `numpy`?
2. Why is it vital to use built-in numpy functions whenever possible?
3. You need to find the running total of each element in a numpy array. For example, the input \[2,5,3,2\] should give the output \[2,7,10,12\]. Rather than writing your own code, think of at least two google querrys to find a relevant numpy function.
4. What `numpy` function does the job above?

## References

- [python4datascience-atc](https://github.com/pythontz/python4datascience-atc)
- [PythonDataScienceHandbook](https://github.com/jakevdp/PythonDataScienceHandbook)
- [DS-python-data-analysis](https://github.com/jorisvandenbossche/DS-python-data-analysis)