<center><h2><strong><font color="blue"> Advanced Programming for Data Science (APDS)</font></strong></h2></center>

<center><img alt="" src="images/covers/taudata-cover.jpg"/></center>

<center><h2><strong><font color="blue">APDS-04: Introduction to Python Numpy for Scientific Computing & Data Science</font></strong></h2></center>

<b><center><h3>(C) Taufik Sutanto</h3></center>
* .

### Reading materials / Reference:
* https://github.com/donnemartin/data-science-ipython-notebooks/tree/master/numpy
* https://github.com/numpy/numpy-tutorials
* https://www.w3schools.com/python/numpy/default.asp
* https://numpy.org/numpy-tutorials/

<center><img alt="" src="images/APDS/Numpy-Project.jpg"  width=600/></center>

> NumPy is the fundamental package for scientific computing with Python.
* Website: https://numpy.org
* Documentation: https://numpy.org/doc
* Source code: https://github.com/numpy/numpy

<center><img alt="" src="images/APDS/Numpy-About.jpg" width=800/></center>

<center><h2><strong><font color="blue">About Numpy</font></strong></h2></center>

## What is NumPy?
* NumPy is a Python library used for working with arrays.
* It also has functions for working in domain of linear algebra, fourier transform, and matrices.
* NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely.
* NumPy stands for **Numerical Python**.

## Why Use NumPy?
* In Python we have lists that serve the purpose of arrays, but they are slow to process.
* NumPy aims to provide an array object that is up to 50x faster than traditional Python lists.
* The array object in NumPy is called *ndarray*, it provides a lot of supporting functions that make working with ndarray very easy.
* Arrays are very **frequently used in data science**, where speed and resources are very important.

## Why is NumPy Faster Than Lists?
* NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently.
* This behavior is called locality of reference in computer science.
* This is the main reason why NumPy is faster than lists. Also it is optimized to work with latest CPU architectures.
* NumPy is a Python library and is written partially in Python, but most of the parts that require fast computation are written in C or C++.

<center><h2><strong><font color="blue">Numpy and Data Science</font></strong></h2></center>

<center><img alt="" src="images/APDS/Numpy-DS.jpg"  width=800/></center>

<center><h2><strong><font color="blue">Numpy and Machine Learning</font></strong></h2></center>

<center><img alt="" src="images/APDS/tensorflow-ml-anim.gif"  width=300/></center>

NumPy forms the basis of powerful machine learning libraries like scikit-learn and SciPy. As machine learning grows, so does the list of libraries built on NumPy. TensorFlow’s deep learning capabilities have broad applications — among them speech and image recognition, text-based applications, time-series analysis, and video detection. PyTorch, another deep learning library, is popular among researchers in computer vision and natural language processing.

<center><h2><strong><font color="blue">Numpy Common Functions</font></strong></h2></center>

### Installing Numpy
> **Conda install numpy** OR **pip install numpy**

<center><img alt="" src="images/APDS/Numpy-common-functions.jpg"  width=400/></center>

In [None]:
import numpy as np #importing Numpy using an alias

np.__version__ #this is a (rather) universal way to check a Python module's version

In [None]:
l = [2.1, 2.8, 1.9, 2.5, 2.7, 2.3, 1.8, 1.2, 0.9, 0.1]
a = np.array(l)
print(a)
a.shape

In [None]:
# element-wise operations
print(a * 2+1)

In [None]:
try:
    print(l * 2+1) # Error: cannot be done on a List
except Exception as err_:
    print(err_)

In [None]:
print(a)
print(a*a)

In [None]:
print(np.dot(a, a)) # Euclidean distance in Data Science, e.g., k-Means

In [None]:
# Matrix
A = [ [1,2], [3,4] ]
B = np.array(A)
print(B.shape)
B

In [None]:
np.matmul(B,B) # Matlab version of B*B

In [None]:
# Default is element-wise operation
B*B

In [None]:
B.transpose() # this is the MATLAB version of B'

In [None]:
#similarly
B.T

In [None]:
inv = np.linalg.inv # alias
#np.linalg.inv(B) # this is the MATLAB version of inv(B)
inv(B)

In [None]:
det = np.linalg.det
det(B) # Determinant of Matrix B

In [None]:
# Multidimensional Array
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)

In [None]:
# in summary

a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])
d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(a.ndim)
print(b.ndim)
print(c.ndim)
print(d.ndim)

In [None]:
# even More
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)
print('number of dimensions :', arr.ndim)

## Accessing/Slicing a NumPy Array is similar to a List

In [None]:
b[:-2]

<h2 id="List-VS-Array-:-Best-use-scenario">List VS Array : Best use scenario</h2>

In [None]:
# Memory Usage Comparison (bits)
from sys import getsizeof as size

a = np.array([24, 12, 57])
b = np.array([])
c = []
d = [24, 12, 57]
print(size(a),size(b),size(c),size(d))

In [None]:
# Let's test the speed comparison between numpy and list
# In Data Science, EFFICIENCY is very important
N = 10000
A = [i+1 for i in range(N)] # [1,2,3,...,N]
B = [i*2 for i in range(N)]
C = np.array(A)
D = np.array(B)
D[:10]

In [None]:
%%timeit
E = [a+b for a,b in zip(A,B)]

In [None]:
%%timeit
F = np.add(C,D)

## Simple Visualization

In [None]:
import matplotlib.pyplot as plt # conda install matplotlib

a = np.array([2.1, 2.8, 1.9, 2.5, 2.7, 2.3, 1.8, 1.2, 0.9, 0.1])
plt.plot(a)
plt.show()

In [None]:
X = np.linspace(-2 * np.pi, 2 * np.pi, 50, endpoint=True)
F1 = 3 * np.sin(X)
F2 = np.sin(2*X)
F3 = 0.3 * np.sin(X)
F1

In [None]:
startx, endx = -2 * np.pi - 0.1, 2*np.pi + 0.1
starty, endy = -3.1, 3.1
startx, endx

In [None]:
plt.axis([startx, endx, starty, endy])
plt.plot(X,F1)
plt.plot(X,F2)
plt.plot(X,F3)
plt.plot(X, F1, 'ro')
plt.plot(X, F2, 'bx')
plt.show()
# Comment F1, F2, F3 for a scatter plot example

<center><h2><strong><font color="blue">Data Types</font></strong></h2></center>

### Data Types in Python
* **strings** - used to represent text data, the text is given under quote marks. e.g. "ABCD"
* **integer** - used to represent integer numbers. e.g. -1, -2, -3
* **float** - used to represent real numbers. e.g. 1.2, 42.42
* **boolean** - used to represent True or False.
* **complex** - used to represent complex numbers. e.g. 1.0 + 2.0j, 1.5 + 2.5j

### Data Types in NumPy
<center><img alt="" src="images/APDS/numpy-data-types.PNG" width=400/></center>

In [None]:
arr = np.array([1, 2, 3, 4])
arr.dtype

In [None]:
arr = np.array(['apple', 'banana', 'cherry'])
arr.dtype

In [None]:
arr = np.array([1, 2, 3, 4], dtype='S')
print(arr)
print(arr.dtype)

In [None]:
arr = np.array([1, 2, 3, 4], dtype='i4')

print(arr)
print(arr.dtype)

In [None]:
# Warning

try:
    arr = np.array(['a', '2', '3'], dtype='i')
except Exception as err_:
    print(err_)

## View vs. Copy in NumPy Array

In [None]:
arr = np.array([1, 2, 3, 4, 5])
x = arr.copy()
arr[0] = 42

print(arr)
print(x)

In [None]:
arr = np.array([1, 2, 3, 4, 5])
x = arr.view()
arr[0] = 42

print(arr)
print(x)

### View or Copy?

In [None]:
arr = np.array([1, 2, 3, 4, 5])

x = arr.copy()
y = arr.view()

print(x.base)
print(y.base)

## Shape and Reshape

* Often used in AI ... Why?

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print(arr.shape)
print(arr)
print()
newarr = arr.reshape(4, 3)
print(newarr)

### What if the number of elements does not agree?

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
try:
    newarr = arr.reshape(3, 3)
    print(newarr)
except Exception as err_:
    print(err_)

## Unknown Dimension

* You are allowed to have one "unknown" dimension.
* Meaning that you do not have to specify an exact number for one of the dimensions in the reshape method.
* Pass -1 as the value, and NumPy will calculate this number for you.
* Used a lot in Deep learning (computer vision, NLP, etc)

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])

newarr = arr.reshape(2, 2, -1)
print(newarr)

## Iterating Over an Array

In [None]:
arr = np.array([1, 2, 3])

for x in arr:
  print(x)

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
  print(x)

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])

for x in arr:
  for y in x:
    print(y)

### nditer & ndenumerate

In [None]:
arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in np.nditer(arr):
  print(x)

In [None]:
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

for idx, x in np.ndenumerate(arr):
  print(idx, x)

## Joining NumPy Arrays

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.concatenate((arr1, arr2))

print(arr)

In [None]:
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
print(arr1)
print(arr2)
print(arr)

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.stack((arr1, arr2), axis=1)
print(arr1)
print(arr2)
print(arr)

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.vstack((arr1, arr2))
print(arr1)
print(arr2)
print(arr)

In [None]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr = np.dstack((arr1, arr2))
print(arr1)
print(arr2)
print(arr)

## Array Search

In [None]:
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8])
x = np.where(arr%2 == 0)
print(x)

In [None]:
# Why is it faster if sorted?
arr = np.array([6, 7, 8, 9])
x = np.searchsorted(arr, 7)
print(x)

## Array Sort

In [None]:
arr = np.array(['banana', 'cherry', 'apple'])
print(np.sort(arr))

In [None]:
arr = np.array([[3, 2, 4], [5, 0, 1]])
print(np.sort(arr))

## Filtering Arrays

In [None]:
arr = np.array([41, 42, 43, 44])
x = [True, False, True, False]
newarr = arr[x]
print(newarr)

In [None]:
arr = np.array([41, 42, 43, 44])
filter_arr = arr > 42
newarr = arr[filter_arr]
print(filter_arr)
print(newarr)

## NumPy Pseudo-Random Number Generator

* Random numbers generated through a generation algorithm are called pseudo random.

<center><img alt="" src="images/rng.png"/></center>

In [None]:
from numpy import random

x = random.randint(8)
print(x)

In [None]:
x = random.randint(100, size=(3, 5))

print(x)

In [None]:
x = random.rand()
print(x)

In [None]:
x = random.randn(100)
print(x[:10])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns # conda install seaborn

sns.displot(x, kind="kde")
plt.show()

In [None]:
x = random.choice([3, 5, 7, 9])
print(x)

In [None]:
x = random.choice([3, 5, 7, 9], size=(3, 5))

print(x)

In [None]:
arr = np.array([1, 2, 3, 4, 5])
random.shuffle(arr)
print(arr)

## NumPy ufuncs

* ufuncs stands for "Universal Functions" and they are NumPy functions that operate on the ndarray object.

### Why use ufuncs?
* ufuncs are used to implement vectorization in NumPy which is way faster than iterating over elements.
* They also provide **broadcasting** and additional methods like **reduce, accumulate** etc. that are very helpful for computation.

### What is Vectorization?
* Converting iterative statements into a vector based operation is called vectorization.
* It is faster as modern CPUs are optimized for such operations.

In [None]:
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = []

for i, j in zip(x, y):
  z.append(i + j)
z

In [None]:
x = [1, 2, 3, 4]
y = [4, 5, 6, 7]
z = np.add(x, y)

z

## Custom Function

In [None]:
def myadd(x, y):
  return x+y

myadd = np.frompyfunc(myadd, 2, 1)
print(myadd([1, 2, 3, 4], [5, 6, 7, 8]))

# Aggregations: Min, Max, and Everything In Between

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.
Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.

## Summing the Values in an Array

As a quick example, consider computing the sum of all values in an array.
Python itself can do this using the built-in ``sum`` function:

In [None]:
import numpy as np

L = np.random.random(100)
L[:8]

In [None]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

## Minimum and Maximum

Similarly, Python has built-in ``min`` and ``max`` functions, used to find the minimum value and maximum value of any given array:

In [None]:
%timeit min(big_array)
%timeit np.min(big_array)

### Multi dimensional aggregates

One common type of aggregation operation is an aggregate along a row or column.
Say you have some data stored in a two-dimensional array:

In [None]:
M = np.random.random((3, 4))
print(M)

In [None]:
M.sum(), M.min(axis=0), M.max(axis=1)

<center><img alt="" src="images/APDS/Numpy-Aggregate.jpg"/></center>

# Computation on Arrays: Broadcasting

We saw in the previous section how NumPy's universal functions can be used to *vectorize* operations and thereby remove slow Python loops.
Another means of vectorizing operations is to use NumPy's *broadcasting* functionality.
Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes.

Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

Broadcasting allows these types of binary operations to be performed on arrays of different sizes–for example, we can just as easily add a scalar (think of it as a zero-dimensional array) to an array:

In [None]:
a + 5

We can think of this as an operation that stretches or duplicates the value 5 into the array [5, 5, 5], and adds the results. The advantage of NumPy's broadcasting is that this duplication of values does not actually take place, but it is a useful mental model as we think about broadcasting.

We can similarly extend this to arrays of higher dimension. Observe the result when we add a one-dimensional array to a two-dimensional array:

In [None]:
M = np.ones((3, 3))
M

In [None]:
M + a

Here the one-dimensional array ``a`` is stretched, or broadcast across the second dimension in order to match the shape of ``M``.

While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. Consider the following example:

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

Just as before we stretched or broadcasted one value to match the shape of the other, here we've stretched *both* ``a`` and ``b`` to match a common shape, and the result is a two-dimensional array!
The geometry of these examples is visualized in the following figure.

<center><img alt="" src="images/APDS/Numpy-Broadcast.jpg"/></center>



## Rules of Broadcasting

Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:

- Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is *padded* with ones on its leading (left) side.
- Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

To make these rules clear, let's consider a few examples in detail.

In [None]:
M = np.ones((2, 3))
a = np.arange(3)

Let's consider an operation on these two arrays. The shape of the arrays are

- ``M.shape = (2, 3)``
- ``a.shape = (3,)``

We see by rule 1 that the array ``a`` has fewer dimensions, so we pad it on the left with ones:

- ``M.shape -> (2, 3)``
- ``a.shape -> (1, 3)``

By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:

- ``M.shape -> (2, 3)``
- ``a.shape -> (2, 3)``

The shapes match, and we see that the final shape will be ``(2, 3)``:

In [None]:
M + a

### Broadcasting example 2

Let's take a look at an example where both arrays need to be broadcast:

In [None]:
a = np.arange(3).reshape((3, 1))
b = np.arange(3)

a, b

Again, we'll start by writing out the shape of the arrays:

- ``a.shape = (3, 1)``
- ``b.shape = (3,)``

Rule 1 says we must pad the shape of ``b`` with ones:

- ``a.shape -> (3, 1)``
- ``b.shape -> (1, 3)``

And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array:

- ``a.shape -> (3, 3)``
- ``b.shape -> (3, 3)``

Because the result matches, these shapes are compatible. We can see this here:

In [None]:
a+b

### Broadcasting example 3

Now let's take a look at an example in which the two arrays are not compatible:

In [None]:
M = np.ones((3, 2))
a = np.arange(3)

This is just a slightly different situation than in the first example: the matrix ``M`` is transposed.
How does this affect the calculation? The shape of the arrays are

- ``M.shape = (3, 2)``
- ``a.shape = (3,)``

Again, rule 1 tells us that we must pad the shape of ``a`` with ones:

- ``M.shape -> (3, 2)``
- ``a.shape -> (1, 3)``

By rule 2, the first dimension of ``a`` is stretched to match that of ``M``:

- ``M.shape -> (3, 2)``
- ``a.shape -> (3, 3)``

Now we hit rule 3–the final shapes do not match, so these two arrays are incompatible, as we can observe by attempting this operation:

In [None]:
try:
    M + a
except Exception as err_:
    print(err_)

<center><h2><strong><font color="blue">End of Module</font></strong></h2></center>
<hr>
<center><img alt="" src="images/meme-cartoon/numpy-trump-meme.png"/></center>