# Python for Machine Learning - NumPy

> Introduction to NumPy - Python's Feature-Rich Mathematical Library.

- toc: true
- badges: true
- comments: false
- categories: ['Python for ML','NumPy','Machine Learning']
- image: images/numpy-logo.png

# Importing NumPy

[NumPy](https://numpy.org/) is a scientific computing library for Python. It's an extensive collection of pre-written code that optimizes and extends, among other things, the Python array (i.e. `list`) object into an n-dimensional NumPy array called `ndarray`. It comes with a variety of tools, such as matrix operations and common mathematical functions, that enable Python to perform complex linear algebraic tasks, generate pseudo-random numbers, perform Fourier analysis, etc.

We import NumPy, as we import any other library, using the `import` keyword (with or without a shorthand).

```python
import numpy
```
 Or, alternatively:
 
```python
import numpy as np
```

# Optimizations

As we've briefly discussed in the ["Python for Machine Learning - Pandas"](https://v-poghosyan.github.io/blog/python%20for%20ml/pandas/machine%20learning/2021/12/28/Python-for-ML-Pandas.html#Broadcasted-and-Vectorized-Operations) post, NumPy works by delegating tasks to well-optimized C code under the hood. In this way it exploits the flexibility of Python, meanwhile bypassing the speed limitations of an *interpreted* language in favor of faster, compiled code.


## Scalable Memory Representation

As far as memory optimization is concerned, one of the things NumPy optimizes is data storage. 
In contrast to Python 3.x's scalable memory representation of numeric values such as integers, which can grow to accomodate a given number, NumPy stores numeric types in fixed-sized blocks of memory (such as `int32` or `int64`). This means NumPy is able to take advantage of the modern processors' low-level CPU instructions designed for fixed-sized numeric types. Another advantage of fixed-sized storage is that consecutive blocks of memory can ba allocated, which enables the libraries upon which NumPy relies to do extremely performant computations. This enforcement of fixed-sized data types is part of the  optimization strategy NumPy uses which is called vectorization.

## Vectorization

As already discussed in the [aforementioned post](https://v-poghosyan.github.io/blog/python%20for%20ml/pandas/machine%20learning/2021/12/28/Python-for-ML-Pandas.html#Broadcasted-and-Vectorized-Operations), *vectorization* is the process by which NumPy stores an array internally in a contiguous block of memory, and restricts its contents to only one data type. Letting Python know this data type in advance, NumPy can then skip the per-iteration type checking that Python normally does in order to speed up our code. Optimizing the array data structure in such a way enables NumPy to delegate most of the operations on such arrays to pre-written C code under the hood. In effect, this simply means that looping occurs in C instead of Python.

## Broadcasting

The term *broadcasting* describes the process by which NumPy performs arithmetic operations on arrays of different dimensions. The process is usually as follows: the smaller array is “broadcast” across the larger array so that the two arrays have compatible dimensions. Broadcasting provides a means of vectorizing array operations. 

## Comparing Runtime
To demonstrate the performance optimizations of NumPy, let's compare squaring every element of a `1,000,000`-element array and summing the results. 

### Using a Python List

First, we will use a Python list:

In [10]:
unoptimized_list = list(range(1000000))

Squaring each element and summing:

In [11]:
import numpy as np
%time np.sum([i**2 for i in unoptimized_list])

Wall time: 312 ms


333332833333500000

> Note: Even though we're using NumPy's `sum()` method, since the input we're passing to it is a regular Python list, NumPy optimizations are not applied. 

<br>

As we can see the whole thing took `312 ms`.

### Using a NumPy Array

Now let's do the same with a NumPy array, which also gives us the opportunity to intruduce the syntax for defining one using a range.

In [12]:
optimized_array = np.arange(1000000)

Let's check the type of `optimized_array` to convince ourselves that it is, indeed, a NumPy `ndarray`.

In [25]:
type(optimized_array)

numpy.ndarray

Now, finally, let's square each element and sum the results:

In [13]:
%time np.sum(optimized_array**2)

Wall time: 2.48 ms


584144992

Remarkably, the run-time was cut from `312 ms` to only `2.48ms`!

# NumPy Basics

Let's exlpore some of the ways in which we can represent arrays and matrices in NumPy.

## Creating Arrays

We've already seen how we can create a 1-dimensional NumPy array of consecutive integers $0,1,...,1000000$ using the `arrange()` method. 

The standard way of creating a NumPy array is passing a Python list to the constructor `array()` like so:

In [16]:
a = np.array([1,2,3])
a

array([1, 2, 3])

## Representing Matrices

Let's represent a $2 \times 3$ matrix 
$
A = \begin{bmatrix}
1 & 2 & 3\\
4 & 5 & 6
\end{bmatrix}
$
using NumPy:

In [19]:
A = np.array([
    [1,2,3],
    [4,5,6]
])
A

array([[1, 2, 3],
       [4, 5, 6]])

## Indexing

Indexing a 1-dimensional NumPy array is done as expected, through the use of the trusty brackets `[]`. Indexing an n-dimensional matrix in NumPy still uses `[]` but it introduces a new, improved, syntax. 

Suppose we'd like to access the element in the first row, and last column of `A`. The standard way would be: 

In [20]:
A[0][2]

3

As we can see, that still works. But the recommended and, subjectively speaking, prettier way is:

In [21]:
A[0,2]

3

Of course, slicing still works as expected.

For example, let's print the entire first row of `A`:

In [22]:
A[0,:]

array([1, 2, 3])

The entire first column: 

In [23]:
A[:,0]

array([1, 4])

Finally, let's print the submatrix $
\begin{bmatrix}
2 & 3
\end{bmatrix}
$:

In [24]:
A[0,1:]

array([2, 3])

## Properties and Methods of NumPy Arrays

A few of the useful properties and methods of `ndarray` are highlighted in this section.


* `arange()` - Takes an integer $n$ input and creates a sequential array from $0,...,n-1$.
```python
np.arange(5) # array([0, 1, 2, 3, 4])
```
* `shape` - returns the shape of the matrix as an $(m,n)$ pair
```python
A.shape # (2,3)
```

* `ndim` - returns the dimension of a matrix as a single digit
> Note: The output of the `ndim` property can be understood in a linear angebraic sense as the dimension of the column vectors of the matrix (i.e. the domain of the transform), or in the data structure sense as the level of nestedness of the array.
```python
A.ndim # 2
```

* `size` - returns the total number of elements in the matrix
```python
A.size # 6
```

* `dtype` - returns the data type of the elements in the matrix.
> Note: If the `ndarray` does not represent a matrix, such as `B = np.array([[1,2,3],[4,5]])` then `dtype` outputs `O` signifying that the entries are general Python objects. In such a case, the array loses its optimizations. 
```python
A.dtype # dtype('int32')
```

### Statistical and Mathematical Methods

There is also a vast selection of statstical, and more generally, mathematical methods that `ndarrays` come with. Here are a few of the common ones:

* `sum()` - returns the sum of all the entries
```python
A.sum() # 21
```
It also accepts an `axis` attribute where `axis = 0` refers to the sum along the columns, and `axis = 1` refers to the sum along the rows.
```python
A.sum(axis = 0) # [5,7,9]
A.sum(axis = 1) # [6,15]
```
* `mean()` - returns the empirical mean of all the entries 
```python
A.mean() # 3.5
```
* `var()` - returns the variance of the entries
```python
A.var() # 2.9166666666666665
```
* `std()` - returns the standard deviation of the entries
```python
A.std() # 1.707825127659933
```

## Multi-Indexing, Filtering, and Broadcasted Operations

Recall from the [Pandas article](https://v-poghosyan.github.io/blog/python%20for%20ml/pandas/machine%20learning/2021/12/28/Python-for-ML-Pandas.html) the ways in which we were able to multi-index and filter, and how we eliminated the need for using Python loops and list comprehensions using broadcasted operators instead. Since both a Pandas `Series` and a `DataFrame` are extensions of NumPy's `ndarray`, all of these apply here as well. 