# Numpy


> <font size=+1>Numpy is a scientific standard for handling multidimensional data in Python</font>

__This is one of the most if not the most important library for Python__, because:
- It is a core of many scientific stacks:
    - Underlying library for __Pandas__ (we will learn about it later)
    - API parity (or similarity) with __PyTorch__ or __Tensorflow__, two of the main Deep Learning libraries for Python
    - Many third party libraries implement ideas we will see here, such as dimensionality
    
Its popularity could be attributed to a few key traits:
- Ease of use
- Efficienty: Numpy is built on top of C (Python acts as a front-end)
- Intuitive syntax
- It "just works" as you'd expect (and would like it to)



## Installation


We can install numpy really easily via `pip` or `conda` (available in the main `conda` channel) via:

```bash
pip install numpy
```

```
conda install numpy
```

In [None]:
!pip install numpy

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


Once installed, the canonical way to import it in Python is giving it the alias `np`, so it will look like:

`import numpy as np`

Let's take a look at the some of the most common elements you will find in Numpy

## np.ndarray


> <font size=+1>`np.ndarray` is highly-efficient data abstraction written in C with Python's bindings for easier usage</font>

Important traits about `np.ndarray`:
- Can have arbitrary number of dimensions
- Single `dtype`__ (type of data), usually numeric, for example:
    - `float32` (a.k.a. `float`)
    - `float64` (a.k.a. `double`); __default__
    - `int32` (a.k.a. `int`)
- __Has to be "rectangle-like"__:
    - We cannot have `3` lists of different sizes in a single `np.ndarray`

You can generate ndarrays using this method. However, it might not be very intuitive at the beginning, since we need to pass the dimension of the matrix we want to generate, and `ndarray` will populate it with random numbers

In [None]:
import numpy as np

nd_array1 = np.ndarray((2, 2))
print(nd_array1)

[[-1.49166815e-154 -2.68678217e+154]
 [ 9.88131292e-324  2.78134232e-309]]


To create `ndarrays` out of an object we already have, for example, a list, we can use the `array` method


### np.array vs np.ndarray


> __`np.array` IS A FACTORY METHOD which creates `np.ndarray` (numpy N-dimensional array) objects__

What is a factory method?

> Factory methods are methods which, dependent on the input we pass to it __returns different object types__

Let's see:
- How to create `np.ndarray` object from Python's objects (`list` and `tuple`)
- How the type is inferred based on content
- Uniform presentation of arrays on Python level (`type(array)`)

> __You should always use `np.array` in order to create an array because it infers `dtype` correctly!__

In [None]:
import numpy as np  # always use this alias!

# Defining arrays
arr1d_int = np.array([1, 2, 3, 4])
arr2d_float = np.array(((1, 2, 3, 4), (5, 6, 7, 8.0)))  # Notice 8.0

print(arr1d_int)
print(arr2d_float)
arr1d_int.dtype, type(arr1d_int), arr2d_float.dtype, type(arr2d_float)

[1 2 3 4]
[[1. 2. 3. 4.]
 [5. 6. 7. 8.]]


(dtype('int64'), numpy.ndarray, dtype('float64'), numpy.ndarray)

Notice that, just by adding a float to the array, the whole array now contains solely floats.

### Changing data type


Sometimes we would like to use different data type than the one inferred.

There are two basic approaches to obtain that:
- Specifying during creation
- __Casting__ via `.astype` (__new array is created, THIS IS NOT DONE IN-PLACE__ as new array has to be

Let's see casting first:

In [None]:
# Failed attempt, new array returned 
arr1d_int.astype("int8")
print(arr1d_int.dtype)

# Correct way, new object is assigned to itself
arr1d_int = arr1d_int.astype("int8")

arr1d_int.dtype

int64


dtype('int8')

In [None]:
# We can also specify it as `np.TYPE` object
new_arr = np.array([1, 2, 3], dtype=np.int8) # or "int8" string

## Data layout

> __`np.ndarray` is kept in memory as `1D` array of contiguous values__

If so, how can we have, for example, `3D` array? Numpy has everything stored in a "single line", but it has an attribute called _stride_ that helps to know how the data is distributed.

### strides

> __Strides define HOW MANY BYTES one need to traverse in order to get next element for each dimension__

<p align=center><img src=https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_memory_layout.png?raw=1 width=600></p>

<p align=center><img src=https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_strides.svg?raw=1 width=600></p>

Let's see what these are for our two arrays:

In [None]:
print(
    f"""Int1D itemsize: {arr1d_int.itemsize}
Int1D strides: {arr1d_int.strides}
Float2D itemsize: {arr2d_float.itemsize}
Float2D strides: {arr2d_float.strides}
    """
)

Int1D itemsize: 1
Int1D strides: (1,)
Float2D itemsize: 8
Float2D strides: (32, 8)
    


- `itemsize` - specifies how many bytes are used for the data type
- `stride` - specifies how many bytes we have to jump in order to move to the next element

In [None]:
# Explain values below based on the code and output

arr = np.arange(9).reshape(3, 3)

print(arr)
print(f'The data type of each element is: {arr.dtype}')
print(f'The length of each element in bytes is: {arr.itemsize}')
print(f'The strides of the data types is: {arr.strides}')

[[0 1 2]
 [3 4 5]
 [6 7 8]]
The data type of each element is: int64
The length of each element in bytes is: 8
The strides of the data types is: (24, 8)


Makes sense right? The second element in the tuple is the amount of bytes we need to "move to the right" and the first element is the number of bytes we need to "move to the next row"

We can also transposed our array, let's see how this changes our strides

<p align=center><img src=https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_strides_transposed.svg?raw=1 width=600></p>

__Take note that__:
- Our internal data was "moved" around
- __Why would we need it, wouldn't change in strides suffice?__

In [None]:
transposed = arr.T

print(transposed)
transposed.strides

[[0 3 6]
 [1 4 7]
 [2 5 8]]


(8, 24)

## shape


> `<our_array>.shape` returns dimensionality of `<our_array>`

It is one of the most often used attributes in `numpy` and scientific computing so keep that in mind!


## Creating `np.ndarray`s


Numpy allows us to easily create data in multiple ways, namely:
- __From standard Python structures (`list`s or `tuple`s)__ (possibly nested)
- __Direct creation of `np.ndarray`__ via:
    - random operations (elements are taken from some distribution)
    - using single value (zeros, ones, `eye` with some value)
    
Let's see a few creational operations (__all of them are listed [here](https://numpy.org/doc/stable/reference/routines.array-creation.html)__). Usually, the arguments we pass to them is the dimensions we want to give to the matrix

In [None]:
ones = np.ones((3, 2)) # 2D matrix filled with ones
zeros = np.zeros_like(ones) # 2D zero matrix filled with zeros of the same shape as ones and zeros
identity = np.eye(3)

print(ones)
print(f'Shape of "one" is: {ones.shape}')
print(zeros)
print(f'Shape of "zeros" is: {zeros.shape}')
print(identity)
print(f'Shape of "identity" is: {identity.shape}')


[[1. 1.]
 [1. 1.]
 [1. 1.]]
Shape of "one" is: (3, 2)
[[0. 0.]
 [0. 0.]
 [0. 0.]]
Shape of "zeros" is: (3, 2)
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
Shape of "identity" is: (3, 3)


## Creating random np.array


> `numpy` provides means to create random arrays (for example defined by some distribution)

[Here](https://numpy.org/doc/stable/reference/random/index.html) you can see a full list of possibilities,
__all of them are located in `random` module__.

Example usage:

In [None]:
vals = np.random.standard_normal(10)

vals

array([-0.24254289,  0.7083943 , -0.33446633,  0.42426222, -1.99369232,
        0.80625689, -0.54320899,  2.36829641, -0.20734505, -0.59212398])

In [None]:
# Random NORMAL distribution (mean: 0 and stddev: 1)
vals = np.random.randn(3, 4)

vals

array([[-0.74751218,  0.09683241,  0.48627724, -0.60063956],
       [ 1.47518089,  2.11843449, -1.02584517,  0.19524806],
       [ 1.11244104, -0.59861959,  0.61149984,  0.67560329]])

In [None]:
# Random UNIFORM distribution (0, 1 range)
vals = np.random.rand(3, 4)

vals

array([[0.98468015, 0.37304705, 0.53834982, 0.33175601],
       [0.3772029 , 0.5516284 , 0.16610994, 0.2366541 ],
       [0.50624113, 0.64051987, 0.27737038, 0.32808795]])

## Operations on arrays


> __`numpy` provides a lot of mathematical functions via easy to use notation__

Most of the operations are done "element-wise" (each element with respective elements of the other array):
- addition: `+`
- subtraction: `-`
- multiplication: `*`
- bitwise operations (when array is boolean)

and many others (__see [here](https://scipy-lectures.org/intro/numpy/operations.html) for more examples__)

In [None]:
arr1 = np.full((3, 3), fill_value = 5)
identity_1 = np.eye(arr1.shape[0], dtype='int64') 
print(arr1)
print('    +    ')
print(identity_1)
print('    =    ')
print(arr1 + identity_1)

[[5 5 5]
 [5 5 5]
 [5 5 5]]
    +    
[[1 0 0]
 [0 1 0]
 [0 0 1]]
    =    
[[6 5 5]
 [5 6 5]
 [5 5 6]]


## Mathematical functions


> `numpy` provides a lot of math functions (e.g. trigonometric)

Traits:
- Works on any array (usually element-wise) with some edge-case exceptions
- Optimized C implementations
- Provided in the `np` namespace

> All of the available operations are listed [here](https://numpy.org/doc/stable/reference/routines.math.html)

Let's see an example below:

In [None]:
# np.e and np.pi are predefined constant 

(np.cos(arr1) - np.sin(arr1) ** 3) / (np.eye(arr1.shape[0]) * np.e + 0.1)

array([[ 0.41352406, 11.65427351, 11.65427351],
       [11.65427351,  0.41352406, 11.65427351],
       [11.65427351, 11.65427351,  0.41352406]])

## Linear algebra operations

> `numpy` provides linear algebra functionalities __located within `np.linalg` submodule__

See [here](https://numpy.org/doc/stable/reference/routines.linalg.html) for available functionalities.

Some of them are provided as overloaded operations, namely __matrix multiplication__:

In [None]:
# Inner dimensions must match!

X, y = np.random.randn(10, 5), np.random.randn(6, 3)

(X @ y).shape

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 6 is different from 5)

In [None]:
x = np.ndarray((2, 3)

TypeError: Cannot interpret '3' as a data type

In [None]:
A = np.random.randn(10)
B = np.random.randn(10)

np.inner(A, B)

1.7650168848086591

In [None]:
# Eigen values

np.linalg.eig(np.random.randn(10, 10))

(array([ 0.00653762+2.73914504j,  0.00653762-2.73914504j,
        -1.45754244+1.25433401j, -1.45754244-1.25433401j,
        -0.53901557+1.05835501j, -0.53901557-1.05835501j,
         2.6212444 +0.85592199j,  2.6212444 -0.85592199j,
         0.65357259+0.j        ,  1.35407656+0.j        ]),
 array([[ 5.63840421e-01+0.j        ,  5.63840421e-01-0.j        ,
          4.43458617e-01+0.j        ,  4.43458617e-01-0.j        ,
         -1.42880129e-01+0.18902739j, -1.42880129e-01-0.18902739j,
         -4.06683218e-02+0.03470009j, -4.06683218e-02-0.03470009j,
          1.03792994e-01+0.j        ,  1.04679952e-01+0.j        ],
        [ 3.65214582e-02-0.25958872j,  3.65214582e-02+0.25958872j,
          2.46625247e-01+0.06615217j,  2.46625247e-01-0.06615217j,
          3.58238532e-01+0.27337585j,  3.58238532e-01-0.27337585j,
          2.41131070e-01+0.10407014j,  2.41131070e-01-0.10407014j,
         -2.50490697e-01+0.j        ,  4.24520201e-01+0.j        ],
        [-1.37788927e-02+0.12618169j

## Accessing elements


> `numpy` allows us to access data in multiple ways

Before we move on to accessing data (and advanced way to do that) you should keep the following in mind (__all the time!__):
- __ALWAYS USE `numpy` OPERATIONS__
- __NO FOR LOOPS KNOWN FROM PYTHON__ (every operation you do should be done purely in `numpy`)
- You will learn more and more ways to avoid loops as we go through the course materials

## Standard index-based item

First, let's create a `2D` array we will use:

In [None]:
matrix = np.arange(20).reshape(5, 4)

matrix

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [None]:
# Obtaining single element

matrix[0, 0], matrix[1, 0], matrix[0, 1], matrix[2][0]

(0, 4, 0, 8)

In [None]:
# first row

matrix[0]

array([0, 1, 2, 3])

In [None]:
# : means all elements
# 0 means 0th column

matrix[:, 0]

array([ 0,  4,  8, 12, 16])

In [None]:
# Rows from second upwards
matrix[2:]

array([[ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [None]:
# Columns from second upwards
matrix[:, 2:]

array([[ 2,  3],
       [ 6,  7],
       [10, 11],
       [14, 15],
       [18, 19]])

In [None]:
# Rows from zeroth to third, column 0
matrix[:3, 0]

array([0, 4, 8])

In [None]:
# Inverse of columns

matrix[:, ::-1]

array([[ 3,  2,  1,  0],
       [ 7,  6,  5,  4],
       [11, 10,  9,  8],
       [15, 14, 13, 12],
       [19, 18, 17, 16]])

In [None]:
matrix[::-1, ::-1]

array([[19, 18, 17, 16],
       [15, 14, 13, 12],
       [11, 10,  9,  8],
       [ 7,  6,  5,  4],
       [ 3,  2,  1,  0]])

In [None]:
matrix[::-3, ::-2]

array([[19, 17],
       [ 7,  5]])

In [None]:
# change to 3D tensor
temp = matrix.reshape(2, 2, -1)
temp.shape

(2, 2, 5)

In [None]:
# Same as temp[:, :, -1] e.g. last element from last dimension
# Rest left in-tact
# 2, 2 as we have created five 2,2 matrices

temp[..., -1]

array([[ 4,  9],
       [14, 19]])

## Fancy indexing

One of `numpy`'s killer features:

> __Fancy indexing allows us to choose elements FROM ANY DIMENSION based on indices we provide__

![](https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_fancy_indexing.png?raw=1)

Once again, we will use our `2D` matrix:

In [None]:
matrix

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [None]:
# Column 0 and 2
matrix[:, [0, 2]]

array([[ 0,  2],
       [ 4,  6],
       [ 8, 10],
       [12, 14],
       [16, 18]])

In [None]:
# Take last row twice

matrix[[-1, -1]]

array([[16, 17, 18, 19],
       [16, 17, 18, 19]])

In [None]:
# Shuffle rows using indices

indices = np.arange(matrix.shape[0])
print(indices)
permuted = np.random.permutation(indices)
print(permuted)

matrix[permuted]

[0 1 2 3 4]
[1 4 3 0 2]


array([[ 4,  5,  6,  7],
       [16, 17, 18, 19],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 8,  9, 10, 11]])

In [None]:
# Obtain rows based on boolean values

matrix[[True, True, False, True, False]]

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [12, 13, 14, 15]])

In [None]:
# Obtaining only elements which fulfill condition
# In this case elements larger than 5

print(matrix > 5)

# Array has to be flat as it lost it's N x N structure
matrix[matrix > 5]

[[False False False False]
 [False False  True  True]
 [ True  True  True  True]
 [ True  True  True  True]
 [ True  True  True  True]]


array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

## Reshape and view

> __`.reshape(x, y, z)` will change the way we access our array__

It is important to note that:
- reshape __USUALLY DOES NOT COPY UNDERLYING DATA__ (it is merely changing `strides` and the way we access it)
- __COPY OF `np.ndarray`s IS USUALLY NOT DONE__ (unless necessary)
- It almost never creates any problem for us (as long as we're working with `numpy` reasonably)

First option (without copy) is called __`view`__, while the other one is called __`copy`__.

![](https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_copy_view.png?raw=1)

What does "working reasonably" mean?
- __After reshaping DON'T CHANGE ELEMENTS IN EITHER OF THE VIEWS__
- Use them in "functional" manner returning new objects (e.g. addition after reshape)
- See examples below

In [None]:
# elements 0-18 reshaped into 

arr = np.arange(18)

print(arr.shape, arr.strides)

reshaped = arr.reshape(3, 2, -1)

print(reshaped.shape, reshaped.strides)

print(f"Sharing underlying memory: {np.may_share_memory(arr, reshaped)}")

(18,) (8,)
(3, 2, 3) (48, 24, 8)
Sharing underlying memory: True


In [None]:
# Will change both arrays
arr[7] = 99999.

print(arr)
print(reshaped)

[    0     1     2     3     4     5     6 99999     8     9    10    11
    12    13    14    15    16    17]
[[[    0     1     2]
  [    3     4     5]]

 [[    6 99999     8]
  [    9    10    11]]

 [[   12    13    14]
  [   15    16    17]]]


In [None]:
# Correct usage, will not change underlying memory
# View will be used to multiply values within X1

X1 = np.random.randn(128, 10)

X2 = np.random.rand(1280)

X1 * X2.reshape(X1.shape)

array([[ 0.06607302, -0.03757907, -0.22012916, ..., -0.33735095,
         0.12948773, -0.32322123],
       [ 0.40553952, -0.10047234, -0.07102722, ..., -0.2595114 ,
         0.29522957,  0.04071672],
       [-0.11908935, -0.5060945 , -0.01849313, ...,  0.45426441,
         0.13575486,  0.35651088],
       ...,
       [-0.19608282,  0.14517763,  1.42295067, ...,  0.06966209,
         1.05039197,  0.7748389 ],
       [ 0.96215246, -0.0456957 ,  0.47949846, ...,  0.33081944,
        -0.20928297,  0.37043871],
       [ 0.84462781, -0.59371553,  0.01023492, ...,  1.03324647,
        -0.1447829 ,  0.06648208]])

## -1 in reshape

> `-1` is used in order to __infer__ missing dimensionality

It is pretty useful when:
- __we don't know some dimension beforehand__
- __we write function that has to work independently of some dimension__

Let's see a dummy example:

In [None]:
np.random.randn(5, 6, 8).reshape(-1, 10).shape

(24, 10)

In [None]:
def make_second_dimension_10(array):
    assert array.size % 10 == 0, "Number of array elements has to be dividable by 10"
    return array.reshape(-1, 10)


print(make_second_dimension_10(np.random.randn(5, 6, 8)).shape)
make_second_dimension_10(np.random.randn(120)).shape

(24, 10)


(12, 10)

# Broadcasting

After `fancy indexing`, `reshape` third killer feature of `numpy` is introduced:

> __Broadcasting means automatic expansion of smaller array to a larger one__

![](https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_broadcasting.png?raw=1)

Looking at the picture above:
- __Arrays have to be expandable__, e.g.:
    - `(3, 10)` and `(3,)`, second one will be expanded to `(3, 1)`
    - `(3, 10)` and `(10,)` __WILL NOT WORK__ as the first dimension does not match
    - We have to reshape above to `(1, 10)`, so the `(1,)` dimension will be expanded to `(3,)`
- __Dimensions have to match__ (exampele above)

Let's see a few examples:

In [None]:
import numpy as np
(np.array([[1], [2], [3]]) * np.array([[1, 2]])).shape

(3, 2)

In [None]:
# Broadcasting for both arrays

arr1 = np.random.randn(10, 3)
arr2 = np.random.randn(10, 5)

result = arr1.reshape(-1, 1, 3) * arr2.reshape(10, -1, 1)
result.shape

(10, 5, 3)

In [None]:
# Will not work
a = np.random.randn(1, 10)
b = np.random.randn(3)

a + b

array([-2.317568  , -1.51070574, -1.46470051])

In [None]:
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([0, 2, 0]).reshape(3, 1)

x * y

array([[ 0,  0,  0],
       [ 8, 10, 12],
       [ 0,  0,  0]])

In [None]:
a = np.random.randn(3, 3)
b = np.random.randn(3)

a - b

array([[-0.31557747, -3.0836088 , -2.65585039],
       [ 1.21277083, -2.12886569,  0.18391569],
       [-1.09073187, -0.96382431, -2.76317141]])

# Working with shapes

`numpy` is a framework which allows us to work with `N` dimensional arrays.

Due to that, we should try to __think in terms of shapes__, not in terms of specific elements.

Throughout the course you will often see (also today) that we will define many tasks in terms of __dimensions__ and __what each dimension represents__.


An example could be data of shape `(users, movies)` which specifies:
- ratings given for a movie
- for every user
- for every movie

Visually (assume `?` are equal to zero):

![](https://github.com/life-efficient/Data-Engineering/blob/main/2.%20Data%20Formats%20and%20Pandas/0.%20Numpy/images/numpy_example_matrix.png?raw=1)

Let's create such data and see operations one can do on it:

In [None]:
import numpy as np

users = 24
movies = 10

data = np.random.randint(0, 11, size=(users, movies)) # 11 as it's one more than maximum 10 score

data

array([[ 8,  0,  9,  7, 10,  6,  0,  7,  1,  3],
       [ 4,  8,  0,  2,  3, 10,  6,  8,  4,  5],
       [ 4,  5,  0,  6, 10,  3,  6,  6,  5,  4],
       [ 6,  2,  4,  9, 10,  4,  5,  2,  6,  7],
       [ 1,  2,  8,  0, 10,  2,  4,  8,  1, 10],
       [ 9, 10,  7,  8,  5,  4,  6,  4,  8,  9],
       [10,  3,  6,  7,  6,  9,  9,  4,  1,  1],
       [ 5,  3,  0,  9,  5,  2,  2, 10,  8,  0],
       [ 9,  8,  5,  2,  2,  7,  3,  6,  3,  4],
       [ 2, 10,  6,  3,  8,  1,  7,  1, 10,  4],
       [ 9,  3,  0,  8,  1,  4,  5, 10,  0,  4],
       [ 1,  4,  1,  1,  7,  5,  6,  5,  4,  9],
       [ 3,  9,  6,  6,  6,  4,  5,  6,  6, 10],
       [ 3,  3,  1,  8,  8,  6,  5,  3,  3,  8],
       [10,  7,  5,  9,  8,  6,  4,  8,  5,  1],
       [ 6,  4, 10,  0,  8,  4,  1,  4,  0,  8],
       [ 5, 10, 10,  4,  3, 10,  2,  3, 10,  1],
       [ 2,  4, 10,  6, 10,  3, 10,  0,  7,  5],
       [ 9,  8,  1,  9,  3,  9,  9,  5,  9,  9],
       [ 3,  9,  1,  2,  4,  9,  2,  7,  9,  1],
       [10,  8,  7, 

__Please notice__:
- If we just look at the numbers they do not convey too much information
- If, instead, we think about what the dimensions represent, we can more easily reason about various operations.

> __Most of `numpy` math (and not only math) operations allow us to specify `axis` argument__

> __`axis` allows us to carry operation across specific dimension__

__TIPS:__

- __WRITE DATA SHAPES AS YOU APPLY SPATIAL TRANSFORMATIONS IN CODE COMMENT__
- __DIMENSION ACROSS WHICH WE CARRY THE OPERATION IS OFTEN REMOVED__



Let's see how one could __find average rating for each user__:

In [None]:
# data: (users, movies)

# total_ratings: (users,)
total_ratings = data.sum(axis=1) # sum all of the columns

# mean_ratings: (users,)
mean_ratings = total_ratings / data.shape[1] # divide by total number of available movies

mean_ratings

array([5.1, 5. , 4.9, 5.5, 4.6, 7. , 5.6, 4.4, 4.9, 5.2, 4.4, 4.3, 6.1,
       4.8, 6.3, 4.5, 5.8, 5.7, 7.1, 4.7, 4.8, 3.4, 6.1, 4. ])

Average rating for a movie (__almost the same as previously, just changing dimensions!__):

In [None]:
# data: (users, movies)

# total_ratings: (movies,)
total_ratings = data.sum(axis=0) # sum all of the rows

# mean_ratings: (movies,)
mean_ratings = total_ratings / data.shape[0] # divide by total number of users which gave the movie rating

mean_ratings

array([5.375     , 5.29166667, 4.70833333, 5.5       , 6.08333333,
       4.875     , 4.75      , 5.20833333, 4.91666667, 5.04166667])

Highest rating gave for any movie by specific user:

In [None]:
data.max(axis=1)

array([10, 10, 10, 10, 10, 10, 10, 10,  9, 10, 10,  9, 10,  8, 10, 10, 10,
       10,  9,  9, 10,  9, 10, 10])

Which movie (__movie index__) got the lowest score for each user:

In [None]:
data.argmin(axis=1)

array([1, 2, 2, 1, 3, 5, 8, 2, 3, 5, 2, 0, 0, 2, 9, 3, 9, 7, 2, 2, 5, 0,
       8, 1])

And which one was scored the lowest amongst all users:

In [None]:
# Movie which got the lowest score per-user

lowest = data.argmin(axis=1) # (users, )

# Calculate how often each lowest value occured
# minlength specifies number of entries (10 in our case as there are 10 movies)

counts = np.bincount(lowest, minlength=data.shape[1]) # (movies,)

# Get movies which got lowest rated most frequently:

np.argmax(counts) # (1, )

2

# Summary

Where have learned in this notebook:

- How to create NumPy arrays and perform mathematical operations on them. 
- Used many different ways to slice the arrays to get the data we need.
- Found out ways to reshape arrays to get the dimensions we want.
- Created new arrays through the use of broadcasting to create bigger arrays.
- Applied transformations across an arrays axis to analyse data.