<a href="https://colab.research.google.com/github/ranvirsahota/AiCore/blob/pandas/8_numpy_reshape_and_broadcasting/notebook_lesson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Reshape and view

> __`.reshape(x, y, z)` will change the way we access our array__

It is important to note that:
- reshape __USUALLY DOES NOT COPY UNDERLYING DATA__ (it is merely changing `strides` and the way we access it)
- __COPY OF `np.ndarray`s IS USUALLY NOT DONE__ (unless necessary)
- It almost never creates any problem for us (as long as we're working with `numpy` reasonably)

First option (without copy) is called __`view`__, while the other one is called __`copy`__.

![](https://github.com/AI-Core/Content-Public/blob/main/Content/units/Data-Handling/1.%20Numpy/2.%20Numpy%20-%20Reshape%20and%20Broadcasting/images/numpy_copy_view.png?raw=1)

What does "working reasonably" mean?
- __After reshaping DON'T CHANGE ELEMENTS IN EITHER OF THE VIEWS__
- Use them in "functional" manner returning new objects (e.g. addition after reshape)
- See examples below

In [12]:
# elements 0-18 reshaped into
import numpy as np

arr = np.arange(18)

print(arr.shape, arr.strides)

reshaped = arr.reshape(3, 2, -1)

print(reshaped.shape, reshaped.strides)

print(f"Sharing underlying memory: {np.may_share_memory(arr, reshaped)}")

(18,) (8,)
(3, 2, 3) (48, 24, 8)
Sharing underlying memory: True


In [14]:
# Will change both arrays
arr[7] = 99999.

print(arr)
print(reshaped)

[    0     1     2     3     4     5     6 99999     8     9    10    11
    12    13    14    15    16    17]
[[[    0     1     2]
  [    3     4     5]]

 [[    6 99999     8]
  [    9    10    11]]

 [[   12    13    14]
  [   15    16    17]]]


In [15]:
# Correct usage, will not change underlying memory
# View will be used to multiply values within X1

X1 = np.random.randn(128, 10)

X2 = np.random.rand(1280)

X1 * X2.reshape(X1.shape)

array([[ 1.09695311, -0.57103464, -0.37362709, ...,  0.18364325,
         0.2739754 , -0.23963659],
       [-0.43891596,  0.52566686,  0.04451527, ...,  0.02906756,
         0.15693211, -0.01929025],
       [ 0.03280199,  0.49542968,  0.11077619, ..., -0.07418581,
        -0.05601378,  0.28917859],
       ...,
       [-1.09293388, -0.40692231,  0.4142739 , ...,  0.45215304,
         1.04179811, -0.1182217 ],
       [-0.84412205,  0.5119517 ,  0.03951486, ..., -0.01404378,
         0.48855078,  0.06924123],
       [-0.45700664,  0.96316058,  0.06289761, ..., -0.14675977,
        -0.16111533, -0.43377249]])

## -1 in reshape

> `-1` is used in order to __infer__ missing dimensionality

It is pretty useful when:
- __we don't know some dimension beforehand__
- __we write function that has to work independently of some dimension__

Let's see a dummy example:

In [17]:
np.random.randn(5, 6, 8).reshape(-1, 10).shape

(24, 10)

In [18]:
def make_second_dimension_10(array):
    assert array.size % 10 == 0, "Number of array elements has to be dividable by 10"
    return array.reshape(-1, 10)


print(make_second_dimension_10(np.random.randn(5, 6, 8)).shape)
make_second_dimension_10(np.random.randn(120)).shape

(24, 10)


(12, 10)

# Broadcasting

After explaining `fancy indexing` and `reshape`, let's take a look at a third, powerful feature of `numpy`:

> __Broadcasting means automatic expansion of smaller array to a larger one__

![](https://github.com/AI-Core/Content-Public/blob/main/Content/units/Data-Handling/1.%20Numpy/2.%20Numpy%20-%20Reshape%20and%20Broadcasting/images/numpy_broadcasting.png?raw=1)

Looking at the picture above:
- __Arrays have to be expandable__, e.g.:
    - `(3, 10)` and `(3,)`, second one will be expanded to `(3, 1)`
    - `(3, 10)` and `(10,)` __WILL NOT WORK__ as the first dimension does not match
    - We have to reshape above to `(1, 10)`, so the `(1,)` dimension will be expanded to `(3,)`
- __Dimensions have to match__ (exampele above)

Let's see a few examples:

In [19]:
import numpy as np
(np.array([[1], [2], [3]]) * np.array([[1, 2]])).shape

(3, 2)

In [20]:
# Broadcasting for both arrays

arr1 = np.random.randn(10, 3)
arr2 = np.random.randn(10, 5)

result = arr1.reshape(-1, 1, 3) * arr2.reshape(10, -1, 1)
result.shape

(10, 5, 3)

In [21]:
# Will not work
a = np.random.randn(1, 10)
b = np.random.randn(3)

a + b

ValueError: ignored

In [22]:
x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([0, 2, 0]).reshape(3, 1)

x * y

array([[ 0,  0,  0],
       [ 8, 10, 12],
       [ 0,  0,  0]])

In [23]:
a = np.random.randn(3, 3)
b = np.random.randn(3)

a - b

array([[ 0.9851471 , -0.0893894 , -0.38150211],
       [ 2.21758504,  1.35615692, -0.99540993],
       [ 1.09978655,  1.20897105, -1.70510513]])

# Working with shapes

`numpy` is a framework which allows us to work with `N` dimensional arrays.

Due to that, we should try to __think in terms of shapes__, not in terms of specific elements.

Throughout the course you will often see (also today) that we will define many tasks in terms of __dimensions__ and __what each dimension represents__.


An example could be data of shape `(users, movies)` which specifies:
- Ratings given for a movie
- For every user
- For every movie

Visually (assume `?` are equal to zero):

![](https://github.com/AI-Core/Content-Public/blob/main/Content/units/Data-Handling/1.%20Numpy/2.%20Numpy%20-%20Reshape%20and%20Broadcasting/images/numpy_example_matrix.png?raw=1)

Let's create such data and see operations one can do on it:

In [24]:
import numpy as np

users = 24
movies = 10

data = np.random.randint(0, 11, size=(users, movies)) # 11 as it's one more than maximum 10 score

data

array([[ 6,  0, 10,  1,  1,  1,  9, 10,  6,  3],
       [ 5,  6,  8,  7,  3,  4,  6,  8,  1,  8],
       [10,  5,  4,  4,  1,  4,  9,  6, 10,  1],
       [ 5,  2, 10,  6,  5,  6,  7,  1,  8,  1],
       [ 9, 10,  7, 10,  0,  8,  5,  5,  8,  9],
       [ 8,  9,  2,  0,  4,  5,  9,  1,  2,  6],
       [ 9,  3,  8,  8,  9,  3,  8, 10,  6,  9],
       [ 1,  2,  4,  3,  3,  2, 10,  6,  4,  5],
       [ 2,  3,  1,  3,  7,  9, 10,  2, 10,  1],
       [ 7,  3,  8,  0,  8,  8,  0, 10,  9,  0],
       [ 5,  2, 10,  6,  5,  2,  1,  7,  4,  0],
       [ 8,  9,  2, 10,  1,  6,  0,  3,  3,  1],
       [ 0,  8,  2,  9,  3,  2,  8,  9,  4,  8],
       [10,  8,  8,  1,  7,  0,  5,  2,  5,  0],
       [ 7,  4, 10,  3,  1,  2,  5, 10,  4, 10],
       [ 4,  1,  8,  7,  7,  5,  9,  9,  5,  1],
       [ 5,  3,  6,  8,  7,  9,  0,  5,  6,  8],
       [ 8,  6,  9,  1,  6,  7,  0,  0,  2,  5],
       [ 6,  3,  2,  3,  0,  5,  1,  0,  1,  0],
       [ 0,  5,  5,  8,  5,  3,  0,  1,  7,  0],
       [ 2,  9,  0, 

__Please notice__:
- If we just look at the numbers alone, they do not convey too much information
- If, instead, we think about what the dimensions represent, we can more easily reason about various operations

> __Most of `numpy` math (and not only math) operations allow us to specify `axis` argument__

> __`axis` allows us to carry operation across specific dimension__

__TIPS:__

- __WRITE DATA SHAPES AS YOU APPLY SPATIAL TRANSFORMATIONS IN CODE COMMENT__
- __DIMENSION ACROSS WHICH WE CARRY THE OPERATION IS OFTEN REMOVED__



Let's see how one could __find average rating for each user__:

In [25]:
# data: (users, movies)

# total_ratings: (users,)
total_ratings = data.sum(axis=1) # sum all of the columns

# mean_ratings: (users,)
mean_ratings = total_ratings / data.shape[1] # divide by total number of available movies

mean_ratings

array([4.7, 5.6, 5.4, 5.1, 7.1, 4.6, 7.3, 4. , 4.8, 5.3, 4.2, 4.3, 5.3,
       4.6, 5.6, 5.6, 5.7, 4.4, 2.1, 3.4, 3.4, 3.9, 5.1, 4.4])

Average rating for a movie (__almost the same as previously, just changing dimensions!__):

In [26]:
# data: (users, movies)

# total_ratings: (movies,)
total_ratings = data.sum(axis=0) # sum all of the rows

# mean_ratings: (movies,)
mean_ratings = total_ratings / data.shape[0] # divide by total number of users which gave the movie rating

mean_ratings

array([5.70833333, 4.41666667, 5.66666667, 5.25      , 4.25      ,
       4.54166667, 4.79166667, 4.83333333, 4.66666667, 4.16666667])

Highest rating gave for any movie by specific user:

In [27]:
data.max(axis=1)

array([10,  8, 10, 10, 10,  9, 10, 10, 10, 10, 10, 10,  9, 10, 10,  9,  9,
        9,  6,  8,  9,  9, 10,  8])

Which movie (__movie index__) got the lowest score for each user:

In [28]:
data.argmin(axis=1)

array([1, 8, 4, 7, 4, 3, 1, 0, 2, 3, 9, 6, 0, 5, 4, 1, 6, 6, 4, 0, 2, 1,
       1, 4])

And which one was scored the lowest amongst all users:

In [29]:
# Movie which got the lowest score per-user

lowest = data.argmin(axis=1) # (users, )

# Calculate how often each lowest value occured
# minlength specifies number of entries (10 in our case as there are 10 movies)

counts = np.bincount(lowest, minlength=data.shape[1]) # (movies,)

# Get movies which got lowest rated most frequently:

np.argmax(counts) # (1, )

1

# Key Takeaways

- The `reshape` feature changes the way we access arrays, and does not copy underlying data. It simply changes `strides` and the way we access it
- The `view` option can be used to reshape data, and it does not copy the data itself
- The `copy` option can be used when we want to physically copy the data
- Adding `-1` to the reshape command helps to infer any missing dimensionality. It's especially useful when we don't know some dimension beforehand
- Broadcasting is a Numpy feature which automatically expands smaller arrays into larger ones
- It's suggested to think of data stores in arrays as shapes, rather than specific elements


# More Resources
- Numpy : https://youtu.be/Lfd776JSicY?feature=shared&t=2027
 - Broadcasting: Start at **33:47** and finish at **54:02**
 - Resahpe: start at **1:17:11** and finish at **1:18:49**
 - Aggregation: start at **53:02** and fiish at **1:01:10**