<a href="https://colab.research.google.com/github/sdadi/ScalerDSML/blob/main/1_numpy/Class1/Postread_Numpy_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy 1

## **Introduction to DAV (Data Analysis and Visualization) Module**

With this lecture, we're starting the DAV module.

It will contain 3 sections -

1. DAV-1: Python Libraries
 - Numpy
 - Pandas
 - Matplotlib & Seaborn
2. DAV-2: Probability Statistics
3. DAV-3: Hypothesis Testing

---

## **Content**

- Introduction to DAV
- Python Lists vs Numpy Array
  - Importing Numpy
- Dimension & Shape
- Type Conversion in Numpy Arrays
- Indexing & Slicing
- Working with 2D arrays (Matrices)
  - Transpose
  - Indexing
  - Slicing


---

## **Python Lists vs Numpy Arrays**

### **Homogeneity of data**

So far, we've been working with Python lists, that can have **heterogenous data**.

In [None]:
a = [1, 2, 3, "Michael", True]
a

[1, 2, 3, 'Michael', True]

Because of this hetergenity, in Python lists, the data elements are not stored together in the memory (RAM).

- Each element is stored in a different location.
- Only the address of each of the element will be stored together.
- So, a list is actually just referencing to these different locations, in order to access the actual element.

\
On the other hand, Numpy only stores **homogenous data**, i.e. a numpy array cannot contain mixed data types.

It will either
- ONLY contain integers
- ONLY contain floats
- ONLY contain characters

... and so on.

Because of this, we can now store these different data items together, as they are of the same type.

<img src="https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/063/995/original/download.png?1706870327" width=700 height=175>

### **Speed**

Programming languages can also be slow or fast.

In fact,
- Java is a decently fast language.
- Python is a slow language.
- C, one of the earliest available languages, is super fast.

This is because C has concepts like memory allocation, pointers, etc.

#### **How is this possible?**

With Numpy, though we will be writing our code using Python, but behind the scene, all the code is written in the **C programming language**, to make it faster.

Because of this, a Numpy Array will be significantly faster than a Python List in performing the same operation.

This is very important to us, because in data science, we deal with huge amount of data.


### **Properties**

- **In-built Functions**
 - For a Python list `a`, we had in-built functions like `.sum(a)`, etc.
 - For NumPy arrays also, we will have such in-built functions.

- **Slicing**
 - Recall that we were able to perform list slicing.
 - All of that is still applicable here.


## **Importing Numpy**

Recall how we used to import a module/library in Python.

* In order to use Python Lists, we do not need to import anything extra.
* However to use Numpy Arrays, we need to import it into our environment, as it is a Library.

Generally, we do so while using the alias **`np`**.

In [None]:
import numpy as np

**Note:**
- In this terminal, we will already have numpy installed as we are working on Google Colab
- However, when working on an evironment that does not have it installed, you'll have to install it the first time working.
- This can be done with the command: `!pip install numpy`

---

## Why Use NumPy? — Speed & Simplicity

Let’s say you have a list of numbers and you want to square each element:

In [None]:
a = [1, 2, 3, 4, 5]
res = [i**2 for i in a]
print(res)

[1, 4, 9, 16, 25]


This works, but let’s try the same with NumPy:

In [None]:
import numpy as np
b = np.array(a)
print(b**2)  # Output: [ 1  4  9 16 25]

[ 1  4  9 16 25]


Much cleaner — no loops needed, and the syntax is concise.

### Speed Comparison
Let’s square 1 million numbers using both methods:

In [None]:
l = range(1000000)
%timeit [i**2 for i in l]  # ~300 ms

l_np = np.array(range(1000000))
%timeit l_np**2            # ~900 µs

92.8 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
971 µs ± 45.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


NumPy is nearly 100x faster!

**Why is NumPy faster?**

- NumPy arrays are homogeneous and densely packed in memory.
- Under the hood, NumPy uses C code (not Python loops).
- It can also do vectorized operations and even use parallel processing.




---

## **Dimensions and Shape**

**We can get the dimension of an array using the `ndim` property.**

In [None]:
arr1 = np.array(range(1000000))
arr1.ndim

1

**Numpy arrays have another property called `shape` that tells us number of elements across every dimension.**

In [None]:
arr1.shape

(1000000,)

This means that the array `arr1` has 1000000 elements in a single dimension.

Let's take another example to understand `shape` and `ndim` better.

In [None]:
arr2 = np.array([[1, 2, 3], [4, 5, 6], [10, 11, 12]])
print(arr2)

[[ 1  2  3]
 [ 4  5  6]
 [10 11 12]]


**What do you think will be the shape & dimension of this array?**

In [None]:
arr2.ndim

2

In [None]:
arr2.shape

(3, 3)

`ndim` specifies the number of dimensions of the array i.e. 1D (1), 2D (2), 3D (3) and so on.

`shape` returns the exact shape in all dimensions, that is (3,3) which implies 3 in axis 0 and 3 in axis 1.

<img src="https://drive.google.com/uc?id=1GSV_E1CaCc_Ur7pWJ-Kqv0VKvBRwByR1">

---

### **`np.arange()`**

Let's create some sequences in  Numpy.

We can pass **starting** point, **ending** point (not included in the array) and **step-size**.

**Syntax:**
- `arange(start, end, step)`

In [None]:
arr2_step = np.arange(1, 5, 2)
arr2_step

array([1, 3])

`np.arange()` behaves in the same way as `range()` function.

**But then why not call it np.range?**

- In `np.arange()`, we can pass a **floating point number** as **step-size**.

In [None]:
arr3 = np.arange(1, 5, 0.5)
arr3

array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

---

## **Type Conversion in Numpy Arrays**

For this, let's pass a **float** as one of the values in a **numpy array**.

Let’s see what happens when we mix data types:

In [None]:
import numpy as np

arr1 = np.array([1, 2, 3, 4.0])
print(arr1)


[1. 2. 3. 4.]


Notice that int was automatically upcasted to float.
- Because a numpy array can only store **homogenous data** i.e. values of one data type.

What about strings and numbers together?

In [None]:
arr2 = np.array(["Harry Potter", 1, 2, 3])
print(arr2)

['Harry Potter' '1' '2' '3']


Everything is converted to strings (char type)

Specifying Data Types with dtype

There's a `dtype` parameter in the `np.array()` function.

**What if we set the `dtype` of array containing `integer` values to `float`?**

In [None]:
arr3 = np.array([1, 2, 3, 4], dtype='float')
print(arr3)

[1. 2. 3. 4.]


You can control the data type directly using the dtype argument.

**Question:** What will happen in the following code?

In [None]:
np.array(["Shivank", "Bipin"], dtype=float)
# Raises ValueError: could not convert string to float

ValueError: could not convert string to float: 'Shivank'

Since it is not possible to convert strings of alphabets to floats, it will naturally return an Error.

\
We can also convert the data type with the `astype()` method.

In [None]:
arr4 = np.array([10, 20, 30])
arr4 = arr4.astype('float64')
print(arr4)

[10. 20. 30.]


---

## **Indexing**

- Similar to Python lists

In [None]:
m1 = np.arange(12)
m1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [None]:
m1[-1] # negative indexing in numpy array

11

You can also use list of indexes in numpy.

In [None]:
m1 = np.array([100,200,300,400,500,600])

In [None]:
m1[[2,3,4,1,2,2]]

array([300, 400, 500, 200, 300, 300])

Did you notice how single index can be repeated multiple times when giving list of indexes?

**Note:**
- If you want to extract multiple indices, you need to use two sets of square brackets `[[ ]]`
  - Otherwise, you will get an error.
- Because it is only expecting a single index.
- For multiple indices, you need to pass them as a list.



In [None]:
m1[2,3,4,1,2,2]

IndexError: too many indices for array: array is 1-dimensional, but 6 were indexed

---

## **Slicing**

- Similar to Python lists

In [None]:
m1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
m1

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [None]:
m1[:5]

array([1, 2, 3, 4, 5])

**Question:** What'll be output of `arr[-5:-1]` ?

In [None]:
m1[-5:-1]

array([6, 7, 8, 9])

**Question:** What'll be the output for `arr[-5:-1: -1]` ?




In [None]:
m1[-5: -1: -1]

array([], dtype=int64)

---

## **Working with 2D arrays (Matrices)**

Let's create an array -

In [None]:
import numpy as np
a = np.array(range(16))
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

What will be it's shape and dimensions?

In [None]:
a.shape

(16,)

In [None]:
a.ndim

1

#### How can we convert this array to a 2-dimensional array?

- Using `reshape()`

For a 2D array, we will have to specify the followings :-
- **First argument** is **no. of rows**
- **Second argument** is **no. of columns**

\
Let's try converting it into a `8x2` array.

In [None]:
a.reshape(8, 2)

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15]])

In [None]:
a.reshape(4, 5)

ValueError: cannot reshape array of size 16 into shape (4,5)

**This will give an Error. Why?**

* We have 16 elements in `a`, but `reshape(4, 5)` is trying to fill in `4x5 = 20` elements.
* Therefore, whatever the shape we're trying to reshape to, must be able to incorporate the number of elements that we have.


In [None]:
a.reshape(8, -1)

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15]])

Notice that Python automatically figured out what should be the replacement of `-1` argument, given that the first argument is `8`.

We can also put `-1` as the first argument. As long as one argument is given, it will calculate the other one.

**What if we pass both args as `-1`?**

In [None]:
a.reshape(-1, -1)

ValueError: can only specify one unknown dimension

- You need to give at least one dimension.

Let's save `a` as a `8 x 2` array (matrix) for now.

In [None]:
a = a.reshape(8, 2)

**What will be the length of `a`?**

* It will be 8, since it contains 8 lists as it's elements.
* Each of these lists have 2 elements, but that's a different thing.

**Explanation: len(nd array) will give you the magnitude of first dimension**

In [None]:
len(a)

8

In [None]:
len(a[0])

2

---

### **Transpose**

Let's create a 2D numpy array.

In [None]:
a = np.arange(12).reshape(3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [None]:
a.shape

(3, 4)

There is another operation on a multi-dimensional array, known as **Transpose**.

It basically means that the no. of rows is interchanged by no. of cols, and vice-versa.

In [None]:
a.T

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

Let's verify the shape of this transpose array -

In [None]:
a.T.shape

(4, 3)

---

### Indexing in NumPy 2D Arrays
Just like Python lists, NumPy supports intuitive indexing for multi-dimensional arrays — but with enhanced flexibility and speed.

<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/054/693/original/2dnp.png?1697949471 height = "600" width = "700">

In [None]:
a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])


You can access elements in row, column format:

In [None]:
a[1, 2]     # Output: 6 (2nd row, 3rd column)

np.int64(6)

In [None]:
a[1][2]     # Works the same way

np.int64(6)

NumPy supports both comma-based and nested indexing.

In [None]:
m1 = np.arange(1,10).reshape((3,3))
m1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

**What will be the output of this?**

In [None]:
m1[1, 1] # m1[row,column]

5

We saw how we can use list of indexes in numpy array.

In [None]:
m1 = np.array([100,200,300,400,500,600])

**Will this work now?**

In [None]:
m1[2, 3]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

**Note:**
- Since `m1` is a 1D array, this will not work.
- This is because there are no row and column entity here.

Therefore, you cannot use the same syntax for 1D arrays, as you did with 2D arrays, and vice-versa.

\
However with a little tweak in this code, we can access elements of `m1` at different positions/indices.

In [None]:
m1[[2, 3]]

array([300, 400])

#### **How will you print the diagonal elements of the following 2D array?**

In [None]:
m1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [None]:
m1[[0,1,2],[0,1,2]] # picking up element (0,0), (1,1) and (2,2)

array([1, 5, 9])

When list of indexes is provided for both rows and cols, for example: `m1[[0,1,2],[0,1,2]]`

It selects individual elements i.e. `m1[0][0], m1[1][1] and m2[2][2]`.




---

## **Slicing in 2D arrays**

- We need to **provide two slice ranges**, one for **row** and one for **column**.
- We can also **mix Indexing and Slicing**

In [None]:
m1 = np.arange(12).reshape(3,4)
m1

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [None]:
m1[:2] # gives first two rows

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

#### **How can we get columns from a 2D array?**

In [None]:
m1[:, :2] # gives first two columns

array([[0, 1],
       [4, 5],
       [8, 9]])

In [None]:
m1[:, 1:3] # gives 2nd and 3rd col

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

---