
## Programming for Data Science

### Lecture 3: Data Structures, Part 2

### Instructor: Farhad Pourkamali 



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/farhad-pourkamali/CUSucceedProgrammingForDataScience/blob/main/Lecture3_DataStructures_Part2.ipynb)


### Introduction
<hr style="border:2px solid gray">

* NumPy (https://numpy.org/) is the fundamental library for scientific computing and data science with Python.

* It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions, random number generators, linear algebra routines, and more.

* The fundamental data structure in NumPy is the `ndarray`, which stands for N-dimensional array.
    * `ndarrays` can have multiple dimensions, allowing for the representation of matrices, tensors, and other multi-dimensional data structures.
    
    * `ndarrays` are more memory-efficient and faster than built-in Python lists, especially for large data sets. This efficiency is crucial for numerical and scientific computing.
    * Many other libraries in the Python ecosystem, such as pandas, scikit-learn, and TensorFlow, build upon NumPy. 


* To utilize NumPy, it must be imported first. A common practice is to import it with the abbreviated name "np".

In [1]:
import numpy as np 

print(np.__version__)

1.26.2


* In the context of a NumPy ndarray, an `axis` represents a specific dimension along which the array is defined. 

* The number of axes in an ndarray is referred to as its "rank".
    * For example, a 1D array has a rank of 1, a 2D array has a rank of 2, and so on. 



<img src="numpy.png" width=600>

* Imagine a 2D array or matrix like a table. The rows are like the horizontal lines in the table, and the columns are like the vertical columns. So, when you look at a matrix, think of rows as the things going from left to right, and columns as the things going from top to bottom. 


$$\mathbf{A}=\begin{bmatrix}a_{11}&\ldots&a_{1m}\\
\vdots&&\vdots\\a_{n1}&&a_{nm}\end{bmatrix}\in\mathbb{R}^{n\times m}$$

In [2]:
a = np.zeros(4)

a


array([0., 0., 0., 0.])

In [3]:
# shape 

a.shape

(4,)

In [4]:
# Number of axes or rank

a.ndim

1

In [5]:
type(a)

numpy.ndarray

In [6]:
# 2D array: provide a tuple with the desired number of rows and columns

A = np.zeros((2,3))

A

array([[0., 0., 0.],
       [0., 0., 0.]])

In [7]:
A.shape

(2, 3)

In [8]:
A.ndim

2

In [9]:
# 3D array 

B = np.zeros((2,3,3))

B

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

In [10]:
B.shape

(2, 3, 3)

In [11]:
B.ndim


3

In [12]:
# 3x4 matrix full of ones

np.ones((3,4))


array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [13]:
# np.full creates ndarray with a specified shape and 
# fills it with a constant value

# Create a 2x3 array filled with the value 7

filled_array = np.full((2, 3), 7)

filled_array

array([[7, 7, 7],
       [7, 7, 7]])

* You have the option to employ the `np.array` function to transform a Python's list.

In [14]:
x = np.array([1, 4, 3])

x

array([1, 4, 3])

In [15]:
x.shape

(3,)

In [16]:
y = np.array([[2, 4, 5], [9, 2, 7]])

y

array([[2, 4, 5],
       [9, 2, 7]])

In [17]:
y.shape

(2, 3)

In [18]:
y.ndim

2

* `np.arange` is a NumPy function used to create an ndarray with regularly spaced values within a specified range. The syntax for `np.arange` is as follows:

    * `np.arange([start, ]stop, [step, ], dtype=None)`

        * `start`: (Optional) The start of the range. If not specified, the default is 0.
        * `stop`: The end of the range (exclusive). The created array will not include this value.
        * `step`: (Optional) The step size between values. If not specified, the default is 1.
        * `dtype`: (Optional) Data type of the array. If not specified, the data type is inferred.


In [19]:
z = np.arange(10)

z

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [20]:
t = np.arange(0, 10, 2)

t

array([0, 2, 4, 6, 8])

In [21]:
np.arange(1, 5)


array([1, 2, 3, 4])

* `np.linspace` creates an array with a specified number of evenly spaced values over a specified range (inclusive endpoints).

In [22]:
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

### Random numbers
<hr style="border:2px solid gray">

* `np.random` is a submodule in NumPy that provides functions for generating random numbers. Here are some key distributions available in numpy.random along with brief explanations:

1. Uniform Distribution:
    * Function: `np.random.rand()`
    * Description: Generates random numbers from a uniform distribution over the interval $[0, 1)$.

In [23]:
# Uniform 

np.random.rand(5)

array([0.68400166, 0.89658574, 0.37392833, 0.34418198, 0.29328848])

2. Normal (Gaussian) Distribution:
    * Function: `np.random.randn()`
    * Description: Generates random numbers from a standard normal distribution (mean=0, standard deviation=1).

In [24]:
# Normal 

np.random.randn(5)

array([ 2.73815073,  0.79264182, -1.78950834, -0.44839618,  0.08471108])

In [25]:
# We can change the mean and standard deviation

np.random.normal(loc=10, scale=1, size=5) # loc: mean, scale: Standard deviation

array([11.42366573, 10.03669191, 10.05378577, 10.48300035, 10.36884011])

https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

3. Uniform Integer Distribution:

    * Function: `np.random.randint()`
    * Description: Generates random integers from a specified range.

In [26]:
np.random.randint(1, 5, size=(2, 5))

array([[4, 4, 1, 3, 3],
       [4, 1, 1, 4, 4]])

4. Binomial Distribution:
    * Function: `np.random.binomial()`
    * Description: Generates random numbers from a binomial distribution with specified number of trials and probability of success.
    * The probability mass function (PMF) of a binomial distribution is given by the following equation:
    $$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}
$$
    * $n$: the number of trials or experiments. 
    * $k$: the number of successes.
    * $p$: the probability of success on a single trial.
    

In [27]:
np.random.binomial(n=10, p=0.7, size=5)

array([7, 7, 7, 7, 9])

* `np.random.seed` is a function that initializes the random number generator with a specified seed value. This is useful when you want to ensure reproducibility in your code.

In [28]:
# Set a seed value (e.g., 42)
np.random.seed(42)

# Generate random numbers
np.random.rand(5)

array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864])

### Reshaping an array
<hr style="border:2px solid gray">

* Altering the configuration of an `ndarray` is easily achieved by adjusting its shape attribute. It's important to note that the total size of the array must remain unchanged during this operation.

In [29]:
a = np.arange(12)

a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [30]:
a.shape = (3, 4)

a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [31]:
a.ndim

2

In [32]:
a.shape

(3, 4)

* The `reshape` function returns a new `ndarray` object pointing at the same data. This means that modifying one array will also modify the other.

In [33]:
b = a.reshape(6,2)

b

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [34]:
# change a 
a[0, 0] = 20

a

array([[20,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [35]:
# check b 

b

array([[20,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

The `ravel` function is used to flatten a multi-dimensional array into a one-dimensional array. It returns a flattened view of the input array without creating a new copy of the data. 

In [36]:
# Use the ravel function 

b.ravel()

array([20,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

### Mathematical and statistical functions
<hr style="border:2px solid gray">

* Here are some key points about mathematical and statistical functions in NumPy:
    1. Element-wise Operations
        * np.power, np.equal, np.greater,...
    
    2. Universal Functions
        * np.sin, np.cos, np.sinh, np.exp, no.log, ...
    
    3. Descriptive Statistics
        * np.mean, np.std, np.min, np.max, np.percentile, ...
    
    4. Linear Algebra 
        * np.dot, np.linalg.inv, np.linalg.eig, ...



In [37]:
my_array = np.array([2, 3, 8])

my_array

array([2, 3, 8])

In [38]:
np.power(my_array, 2)

array([ 4,  9, 64])

In [39]:
my_array2 = np.array([2, 5, 6])

my_array2

array([2, 5, 6])

In [40]:
np.equal(my_array, my_array2)

array([ True, False, False])

In [41]:
np.greater(my_array, my_array2)

array([False, False,  True])

In [42]:
# Create a 2x2 array
my_array = np.array([[1, 2],
                     [3, 4]])

# Calculate overall mean
overall_mean = np.mean(my_array)

overall_mean

2.5

In [43]:
# Calculate mean along axis 0 (column-wise)
mean_axis_0 = np.mean(my_array, axis=0)

mean_axis_0

array([2., 3.])

In [44]:
# Calculate mean along axis 1 (row-wise)
mean_axis_1 = np.mean(my_array, axis=1)

mean_axis_1

array([1.5, 3.5])

* The dot product (also known as the scalar product or inner product) of two vectors $\mathbf{a}$ and $\mathbf{b}$

$$\langle \mathbf{a},  \mathbf{b}\rangle = a_1 \cdot b_1 + a_2 \cdot b_2 + \ldots + a_n \cdot b_n$$


In [45]:
# Define two 1-D arrays
vector_a = np.array([1, 2, 3])
vector_b = np.array([4, 5, 6])

# Calculate the dot product using np.dot
dot_product = np.dot(vector_a, vector_b)

dot_product

32

* Matrix-vector multiplication

* Method 1 
$$\begin{bmatrix}a_{11} & \ldots & a_{1n}\\\vdots & & \vdots\\a_{m1}&\ldots &a_{mn}\end{bmatrix}\begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}=x_1\begin{bmatrix}a_{11}\\\vdots\\a_{m1}\end{bmatrix}+\ldots+x_n\begin{bmatrix}a_{1n}\\\vdots\\a_{mn}\end{bmatrix}$$

* Method 2
$$\begin{bmatrix}a_{11} & \ldots & a_{1n}\\\vdots & & \vdots\\a_{m1}&\ldots &a_{mn}\end{bmatrix}\begin{bmatrix}x_1\\\vdots\\x_n\end{bmatrix}=\begin{bmatrix}\sum_{i=1}^n a_{1i}x_i\\\vdots\\\sum_{i=1}^na_{mi}x_i\end{bmatrix}$$

In [46]:
A = np.array([[1,2,3],[4,5,6]])

x = np.array([2,4,5]) 

z = np.matmul(A,x)

print(z)

[25 58]


In [47]:
A.shape, x.shape, z.shape

((2, 3), (3,), (2,))

In [48]:
A

array([[1, 2, 3],
       [4, 5, 6]])

In [49]:
# verify your result 

np.dot(A[0,:], x), np.dot(A[1,:], x)

(25, 58)

### Array indexing
<hr style="border:2px solid gray">

* One-dimensional NumPy arrays can be accessed more or less like regular Python lists.


In [50]:
my_array = np.array([1, 2, 3, 4, 5])

element = my_array[2]  # Accesses the element at index 2 

element

3

In [51]:
my_array[2:]

array([3, 4, 5])

* For two-Dimensional arrays, use two indices separated by a comma to access elements in a 2D array.


In [52]:
my_2d_array = np.array([[1, 2, 3], [4, 5, 6]])

my_2d_array


array([[1, 2, 3],
       [4, 5, 6]])

In [53]:
my_2d_array[1, 2]  # Accesses the element at row 1, column 2 

6

In [54]:
my_2d_array[:, 1:3]  # Selects columns 1 to 2 for all rows

array([[2, 3],
       [5, 6]])

* In NumPy, the ellipsis (`...`) is a convenient shorthand for representing multiple colons in array slicing. It is often used when you have arrays with more than two dimensions, and you want to specify a slice along a particular dimension without explicitly writing out all the colons.

In [55]:
arr_3d = np.random.rand(3, 4, 5)

arr_3d

array([[[0.15599452, 0.05808361, 0.86617615, 0.60111501, 0.70807258],
        [0.02058449, 0.96990985, 0.83244264, 0.21233911, 0.18182497],
        [0.18340451, 0.30424224, 0.52475643, 0.43194502, 0.29122914],
        [0.61185289, 0.13949386, 0.29214465, 0.36636184, 0.45606998]],

       [[0.78517596, 0.19967378, 0.51423444, 0.59241457, 0.04645041],
        [0.60754485, 0.17052412, 0.06505159, 0.94888554, 0.96563203],
        [0.80839735, 0.30461377, 0.09767211, 0.68423303, 0.44015249],
        [0.12203823, 0.49517691, 0.03438852, 0.9093204 , 0.25877998]],

       [[0.66252228, 0.31171108, 0.52006802, 0.54671028, 0.18485446],
        [0.96958463, 0.77513282, 0.93949894, 0.89482735, 0.59789998],
        [0.92187424, 0.0884925 , 0.19598286, 0.04522729, 0.32533033],
        [0.38867729, 0.27134903, 0.82873751, 0.35675333, 0.28093451]]])

In [56]:
slice_result = arr_3d[2, :, :]  # Without ellipsis

slice_result

array([[0.66252228, 0.31171108, 0.52006802, 0.54671028, 0.18485446],
       [0.96958463, 0.77513282, 0.93949894, 0.89482735, 0.59789998],
       [0.92187424, 0.0884925 , 0.19598286, 0.04522729, 0.32533033],
       [0.38867729, 0.27134903, 0.82873751, 0.35675333, 0.28093451]])

In [57]:
ellipsis_result = arr_3d[2, ...]  # With ellipsis

ellipsis_result

array([[0.66252228, 0.31171108, 0.52006802, 0.54671028, 0.18485446],
       [0.96958463, 0.77513282, 0.93949894, 0.89482735, 0.59789998],
       [0.92187424, 0.0884925 , 0.19598286, 0.04522729, 0.32533033],
       [0.38867729, 0.27134903, 0.82873751, 0.35675333, 0.28093451]])

* Boolean indexing in NumPy involves using boolean arrays (arrays of True and False values) to index or filter elements from another array. The boolean array acts as a mask, indicating which elements from the original array should be included in the result.

In [58]:
arr = np.array([1, 2, 3, 4, 5])

mask = np.array([True, False, True, False, True])

result = arr[mask]

print(result)

[1 3 5]


* Let's say you have an array and you want to select only the elements that are greater than a certain threshold.

In [59]:
arr = np.array([10, 5, 8, 12, 3])

threshold = 8

# Create a boolean array based on the condition (elements greater than the threshold)
mask = arr > threshold

# Use boolean indexing to select elements greater than the threshold
result = arr[mask]

print(result)

[10 12]


### Concatenating arrays 
<hr style="border:2px solid gray">

* `np.concatenate` is a NumPy function used for concatenating (joining together) arrays along a specified axis. 
    * Syntax: `np.concatenate((array1, array2, ...), axis=0)`
    * Parameters:

        * arrays: Sequence of arrays to be concatenated.
        * axis: Axis along which the arrays will be joined. Default is 0 (along rows).

In [60]:
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6]])

arr1, arr2

(array([[1, 2],
        [3, 4]]),
 array([[5, 6]]))

In [61]:
# Concatenate along rows (axis=0)
result = np.concatenate((arr1, arr2), axis=0)

print(result)

[[1 2]
 [3 4]
 [5 6]]


### Saving and loading arrays 

<hr style="border:2px solid gray">

* The functions `np.savetxt` and `np.loadtxt` in NumPy are used for saving and loading arrays to and from text files, respectively.
    * Syntax: `np.savetxt(fname, arr)`
    * Parameters:
        * fname: File name or file object where the data will be saved.
        * arr: Array or data to be saved.

In [62]:
arr = np.array([[1., 2, 3], [4, 5, 6]])

arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [63]:
# Save to a text file
np.savetxt('my_array.txt', arr)

# Load from a text file
loaded_arr = np.loadtxt('my_array.txt')

loaded_arr

array([[1., 2., 3.],
       [4., 5., 6.]])

### HW 3

1. Consider the following Numpy array. Write a Python code that finds the count of zero elements in the array.

In [None]:
arr = np.array([0, 5, 0, 3, 7, 0, 8, 0, 2])


2. Given a 2D Numpy array, write a Python code that takes the org_array and returns a new array with a border of 0s around it. The border should consist of 0s on all sides (shown below).

In [None]:
org_arr = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

desired_arr = np.array(
       [[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 0],
       [0, 4, 5, 6, 0],
       [0, 7, 8, 9, 0],
       [0, 0, 0, 0, 0]])

print(f"Your solution should be:\n {desired_arr}" )

3. Create a 2×4 zero array. Change the second column in the array to 1.

4. Given a NumPy array `arr`, devise a solution to find the value that occurs most frequently in the array using `np.unique()`: https://numpy.org/doc/stable/reference/generated/numpy.unique.html.

In [None]:
arr = np.array([1, 2, 2, 3, 3, 3, 4, 4, 4, 4])


5. Given a 2D NumPy array `matrix`, find the sums of each row and each column in two separate arrays: `sums_of_rows` and `sums_of_columns`. 

In [None]:
matrix = np.array([[2, 4, 6],
                   [1, 3, 5],
                   [7, 8, 9],
                   [10, 11, 12]])

matrix

6. Write down the shapes of `matrix`, `sums_of_rows`, and `sums_of_columns`. 

7. Consider two NumPy arrays, `array1` and `array2`, both of length `n`. Your task is to write a Python code that generates a new 2D NumPy array with rows containing all possible combinations of elements from `array1` and `array2`. To solve this problem, `np.repeat` and `np.tile` are useful NumPy functions. 
    * `np.repeat(array, reps)` repeats each element of the array a specified number of times: https://numpy.org/doc/stable/reference/generated/numpy.repeat.html. 
    * `np.tile(array, reps)` constructs an array by repeating the entire array a specified number of times: https://numpy.org/doc/stable/reference/generated/numpy.tile.html.

In [None]:
array1 = np.array([1, 2])

array2 = np.array([3, 4])

8. Write a Python code to get the common items between two python numpy arrays `a` and `b`?

In [None]:
a = np.array([1,2,3,2,3,4,3,4,5,6])

b = np.array([7,2,10,2,7,4,9,4,9,8])

9. Write a Python code to normalize the following array so the values range exactly between 0 and 1? Min-max scaling is a normalization technique used to scale and transform the values of a dataset into a specific range, usually between 0 and 1. The formula for min-max scaling is:
$$X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$

In [None]:
unnormalized_arr = np.array([3, 2, 8, 9, 1, 4])

10. You have a list of exam scores. Create a new NumPy array called `result` such that for each score, if it's greater than or equal to 60, assign the corresponding value in `result` as "Pass", otherwise assign it as "Fail". `np.where` is a NumPy function that returns the indices of elements in an input array that satisfy a specified condition (https://numpy.org/doc/stable/reference/generated/numpy.where.html).

In [None]:
exam_scores = np.array([75, 48, 90, 30, 65, 80])
