In [1]:
import numpy as np

## Introduction

* What is a NumPy ndarray?
  * A NumPy ndarray is a multidimensional array object that is the core data structure of the NumPy library. It's designed for efficient storage and manipulation of numerical data. All elements in an ndarray must be of the same data type, specified by the dtype attribute.
 
* What is the difference between NumPy and standard Python?

* What are the pros and cons?
  * Pros
    * Efficient numerical computations due to vectorized operations.
    * Homogeneous data type for memory efficiency.
    * Fast indexing and slicing capabilities.
    * Support for linear algebra and other mathematical functions.
  * Cons
    * Fixed size after creation.
    * Requires all elements to be the same data type.

* When do you use it?
  * When you need to perform numerical computations efficiently on large datasets, especially in data analysis, scientific computing, and machine learning. It is used when dealing with arrays, matrices, and other numerical data structures.
 

### Coding Question:
* Create a 1D NumPy array
* Create a 2D NumPy array
* Create a 3x3 NumPy ndarray named my_array with random integers between 1 and 10. Print its shape, data type, and the maximum and minimum values in the array.

### Conceptual Questions:
* Why is an ndarray more suitable for numerical computations than a standard Python list?
  * Because ndarrays support vectorized operations and are more memory efficient due to their homogeneous data type. Python lists do not support vectorized operations without looping.

* What is the significance of the dtype attribute in an ndarray?
  * The dtype specifies the data type of the elements and ensures efficient storage and computation by allowing NumPy to handle memory in a more performant manner.

In [8]:
# Create
my_array1D = np.array([1, 2, 3])

print("Array: \n", my_array1D)
print("Shape: ", my_array1D.shape)
print("Data Type: ", my_array1D.dtype)
print("Maximum Value:", np.max(my_array1D))
print("Minimum Value:", np.min(my_array1D))
print("Array Dimension:", my_array1D.ndim)

my_array2D = np.array([[1, 2, 3], [4, 5 ,6]])

print("Array: \n", my_array2D)
print("Shape: ", my_array2D.shape)
print("Data Type: ", my_array2D.dtype)
print("Maximum Value:", np.max(my_array2D))
print("Minimum Value:", np.min(my_array2D))
print("Array Dimension:", my_array2D.ndim)


# Create a 3x3 NumPy ndarray named my_array with random integers between 1 and 10. Print its shape, data type, and the maximum and minimum values in the array.
my_array = np.random.randint(1, 11, size=(3, 3))

print("Array: \n", my_array)
print("Shape: ", my_array.shape)
print("Data Type: ", my_array.dtype)
print("Maximum Value:", np.max(my_array))
print("Minimum Value:", np.min(my_array))
print("Array Dimension:", my_array.ndim)

Array: 
 [1 2 3]
Shape:  (3,)
Data Type:  int64
Maximum Value: 3
Minimum Value: 1
Array Dimension: 1
Array: 
 [[1 2 3]
 [4 5 6]]
Shape:  (2, 3)
Data Type:  int64
Maximum Value: 6
Minimum Value: 1
Array Dimension: 2
Array: 
 [[ 4  6  8]
 [ 3 10  7]
 [ 2  8  9]]
Shape:  (3, 3)
Data Type:  int64
Maximum Value: 10
Minimum Value: 2
Array Dimension: 2


## Vectorized Operations

* What are vectorized operations in NumPy?
  * Vectorized operations are operations applied to entire NumPy arrays without writing explicit loops. They enable *efficient element-wise computation*. (記得 shape 要相容 (3, 1) + (1, 3))
 

* What are the pros and cons?
  * Pros
    * Significant performance improvement over using loops in Python.
    * More concise and readable code.
    * Leverages optimized low-level implementations.
  * Cons
    * Can be less intuitive to write for complex operations that cannot be easily vectorized.

* When do you use it?
  * Whenever you need to perform element-wise computations on arrays, such as arithmetic, mathematical, and logical operations. This is a core practice for all NumPy array calculations.
 

### Coding Question:
* Create two NumPy arrays, arr1 and arr2, both of shape (5,). Calculate the element-wise addition, subtraction, multiplication, and division of these arrays using vectorized operations.


### Conceptual Questions:
* Why are vectorized operations more efficient than using loops in Python for numerical computations?
  * Vectorized operations are implemented in optimized low-level code, avoiding Python's interpreter overhead. Loops in Python are significantly slower than NumPy's vectorized operations.

* How can vectorized operations improve the readability and maintainability of your code?
  * Vectorized operations allow you to express computations concisely, making the code easier to understand and modify.

In [10]:
# Demo
# element-wise computation - python list
print([1, 2, 3] + [4, 5, 6])

# element-wise computation - NumPy
print(np.array([1, 2, 3]) + np.array([4, 5, 6]))

[1, 2, 3, 4, 5, 6]
[5 7 9]


In [20]:
# Create two NumPy arrays, arr1 and arr2, both of shape (5,). Calculate the element-wise addition, subtraction, multiplication, and division of these arrays using vectorized operations.

arr1 = np.random.randint(0, 100, size=(5, ))
arr2 = np.random.randint(0, 100, size=(5, ))

print("Array1: \n", arr1)
print("Array2: \n", arr2)
print("Addition: ", arr1 + arr2)
print("Subtraction: ", arr1 - arr2)
print("Multiplication: ", arr1 * arr2)
print("Division: ", arr1 / arr2)

Array1: 
 [ 2 76 15 90 45]
Array2: 
 [93 58 33 30 28]
Addition:  [ 95 134  48 120  73]
Subtraction:  [-91  18 -18  60  17]
Multiplication:  [ 186 4408  495 2700 1260]
Division:  [0.02150538 1.31034483 0.45454545 3.         1.60714286]


## Boolean Indexing

* What is boolean indexing in NumPy?
  * Boolean indexing uses a boolean array (mask) to select elements from a NumPy array based on specified conditions.

* What are the pros and cons?
  * Pros (Boolean indexing saves time, improves code clarity, and handles large datasets efficiently.)
    * Flexible and efficient data filtering.
    * Clear and concise way to express data selection.
  * Cons
    * Requires creating a boolean array, which may take additional memory.

* When do you use it?
  * When you need to select or filter specific elements from an array based on certain conditions. This technique is widely used when cleaning and processing data.
  * Data Filtering
    * Use boolean conditions to filter out unwanted or invalid data, such as selecting values that meet a certain threshold
    * Example: Filtering all temperatures above 30°C in climate data.
  * Data Cleaning
    * Identify and handle missing values, outliers, or incorrect data using boolean masks
    * Example: Setting invalid data points (e.g., negative age values) to NaN or excluding them.
  * Feature Selection in Machine Learning
    * Use boolean indexing to extract rows or columns that meet specific conditions.
    * Example: Selecting samples with labels or features that satisfy a criterion.
  * Efficient Operations
    * Instead of writing slow loops, boolean indexing lets you perform operations on specific subsets of data efficiently using NumPy’s optimized array operations.
 

### Coding Question:
* Create a NumPy array named my_array of 20 random numbers. Use boolean indexing to create a new array named filtered_array containing only the values greater than 0.5. Also, calculate how many of the values in the original array are greater than 0.5 using the sum method.

### Conceptual Questions:
* How can boolean indexing help in cleaning and pre-processing data?
  * Boolean indexing can be used to easily identify and remove invalid or outlier data. This makes data cleaning much easier to implement.

* Why is boolean indexing more efficient for data filtering than using loops?
  * Boolean indexing operates directly on the array, using optimized low-level implementations, which is more efficient than using loops in Python.

In [24]:
# Create a NumPy array named my_array of 20 random numbers. Use boolean indexing to create a new array named filtered_array containing only the values greater than 0.5. Also, calculate how many of the values in the original array are greater than 0.5 using the sum method.

# Create a NumPy array of 20 random numbers between 0 and 1
my_array = np.random.rand(20)

# Use boolean indexing to filter values greater than 0.5
filtered_array = my_array[my_array > 0.5]

# Calculate how many values are greater than 0.5 using the sum method
count = (my_array > 0.5).sum()

print("Original Array:\n", my_array)
print("\nFiltered Array (values > 0.5):\n", filtered_array)
print("\nNumber of values greater than 0.5:", count)

Original Array:
 [0.62261092 0.40563352 0.00937596 0.61674144 0.09075599 0.42716926
 0.48076599 0.39251715 0.18754434 0.6341253  0.33284906 0.72073806
 0.19892235 0.4055414  0.14665279 0.83493904 0.88315604 0.70813899
 0.9694146  0.05374326]

Filtered Array (values > 0.5):
 [0.62261092 0.61674144 0.6341253  0.72073806 0.83493904 0.88315604
 0.70813899 0.9694146 ]

Number of values greater than 0.5: 8


In [35]:
# Suppose we have a dictionary of clientID and their transaction amount. We want to filter out the data of transaction amount below 500.

transactions = {
    "client1": 150,
    "client2": 1200,
    "client3": 800,
    "client4": 50,
    "client5": 1300
}

# Convert to NumPy arrays
client_ids = np.array(list(transactions.keys()))
transaction_amounts = np.array(list(transactions.values()))

# Apply boolean indexing: filtering transactions > $500
filtered_indices = transaction_amounts > 500
filtered_clients = client_ids[filtered_indices]
filtered_amounts = transaction_amounts[filtered_indices]

# Combine filtered results into a new dictionary
filtered_transactions = dict(zip(filtered_clients.tolist(), filtered_amounts.tolist()))
print("Filtered Transactions:", filtered_transactions)

Filtered Transactions: {'client2': 1200, 'client3': 800, 'client5': 1300}


## Indexing and Slicing

### Coding Question
* Extract the second row.
* Extract the first column.
* Extract the element in the third row and second column.
* Extract a subarray containing the first two rows and last two columns.

### Explanation
* array[n]: return nth row
* array[n:m]: return n to (m - 1)th row
* array[n][m]: return the element at nth row and mth column
* array[n, m]: return the list in nth row and mth column
* array[n:, m:]: return all rows and columns starting from nth row and mth column

In [52]:
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Second row: ", array[1])
print("First column: ", array[:, 0])
print("Element at (3, 2):", array[2][1])
print("Subarray: \n", array[0:2, 1:])

Second row:  [4 5 6]
First column:  [1 4 7]
Element at (3, 2): 8
Subarray: 
 [[2 3]
 [5 6]]


## Universal Functions (ufuncs)

* What are universal functions (ufuncs) in NumPy?
  * Universal functions are functions that perform *element-wise operations* on NumPy arrays.

* What are the pros and cons?
  * Pros
    * Highly efficient for element-wise operations.
    * Can be used on arrays of any shape.
    * Concise syntax for common operations.
  * Cons
    * May not be as flexible for operations that require multiple elements at a time.

* When do you use it?
  * When you need to apply a mathematical or logical function to every element of an array, such as square roots, exponents, or trigonometric functions.

### Coding Question:

* Create a NumPy array of 10 values. Calculate the square root and the exponential of each element using ufuncs.


### Conceptual Questions:

* How do ufuncs contribute to the efficiency of NumPy?
  * ufuncs are implemented using optimized low-level code, which makes them faster than similar operations done using Python loops.

* What are some common use cases for ufuncs in data analysis?
  * ufuncs are used for transforming data, such as scaling values, taking logarithms, or applying trigonometric functions to features of your data set.

In [53]:
# Create a Numpy array with 10 values
values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Calcuate the square root of each element
square_roots = np.sqrt(values)

# Calculate the exponential of each element
exponentials = np.exp(values)

print("Original Array:", values)
print("Square Roots:", square_roots)
print("Exponentials:", exponentials)

Original Array: [ 1  2  3  4  5  6  7  8  9 10]
Square Roots: [1.         1.41421356 1.73205081 2.         2.23606798 2.44948974
 2.64575131 2.82842712 3.         3.16227766]
Exponentials: [2.71828183e+00 7.38905610e+00 2.00855369e+01 5.45981500e+01
 1.48413159e+02 4.03428793e+02 1.09663316e+03 2.98095799e+03
 8.10308393e+03 2.20264658e+04]


## Array Transposition

* What is array transposition in NumPy?
  * Array transposition is the process of swapping the rows and columns of a NumPy array.

* What are the pros and cons?
  * Pros
    * Useful for reshaping data for linear algebra and other operations.
    * Provides a view rather than a copy of the original array, saving memory.
  * Cons
    * Can alter data layout, potentially leading to non-contiguous arrays.

* When do you use it?
  * When you need to change the orientation of your data. This can occur when preparing data for linear algebra operations or when needing to switch rows to columns.

### Coding Question:
* Create a 2x4 NumPy array and calculate its transpose using the .T attribute.

### Conceptual Questions:

* Why is transposition useful in the context of linear algebra?
  * Transposition is often needed to align the dimensions of arrays for linear algebra operations such as matrix multiplication.

* How does the transpose operation avoid making a copy of the data?
  * Transpose returns a view of the underlying data, not a copy. This improves efficiency, especially when working with large arrays.

In [56]:
my_array = np.array([[1, 2, 3, 4],
                     [5, 6, 7, 8]])
transposed_array = my_array.T

print("Original Array (2x4):")
print(my_array)
print("\nTransposed Array (4x2):")
print(transposed_array)

Original Array (2x4):
[[1 2 3 4]
 [5 6 7 8]]

Transposed Array (4x2):
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
