# Introduction to NumPy for Machine Learning
In this notebook, we will practice Python with NumPy, the fundamental package for scientific computing in Python. Understanding NumPy is essential for Machine Learning because:

- NumPy provides efficient multi-dimensional array operations

- Most ML libraries (scikit-learn, TensorFlow, PyTorch) use NumPy arrays internally

- Numerical computations on arrays are much faster than Python lists

- Linear algebra operations are built-in and optimized

We will go step by step:

- NumPy basics (arrays, data types, attributes)

- Array creation methods

- Array indexing and slicing

- Array operations and broadcasting

- Mathematical functions

- Linear algebra operations

- Random number generation

- Reshaping and manipulating arrays

Run each code cell and read the explanation carefully. Try small variations yourself!

## Import NumPy

In [None]:
import numpy as np

# Check NumPy version
print("NumPy version:", np.__version__)

## What is NumPy?
NumPy (Numerical Python) is the core library for numerical computing in Python. It provides:

- ndarray: Efficient multi-dimensional array object

- Mathematical functions for array operations

- Tools for integrating C/C++ and Fortran code

- Linear algebra, Fourier transform, and random number capabilities

The key advantage over Python lists is speed and convenience for numerical operations.

In [None]:
# Simple comparison: Python list vs NumPy array
import time

# Using Python list
python_list = list(range(1000000))
start_time = time.time()
result_list = [x * 2 for x in python_list]
list_time = time.time() - start_time

# Using NumPy array
numpy_array = np.arange(1000000)
start_time = time.time()
result_array = numpy_array * 2
array_time = time.time() - start_time

print(f"Python list time: {list_time:.4f} seconds")
print(f"NumPy array time: {array_time:.4f} seconds")
print(f"NumPy is {list_time/array_time:.1f}x faster!")

## NumPy Arrays: The ndarray
The ndarray ```n-dimensional array``` is NumPy's core data structure. Key properties:

- Homogeneous: All elements must be the same data type

- Fixed size: Size cannot be changed after creation (unlike Python lists)

- Efficient: Optimized for numerical operations

- Multi-dimensional: Can have 1, 2, 3, or more dimensions

In [None]:
# Creating a simple 1D array
arr1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1d)
print("Type:", type(arr1d))
print("Shape:", arr1d.shape)
print("Dimensions:", arr1d.ndim)
print("Data type:", arr1d.dtype)
print("Size (total elements):", arr1d.size)

## Creating NumPy Arrays
There are multiple ways to create NumPy arrays:

In [None]:
# From Python lists
list_data = [1, 2, 3, 4, 5]
arr_from_list = np.array(list_data)
print("From list:", arr_from_list)

In [None]:
# 2D array from nested lists
matrix_data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
arr_2d = np.array(matrix_data)
print("\n2D Array:")
print(arr_2d)
print("Shape:", arr_2d.shape)

## Special Array Creation Functions
NumPy provides many functions to create arrays with specific patterns:

In [None]:
# Array of zeros
zeros_arr = np.zeros(5)
print("Zeros:", zeros_arr)

In [None]:
# Array of ones
ones_arr = np.ones((3, 3))  # 3x3 matrix of ones
print("\nOnes (3x3):")
print(ones_arr)

In [None]:
# Identity matrix
identity_matrix = np.eye(4)
print("\nIdentity Matrix (4x4):")
print(identity_matrix)

In [None]:
# Array with a range of values
range_arr = np.arange(0, 10, 2)  # start, stop, step
print("\nRange (0 to 10, step 2):", range_arr)

In [None]:
# Array with evenly spaced values
linspace_arr = np.linspace(0, 1, 5)  # start, stop, num_points
print("\nLinspace (0 to 1, 5 points):", linspace_arr)

## Array Data Types
NumPy arrays have specific data types for memory efficiency:

In [None]:
# Different data types
int_arr = np.array([1, 2, 3], dtype=np.int32)
float_arr = np.array([1.0, 2.0, 3.0], dtype=np.float64)
bool_arr = np.array([True, False, True], dtype=np.bool_)

print("Integer array:", int_arr, "dtype:", int_arr.dtype)
print("Float array:", float_arr, "dtype:", float_arr.dtype)
print("Boolean array:", bool_arr, "dtype:", bool_arr.dtype)

# Converting data types
converted = int_arr.astype(np.float64)
print("\nConverted to float:", converted, "dtype:", converted.dtype)

## Array Indexing and Slicing
Accessing elements in NumPy arrays is similar to Python lists but with multi-dimensional support:

In [None]:
# Create a sample 2D array
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

print("Original array:")
print(arr)

In [None]:
# Basic indexing
print("\nElement at row 1, column 2:", arr[1, 2])
print("First row:", arr[0])
print("Last column:", arr[:, -1])

In [None]:
# Slicing
print("\nFirst two rows:")
print(arr[:2])
print("\nMiddle two columns:")
print(arr[:, 1:3])

## Boolean Indexing
Select elements based on conditions:

In [None]:
# Create sample data
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

print("Original data:")
print(data)

# Boolean indexing
mask = data > 5
print("\nBoolean mask (elements > 5):")
print(mask)

print("\nElements greater than 5:")
print(data[data > 5])

In [None]:
# Multiple conditions
print("\nElements between 3 and 7:")
print(data[(data > 3) & (data < 7)])

## Array Operations
NumPy supports element-wise operations:

In [None]:
# Create sample arrays
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

print("Array a:", a)
print("Array b:", b)

In [None]:
# Arithmetic operations
print("\nAddition:", a + b)
print("Subtraction:", a - b)
print("Multiplication:", a * b)
print("Division:", a / b)
print("Exponentiation:", a ** 2)

In [None]:
# Comparison operations
print("\na > 2:", a > 2)
print("a == b:", a == b)

## Broadcasting
NumPy can perform operations on arrays of different shapes through broadcasting:

In [None]:
# Array and scalar
arr = np.array([[1, 2, 3],
                [4, 5, 6]])
print("Original array:")
print(arr)

In [None]:
print("\nArray + 5:")
print(arr + 5)

In [None]:
print("\nArray * 2:")
print(arr * 2)

In [None]:
# Array and 1D array
vector = np.array([10, 20, 30])
print("\nArray + vector:")
print(arr + vector)

## Mathematical Functions
NumPy provides comprehensive mathematical functions:

In [None]:
# Create sample data
arr = np.array([1, 4, 9, 16, 25])

print("Array:", arr)

In [None]:
# Basic math functions
print("\nSquare root:", np.sqrt(arr))
print("Exponential:", np.exp(arr))
print("Logarithm:", np.log(arr))
print("Sine:", np.sin(arr))

In [None]:
# Statistical functions
print("\nSum:", np.sum(arr))
print("Mean:", np.mean(arr))
print("Standard deviation:", np.std(arr))
print("Minimum:", np.min(arr))
print("Maximum:", np.max(arr))
print("Index of maximum:", np.argmax(arr))

## Array Manipulation
Reshaping and manipulating arrays:

In [None]:
# Create sample array
arr = np.arange(12)
print("Original 1D array:", arr)

In [None]:
# Reshape to 2D
arr_2d = arr.reshape(3, 4)
print("\nReshaped to 3x4:")
print(arr_2d)

In [None]:
# Flatten back to 1D
arr_flat = arr_2d.flatten()
print("\nFlattened:", arr_flat)

In [None]:
# Transpose
arr_transposed = arr_2d.T
print("\nTransposed:")
print(arr_transposed)

In [None]:
# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print("\nVertical stack:")
print(np.vstack([a, b]))
print("\nHorizontal stack:")
print(np.hstack([a, b]))

## Linear Algebra Operations
NumPy has extensive linear algebra capabilities:

In [None]:
# Create matrices
A = np.array([[1, 2],
              [3, 4]])
B = np.array([[5, 6],
              [7, 8]])

print("Matrix A:")
print(A)
print("\nMatrix B:")
print(B)

In [None]:
# Matrix operations
print("\nMatrix multiplication:")
print(np.dot(A, B))

In [None]:
# Alternative: A @ B
print("\nMatrix multiplication (alternative):")
print(A @ B)

In [None]:
print("\nElement-wise multiplication:")
print(A * B)

In [None]:
print("\nDeterminant of A:", np.linalg.det(A))

In [None]:
print("Inverse of A:")
print(np.linalg.inv(A))

In [None]:
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("\nEigenvalues of A:", eigenvalues)
print("Eigenvectors of A:")
print(eigenvectors)

## Random Number Generation
Essential for machine learning initialization and simulations:

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Various random distributions
print("Random floats [0,1):", np.random.rand(5))
print("\nRandom integers [0,10):", np.random.randint(0, 10, 5))
print("\nRandom normal distribution:", np.random.randn(5))

In [None]:
# Random 2D array
random_matrix = np.random.rand(3, 3)
print("\nRandom 3x3 matrix:")
print(random_matrix)

In [None]:
# Random choice from array
choices = ['a', 'b', 'c', 'd']
print("\nRandom choices:", np.random.choice(choices, size=10))
print("Random choices with probabilities:",
      np.random.choice(choices, size=10, p=[0.1, 0.2, 0.3, 0.4]))

## Working with Real Data
Let's apply NumPy to a more realistic scenario:

In [None]:
# Simulate a dataset (like you might load from a CSV)
# Let's create sample data for house prices
np.random.seed(123)

In [None]:
# Features: size (sq ft), bedrooms, age (years)
n_samples = 100
house_sizes = np.random.normal(2000, 500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
house_ages = np.random.randint(0, 50, n_samples)

In [None]:
# Create feature matrix
X = np.column_stack([house_sizes, bedrooms, house_ages])
print("Feature matrix shape:", X.shape)
print("First 5 rows:")
print(X[:5])

In [None]:
# Target variable: price (simplified calculation)
prices = (house_sizes * 100 + bedrooms * 10000 - house_ages * 1000 +
          np.random.normal(0, 10000, n_samples))

print("\nPrice statistics:")
print(f"Mean: ${np.mean(prices):.2f}")
print(f"Std: ${np.std(prices):.2f}")
print(f"Min: ${np.min(prices):.2f}")
print(f"Max: ${np.max(prices):.2f}")

In [None]:
# Normalize features (common in ML)
X_normalized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
print("\nFirst 5 normalized features:")
print(X_normalized[:5])

## Saving and Loading Arrays
Persist your data for later use:

In [None]:
# Create sample data
data = np.random.randn(5, 3)
print("Sample data:")
print(data)

In [None]:
# Save to file
np.save('my_array.npy', data)

In [None]:
# Load from file
loaded_data = np.load('my_array.npy')
print("\nLoaded data:")
print(loaded_data)

In [None]:
# Save and load text files (CSV-like)
np.savetxt('my_array.csv', data, delimiter=',')
loaded_csv = np.loadtxt('my_array.csv', delimiter=',')
print("\nLoaded from CSV:")
print(loaded_csv)

In [None]:
# Verify they're the same
print("\nArrays are equal:", np.array_equal(data, loaded_data))

## Advanced Indexing: Fancy Indexing
Selecting multiple non-adjacent elements:

In [None]:
# Create sample array
arr = np.arange(0, 100, 10)
print("Original array:", arr)

In [None]:
# Fancy indexing with integer arrays
indices = [1, 3, 5, 7]
print("\nElements at indices", indices, ":", arr[indices])

In [None]:
# Boolean fancy indexing
mask = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=bool)
print("\nElements where mask is True:", arr[mask])

In [None]:
# Multi-dimensional fancy indexing
matrix = np.random.rand(4, 4)
print("\nRandom 4x4 matrix:")
print(matrix)

In [None]:
# Select specific rows and columns
selected = matrix[[0, 2, 3], [1, 2, 0]]  # (0,1), (2,2), (3,0)
print("\nSelected elements:", selected)

## Practical Example: Image Processing
NumPy arrays are perfect for image manipulation:

In [None]:
# Simulate a grayscale image (2D array)
image = np.random.randint(0, 256, (10, 10), dtype=np.uint8)
print("Original 'image' (10x10 pixels):")
print(image)







In [None]:
# Image operations
brightened = np.clip(image + 50, 0, 255)  # Brighten
print("\nBrightened image:")
print(brightened)

In [None]:
# Edge detection kernel (simplified)
kernel = np.array([[-1, -1, -1],
                   [-1,  8, -1],
                   [-1, -1, -1]])

# We'd normally use convolution, but for simplicity:
edges = np.zeros_like(image)
for i in range(1, image.shape[0] - 1):
    for j in range(1, image.shape[1] - 1):
        neighborhood = image[i-1:i+2, j-1:j+2]
        edges[i, j] = np.sum(neighborhood * kernel)

# let's see the final output
print("\nEdges detected:")
print(edges)

In [None]:
# Rescaling image
scaled = image[::2, ::2]  # Downsample by 2
print("\nScaled image (5x5):")
print(scaled)

## Practice
Create a complete data analysis workflow:

1.    Generate sample student data: scores in Math, Physics, Chemistry (0-100)

2.    Calculate average score for each student

3.    Find students with average > 80

4.    Normalize all scores

5.    Calculate correlation matrix between subjects

6.    Save the processed data to a file



In [None]:
## Try your solution here


<details>
<summary>👉 Click to show solution</summary>

```python
# Solution
import numpy as np
np.random.seed(42)

# 1. Generate sample data
n_students = 50
math_scores = np.random.randint(50, 101, n_students)
physics_scores = np.random.randint(55, 98, n_students)
chemistry_scores = np.random.randint(60, 95, n_students)

scores = np.column_stack([math_scores, physics_scores, chemistry_scores])
print("1. Student scores (first 5 rows):")
print(scores[:5])

# 2. Calculate averages
averages = np.mean(scores, axis=1)
print("\n2. Student averages (first 5):")
print(averages[:5])

# 3. High performing students
high_performers = scores[averages > 80]
print(f"\n3. High performers (average > 80): {len(high_performers)} students")

# 4. Normalize scores
scores_normalized = (scores - np.mean(scores, axis=0)) / np.std(scores, axis=0)
print("\n4. Normalized scores (first 5 rows):")
print(scores_normalized[:5])

# 5. Correlation matrix
correlation_matrix = np.corrcoef(scores.T)
print("\n5. Correlation matrix:")
print("   Math  Physics Chemistry")
print(correlation_matrix)

# 6. Save data
np.savetxt('student_scores.csv', scores, delimiter=',', fmt='%d')
print("\n6. Data saved to 'student_scores.csv'")

# Verify by loading
loaded_scores = np.loadtxt('student_scores.csv', delimiter=',')
print("Data verified:", np.array_equal(scores, loaded_scores))
```
</details>

## Summary and Best Practices
Key Takeaways:

- NumPy arrays are faster and more memory-efficient than Python lists

- Broadcasting allows operations on differently shaped arrays

- Vectorized operations are preferred over loops

- Understand array shapes and dimensions for ML applications

Common Pitfalls to Avoid:

- Modifying arrays without copying when needed

- Ignoring array shapes and dimensions

- Using Python loops instead of vectorized operations

- Forgetting about data types and memory usage