# Lecture 1: The Ultimate Guide to Representing Data (A NumPy Masterclass)

Welcome to the first lecture and notebook for the series on Linear Algebra for Machine Learning! In this session, we'll build the entire foundation for our journey. We'll cover:

1.  **The "Why":** A tour of real-world machine learning problems to see how data is represented.
2.  **The "What":** Formal definitions of the core data containers: Scalars, Vectors, Matrices, and Tensors.
3.  **The "How":** A practical, hands-on masterclass in NumPy to create, inspect, and manipulate these objects.

## Setup

First, let's import NumPy and our visualization libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_sample_image

# Set plot style for better visuals
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context('talk')
%matplotlib inline

---

## Part 1: Why Linear Algebra? The Language of Data

Machine learning models can only process numbers. We use linear algebra to represent real-world data in a structured, numerical format.

### The Tour: 1. Tabular Data

- **Description:** Standard spreadsheets or tables.
- **Representation:** The entire table is a **Matrix**. A single row (e.g., one house) is a **Vector**.

In [None]:
# Create a Pandas DataFrame to represent a spreadsheet
house_df = pd.DataFrame({
    'SquareFootage': [1500, 2000, 1200, 1800],
    'Bedrooms': [3, 4, 2, 3],
    'Bathrooms': [2, 3, 1, 2],
    'YearBuilt': [1990, 2000, 1975, 1985]
})

# Convert the DataFrame to a NumPy matrix
houses_matrix = house_df.to_numpy()

# Visualize the matrix with a heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(houses_matrix, annot=True, fmt='d', cmap='viridis',
            xticklabels=house_df.columns, yticklabels=[f'House {i+1}' for i in range(4)])
plt.title('Tabular Data as a Matrix')
plt.show()

# Extract a single row as a vector
house_vector = houses_matrix[0, :]
print(f"A single house as a vector: {house_vector}")

### The Tour: 2. Image Data

- **Description:** Images are grids of pixels.
- **Representation:** A grayscale image is a 2D **Matrix**. A color image is a 3D **Tensor** (height, width, color channels).

In [None]:
# Load a sample color image from scikit-learn
china = load_sample_image('china.jpg')

# The image is a 3D NumPy array (Tensor)
print(f"Image Tensor Shape: {china.shape}") # (height, width, channels)

# Separate the channels
red_channel = china[:, :, 0]
green_channel = china[:, :, 1]
blue_channel = china[:, :, 2]

# Visualize the full image and its channels
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

axes[0].imshow(china)
axes[0].set_title('Original Color Image (Tensor)')

axes[1].imshow(red_channel, cmap='Reds')
axes[1].set_title('Red Channel (Matrix)')

axes[2].imshow(green_channel, cmap='Greens')
axes[2].set_title('Green Channel (Matrix)')

axes[3].imshow(blue_channel, cmap='Blues')
axes[3].set_title('Blue Channel (Matrix)')

for ax in axes:
    ax.axis('off')

plt.tight_layout()
plt.show()

### The Tour: 3. Text Data (NLP)

- **Description:** Sentences and documents.
- **Representation:** A sentence can be a **Matrix**, where each row is a **Vector** (a 'word embedding') representing a single word.

In [None]:
# Create a fake word embedding matrix for a sentence
sentence = "Linear algebra is fun"
words = sentence.split()
embedding_dim = 8 # Each word is represented by an 8-dimensional vector

# Each row is a word vector
embedding_matrix = np.random.rand(len(words), embedding_dim)

plt.figure(figsize=(10, 3))
sns.heatmap(embedding_matrix, annot=False, cmap='plasma',
            yticklabels=words, xticklabels=[f'dim {i+1}' for i in range(embedding_dim)])
plt.title('Sentence as a Word Embedding Matrix')
plt.show()

### The Tour: 4. Recommender Systems

- **Description:** User interactions with items (e.g., movie ratings).
- **Representation:** A large user-item interaction **Matrix**, where rows are users, columns are movies, and cells are ratings.

In [None]:
# Create a sample user-item matrix. 0 represents an unrated movie.
user_item_matrix = np.array([
    [5, 4, 0, 1, 0],
    [0, 5, 4, 0, 0],
    [1, 0, 0, 5, 4],
    [0, 0, 2, 4, 5],
    [4, 4, 0, 0, 0]
])

# Create a masked array to properly visualize the zeros (missing ratings)
masked_matrix = np.ma.masked_where(user_item_matrix == 0, user_item_matrix)

plt.figure(figsize=(8, 6))
sns.heatmap(masked_matrix, annot=True, fmt='.0f', cmap='coolwarm',
            xticklabels=[f'Movie {i+1}' for i in range(5)],
            yticklabels=[f'User {i+1}' for i in range(5)])
plt.title('Recommender System Data as a Matrix')
plt.show()

---

## Part 2: The NumPy Masterclass

Now we'll dive deep into the essential NumPy operations for creating and manipulating these data structures.

### 2.1 Key Terminology & Inspecting Attributes

In [None]:
M = np.array([
    [4, 1800, 25],
    [3, 1500, 40]
])

print("A Matrix is a 2D array of numbers.")
print(f"Matrix:\n{M}")
print("---------------------------------------")
print("A Scalar is a single number.")
s = M[0, 0]
print(f"Scalar extracted from matrix: {s}")
print("---------------------------------------")
print("A Vector is an ordered list of numbers.")
v = M[0, :]
print(f"Vector extracted from matrix: {v}")
print("---------------------------------------")
print("Dimension can refer to the number of elements in a vector OR the size of a matrix axis.")
print(f"Shape (rows, columns): {M.shape}")
print(f"Number of dimensions (axes): {M.ndim}")

### 2.2 Array Creation Routines

In [None]:
print("np.zeros:")
print(np.zeros((2, 3)))

print("\nnp.ones:")
print(np.ones(4))

print("\nnp.arange:")
print(np.arange(5, 15, 2))

print("\nnp.linspace:")
print(np.linspace(0, 10, 5))

print("\nnp.random.randint (a 3x3 matrix of integers from 1 to 10):")
print(np.random.randint(1, 11, size=(3, 3)))

### 2.3 Indexing and Slicing (The Most Important Skill!)

In [None]:
M = np.random.randint(10, 100, size=(5, 5))
print(f"Original 5x5 Matrix:\n{M}")

print(f"\nElement at row 0, col 1: {M[0, 1]}")

print(f"\nSecond row (index 1): {M[1, :]}")

print(f"\nThird column (index 2): {M[:, 2]}")

print(f"\nTop-right 2x2 sub-grid:\n{M[0:2, 3:5]}")

### 2.4 Reshaping and Broadcasting

In [None]:
# Broadcasting Visualization
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
vector = np.array([10, 20, 30])
result = matrix + vector

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sns.heatmap(matrix, annot=True, fmt='d', cmap='viridis', cbar=False, ax=axes[0])
axes[0].set_title('Original Matrix')

sns.heatmap(vector.reshape(1, -1), annot=True, fmt='d', cmap='viridis', cbar=False, ax=axes[1])
axes[1].set_title('Vector to Broadcast (+)')
axes[1].set_yticks([]) # Hide y-axis ticks

sns.heatmap(result, annot=True, fmt='d', cmap='viridis', cbar=False, ax=axes[2])
axes[2].set_title('Result after Broadcasting (=)')

plt.tight_layout()
plt.show()

---

## Part 3: Exercises

Now it's your turn to practice. Complete the following exercises to solidify your understanding.

### Exercise 1 (Creation)
Create a 1D NumPy array containing the numbers from 10 to 50, with a step of 5.

In [None]:
# Your solution here
ex1_array = np.arange(10, 51, 5)
print(ex1_array)

### Exercise 2 (Reshaping)
Take the array from Ex 1. Notice it has 9 elements. Reshape it into a 3x3 matrix.

In [None]:
# Your solution here
ex2_matrix = ex1_array.reshape(3, 3)
print(ex2_matrix)

### Exercise 3 (Data Representation)
Create a random 5x4 matrix representing a small dataset of 5 students and 4 exam scores. The scores should be integers between 50 and 100.

In [None]:
# Your solution here
student_scores = np.random.randint(50, 101, size=(5, 4))
print(student_scores)

### Exercise 4 (Slicing)
From the student matrix in Ex 3, select:
1. The scores of the 3rd student (index 2).
2. The scores of all students for the 4th exam (index 3).

In [None]:
# Your solution here
third_student_scores = student_scores[2, :]
fourth_exam_scores = student_scores[:, 3]
print(f"Scores for student 3: {third_student_scores}")
print(f"Scores for exam 4: {fourth_exam_scores}")

### Exercise 5 (Boolean Indexing)
From the student matrix, find all scores that are considered 'A' grades (>= 90).

In [None]:
# Your solution here
a_grades = student_scores[student_scores >= 90]
print(f"'A' Grades: {a_grades}")

### Exercise 6 (Broadcasting Challenge)
You have the 5x4 student matrix. 'Curve' the grades by adding the following points to each of the 4 exams respectively: `[2, 5, 0, 3]`. This will require broadcasting a 1D vector onto the 2D matrix.

In [None]:
# Your solution here
curve = np.array([2, 5, 0, 3])
curved_scores = student_scores + curve
print("Curved Scores:")
print(curved_scores)

### Bonus Exercise (Image Processing)
Create a random 3D tensor of shape `(32, 32, 3)` representing a small color image. Use slicing to set the 'green' channel of the top-left 10x10 pixels to pure green (value 1.0). Visualize the image before and after the modification.

In [None]:
# Your solution here
image = np.random.rand(32, 32, 3)  # Values are between 0 and 1
modified_image = image.copy()

# Set the R and B channels to 0 and the G channel to 1 in the top-left corner
modified_image[0:10, 0:10, 0] = 0 # Red channel
modified_image[0:10, 0:10, 1] = 1 # Green channel
modified_image[0:10, 0:10, 2] = 0 # Blue channel

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].imshow(image)
axes[0].set_title('Original Random Image')
axes[1].imshow(modified_image)
axes[1].set_title('Modified Image')
plt.show()

---

## Next Steps

Congratulations on completing the first lecture! You now have a solid foundation in how data is represented numerically and how to manipulate it with NumPy.

If you're comfortable with all the material here, you're ready to move on to **Lecture 2: The Dot Product - The Heart of Machine Learning!**