# Lecture 1: The Ultimate Guide to Representing Data (A NumPy Masterclass)

Welcome to the first lecture and notebook for the series on Linear Algebra for Machine Learning! 

In this session, we'll build the entire foundation for our journey. We'll cover:

1.  **The "Why":** A tour of real-world machine learning problems to see how data is represented.
2.  **The "What":** Formal definitions of the core data containers: Scalars, Vectors, Matrices, and Tensors.
3.  **The "How":** A practical, hands-on masterclass in NumPy to create, inspect, and manipulate these objects.

Let's get started!

## Setup

First, let's import NumPy, our fundamental library for numerical computations. We'll also import some visualization libraries that we'll use for examples.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot style
plt.style.use('seaborn')
%matplotlib inline

## Part 1: Why Linear Algebra? The Language of Data

Machine learning models, no matter how complex, can only process numbers. Linear algebra provides the tools and structures to represent different kinds of data in a numerical format.

### The Tour: 1. Tabular Data

- **Description:** Standard spreadsheets or tables.
- **Representation:** The entire table is a **Matrix**. A single row (e.g., one house) is a **Vector**.

### The Tour: 2. Image Data

- **Description:** Images are grids of pixels.
- **Representation:** A grayscale image is a 2D **Matrix**. A color image is a 3D **Tensor** (height, width, color channels).

### The Tour: 3. Text Data (NLP)

- **Description:** Sentences and documents.
- **Representation:** A sentence can be a **Matrix**, where each row is a **Vector** (a 'word embedding') representing a single word.

### The Tour: 4. Recommender Systems

- **Description:** User interactions with items (e.g., movie ratings).
- **Representation:** A large user-item interaction **Matrix**, where rows are users, columns are movies, and cells are ratings.

Let's see some examples!

In [None]:
# Example 1: Representing a data point
# Each feature is one dimension in our vector
house_features = np.array([1500,  # Square footage
                          3,      # Bedrooms
                          2,      # Bathrooms
                          1990])  # Year built

print("House features as a vector:", house_features)
print("Vector shape:", house_features.shape)

### 2.1 From Single Points to Datasets

In machine learning, we typically work with many data points. This is where matrices come in!

In [None]:
# Example 2: Multiple houses as a matrix
houses_dataset = np.array([
    [1500, 3, 2, 1990],  # House 1
    [2000, 4, 3, 2000],  # House 2
    [1200, 2, 1, 1975],  # House 3
    [1800, 3, 2, 1985]   # House 4
])

print("Houses dataset as a matrix:")
print(houses_dataset)
print("\nMatrix shape:", houses_dataset.shape)

## 3. Visual Example: Linear Algebra in Image Processing

Images are naturally represented as matrices where each element represents a pixel value.

## Part 2: The Core Data Containers & The Power of Shape

Let's formally define these objects.

### Decoding Tensor Shapes: The Key to Debugging Deep Learning

The `shape` of a NumPy array is a tuple that tells you the size of the array along each dimension.

- **A Color Image:** `(height, width, channels)` -> e.g., `(1080, 1920, 3)`
- **A Batch of Images (for a CNN):** `(batch_size, height, width, channels)` -> e.g., `(64, 224, 224, 3)`
- **NLP Data (A Batch of Sentences):** `(batch_size, sequence_length, embedding_dim)` -> e.g., `(32, 50, 300)`

Let's explore this with a visual example:

In [None]:
# Create a simple 8x8 grayscale image
simple_image = np.zeros((8, 8))
simple_image[2:6, 2:6] = 1  # Create a white square in the middle

plt.figure(figsize=(10, 4))

# Show the image
plt.subplot(121)
plt.imshow(simple_image, cmap='gray')
plt.title('Simple 8x8 Image')

# Show the matrix values
plt.subplot(122)
sns.heatmap(simple_image, annot=True, fmt='.1f', cmap='gray')
plt.title('Image as a Matrix')

plt.tight_layout()
plt.show()

## 4. Key Terminology

Let's introduce some key terms we'll use throughout the series:

- **Vector**: An ordered list of numbers (like our house features)
- **Matrix**: A 2D array of numbers (like our houses dataset)
- **Dimension**: The number of components in a vector or the size of a matrix
- **Scalar**: A single number

We'll expand on these concepts in the upcoming lectures!

## Part 3: The NumPy Masterclass

Now that we understand what these objects are, let's learn how to create and manipulate them in Python using NumPy. We'll cover:

1. Creating Arrays & Inspecting Attributes
2. Array Creation Routines
3. Indexing and Slicing (The Most Important Skill!)
4. Reshaping and Broadcasting

### 3.1 Creating Arrays & Inspecting Attributes

Let's start with the fundamentals: creating NumPy arrays of different dimensions and inspecting their properties.

## Part 4: Exercises

Now it's your turn to practice these concepts. Each exercise builds upon the previous ones and covers a different aspect of data representation and manipulation.

### Exercise 1 (Creation)
Create a 1D NumPy array containing the numbers from 10 to 50, with a step of 5.

### Exercise 2 (Reshaping)
Take the array from Ex 1 and reshape it into a 3x3 matrix.

### Exercise 3 (Data Representation)
Create a random 5x4 matrix representing a small dataset of 5 students and 4 exam scores. The scores should be integers between 50 and 100.

### Exercise 4 (Slicing)
From the student matrix in Ex 3:
1. Select the scores of the 3rd student (index 2)
2. Select the scores of all students for the 4th exam (index 3)

### Exercise 5 (Boolean Indexing)
From the student matrix, find all scores that are considered 'A' grades (>= 90).

### Exercise 6 (Broadcasting Challenge)
You have the 5x4 student matrix. 'Curve' the grades by adding the following points to each of the 4 exams respectively: `[2, 5, 0, 3]`. This will require broadcasting a 1D vector onto the 2D matrix.

### Bonus Exercise (Image Processing)
Create a random 3D tensor of shape (32, 32, 3) representing a small color image. Use slicing to set the 'green' channel of the top-left 10x10 pixels to pure green (value 1.0). Visualize the image before and after the modification.

In [None]:
# Exercise 1: Creation
# Your code here

# Example solution:
# array = np.arange(10, 51, 5)
# print(array)

In [None]:
# Exercise 2: Reshaping
# Your code here

# Example solution:
# reshaped = array[:9].reshape(3, 3)  # We need 9 elements for a 3x3 matrix
# print(reshaped)

In [None]:
# Exercise 3: Data Representation
# Your code here

# Example solution:
# student_scores = np.random.randint(50, 101, size=(5, 4))
# print("Student scores matrix:")
# print(student_scores)
# print("\nShape:", student_scores.shape)

In [None]:
# Exercise 4: Slicing
# Your code here

# Example solution:
# third_student = student_scores[2, :]  # All scores for the third student
# fourth_exam = student_scores[:, 3]    # Fourth exam scores for all students
# print("Third student's scores:", third_student)
# print("Fourth exam scores:", fourth_exam)

In [None]:
# Exercise 5: Boolean Indexing
# Your code here

# Example solution:
# a_grades_mask = student_scores >= 90
# a_grades = student_scores[a_grades_mask]
# print("A grades:", a_grades)
# print("Number of A grades:", len(a_grades))

In [None]:
# Exercise 6: Broadcasting Challenge
# Your code here

# Example solution:
# curve = np.array([2, 5, 0, 3])  # Points to add to each exam
# curved_scores = student_scores + curve  # Broadcasting in action!
# print("Original scores:\n", student_scores)
# print("\nCurved scores:\n", curved_scores)

In [None]:
# Bonus Exercise: Image Processing
# Your code here

# Example solution:
# Create a random RGB image
# image = np.random.random((32, 32, 3))  # Values between 0 and 1

# Make a copy for comparison
# modified_image = image.copy()

# Set the green channel (index 1) to 1.0 in the top-left 10x10 pixels
# modified_image[:10, :10, 1] = 1.0

# Visualize the results
# plt.figure(figsize=(12, 4))
# plt.subplot(121)
# plt.imshow(image)
# plt.title('Original Image')
# plt.subplot(122)
# plt.imshow(modified_image)
# plt.title('Modified Image (Green Patch)')
# plt.show()

In [None]:
# Your solution here!


## Next Steps

In this lecture, we've built a solid foundation in representing data numerically using NumPy. In the next lecture, we'll dive deep into the dot product - the fundamental operation that powers similarity measures, neural networks, and much more.

What you've learned:
1. How different types of data are represented numerically
2. The core data containers: scalars, vectors, matrices, and tensors
3. Essential NumPy operations for data manipulation
4. Practical skills in indexing, slicing, and broadcasting

Before moving on:
1. Complete all the exercises
2. Try modifying the exercises to create your own challenges
3. Think about how these concepts apply to your own data problems
4. Watch the accompanying video lecture if you haven't already

If you're comfortable with all the material here, you're ready to move on to Lecture 2: The Dot Product - The Heart of Machine Learning!

## Exercises

### Exercise 1: User Feature Matrix
Create a matrix representing a batch of 5 user feature vectors, where each user is described by [age, city_id, num_friends, avg_daily_minutes].

In [None]:
# Your solution here
# Create a matrix with 5 rows (users) and 4 columns (features)
# Features: [age, city_id, num_friends, avg_daily_minutes]

### Exercise 2: Matrix Slicing
Using the matrix from Exercise 1:
1. Select the first 3 users
2. Select only the age and num_friends columns for all users

In [None]:
# Your solution here
# 1. Use array slicing to get first 3 users
# 2. Use array slicing to select specific columns

### Exercise 3: Boolean Indexing
Using the matrix from Exercise 1, use boolean indexing to find all users older than 30.

In [None]:
# Your solution here
# Use boolean indexing to filter users by age
# Hint: Use user_data[:, 0] to get the age column

### Exercise 4: Matrix Normalization
Create a random 5x5 matrix and "normalize" it by:
1. Subtracting the mean of the whole matrix from every element
2. Dividing by the standard deviation

This is a common preprocessing step in machine learning known as standardization.

In [None]:
# Your solution here
# 1. Create a random 5x5 matrix
# 2. Calculate its mean and standard deviation
# 3. Normalize it using broadcasting
# Hint: Use np.mean() and np.std()

### Exercise 5: Tensor Operations
Create a random 3D tensor of shape (32, 32, 3) representing a small color image. Use slicing to set the 'green' channel of the top-left 10x10 pixels to pure green (value 1.0).

Bonus: Visualize the image before and after the modification.

In [None]:
# Your solution here
# 1. Create a random 32x32x3 tensor
# 2. Modify the green channel in the top-left corner
# 3. Visualize the results using plt.imshow()
# Hint: The green channel is at index 1