# Lab 1: Data & Model Building

In this notebook, we'll create synthetic linear data, split it into training and test sets, visualize it, and build our first PyTorch model.

Our goal: Build a model that can learn the pattern of a straight line.

## Install Dependencies

First, let's install the required libraries by running the following cell.

In [None]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install matplotlib

## Import Libraries

We need:
- `torch`: Core PyTorch library for tensors
- `torch.nn`: Contains building blocks for neural networks
- `matplotlib`: For visualization

In [None]:
import torch
from torch import nn
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")

## 1. Creating Synthetic Data

We'll create data using the linear regression formula: `y = weight * X + bias`

We set **known parameters** that our model will try to learn:
- `weight = 0.4` (the slope)
- `bias = 0.1` (the y-intercept)

We create 50 evenly spaced X values between 0 and 1, then compute the corresponding y values.

In [None]:
# Create *known* parameters
weight = 0.4
bias = 0.1

# Create data
start = 0
end = 1
step = 0.02
X = torch.arange(start, end, step).unsqueeze(dim=1)
y = weight * X + bias

X[:10], y[:10]

Now we're going to build a model that can learn the relationship between `X` (features) and `y` (labels).

### Why unsqueeze?

The `unsqueeze(dim=1)` adds an extra dimension to X, changing it from shape `[50]` to `[50, 1]`. PyTorch models expect input data in the format `[batch_size, features]`.

In [None]:
print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Number of samples: {len(X)}")

## 2. Splitting Data into Train and Test Sets

One of the most important steps in a machine learning project is creating a training and test set.

![Train Test Split](https://raw.githubusercontent.com/poridhiEng/lab-asset/7008e578e0c9c57813d1b267134700911793d762/tensorcode/Deep-learning-with-pytorch/LinearRegression/lab-01/images/train-test-split.svg)

- **Training set**: The model learns from this data (~80%)
- **Test set**: The model gets evaluated on this data (~20%)

We want our model to learn from training data and then evaluate it on test data to see how well it **generalizes** to unseen examples.

In [None]:
# Create train/test split
train_split = int(0.8 * len(X))  # 80% for training, 20% for testing

X_train, y_train = X[:train_split], y[:train_split]
X_test, y_test = X[train_split:], y[train_split:]

len(X_train), len(y_train), len(X_test), len(y_test)

We've got 40 samples for training and 10 samples for testing.

The model will try to learn the relationship between `X_train` & `y_train`, then we'll evaluate what it learned on `X_test` and `y_test`.

## 3. Visualizing the Data

Right now our data is just numbers on a page. Let's create a function to visualize it.

The `plot_predictions()` function creates a scatter plot showing training data (blue), test data (red), and optionally model predictions (green). This helps us visually compare how well our model's predictions match the actual data.

In [None]:
def plot_predictions(train_data=X_train, 
                     train_labels=y_train, 
                     test_data=X_test, 
                     test_labels=y_test, 
                     predictions=None):
    """
    Plots training data, test data and compares predictions.
    """
    plt.figure(figsize=(10, 7))

    # Plot training data in blue
    plt.scatter(train_data, train_labels, c="b", s=4, label="Training data")
    
    # Plot test data in red
    plt.scatter(test_data, test_labels, c="r", s=4, label="Testing data")

    if predictions is not None:
        # Plot the predictions in green (predictions were made on the test data)
        plt.scatter(test_data, predictions, c="g", s=4, label="Predictions")

    # Show the legend
    plt.legend(prop={"size": 14})
    plt.xlabel("X")
    plt.ylabel("y")
    plt.show()

Now let's visualize our data. We should see a straight line pattern.

In [None]:
plot_predictions()

Now instead of just numbers on a page, our data is a straight line. Blue dots are training data, red dots are test data.

## 4. Building a Linear Regression Model

Now we've got some data, let's build a model to use the **blue dots to predict the red dots**.

![Linear Model](https://raw.githubusercontent.com/poridhiEng/lab-asset/7008e578e0c9c57813d1b267134700911793d762/tensorcode/Deep-learning-with-pytorch/LinearRegression/lab-01/images/linear-model.svg)

We'll create a class that subclasses `nn.Module`. The model has two learnable parameters:
- `self.weights`: Initialized with random values, multiplied with input X
- `self.bias`: Initialized with random values, added to the result

Both parameters use `nn.Parameter` with `requires_grad=True`, which tells PyTorch to track gradients so the values can be updated during training.

The `forward()` method defines how data flows through the model — it computes `y = weights * x + bias`.

In [None]:
class LinearRegressionModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(1, dtype=torch.float), requires_grad=True)
        self.bias = nn.Parameter(torch.randn(1, dtype=torch.float), requires_grad=True)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.weights * x + self.bias

### Checking the Model Contents

Let's create a model instance and check its parameters using `.parameters()` and `.state_dict()`.

In [None]:
# Set manual seed since nn.Parameter are randomly initialized
torch.manual_seed(42)

# Create an instance of the model
model_0 = LinearRegressionModel()

# Check the nn.Parameter(s) within the nn.Module subclass
list(model_0.parameters())

We can also get the state (what the model contains) using `.state_dict()`.

In [None]:
# List named parameters
model_0.state_dict()

Notice how the values for `weights` and `bias` come out as **random float tensors**?

This is because we initialized them using `torch.randn()`.

We want to start from random parameters and get the model to update them towards the target values:
- Target weight: **0.4**
- Target bias: **0.1**

Because our model starts with random values, right now it'll have **poor predictive power**.

In [None]:
print(f"Current model parameters:")
print(f"  weights: {model_0.state_dict()['weights'].item():.4f}")
print(f"  bias:    {model_0.state_dict()['bias'].item():.4f}")

print(f"\nTarget parameters:")
print(f"  weight: {weight}")
print(f"  bias:   {bias}")

## 5. Making Predictions (Before Training)

Let's see what predictions our untrained model makes. We use `torch.inference_mode()` as a context manager.

`torch.inference_mode()` turns off gradient tracking and other training features to make forward-passes faster. It's used when making predictions (inference).

In [None]:
# Make predictions with model
with torch.inference_mode():
    y_preds = model_0(X_test)

# Note: in older PyTorch code you might see torch.no_grad()
# with torch.no_grad():
#   y_preds = model_0(X_test)

Let's check the predictions.

In [None]:
print(f"Number of testing samples: {len(X_test)}")
print(f"Number of predictions made: {len(y_preds)}")
print(f"Predicted values:\n{y_preds}")

Notice how there's one prediction value per testing sample. For our straight line, one X value maps to one y value.

### Visualizing Untrained Predictions

Our predictions are still numbers on a page. Let's visualize them. We'll see some green dots, which are our model's predictions before training. There maybe some gaps between the green dots and the red dots, which are the actual values.

In [None]:
plot_predictions(predictions=y_preds)

### How Far Off Are We?

Let's calculate the difference between predictions and actual values.

In [None]:
y_test - y_preds

**Those predictions look pretty bad!**

This makes sense though — our model is just using **random parameter values** to make predictions. It hasn't even looked at the blue dots to try to predict the red dots.

Time to change that! In our next lab, we'll train the model to learn the correct parameters.

## Summary

In this lab, we:

1. **Created synthetic data** with known parameters (weight=0.4, bias=0.1)
2. **Split the data** into training (40 samples) and test sets (10 samples)
3. **Visualized the data** as a straight line
4. **Built a LinearRegressionModel** with random parameters
5. **Made predictions** with the untrained model (they were bad!)