**Technical Note for the Student:**

* **Kernel:** Select the **`Data Science 3.0`** (or `conda_python3`) kernel when opening this notebook in JupyterLab.
* **Permissions:** Ensure your SageMaker Execution Role has `AmazonSageMakerFullAccess` and `AmazonS3FullAccess` (or specific bucket access).

---

# Lab: California Housing Prediction (SageMaker Edition)

**Role:** SageMaker Data Scientist
**Objective:** Build an end-to-end ML pipeline to predict median house values. You will wrangle data locally in the notebook, train a model at scale using **AWS Linear Learner** on a separate training cluster, and deploy a real-time inference endpoint.

---

## 1. Environment Setup

In SageMaker JupyterLab, the environment is pre-configured with the AWS SDK (`boto3`) and the high-level `sagemaker` library. We initialize our session and define where our data will live in S3.

In [None]:
import sagemaker
import boto3
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Initialize SageMaker Session
session = sagemaker.Session()
region = session.boto_region_name

# 2. Get Execution Role (The permissions this notebook has)
# In SageMaker Studio/JupyterLab, this automatically grabs the role attached to your user profile.
role = sagemaker.get_execution_role()

# 3. Define S3 Bucket and Prefix
# We use the default bucket created by SageMaker for you.
bucket = session.default_bucket()
prefix = 'labs/california-housing'

print(f"Region: {region}")
print(f"S3 Bucket: {bucket}")
print(f"IAM Role: {role.split('/')[-1]}") # Printing just the role name for readability

---

## 2. Data Ingestion & Exploration (EDA)

We will use `sklearn` to fetch the classic 1990 California Housing dataset.

In [None]:
from sklearn.datasets import fetch_california_housing

# Download dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

# Quick look at the data
print(f"Dataset Dimensions: {df.shape}")
df.head()

### 2.1 Visualizing the Geography

Since this is geospatial data, plotting the longitude and latitude reveals the map of California.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=df,
    x="Longitude",
    y="Latitude",
    size="Population",
    hue="MedHouseVal",
    palette="viridis",
    alpha=0.5,
    sizes=(10, 200) # Control the size of the dots
)
plt.title("California Housing: Price & Population Density")
plt.legend(title="Median House Value", loc="upper right", bbox_to_anchor=(1.2, 1))
plt.show()

### 2.2 Correlation Analysis

We need to see which features drive housing prices.

* **Note:** `MedInc` (Median Income) usually has the strongest correlation with `MedHouseVal`.

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Feature Correlation Matrix")
plt.show()

---

## 3. Data Wrangling & Feature Engineering

SageMaker's Linear Learner expects specific data formats. We will perform the following processing:

1. **Feature Engineering:** Create `RoomsPerHousehold` to normalize the room count.
2. **Imputation:** (The sklearn version of this dataset is pre-cleaned, but in real scenarios, you would handle `NaNs` here).
3. **Splitting:** 70% Train, 15% Validation, 15% Test.
4. **Scaling:** Standardize data (Mean , Variance ).

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Feature Engineering
df['RoomsPerHousehold'] = df['AveRooms'] / df['AveOccup']
df['BedroomsPerRoom'] = df['AveBedrms'] / df['AveRooms']

# 2. Prepare X (Features) and y (Target)
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

# 3. Train / Validation / Test Split
# First split: Train (70%) vs Temp (30%)
X_train_raw, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Second split: Temp into Validation (15%) and Test (15%)
X_val_raw, X_test_raw, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# 4. Standardization (Scaling)
# Ideally fit on training data only to prevent data leakage
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_raw)
X_val = scaler.transform(X_val_raw)
X_test = scaler.transform(X_test_raw)

print(f"Training Data:   {X_train.shape}")
print(f"Validation Data: {X_val.shape}")
print(f"Test Data:       {X_test.shape}")

### 3.1 Uploading Data to S3

When training with SageMaker, the compute instance (the "training job") spins up in the background and needs to download data from S3.

**Format Requirement:** For CSV input to Linear Learner, the **first column must be the target variable**, and there should be no headers.

In [None]:
import io

def upload_to_s3(X, y, channel_name):
    # Stack target (y) as the first column, followed by features (X)
    data = np.column_stack((y, X))

    # Write to a CSV buffer (in-memory)
    csv_buffer = io.BytesIO()
    np.savetxt(csv_buffer, data, delimiter=',', fmt='%g')

    # Construct S3 Key
    s3_key = f"{prefix}/{channel_name}/data.csv"

    # Upload
    boto3.resource('s3').Bucket(bucket).put_object(Key=s3_key, Body=csv_buffer.getvalue())
    s3_uri = f"s3://{bucket}/{s3_key}"
    print(f"Uploaded {channel_name} data to: {s3_uri}")
    return s3_uri

# Upload Train and Validation sets
s3_train_uri = upload_to_s3(X_train, y_train, 'train')
s3_val_uri = upload_to_s3(X_val, y_val, 'validation')

---

## 4. Training: AWS Linear Learner

We will now launch a **Training Job**. This happens on a separate EC2 instance managed by SageMaker, not in this notebook.

### 4.1 Define the Estimator

We use the `sagemaker.image_uris` to get the Docker container for the Linear Learner algorithm.

In [None]:
from sagemaker.image_uris import retrieve

# 1. Retrieve the container image
container = retrieve('linear-learner', region)

# 2. Define the Estimator (The configuration for the training job)
ll_estimator = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m5.large', # Balanced general purpose instance
    output_path=f"s3://{bucket}/{prefix}/output",
    sagemaker_session=session
)

# 3. Set Hyperparameters
ll_estimator.set_hyperparameters(
    feature_dim=X_train.shape[1], # Must match number of columns in X
    predictor_type='regressor',   # Regression problem
    mini_batch_size=100,
    epochs=15,
    normalize_data=False,         # We already scaled it
    loss='squared_loss'           # Optimizing for MSE
)

### 4.2 Execute Training

This block will output logs from the remote instance. Wait for it to complete (approx 3-4 minutes).

In [None]:
# Define the data channels
train_input = sagemaker.inputs.TrainingInput(s3_train_uri, content_type='text/csv')
val_input = sagemaker.inputs.TrainingInput(s3_val_uri, content_type='text/csv')

# Start the job
ll_estimator.fit({'train': train_input, 'validation': val_input})

---

## 5. Deployment & Inference

Once training is successful, we deploy the model to an **Endpoint**. This creates a persistent REST API that we can query.

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

# Deploy (approx 3-5 mins)
predictor = ll_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium', # Cost-effective for lab inference
    serializer=CSVSerializer(),   # Converts list -> CSV for the endpoint
    deserializer=JSONDeserializer() # Parses JSON response from endpoint
)

print(f"Endpoint deployed: {predictor.endpoint_name}")

### 5.1 Testing the Prediction

Let's send a single record from our test set to the endpoint to see the result.

In [None]:
# Take one sample
sample_input = X_test[0]
actual_value = y_test.iloc[0]

# Query the endpoint
response = predictor.predict(sample_input)

# Extract prediction
predicted_value = response['predictions'][0]['score']

print(f"Actual Value:    ${actual_value * 100000:,.2f}")
print(f"Predicted Value: ${predicted_value * 100000:,.2f}")

---

## 6. Benchmarking: SageMaker vs. Scikit-Learn

To understand if our model is "good," we compare it to a reference implementation running locally in the notebook.

### 6.1 Scikit-Learn Reference (Local)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Train Local Model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict Local
y_pred_sklearn = lr.predict(X_test)

# Calculate RMSE
rmse_sklearn = np.sqrt(mean_squared_error(y_test, y_pred_sklearn))
print(f"Scikit-Learn RMSE: {rmse_sklearn:.4f}")

### 6.2 SageMaker Evaluation (Remote)

We batch predict the entire test set using the endpoint.

In [None]:
def predict_batch(data, predictor, rows=500):
    # Split data into chunks to respect payload limits
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = []
    for array in split_array:
        result = predictor.predict(array)
        predictions += [r['score'] for r in result['predictions']]
    return np.array(predictions)

# Get all predictions
y_pred_aws = predict_batch(X_test, predictor)

# Calculate RMSE
rmse_aws = np.sqrt(mean_squared_error(y_test, y_pred_aws))

print("------ RESULTS ------")
print(f"AWS Linear Learner RMSE: {rmse_aws:.4f}")
print(f"Scikit-Learn RMSE:       {rmse_sklearn:.4f}")
print("---------------------")

*Note: The results should be very similar as both use linear regression techniques. Slight differences arise from optimization solvers (SGD vs OLS) and regularization defaults.*

---

## 7. Cleanup

**‚ö†Ô∏è IMPORTANT:** Delete the endpoint to stop billing.

In [None]:
# Delete endpoint
predictor.delete_endpoint()

# Optional: Delete the model and endpoint configuration if you want a clean slate
# predictor.delete_model()
# predictor.delete_endpoint_config()

print("Endpoint deleted. Lab complete.")

**SageMaker Canvas** is a "no-code" machine learning interface that allows you to build the same regression model (predicting housing prices) without writing a single line of Python. It uses the same AutoML technology (Amazon SageMaker Autopilot) under the hood.

Here is the **No-Code Lab** walkthrough for the California Housing dataset.

---

# üé® Lab: California Housing Prediction (No-Code Edition)

**Role:** Business Analyst / Citizen Data Scientist
**Tool:** Amazon SageMaker Canvas
**Objective:** Build a regression model to predict `MedHouseVal` using a visual drag-and-drop interface.

---

## 1. Preparation

Before starting, ensure you have the dataset file on your local computer.

* **Download:** If you don't have it, you can download the `housing.csv` from the [standard repository](https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv) or export it from your previous notebook using `df.to_csv('housing.csv', index=False)`.

## 2. Launch SageMaker Canvas

1. Open the **AWS Console** and navigate to **Amazon SageMaker**.
2. In the left sidebar, select **Canvas**.
3. Select your User Profile and click **Open Canvas**. (This may take 1-2 minutes to initialize).

---

## 3. Step-by-Step Walkthrough

### Step 1: Import Data

We need to load the California Housing CSV into Canvas.

1. On the left sidebar, click the **Datasets** icon (database symbol).
2. Click **Import** (top right).
3. Choose **Upload** (from your computer) and select your `housing.csv` file.
4. Click **Preview** to verify the columns (`Longitude`, `Latitude`, `MedHouseVal`, etc.) look correct.
5. Click **Import Data**.

### Step 2: Build the Model

Now we define what we want to predict.

1. Go to the **My models** tab (model symbol) on the left sidebar.
2. Click **New model** and name it `CaliforniaHousing-Regression`.
3. Select the **Predictive Analysis** radio button and click **Create**.
4. **Select Dataset:** Choose the `housing.csv` you just imported and click **Select dataset**.

### Step 3: Configure Target & Model Type

1. **Select Target Column:** In the "Target column" dropdown, select **`MedHouseVal`** (Median House Value).
2. **Verify Model Type:** Canvas will automatically detect this is a **Numeric Prediction** (Regression) problem.
3. **Data Preview:** You will see a distribution histogram of house prices.
4. **Build Options:** You have two choices:
* **Quick Build:** Takes 2-15 minutes. Good for rapid prototyping. (Recommended for this lab).
* **Standard Build:** Takes 1-2 hours. detailed analysis and higher accuracy.


5. Click **Quick Build**.

### Step 4: Analyze Results

Once the build is complete (approx. 10 mins), Canvas presents a visual dashboard.

1. **Overview Tab:** Look at the **RMSE** (Root Mean Squared Error).
* *Compare this number to the Python notebook result. It usually achieves similar performance.*


2. **Column Impact:** This is the equivalent of "Feature Importance."
* You will likely see **`MedInc`** (Median Income) having the highest percentage impact, consistent with our Python analysis.
* You can click on a column (e.g., `Latitude`) to see how changing its value impacts the predicted price (Partial Dependence Plots).



### Step 5: Generate Predictions

1. Click the **Predict** button at the bottom of the analysis page.
2. **Single Prediction (What-if analysis):**
* Manually adjust sliders for `MedInc` or `AveRooms` to see how the predicted House Value changes in real-time.


3. **Batch Prediction:**
* If you have a separate test file (without targets), you can upload it here. Canvas will generate a CSV with predictions for every row.



---

## Summary: Code vs. Canvas

| Feature | SageMaker Notebook (Python) | SageMaker Canvas (No-Code) |
| --- | --- | --- |
| **User** | Data Scientist / ML Engineer | Business Analyst / Domain Expert |
| **Flexibility** | High (Custom feature engineering) | Moderate (Auto-inferred) |
| **Speed** | Slow (requires coding/debugging) | Fast (Click & Go) |
| **Under the hood** | You choose (XGBoost, Linear Learner) | AutoML (Ensemble of models) |

**Conclusion:** Canvas is excellent for establishing a "baseline" performance or allowing non-coders to validate hypotheses before a data scientist builds a production pipeline.

---

### Video Tutorial

For a visual guide on building a regression model in Canvas, watch this video:
[Build a Regression Model in 11 minutes with Amazon Sagemaker Canvas](https://www.youtube.com/watch?v=o_vPaVQ8D1o)

This video demonstrates a similar workflow using a different dataset, but the steps for selecting the target column and interpreting the column impact are identical to what you will do with the housing data.