# Linear Regression Example with the Diabetes Dataset

## Introduction

In this activity, we will perform a linear regression analysis using the Diabetes dataset. The objective is to predict the progression of diabetes (measured by a target variable) based on various features. This exercise will help you understand the process of building and evaluating a linear regression model.

## The Dataset

The Diabetes dataset contains the following features:
- `age`: Age of the patient
- `sex`: Gender of the patient
- `bmi`: Body Mass Index
- `bp`: Average blood pressure
- `s1` to `s6`: Six blood serum measurements

The target variable is:
- `target`: A quantitative measure of disease progression one year after baseline

**Note:** Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).

## Objective

You will:
1. Load and explore the dataset.
2. Split the data into training and testing sets.
3. Train a linear regression model.
4. Evaluate the model's performance.
5. Visualize the results.

Let's get started!

## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Load and explore the dataset

In [None]:
# Load the Diabetes dataset
data = pd.---(---)
# Display the first few rows of the dataset
---

## Splitting the dataset

In [None]:
# TODO: Split the data into training and testing sets
X = data.drop(---, axis= -- )
y = data[---]

In [None]:
# Use train_test_split to create training and testing sets
# Use a test size of 20% and random state for reproducibility
# HINT: Use the function `train_test_split` from sklearn.model_selection
X_train, X_test, y_train, y_test = ---(--, --, test_size=--, random_state=--)

## Training the model

In [None]:
# TODO: Create an instance of LinearRegression and fit it to the training data
# HINT: Use the class `LinearRegression` from sklearn.linear_model
model = ---()

# Fit the model to the training data
model.---(---, ---)

## Making predictions

In [None]:
# TODO: Use the model to make predictions on the testing data
# HINT: Use the `predict` method of the model
y_pred = model.---(---)

## Evaluating the model

In [None]:
# TODO: Calculate the Mean Squared Error (MSE) and R² score for the model
# HINT: Use the functions `mean_squared_error` and `r2_score` from sklearn.metrics
mse = ---(---, ---)
r2 = ---(---, ---)

print(f'Mean Squared Error: {mse}')
print(f'R² Score: {r2}')

## Visualize the results

In [None]:
plt.scatter(---, ---)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.show()

## Discussion:
    1. What does the Mean Squared Error (MSE) tell us about the model's performance?
    2. What is the significance of the R² score, and what does it indicate about our model?
    3. How do the actual values compare to the predicted values in the scatter plot?
    4. Are there any patterns or trends you observe in the scatter plot that could indicate potential issues with the model?