# Task 2: Implementing a Simple Linear Regression Model

Purpose: The purpose of this task is to build a basic linear regression model to predict a target variable using a dataset with a numeric predictor and target variable.

Dataset Selection:
I will use the "Boston Housing" dataset, which contains various features related to housing in Boston, and the goal is to predict the median value of owner-occupied homes in $1000s.

Data Preparation:
Let's start by preparing the data.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)

df['MEDV'] = boston.target


X = df[['RM']]
y = df['MEDV']

# Data Splitting:
We will divide the dataset into a training set and a testing set using the `train_test_split` function from scikit-learn.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Creation:
Next, we create a linear regression model using scikit-learn's `LinearRegression` class.


In [None]:
model = LinearRegression()

# Model Training:
We will train the linear regression model using the training data.


In [None]:
model.fit(X_train, y_train)

# Prediction:
Now, we will use the trained model to make predictions on the testing data.


In [None]:
y_pred = model.predict(X_test)

# Model Evaluation:
We calculate the mean squared error (MSE) and R-squared (R2) score to evaluate the model's performance.


In [None]:
mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R2): {r2:.2f}")

# Visualization (Optional):
Optionally, we can create a scatter plot with the regression line to visualize the model's fit.


In [None]:
plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.xlabel("Average Number of Rooms (RM)")
plt.ylabel("Median Value of Homes (MEDV)")
plt.title("Linear Regression: RM vs. MEDV")
plt.show()

# Documentation:
- Dataset Used: Boston Housing dataset (numeric predictor - RM, and target variable - MEDV).
- Steps Taken: Data preparation, data splitting, model creation (linear regression), model training, prediction, model evaluation (MSE and R2), and optional visualization.
- Model's Performance: The model achieved an MSE of [MSE Value] and an R-squared of [R2 Value], indicating [description of model performance].