# Linear Regression Model to predict the price depending on the surface of the estate

## Introduction
This notebook will reveil the process of making a linear regression model , predicting with it and evaluating it. As we know Linear regression models can only work in 2d , so in this notebook we will work with the **Price** of estates as the target data , that the model must predict from the **Surface** of taht estate.

## Installing and Importing the Libraries needed
We will be using:

- **Pandas**: to create Data Frames and controll the data , read it , modify it .
- **Matplotlib**: which is a Library that helps visualising data by plotting graphs and tables.
- **Scikit-learn**: which is the library that contains the regression linear method , we import the model from this library.
- **Gdown**: It is the tool i use to import the data from google drive.

In [None]:
!pip install pandas matplotlib scikit-learn gdown

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

## Importing and Cleaning the Data
First, we must import the data , which is a housing data that contains the **Surface** and the **Price** equivalent to it.

In [None]:
!gdown --id 1zdrro3Xdt4EqfMBLSar2KYa0vW1e9mQf -O Housing_Data.csv

After downloading the data , we must load the csv into a Data Frame with Pandas

In [None]:
df = pd.read_csv("Housing_Data.csv")

In the process of cleaning the data , we must remove any columns that have the value: **Unnamed: 0** . 

In [None]:
df = df.drop(columns=["Unnamed: 0"], errors='ignore')

Here we check if there are some missing values by printing the summary of the values that are null.

In [None]:
print("Missing values:")
print(df.isnull().sum())

## Data Visualization
Here we scatter plot to visualize the distribution of the values

In [None]:
plt.figure(figsize=(8, 5))
plt.scatter(df["Surface"], df["Prix"], color="blue", alpha=0.5)
plt.xlabel("Surface (square meters)")
plt.ylabel("Prix ($)")
plt.title("Housing Prices vs Surface Area")
plt.grid(True)
plt.show()

## Data Preparation
we must prepare the **Feature** and the **Target** data

In [None]:
# Defining features (X) and target (y)
X = df[["Surface"]]  # Feature should be in a 2D array, since sickit-learn accepts the features as a 2D array
y = df["Prix"]       # Target variable

Now, we slpit the data. 80% is for training , 20% for Testing the Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Creating and Training the Linear Regression Model with sickit-learn
We will be Creating the model , Training it so we can find the line.

In [None]:
model = LinearRegression() #creates the model
model.fit(X_train, y_train) #training

## Predicting with the Trained Model
We will be feeding the model the the Testing Features (the 20% we slpit before), so that it gives us the target data (Y).

In [None]:
y_pred = model.predict(X_test)

## Model Evaluation
we will be analyzing the the **mean squared error** , and the **Coefficient of Determination**. For a good model , the mean squared error must be have a small value. And the Coefficient of Determination must get close to 1 and far from 0.

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared (Coefficient of Determination): {r2}")

## Visualising the Regression Line
here is the Regression line , and its parameters.

In [None]:
# Plot the scatter plot of actual data
plt.figure(figsize=(8, 5))
plt.scatter(X, y, color="blue", alpha=0.5, label="Actual Data")

# Regression line
X_line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)  # Generate X values for line
y_line = model.predict(X_line)  # Predict Y values
plt.plot(X_line, y_line, color="red", linewidth=2, label="Regression Line")

# Equation of the line: Y = ax + b / aka the parameters
a = model.coef_[0]  # Slope
b = model.intercept_  # Intercept
equation_text = f"Y = {a:.2f}X + {b:.2f}"

# Display equation on the plot
plt.text(X.max() * 0.6, y.max() * 0.9, equation_text, fontsize=12, color="red")

# Calculate and display R²
r2 = r2_score(y_test, y_pred)
plt.text(X.max() * 0.6, y.max() * 0.85, f"R² = {r2:.2f}", fontsize=12, color="red")

# Labels and title
plt.xlabel("Surface (square meters)")
plt.ylabel("Prix ($)")
plt.title("Housing Prices vs Surface Area with Regression Line")
plt.legend()
plt.grid(True)
plt.show()
