Model validation and evaluation metrics are essential in assessing the performance of regression models.

**Example Dataset:**
Suppose we have a dataset containing information about houses, including features like the number of bedrooms, square footage, and age, and the target variable is the price of the house.

| Bedroom | Square Footage | Age | Price |
|---------|----------------|-----|-------|
| 3       | 2000           | 10  | 300000|
| 2       | 1500           | 5   | 250000|
| 4       | 2200           | 15  | 350000|
| 3       | 1800           | 8   | 280000|
| 2       | 1600           | 6   | 260000|

In [1]:
import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(0)  # for reproducibility
num_houses = 1000

# Generate features
bedrooms = np.random.randint(1, 6, num_houses)  # Assuming 1 to 5 bedrooms
square_footage = np.random.randint(1000, 3000, num_houses)  # Assuming square footage between 1000 and 3000
age = np.random.randint(1, 50, num_houses)  # Assuming age between 1 and 50 years

# Generate target variable (price)
# Assuming a linear relationship between features and price with some noise
price = 50000 * bedrooms + 200 * square_footage - 1000 * age + np.random.normal(0, 50000, num_houses)

# Create DataFrame
data = pd.DataFrame({'Bedrooms': bedrooms, 'Square Footage': square_footage, 'Age': age, 'Price': price})

# Save DataFrame to CSV
data.to_csv('house_data.csv', index=False)

print("Sample house data saved to house_data.csv")

Sample house data saved to house_data.csv


In [2]:
df = pd.read_csv('house_data.csv')
df.head()

Unnamed: 0,Bedrooms,Square Footage,Age,Price
0,5,2787,19,765379.044361
1,1,1304,29,323936.629851
2,4,2407,36,630311.989335
3,4,2674,27,785096.293552
4,4,2348,40,657990.472173


In [3]:
X = df.iloc[:,:-1]
y = df['Price']

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [5]:
# Assuming 'X' contains features and 'y' contains the target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [7]:
# Predictions
y_pred = model.predict(X_test)

1. **Mean Absolute Error (MAE):**
   MAE measures the average absolute errors between the actual and predicted values.

In [8]:
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

Mean Absolute Error: 38483.87801468218


2. **Mean Squared Error (MSE):**
   MSE measures the average of the squares of the errors between the actual and predicted values.

In [9]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 2257252715.9433484


3. **Root Mean Squared Error (RMSE):**
   RMSE is the square root of the MSE, providing an interpretable measure of the error in the same units as the target variable.

In [10]:
import numpy as np

rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

Root Mean Squared Error: 47510.55373223247


4. **R-squared (R²) Score:**
   R² score represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating a better fit.

In [11]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print("R-squared Score:", r2)

R-squared Score: 0.9060062813349279
