## 🔷 Step 1: Load and Understand the Dataset

### ✅ What we’ll do in this step:

Load the CSV using pandas

View first few rows

Check shape (rows × columns)

See data types and missing values

View basic stats (mean, std, etc.)

Identify features vs target (Price)

In [2]:
import pandas as pd

# 1. Load the dataset
df = pd.read_csv("house_prices.csv")

# 2. Display first 5 row
print("First 5 rows of the dataset:", df.head())

# 3. Shape of the dataset
print("\n Dataset Shape (rows,columns):", df.shape)

# 4. Info about data types and missing values
print("\n Dataset Info: ", df.info())

# 5. Summary statistics
print("\n Summary Statistics: ", df.describe(include="all"))

# 6. Count missing values per column
print("\n Missing Values per Column", df.isnull().sum())

First 5 rows of the dataset:    Index                                              Title  \
0      0  1 BHK Ready to Occupy Flat for sale in Srushti...   
1      1  2 BHK Ready to Occupy Flat for sale in Dosti V...   
2      2  2 BHK Ready to Occupy Flat for sale in Sunrise...   
3      3        1 BHK Ready to Occupy Flat for sale Kasheli   
4      4  2 BHK Ready to Occupy Flat for sale in TenX Ha...   

                                         Description Amount(in rupees)  \
0  Bhiwandi, Thane has an attractive 1 BHK Flat f...           42 Lac    
1  One can find this stunning 2 BHK flat for sale...           98 Lac    
2  Up for immediate sale is a 2 BHK apartment in ...          1.40 Cr    
3  This beautiful 1 BHK Flat is available for sal...           25 Lac    
4  This lovely 2 BHK Flat in Pokhran Road, Thane ...          1.60 Cr    

   Price (in rupees) location Carpet Area         Status         Floor  \
0             6000.0    thane    500 sqft  Ready to Move  10 out of 11   

### 🧠  Key Insights from Dataset Overview

**🔹 1. Dataset Shape and Size**
    
- Rows (properties): 187,531

- Columns (features): 21

- This is a large dataset, great for training machine learning models — but also requires careful preprocessing due to potential data quality issues.

**🔹 2. Target Variable Candidate**
    
- A strong candidate for prediction is likely "Amount(in rupees)" or "Price (in rupees)":

- Amount(in rupees) is in text format (e.g., "42 Lac", "1.40 Cr", "Call for Price").

- Price (in rupees) is numeric (float64) and usable for regression, but has ~17,665 missing values.

➡️ We'll most likely use Price (in rupees) as the target, after cleaning or imputing missing prices.

**🔹 3. Presence of Text and Categorical Features**
    
- Many columns like Title, Description, location, Status, Transaction, Furnishing, facing, Society, etc., are textual/categorical.

- These will need encoding (like LabelEncoding or OneHotEncoding) before feeding into ML models.

**🔹 4. Mixed-Type or Object-Type Numerical Features**
    
Some important columns that look numerical but are stored as object (string):

- Carpet Area → values like "500 sqft" need to be cleaned

- Floor → format like "10 out of 11" → needs parsing

- Balcony, Bathroom → stored as object, but should be numeric

➡️ These need to be converted to numeric properly before modeling.

**🔹 5. Severe Missing Values**

Let’s break it down:

| Column                    | Missing % | Comment                                             |
| ------------------------- | --------- | --------------------------------------------------- |
| `Description`             | \~1.6%    | Okay to keep, not critical                          |
| `Price`                   | \~9.4%    | ⚠️ Important! Needs to be handled carefully         |
| `Carpet Area`             | \~43%     | High missing rate                                   |
| `Society`                 | \~58%     | May consider dropping if not useful                 |
| `Car Parking`             | \~55%     | Clean/encode carefully                              |
| `Super Area`              | \~57%     | Very high missing rate                              |
| `Dimensions`, `Plot Area` | **100%**  | 💣 These are completely missing — should be dropped |

➡️ You'll need to drop fully-null columns and strategically fill/clean the others.

**🔹 6. Duplicate or Redundant Columns**
    
- Index column is just a row number — we can drop it.

- Title and Description contain free-text — useful for NLP, but not for simple regression without preprocessing.

**🔹 7. Suspicious Values in "Amount(in rupees)"**
                                                                    
- This column contains values like:

"42 Lac", "1.40 Cr", "Call for Price" → stored as object

- Not usable for regression unless cleaned and converted to numerical rupees

### ✅ Summary of Next Steps (after Insights):

**Drop useless columns:**

- Index, Dimensions, Plot Area

**Handle missing values:**

- Drop columns with all nulls

-Fill others (median for numeric, mode for categorical)

**Clean string/object columns:**

- Parse Carpet Area, Floor, Amount(in rupees), etc.

**Encode categorical variables**

**Scale numeric features**

## 🔷 Step 2: Data Preprocessing

We'll divide it into mini-parts:

🔹 A. Drop Unnecessary or Useless Columns

🔹 B. Handle Missing Values

🔹 C. Clean Mixed-Type Object Columns (like Carpet Area, Floor, etc.)

🔹 D. Encode Categorical Columns

🔹 E. Scale Numerical Columns

### A. Drop Unnecessary Columns

In [3]:
# Drop completely useless or fully missing columns
df.drop(["Index", "Dimensions", "Plot Area"], axis=1, inplace=True)

# Optional: Drop target rows with missing Price (important for supervised learning)
df = df[df["Price (in rupees)"].notnull()]

# 🔍 What This Does:
# Removes columns that are either identifiers (Index) or completely empty.
# Drops rows where the target (Price) is missing, since we can’t train on them.


### B. Handle Missing Values (Cleanly)

In [4]:
# 1. Separate numerical and categorical columns
num_cols = df.select_dtypes(include=["int64", "float64"]).columns #select_dtypes(...): Helps separate numeric and categorical columns.
cat_cols = df.select_dtypes(include=["object"]).columns

# 2. Fill missing values
# 👉 For numerical columns: fill with median
df[num_cols] = df[num_cols].fillna(df[num_cols].median()) #fillna(...).median(): Replaces nulls in numeric columns with median values to avoid the effect of outliers.

# 👉 For categorical columns: fill with mode (most frequent value)
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0]) #df[col].mode()[0]: Fills nulls in text columns with the most common value (mode).

# 3. Confirm no more missing values
print("Missing Values after imputation")
print(df.isnull().sum().sum()) # .sum().sum(): Confirms that all missing values are handled (should be 0 if successful).

# 🔍 What This Does:
# Fills numerical nulls using median to avoid outlier distortion.
# Fills categorical nulls using most frequent value (mode).

Missing Values after imputation
0


### C. Clean Mixed-Type Object Columns

Some object-type columns that should be numerical need to be cleaned:

**🛠 Convert "Carpet Area" like "1000 sqft" → numeric**

In [5]:
# Step 1: Extract only numbers
df["Carpet Area"] = df["Carpet Area"].str.extract(r'(\d+\.?\d*)')
# \d+	One or more digits (e.g., 3683)
# \.?	An optional decimal point
# \d*	Zero or more digits after the decimal
# ( ... )	Capturing group to extract matched number

# Step 2: Convert the result to float
df["Carpet Area"] = pd.to_numeric(df["Carpet Area"], errors="coerce")

print(df["Carpet Area"].isnull().sum())

0


**🛠 Convert "Floor" like "10 out of 22" → Extract current floor**

In [6]:
df["Floor"] = df["Floor"].str.extract(r'(\d+)').astype(float)
# \d+ → Matches one or more digits
# (\d+) → Captures those digits so .str.extract() can return them
# .astype(float) → Converts the result to float for numerical use

# From each floor entry, extract the first number (if any), and convert it to a float number so we can use it as a numeric feature.

.extract(r'(\d+)')	This uses a regular expression (regex) to extract only the first number from the string.

- 🔹 \d+ means: “one or more digits” (like 10, 3, 1, etc.)

- 🔹 r'(\d+)' is a raw string, which is a cleaner way to write regex in Python.

👉 So, it extracts just the current floor number — for example:

"10 out of 22" → 10

"3 out of 10" → 3

"Ground out of 4" → NaN (no digits found)

**🛠 Convert "Amount(in rupees)" like "1.4 Cr", "25 Lac" → rupees**

*💡 Problem We’re Solving*

"Amount(in rupees)" column has textual prices like:

"1.4 Cr"

"98 Lac"

"Call for Price"

"42 Lac"

These are not directly usable for calculations or machine learning models — so we clean and convert them to numeric rupee values (like 14000000, 9800000, etc.).

In [7]:
def convert_to_rupees(value):
    value = str(value).strip() 
    #Ensures the input is treated as a string, and .strip() removes any extra spaces from start or end.
    if "Cr" in value:
        return float(value.replace("Cr", "").strip()) * 10000000
        # Removes "Cr" from the string, Strips spaces, Converts the number (e.g. "1.4") to float → 1.4, Multiplies by 1 crore = 1,00,00,000 → returns 1.4 * 10000000 = 14000000
    elif "Lac" in value:
        return float(value.replace("Lac", "").strip()) * 100000
    elif value.lower() == "call for price":
        return np.nan
        # If the value is "Call for Price" (which means no price is disclosed), we return np.nan (missing value) so we can handle it later.
    else:
        try:
            return float(value)
        except:
            return np.nan
            # This handles edge cases like plain numeric strings ("4200000") — it tries to convert them to float. If it fails (maybe string is something weird), it returns np.nan.

#Apply Function to the Dataset
df["Amount(in rupees)"] = df["Amount(in rupees)"].apply(convert_to_rupees) #This applies the function to each row of the "Amount(in rupees)" column — converting everything to proper numbers.

# Drop Failed Conversions
df = df[df["Amount(in rupees)"].notnull()] #After conversion, this keeps only rows where conversion succeeded — drops rows where "Call for Price" or any invalid data caused the value to become NaN.

print(df["Amount(in rupees)"].isnull().sum())

0


### D. Encode Categorical Variables

**🧠 Why We Do This:**

Most machine learning models (like Linear Regression, Random Forest, etc.) cannot handle text/categorical data directly. So we need to convert them into numbers.

**📌 What We’ll Do:**

We’ll use Label Encoding (for simplicity, since we are not using OneHotEncoding in this project). Label encoding assigns a unique integer to each unique category.

In [8]:
from sklearn.preprocessing import LabelEncoder #LabelEncoder(): A scikit-learn tool to convert categorical values into integers (e.g., "Yes", "No" → 1, 0).

# 1. Create a LabelEncoder object
le = LabelEncoder()

# 2. Loop through all categorical columns and encode them
for col in cat_cols: # for col in cat_cols:: We apply the encoding to each categorical column one by one.
    df[col] = le.fit_transform(df[col].astype(str)) 
    #.astype(str): Ensures that even missing or numeric-looking categories are treated as strings (to avoid errors).
    #df[col] = le.fit_transform(...): Replaces original text categories with encoded numbers.

# 3. Confirm changes
print("Sample of Encoded Data:\n")
print(df[cat_cols].head())

# 4. Check if any null values are left
print(df[cat_cols].isnull().sum())

Sample of Encoded Data:

   Title  Description  Amount(in rupees)  location  Carpet Area  Status  \
0   3217         5020                712        67         1764       0   
1  10105        24480               1532        67         1730       0   
2  14467        56098                104        67         2108       0   
4  14665        45436                156        67         1934       0   
5   3424         9356                770        67            2       0   

   Floor  Transaction  Furnishing  facing  overlooking  Society  Bathroom  \
0      1            3           2       0            6     8230         0   
1     23            3           1       0            0     2119         2   
2      1            3           2       0            0     8379         2   
4     12            3           2       7            1     8671         2   
5     11            3           2       0            1     9404         0   

   Balcony  Car Parking  Ownership  Super Area  
0        2  

### E. Scale Numerical Features

**🧠 Why We Do This:**

Machine Learning models — especially those based on distance (like KNN, SVM) or gradient descent (like Linear Regression) — often perform better when features are on a similar scale. Otherwise, columns with larger numeric ranges dominate.

We’ll use StandardScaler to standardize the features:

- Centers data to mean = 0 and std deviation = 1.

- Works well for many algorithms.

**⚠️ Important Note:**
            
We will scale only numerical columns — not encoded categorical ones, because the label-encoded values don’t have real numeric meaning (e.g., 0, 1, 2 for “furnishing” isn't an actual scale).

In [9]:
from sklearn.preprocessing import StandardScaler #StandardScaler(): A scaler that makes all numerical columns have mean 0 and standard deviation 1.

# 1. Create a StandardScaler object
scaler = StandardScaler()

# 2. Fit and transform only numerical columns
df[num_cols] = scaler.fit_transform(df[num_cols]) 
#scaler.fit_transform(...): Learns from the data and transforms it.
#df[num_cols] = ...: Updates the original dataframe with the scaled values.

# 3. Confirm changes
print("Summary of Scaled Numerical Columns:\n")
print(df[num_cols].describe())


Summary of Scaled Numerical Columns:

       Price (in rupees)
count       1.698660e+05
mean        5.354189e-18
std         1.000003e+00
min        -2.783891e-01
25%        -1.206526e-01
50%        -5.688985e-02
75%         6.850649e-02
max         2.456688e+02


**Final Check: Confirm Dataset is Clean**

In [10]:
print("Any Missing Values Left?:", df.isnull().sum().sum())
print("\nFinal Shape of Dataset:", df.shape)
df.head()


Any Missing Values Left?: 0

Final Shape of Dataset: (169866, 18)


Unnamed: 0,Title,Description,Amount(in rupees),Price (in rupees),location,Carpet Area,Status,Floor,Transaction,Furnishing,facing,overlooking,Society,Bathroom,Balcony,Car Parking,Ownership,Super Area
0,3217,5020,712,-0.058138,67,1764,0,1,3,2,0,6,8230,0,2,0,1,130
1,10105,24480,1532,0.228152,67,1730,0,23,3,1,0,0,2119,2,2,2,1,130
2,14467,56098,104,0.36401,67,2108,0,1,3,2,0,0,8379,2,2,0,1,130
4,14665,45436,156,0.412612,67,1934,0,12,3,2,7,1,8671,2,2,0,0,130
5,3424,9356,770,-0.035452,67,2,0,11,3,2,0,1,9404,0,0,0,0,2497


## 🔷 Step 3: Split the Data

### ✅ Use train_test_split() to divide data

In [11]:
from sklearn.model_selection import train_test_split

# Separate Features and Target
# We'll predict "Price (in rupees)", so it's our target (y), and all other columns are features (X):
y = df["Price (in rupees)"] # target variable
x = df.drop("Price (in rupees)", axis=1)  # All other columns are features

# plit the Dataset using train_test_split()
# We'll split 80% for training and 20% fortesting. We'll also use random_state=42 to make the split reproducible.
x_train , x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### ✅ Understand the shape of training vs. test sets

In [12]:
#Let's confirm how the data is split:
print("Shape of Training Features (x_train):", x_train.shape)
print("Shape of Testing Features (x_test):", x_test.shape)
print("Shape of Training Labels (y_train):", y_train.shape)
print("Shape of Testing Labels (y_test):", y_test.shape)

Shape of Training Features (x_train): (135892, 17)
Shape of Testing Features (x_test): (33974, 17)
Shape of Training Labels (y_train): (135892,)
Shape of Testing Labels (y_test): (33974,)


### ✅ Discuss what random_state does

**🔍 What is random_state?**

- random_state controls the randomness of the train-test split.

- When you set a fixed value (like 42), it ensures that every time you run the code, the split is the same.

- Think of it like setting the same seed value.

**🧠 Example:**

- If random_state=42, you and I will get the same train/test sets.

- If you leave it out, the split will be random each time you run the code.

## 🔷 Step 4: Train the Regression Model

In [13]:
from sklearn.linear_model import LinearRegression #LinearRegression is a simple yet powerful regression algorithm that finds the best-fitting straight line through the data to predict a continuous value.

# Initialize the Linear Regression model
lr_model = LinearRegression() #We create a LinearRegression object called lr_model. This is the model that will learn from the training data.

# Fit the model on training data
lr_model.fit(x_train, y_train)
# .fit() tells the model to:
# Look at the features (X_train)
# Understand how they relate to the target (y_train)
# Learn the best coefficients (weights) to make predictions

# Predict house prices on test data
y_pred = lr_model.predict(x_test) #.predict() uses the trained model to make predictions on unseen data (X_test). These predictions are stored in y_pred and represent the estimated house prices.

## 🔷 Step 5: Evaluate the Model

| Metric       | What It Measures                              | Interpretation                                                        |
| ------------ | --------------------------------------------- | --------------------------------------------------------------------- |
| **MAE**      | Average error in same unit as target (rupees) | Lower is better. Tells how much you're wrong on average               |
| **MSE**      | Squared error → penalizes larger errors more  | Lower is better. Useful if you care about big errors                  |
| **RMSE**     | Square root of MSE (brings back to rupees)    | Easier to understand. Should be close to MAE if errors are consistent |
| **R² Score** | % of variance explained by the model          | Ranges from 0 to 1 (higher = better fit). 1 means perfect prediction  |


In [14]:
#1. Import Evaluation Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# mean_absolute_error: Measures how far predictions are from actual values (on average).
# mean_squared_error: Squares the errors (so large mistakes hurt more).
# r2_score: Shows how well the model explains the variance (closer to 1 = better).
# numpy (np): Needed to calculate Root of Mean Squared Error (RMSE).

#2. Evaluate the Model
# Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)

# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# R² Score (Coefficient of Determination)
r2 = r2_score(y_test, y_pred)

# Print the Result
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

Mean Absolute Error (MAE): 0.12
Mean Squared Error (MSE): 2.66
Root Mean Squared Error (RMSE): 1.63
R² Score: 0.0031


**🔍 What These Results Mean:**

| Metric       | Value  | Interpretation                                                            |
| ------------ | ------ | ------------------------------------------------------------------------- |
| **MAE**      | 0.12   | Very low → On average, the prediction is just ₹0.12 lakhs (\~₹12,000) off |
| **MSE**      | 2.66   | Error squared → but harder to interpret directly                          |
| **RMSE**     | 1.63   | Average prediction error is ₹1.63 lakhs (brings MSE back to rupee scale)  |
| **R² Score** | 0.0031 | 🚨 Very low! The model is **barely better than guessing the average**     |

**💡 Conclusion:**
    
- Your MAE and RMSE are reasonably low → suggesting the actual predicted values aren't very far off.

- But the R² score is almost zero, meaning: Your model explains less than 1% of the variance in price.

This typically means:

- There’s a lot of noise in the data,

- Or some important features are missing or poorly preprocessed,

- Or the model (Linear Regression) is too simple for this dataset.

## 🔷 Step 6: Improve the Model

### Step 6.1: Try Decision Tree Regressor

In [15]:
# 1. Import the model
from sklearn.tree import DecisionTreeRegressor #This imports the Decision Tree regression model from scikit-learn.

# 2. Initialize the model
dt_model = DecisionTreeRegressor(random_state=42) #We create the model. The random_state=42 ensures reproducibility of the tree structure.

# 3. Train the model on training data
dt_model.fit(x_train, y_train) #This trains the model on the training data (X_train, y_train).

# 4. Predict on the test set
dt_preds = dt_model.predict(x_test) #The trained model now predicts house prices for the unseen test set.

**Evaluate the Decision Tree Model**

In [16]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Evaluate Decision Tree performance
dt_mae = mean_absolute_error(y_test, dt_preds)
dt_mse = mean_squared_error(y_test, dt_preds)
dt_rmse = np.sqrt(dt_mse)
dt_r2 = r2_score(y_test, dt_preds)

# Print the result
print(f"Decision Tree Regressor:")
print(f"Mean Absolute Error (MAE): {dt_mae:.2f}")
print(f"Mean Squared Error (MSE): {dt_mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {dt_rmse:.2f}")
print(f"R² Score: {dt_r2:.4f}")

Decision Tree Regressor:
Mean Absolute Error (MAE): 0.03
Mean Squared Error (MSE): 2.92
Root Mean Squared Error (RMSE): 1.71
R² Score: -0.0961


**🔍 Comparison: Linear Regression vs. Decision Tree**

| Metric   | Linear Regression | Decision Tree Regressor        |
| -------- | ----------------- | ------------------------------ |
| MAE      | 0.12              | **0.03** ✅ *(lower is better)* |
| MSE      | 2.66              | 2.92                           |
| RMSE     | 1.63              | 1.71                           |
| R² Score | 0.0031            | **-0.0961** ❌ *(worse)*        |

**📌 Interpretation:**
  
- MAE improved significantly in Decision Tree → good!

- But R² Score is negative, which means the model is worse than a horizontal average line.

- So, even though the tree fits very closely on training data, it's likely overfitting and not generalizing well to test data.

### Step 6.2: Try Random Forest Regressor

In [17]:
# 1. Import the model
from sklearn.ensemble import RandomForestRegressor #RandomForestRegressor: Builds multiple trees and averages their results to reduce overfitting.

# 2. Initialize the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42) #n_estimators=100: Builds 100 trees. More trees = better performance, usually.

# 3. Train the model
rf_model.fit(x_train, y_train)

# 4. Predict on the test set
rf_preds = rf_model.predict(x_test)

**Evaluate the Random Forest Model**

In [18]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Then evaluate as usual
rf_mae = mean_absolute_error(y_test, rf_preds)
rf_mse = mean_squared_error(y_test, rf_preds)
rf_rmse = np.sqrt(rf_mse)
rf_r2 = r2_score(y_test, rf_preds)

print(f"Random Forest Regressor:")
print(f"MAE: {rf_mae:.2f}")
print(f"MSE: {rf_mse:.2f}")
print(f"RMSE: {rf_rmse:.2f}")
print(f"R² Score: {rf_r2:.4f}")


Random Forest Regressor:
MAE: 0.03
MSE: 2.66
RMSE: 1.63
R² Score: 0.0025


**🔍 Model Comparison (Based on Your Outputs)**

| Metric                  | **Linear Regression** | **Decision Tree** | **Random Forest** |
| ----------------------- | --------------------- | ----------------- | ----------------- |
| **MAE** (↓ better)      | 0.12                  | **0.03**          | **0.03**          |
| **MSE** (↓ better)      | 2.66                  | 2.92              | **2.66**          |
| **RMSE** (↓ better)     | 1.63                  | 1.71              | **1.63**          |
| **R² Score** (↑ better) | 0.0031                | **–0.0961 ❌**     | **0.0025**        |

**✅ Interpretation**
    
- MAE: Lower is better → Decision Tree and Random Forest are doing great here.

- MSE & RMSE: Random Forest has slightly better performance than Decision Tree.

- R² Score:

-Measures how well your model explains the variance.

-Linear Regression and Random Forest have positive but close to zero R² → the model explains very little variance.

-Decision Tree has negative R², which means it performs worse than just predicting the mean every time.

### Step 6.3: Try Regularization: Ridge, Lasso 

**🔘 Try Regularization models – Ridge and Lasso**

These are:

🔹 Still linear models like Linear Regression

🔹 Used when the model overfits or when features are high-dimensional or correlated

🔹 Can help improve generalization, but only when regular Linear Regression is struggling due to multicollinearity or overfitting

Since the dataset has 187,531 rows and 21 columns, Ridge and Lasso are great to try — they help control overfitting in large datasets and give you exposure to regularization techniques 🧠

**🔷 Let’s Start with: Ridge Regression**

In [19]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# 1. Create a Ridge model with alpha (regularization strength)
ridge_model = Ridge(alpha=1.0)  # alpha=1.0 is default 
#Ridge(alpha=1.0) Creates a Ridge Regression model. alpha controls how much regularization to apply (higher = more shrinkage).

# 2. Train the model on training data
ridge_model.fit(x_train, y_train)

# 3. Predict on test data
ridge_preds = ridge_model.predict(x_test)

# 4. Evaluate performance
ridge_mae = mean_absolute_error(y_test, ridge_preds)
ridge_mse = mean_squared_error(y_test, ridge_preds)
ridge_rmse = np.sqrt(ridge_mse)
ridge_r2 = r2_score(y_test, ridge_preds)

# 5. Print results
print("Ridge Regression:")
print(f"MAE: {ridge_mae:.2f}")
print(f"MSE: {ridge_mse:.2f}")
print(f"RMSE: {ridge_rmse:.2f}")
print(f"R² Score: {ridge_r2:.4f}")

Ridge Regression:
MAE: 0.12
MSE: 2.66
RMSE: 1.63
R² Score: 0.0031


*🔍 Interpretation:*

- MAE (0.12): On average, the model’s predictions are off by 0.12 units (in the same units as your target).

- MSE/ RMSE (1.63): Error size is moderate; close to what we got from other models.

- R² (0.0031): Very low—this means the model is only able to explain ~0.3% of the variation in house prices.

⚠️ This confirms that the data may be too noisy, or that important features are missing or not yet properly cleaned.

**🔷 Lasso Regression**

Just like Ridge, but with L1 regularization, which can also shrink some feature weights to zero (feature selection).

In [20]:
from sklearn.linear_model import Lasso

# 1. Create a Lasso model with alpha (regularization strength)
lasso_model = Lasso(alpha=1.0) #Builds a Lasso regression model. L1 regularization shrinks some coefficients to 0, performing automatic feature selection.

# 2. Train the model on training data
lasso_model.fit(x_train, y_train)

# 3. Predict on test data
lasso_preds = lasso_model.predict(x_test)

# 4. Evaluate performance
lasso_mae = mean_absolute_error(y_test, lasso_preds)
lasso_mse = mean_squared_error(y_test, lasso_preds)
lasso_rmse = np.sqrt(lasso_mse)
lasso_r2 = r2_score(y_test, lasso_preds)

# 5. Print results
print("Lasso Regression:")
print(f"MAE: {lasso_mae:.2f}")
print(f"MSE: {lasso_mse:.2f}")
print(f"RMSE: {lasso_rmse:.2f}")
print(f"R² Score: {lasso_r2:.4f}")


Lasso Regression:
MAE: 0.13
MSE: 2.66
RMSE: 1.63
R² Score: 0.0018


### Step 6.4: Compare model performance

**📊 Regression Model Comparison Table**

| Model                       | MAE  | MSE  | RMSE | R² Score      |
| --------------------------- | ---- | ---- | ---- | ------------- |
| **Linear Regression**       | 0.13 | 2.66 | 1.63 | 0.0024        |
| **Ridge Regression**        | 0.13 | 2.66 | 1.63 | 0.0024        |
| **Lasso Regression**        | 0.13 | 2.66 | 1.63 | 0.0018        |
| **Decision Tree Regressor** | 0.03 | 2.92 | 1.71 | **-0.0961** ❌ |
| **Random Forest Regressor** | 0.03 | 2.66 | 1.63 | **0.0025** ✅  |

**🧠 Interpretation & Insights**
    
*✅ Best Overall (R² Score)*
    
- Random Forest gave the highest R² Score (0.0025) — even though it’s still very low, it slightly outperformed others.

*🤔 What does this mean?*

-All models are giving very similar performance, and R² is near 0, which indicates:

- Your current features do not explain the target (price per sqft) very well.

- Possibly important features like location quality, amenities, flat age, etc., are missing or not fully cleaned.

*⚠️ Decision Tree had:*
    
- Low MAE but negative R², meaning it overfitted on the training data and did poorly on test data.

📌 Final Notes

The models are working, but the data's predictive power is weak — that’s common in real estate datasets that aren’t cleaned deeply or lack powerful features.

Further improvement needs:

Feature engineering (e.g., extract BHK count from title)

Grouping locations into zones

Dropping more noisy columns (e.g., Society, Description, etc.)

**✅ Final Ranking (based on this data):**

🥇 Random Forest

🥈 Linear / Ridge

🥉 Lasso

❌ Decision Tree (Overfit)

## 🔷 Step 7: Wrap Up

### Save the Best Model

Make a nice GitHub repo and README

Post progress on LinkedIn (optional)

In [21]:
import joblib # joblib is great for saving ML models

# Save the best model (Random Forest Regressor) #joblib.dump(...): Saves your trained model to a .pkl file #Later, you can load it using joblib.load("random_forest_regressor_model.pkl")
joblib.dump(rf_model, "random_forest_regressor_model.pkl")
print("model saved as random_forest_regressor_model.pkl")

model saved as random_forest_regressor_model.pkl
