# 🏠 Predicting House Prices Using Linear Regression

## ✨ Foreword

This project explores house price prediction using a simple linear regression model trained on features such as average income, number of rooms, house age, and population.  
The aim is to build a foundational understanding of predictive modeling and bridge theory with practical implementation in both **Jupyter Notebook** and a **Flask-based web interface**.

---

## 🗂️ Table of Contents
1. [Importing Libraries](#importing-libraries)
2. [Loading Dataset](#loading-dataset)
3. [Exploratory Data Analysis](#exploratory-data-analysis)
4. [Feature Selection](#feature-selection)
5. [Model Building](#model-building)
6. [Evaluation](#evaluation)
7. [Conclusion](#conclusion)

---


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error,r2_score

In [None]:
%matplotlib inline
%matplotlib qt

## 2. 📂 Loading Dataset

In [None]:
df=pd.read_csv('USA_Housing.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().count()

In [None]:
df.isnull().sum()

In [None]:
df.columns

## 3. 📊 Exploratory Data Analysis (EDA)

We begin by visualizing the data distribution and relationships between features.

### 🔍 Pairplot Insight
This plot helps identify correlations and possible linear relationships. E.g., Area Income shows a positive trend with Price.

In [None]:
# This dataset is artificially created
# Histogram, corrolation and scatterplots of all the columns 
# we can see in histogram everything is more or less normally distributed
# except forthe average number of bedrooms, it's segmented to 2,3,4,5,6. There is soem noise around there
# cause we can't have 4.5 bedrooms
sns.pairplot(df)
plt.show(block=True)

In [None]:
sns.displot(df)
plt.show(block=True)

In [None]:
# Checking out distribution of Price column (our Target Column)
sns.distplot(df['Price'],kde=True,bins=30,vertical=False,rug=True,hist=True)
sns.set_palette("GnBu_d")
sns.set_style('whitegrid')
plt.show(block=True)
# The average price falls somewhere around 1m and 1.5m. 
# This is a nice case we're not gonna work on cleaning the data

In [None]:
# simple plots to checkout data
plt.hist(x=df['Price'])
plt.show()

In [None]:
sns.lmplot(x='Avg. Area Income',y='Price',data=df)
plt.show(block=True)

In [None]:
df.select_dtypes(include='number').corr()

In [None]:
# df.drop('Address',axis=1,inplace=True)
df_numeric = df.select_dtypes(include='number')

In [None]:
# The Heatmap of the corrolation between each of the columns
df_corr=df_numeric.corr()
df_corr

## 3. 📊 Exploratory Data Analysis (EDA)

We begin by visualizing the data distribution and relationships between features.

### 🔍 Pairplot Insight
This plot helps identify correlations and possible linear relationships. E.g., Area Income shows a positive trend with Price.

In [None]:
# a diagonal full correlation, each column is perfectly corrolated with itself
# we have alot of black(low corrolation) amount
sns.heatmap(df_corr,annot=True)
plt.show(block=True)

In [None]:
sns.lmplot(data=df,x='Price',y='Avg. Area Income')
# plt.show(block=True)

## 1. 📦 Importing Libraries

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# We earlier tossed out Address column as it contains non numeric data (df_numeric = df.select_dtypes(include='number'))
df_numeric.head()

In [None]:
df_numeric.columns

In [None]:
X=df_numeric[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population']]

In [None]:
X.head()

In [None]:
y=df_numeric['Price']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## 1. 📦 Importing Libraries

In [None]:
from sklearn.linear_model import LinearRegression

## 5. 🧠 Model Building

We now train a linear regression model using scikit-learn.

In [None]:
linear_model=LinearRegression()

In [None]:
linear_model.fit(X_train,y_train)

In [None]:
# Model evaluation
print(linear_model.intercept_)

In [None]:
linear_model.coef_

In [None]:
pd.DataFrame(linear_model.coef_.T,X_train.columns,columns=['Coeff'])
# A one unit increase in Avg. Area House Age is associated with 165221.119872 increase in Price of the House 

In [None]:
# Prediction
predict = linear_model.predict(X_test)

In [None]:
predict

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
# Evaluation
plt.scatter(y_test,predict)
plt.xlabel('Y Test (Target)')
plt.ylabel('Predicted Y')
plt.show(block=True)

**Residual** = $ \text{Actual} - \text{Predicted} = y_{\text{test}} - \hat{y} $

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
sns.histplot((y_test - predict), bins=50, kde=True)
plt.show(block=True)

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
sns.histplot((y_test - predict), bins=50, kde=True)
plt.title("Distribution of Residuals (Actual - Predicted)")
plt.xlabel("Residual")
plt.ylabel("Frequency")
plt.show(block=True)

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(8,6))

sns.lineplot(data=y_test, label="Actual Price" , color="green" ,linestyle="--" ,linewidth=2,markers="o")
sns.lineplot(data=y_pred_sorted,label="Predicted Price", color="blue",linewidth=2, linestyle="-",markers="s")
plt.title('Actual vs Predicted Car Prices (Line Chart)')


## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
# Sort actual and predicted for better comparison
y_test_sorted, y_pred_sorted = zip(*sorted(zip(y_test, predict)))

sns.set_style("whitegrid")
plt.figure(figsize=(10, 6))
sns.lineplot(data=y_test_sorted, label="Actual Price", color="green", linestyle="--", linewidth=2)
sns.lineplot(data=y_pred_sorted, label="Predicted Price", color="blue", linestyle="-", linewidth=2)
plt.title("Linear Regression: Actual vs Predicted House Prices")
plt.xlabel("Sample Index")
plt.ylabel("Price")
plt.legend()
plt.tight_layout()
plt.show(block=True)

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
# let's visually compares the actual vs predicted prices for the first 10 rows in our test data.
# to examine how well our model performs on unseen data (real vs predicted data)
n = 20
indices = np.arange(n)
bar_width = 0.35

plt.figure(figsize=(12, 6))
plt.bar(indices, y_test.iloc[:n], width=bar_width, label='Actual Price', color='steelblue')
plt.bar(indices + bar_width, predict[:n], width=bar_width, label='Predicted Price', color='coral')
plt.xlabel('Sample Index')
plt.ylabel('Price')
plt.title('Linear Regression: Actual vs Predicted Prices (First 10 Samples)')
plt.xticks(indices + bar_width / 2, labels=[str(i) for i in range(n)])
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show(block=True)

## Insight on Bar Chart
### 1. Model Performance is Generally Reasonable
Most orange bars (predicted prices) are close to the blue bars (actual prices), meaning Linear Regression model has learned the general pricing pattern.

### 2. Underprediction Trend
In several samples — for example:

* Sample 2 (index 2)
* Sample 8
* Sample 13
* Sample 18

The orange bars (predicted prices) are consistently lower than the actual prices.

This could mean model slightly underpredicts for higher-priced homes. Linear regression can sometimes struggle with `extreme values` or `skewed distributions`.

### 3. Low-Price Range Still Varies
At the lower end (e.g., samples 5, 7, 16):

The prediction is again lower than actual, though the margin is smaller.

This might mean the model `underfits` slightly or doesn’t fully capture `nonlinear influences` (like area, room count, etc.).

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
r2 = r2_score(y_test, predict)
mse = mean_squared_error(y_test, predict)

print(f"R² Score: {r2:.2f}")
print(f"Mean Squared Error: {mse:.2f}")

## Model Improvement
For a bette result we are going to:
1.  Log-transform the target variable Price (Log transformation helps stabilize variance and normalize skewed price distributions.)
2. Add an Interaction Feature
   We'll create a new feature: `Avg. Area Number of Rooms × Avg. Area Income`

In [None]:
df['Rooms_Income_Interaction'] = (
    df['Avg. Area Number of Rooms'] * df['Avg. Area Income']
)

In [None]:
# Histogram of original prices
sns.histplot(df['Price'], bins=50, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show(block=True)

In [None]:

# Boxplot to visualize outliers
sns.boxplot(x=df['Price'])
plt.title('Boxplot of House Prices')
plt.show(block=True)

In [None]:
# Optional outlier capping
price_cap = df['Price'].quantile(0.99)
df = df[df['Price'] < price_cap]


In [None]:
# Histogram of original prices
sns.histplot(df['Price'], bins=50, kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show(block=True)

In [None]:
X = df[['Avg. Area Income', 'Avg. Area House Age', 
        'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 
        'Area Population', 'Rooms_Income_Interaction']]

In [None]:
y = np.log(df['Price'])

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## 5. 🧠 Model Building

We now train a linear regression model using scikit-learn.

In [None]:

# Train model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:
# Predict and reverse log
y_pred_log = model.predict(X_test)

In [None]:
y_pred = np.exp(y_pred_log)

In [None]:
y_test_original = np.exp(y_test)

In [None]:
# Evaluation
print("R² Score:", r2_score(y_test_original, y_pred))
print("MSE:", mean_squared_error(y_test_original, y_pred))

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
n = 20
indices = np.arange(n)
bar_width = 0.35

plt.figure(figsize=(12, 6))
plt.bar(indices, y_test.iloc[:n], width=bar_width, label='Actual Price', color='steelblue')
plt.bar(indices + bar_width, predict[:n], width=bar_width, label='Predicted Price', color='coral')
plt.xlabel('Sample Index')
plt.ylabel('Price')
plt.title('Linear Regression: Actual vs Predicted Prices (First 20 Samples)')
plt.xticks(indices + bar_width / 2, labels=[str(i) for i in range(n)])
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show(block=True)

## 6. 📈 Model Evaluation

We evaluate the model using RMSE, R² Score, and residual analysis.

In [None]:
n = 20
indices = np.arange(n)
bar_width = 0.35

# Take first 20 actual and predicted prices
actual_prices = y_test_original[:n]
predicted_prices = y_pred[:n]

# Plot
plt.figure(figsize=(12, 6))
plt.bar(indices, actual_prices, width=bar_width, label='Actual Price', color='steelblue')
plt.bar(indices + bar_width, predicted_prices, width=bar_width, label='Predicted Price', color='coral')
plt.xlabel('Sample Index')
plt.ylabel('Price (in USD)')
plt.title('Linear Regression (Log-Transformed): Actual vs Predicted Prices (First 20 Samples)')
plt.xticks(indices + bar_width / 2, labels=[str(i) for i in range(n)])
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

## Insight on Bar chart after log transformation and capped interaction feature
### 1. Overall Fit is Much Better Now
Compared to previous (non-transformed) Linear Regression model, the predictions are:

* Closer to the actual values
* Less under- or over-predicted on the extremes

**This means log transformation + outlier capping + interaction feature has clearly improved  model’s generalization!** 

### 2. Still Slight Overprediction on Some Samples
For example:

Sample 14 and 16 have slightly higher predictions than the actual values.

These could be homes that had some localized issue or features not captured by our existing variables.

But overall, the margin is small and expected in real-world data.

### 3. Balanced Error Spread
Both overpredictions and underpredictions are present across the 20 samples.

This balance suggests our model is not biased toward over- or under-estimation.

## 1. 📦 Importing Libraries

In [None]:
import joblib

# Save the trained model
joblib.dump(linear_model, 'house_price_model.pkl')



## 7. ✅ Conclusion

This notebook shows how basic regression can effectively model relationships between housing features and price.  
We also built a companion Flask app to allow users to interact with the model using a web interface, enabling real-time predictions.

Key takeaways:
- Area Income is highly predictive of house price.
- Distribution and residuals indicate a reasonable model fit.
- Flask integration brings data science to life with interactivity.
