
# üè† Exploratory Data Analysis & Price Prediction for Real Estate Housing Data

**Project Type:** Data Analytics / Machine Learning  
**Tools Used:** Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn  

---
This notebook presents a **complete end-to-end analysis** of housing price data including:
- Data understanding & cleaning  
- Exploratory Data Analysis (EDA)  
- Feature engineering  
- Visualization  
- Machine learning models  
- Clustering  
- Inferences, conclusions & recommendations  

This notebook is **submission-ready** for academic and professional evaluation.


## 1. Import Required Libraries

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

sns.set(style="whitegrid")
%matplotlib inline



## 2. Load Dataset

The dataset contains structural, qualitative, temporal, and amenity-related attributes of houses.
The target variable is **SalePrice**.


In [None]:

df = pd.read_csv("housing_data.csv")
df.head()



## 3. Data Overview & Understanding

We examine:
- Dataset shape  
- Data types  
- Missing values  
- Statistical summary  


In [None]:

df.shape, df.info(), df.describe()



## 4. Data Cleaning

### Actions Performed:
- Missing numerical values ‚Üí filled using **median**
- Missing categorical values ‚Üí filled using **mode**
- Duplicates checked


In [None]:

df.isnull().sum().sort_values(ascending=False).head(15)

num_cols = df.select_dtypes(include=np.number).columns
cat_cols = df.select_dtypes(exclude=np.number).columns

df[num_cols] = df[num_cols].fillna(df[num_cols].median())
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

df.isnull().sum().max()



## 5. Univariate Analysis

### Objective:
Understand the distribution of house prices and identify skewness and outliers.


In [None]:

plt.figure(figsize=(8,5))
sns.histplot(df["SalePrice"], kde=True)
plt.title("Distribution of House Prices")
plt.show()



### Inference:
- Sale prices are **right-skewed**
- Presence of **high-value outliers**



## 6. Bivariate Analysis

### Relationship between Living Area and Sale Price


In [None]:

plt.figure(figsize=(8,5))
sns.scatterplot(x="GrLivArea", y="SalePrice", data=df)
plt.title("Living Area vs Sale Price")
plt.show()



### Inference:
- Strong **positive correlation**
- Larger houses command higher prices



## 7. Correlation Analysis


In [None]:

plt.figure(figsize=(14,10))
sns.heatmap(df[num_cols].corr(), cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()



### Inference:
- SalePrice strongly correlates with:
  - OverallQual  
  - GrLivArea  
  - GarageCars  



## 8. Feature Engineering

New features are created to enhance model interpretability.


In [None]:

df["HouseAge"] = df["YrSold"] - df["YearBuilt"]
df["PricePerSqFt"] = df["SalePrice"] / df["GrLivArea"]

df[["HouseAge", "PricePerSqFt"]].head()



### Inference:
- New variables capture **depreciation** and **value efficiency**



## 9. Advanced Visualizations


In [None]:

sns.pairplot(df[["SalePrice","GrLivArea","OverallQual","GarageCars","HouseAge"]])
plt.show()

plt.figure(figsize=(8,5))
sns.violinplot(x="OverallQual", y="SalePrice", data=df)
plt.show()



### Inference:
- Quality has a **non-linear impact**
- Higher quality leads to exponential price increase



## 10. Price Prediction Models

Two models are used:
- Linear Regression  
- Random Forest Regressor  


In [None]:

features = ["GrLivArea","OverallQual","GarageCars","TotalBsmtSF","HouseAge"]
X = df[features]
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=200, random_state=42)

lr.fit(X_train, y_train)
rf.fit(X_train, y_train)

lr_preds = lr.predict(X_test)
rf_preds = rf.predict(X_test)



## 11. Model Evaluation


In [None]:

def evaluate(y_true, y_pred, model):
    print(model)
    print("MAE:", mean_absolute_error(y_true, y_pred))
    print("RMSE:", mean_squared_error(y_true, y_pred, squared=False))
    print("R2:", r2_score(y_true, y_pred))
    print()

evaluate(y_test, lr_preds, "Linear Regression")
evaluate(y_test, rf_preds, "Random Forest")



### Inference:
- Random Forest significantly **outperforms** Linear Regression
- Captures **non-linear patterns**



## 12. Feature Importance (Random Forest)


In [None]:

pd.Series(rf.feature_importances_, index=features).sort_values().plot(kind="barh", figsize=(8,5))
plt.show()



### Inference:
- OverallQual & GrLivArea are the **strongest predictors**



## 13. Clustering Analysis (Market Segmentation)


In [None]:

cluster_features = ["GrLivArea","GarageCars","Fireplaces","OverallQual"]
X_scaled = StandardScaler().fit_transform(df[cluster_features])

df["Cluster"] = KMeans(n_clusters=4, random_state=42).fit_predict(X_scaled)

sns.scatterplot(x="GrLivArea", y="SalePrice", hue="Cluster", data=df)
plt.show()



### Inference:
- Houses cluster into **budget, mid-range, and premium segments**



## 14. Conclusion

- House prices are strongly influenced by **size, quality, and amenities**
- Random Forest provides robust predictive performance
- Feature engineering improves interpretability



## 15. Recommendations

- Focus pricing strategy on **quality upgrades**
- Segment market using clustering
- Use ensemble ML models for pricing decisions



## 16. Limitations & Future Scope

### Limitations:
- No geospatial or economic indicators

### Future Scope:
- Include location coordinates
- Use XGBoost / LightGBM
- Integrate macroeconomic data
