# **Decoding Real Estate Trends: Exploring Regression Models for House Price Prediction**

## **Introduction**

House prices are influenced by a wide array of factors, from physical attributes like square footage to seemingly obscure details like proximity to railroads. Understanding the nuances behind these price variations is crucial for developing effective predictive models. This project explores the Kaggle competition dataset, "House Prices – Advanced Regression Techniques," to tackle the challenge of predicting sales prices using regression methods.

The aim of this project is to provide hands-on experience in regression analysis, with a particular focus on linear regression. By analyzing the Ames Housing dataset, which contains 79 explanatory variables describing residential homes, the intricate relationships between these features and house prices are explored. Through creative feature engineering and experimentation with different regression techniques, models are developed to deliver accurate predictions while minimizing errors.

Beyond technical skill development, this project offers an opportunity to gain a deeper understanding of the pre-processing steps required for regression problems. Various transformations are applied, features are selected or engineered, and models are refined to improve prediction performance. This process highlights the real-world applications of regression in domains such as economics and real estate.

## **Data**

The dataset used in this project is the "House Prices - Advanced Regression Techniques" dataset, sourced from Kaggle. It contains extensive information about residential homes in Ames, Iowa, and includes 79 explanatory variables that describe various aspects influencing house prices. Below are key columns from the dataset:

- **Id:** Unique identifier for each house
- **MSSubClass:** The type of dwelling involved in the sale
- **MSZoning:** The general zoning classification
- **LotArea:** Lot size in square feet
- **OverallQual:** Overall material and finish quality
- **OverallCond:** Overall condition rating
- **YearBuilt:** Original construction date
- **YearRemodAdd:** Remodel date
- **Exterior1st:** Exterior covering on the house
- **Exterior2nd:** Exterior covering (if more than one material)
- **SalePrice:** The property's sale price (target variable)

The dataset also includes other variables related to building features, neighborhood information, and house condition, offering ample opportunities for feature engineering and regression modeling.

This data is publicly available at: [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview).

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.impute import KNNImputer

# Load training data
train_df = pd.read_csv("train.csv")

# Load test data
test_df = pd.read_csv("test.csv")

# Display first few rows
train_df.head()

print(train_df.isnull().sum().sort_values(ascending=False).head(20))  # Top missing values

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
MasVnrType       872
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtFinType2      38
BsmtExposure      38
BsmtQual          37
BsmtCond          37
BsmtFinType1      37
MasVnrArea         8
Electrical         1
Id                 0
dtype: int64


### Data Pre-Processing

Prior to data analysis, several preprocessing steps were undertaken to prepare the data for effective modeling:

- **Handling Missing Values:** Missing values in the training and test datasets were imputed using KNNImputer, which calculates the average of the five nearest neighbors for each missing value. This approach ensures a more informed imputation compared to simpler methods like mean replacement.

- **Encoding Categorical Variables:** Categorical features were transformed into numerical representations using one-hot encoding. This process created binary columns for each category, enabling compatibility with machine learning algorithms. To ensure consistency, the training and test datasets were aligned so they contained the same set of features.

- **Creating Polynomial Features:** Polynomial transformations were applied to the data to capture interaction terms and nonlinear relationships between features. A second-degree polynomial expansion was implemented, creating additional features that could enhance the predictive power of regression models.

- **Standardizing Numerical Features:** Numerical features were standardized using StandardScaler. This ensured all variables had a mean of 0 and a standard deviation of 1, eliminating discrepancies in scale and promoting better model performance.

- **Splitting Data into Training and Validation Sets:** The training dataset was further divided into training and validation subsets using an 80-20 split. This allowed for evaluation of model performance on unseen data before moving forward with predictions.

These preprocessing steps optimized the dataset for regression modeling, addressing missing values, ensuring feature consistency, and enhancing the quality of the input data.

In [2]:
# Separate features and target variable for training
X_train = train_df.drop(columns=['Id', 'SalePrice'], errors='ignore')
y_train = train_df['SalePrice']

# Prepare test data (ensure 'SalePrice' is dropped from the test set)
X_test = test_df.drop(columns=['Id'], errors='ignore')

# Perform one-hot encoding on both training and test sets
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

# Align test set with train set (ensure they have the same columns)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

# Handle missing values using KNN Imputer (instead of SimpleImputer)
imputer = KNNImputer(n_neighbors=5)
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Apply Polynomial Features to capture interaction terms (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Standardize the features (scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_poly)
X_test_scaled = scaler.transform(X_test_poly)

# Split training data into training and validation sets
X_train_sub, X_valid, y_train_sub, y_valid = train_test_split(X_train_scaled, y_train, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train_sub, y_train_sub)

# Validate the model
y_pred = model.predict(X_valid)
mae = mean_absolute_error(y_valid, y_pred)
print("Validation MAE:", mae)

# Perform cross-validation to get a more reliable estimate of model performance
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_absolute_error')
print("Cross-validation MAE:", -cv_scores.mean())

# Make predictions on the test set
y_test_pred = model.predict(X_test_scaled)

# Create a submission file
submission = pd.DataFrame({'Id': test_df['Id'], 'SalePrice': y_test_pred})
submission.to_csv('submission.csv', index=False)

print("Submission file saved!")

Validation MAE: 19123.809036286973
Cross-validation MAE: 24918.387121881035
Submission file saved!


### Linear Regression Model: Explanation and Evaluation

The linear regression model serves as a foundational approach for predicting house prices in this project. It aims to model the relationship between the explanatory variables (features) and the target variable (`SalePrice`) by fitting a linear equation to the data.

#### **Model Overview**
Linear regression operates under the assumption that the target variable has a linear relationship with the predictors. The model minimizes the sum of squared residuals (the differences between observed and predicted values) to find the best-fit line. For this project, the following steps were implemented:

- Polynomial transformations were applied to capture nonlinear relationships and interaction terms.
- Features were standardized using `StandardScaler` to ensure consistency in scales, which helps improve model stability and performance.
- The model was trained on the processed data and validated on an 80-20 train-validation split.
- Cross-validation was employed to assess the robustness of the model across multiple folds.

#### **Evaluation**
The model achieved a Kaggle submission score of **0.16427** (based on Root-Mean-Squared-Error, RMSE). While this score is a reasonable starting point, it suggests there is room for improvement in model performance. The score reflects the logarithmic difference between predicted and actual house prices, with lower values indicating better accuracy.

#### **Performance Analysis**
- **Strengths:**
  - The simplicity of linear regression makes it interpretable and computationally efficient.
  - By standardizing features and incorporating polynomial terms, the model can capture more complex relationships than a basic linear regression.

- **Limitations:**
  - Linear regression may struggle with capturing highly nonlinear relationships inherent in the dataset.
  - The assumption of a linear relationship between features and the target variable might not hold true for all predictors.
  - Outliers in the data can influence the model significantly, potentially skewing predictions.
  
#### **Opportunities for Improvement**
To enhance the model's performance:
- Experiment with advanced regression techniques such as Ridge, Lasso, or ElasticNet, which can regularize the model and prevent overfitting.
- Implement tree-based ensemble methods like Random Forest or Gradient Boosting, as these methods handle nonlinearity and interactions naturally.
- Perform feature selection to remove irrelevant or redundant features, improving the model's ability to generalize.
- Consider hyperparameter tuning for the polynomial degree and the regularization parameters in advanced models.

The linear regression model provides valuable insights and serves as a solid baseline for the project. However, exploring and experimenting with alternative models and techniques is essential to achieve better accuracy and lower error on the Kaggle leaderboard.