
# House Price Prediction

## Overview
This project was developed as part of the **Large Scale Data Processing** coursework. It aims to predict house prices based on various attributes such as location, area, build year, and other features. The dataset is processed using Hadoop and analyzed using regression models like Ridge, Lasso, and Elastic Net.

## Objectives
1. Simulate dataset loading and processing using Hadoop.
2. Perform data preprocessing, including handling missing values and feature normalization.
3. Conduct exploratory data analysis (EDA) for data insights.
4. Implement regression models for predicting house prices.
5. Compare model performance using evaluation metrics like R-squared, RMSE, and MAE.

---



## Hadoop Integration

The dataset is loaded into the Hadoop Distributed File System (HDFS) for efficient processing and scalability. While this notebook simulates the process, the actual implementation was executed on a Hadoop cluster.


In [None]:

# Simulate Hadoop dataset upload
print("Simulating Hadoop commands for dataset upload...")
# Command example (executed in actual Hadoop environment):
# hadoop fs -put train.csv /user/house-prices/train.csv
# hadoop fs -put test.csv /user/house-prices/test.csv
print("Dataset successfully uploaded to HDFS (simulation).")



## Data Preprocessing

Data preprocessing includes:
- Handling missing values.
- Normalizing features.
- Encoding categorical variables.


In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Handle missing values
train_data.fillna(train_data.mean(), inplace=True)
test_data.fillna(test_data.mean(), inplace=True)

# Feature normalization
scaler = StandardScaler()
numerical_features = train_data.select_dtypes(include=['float64', 'int64']).columns
train_data[numerical_features] = scaler.fit_transform(train_data[numerical_features])
test_data[numerical_features] = scaler.transform(test_data[numerical_features])

# Split train data into features and target
X = train_data.drop(['SalePrice'], axis=1)
y = train_data['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



## Exploratory Data Analysis (EDA)

Performing EDA to visualize relationships and gain insights into the dataset.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize target variable
sns.histplot(train_data['SalePrice'], kde=True)
plt.title('Distribution of Sale Prices')
plt.show()

# Correlation heatmap
correlation_matrix = train_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()



## Regression Models

We implement and evaluate the following regression models:
1. Ridge Regression
2. Lasso Regression
3. Elastic Net Regression


In [None]:

from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
ridge_predictions = ridge_model.predict(X_test)

# Evaluation
ridge_rmse = mean_squared_error(y_test, ridge_predictions, squared=False)
ridge_r2 = r2_score(y_test, ridge_predictions)
print(f"Ridge Regression RMSE: {ridge_rmse}, R2: {ridge_r2}")


In [None]:

from sklearn.linear_model import Lasso

# Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)
lasso_predictions = lasso_model.predict(X_test)

# Evaluation
lasso_rmse = mean_squared_error(y_test, lasso_predictions, squared=False)
lasso_r2 = r2_score(y_test, lasso_predictions)
print(f"Lasso Regression RMSE: {lasso_rmse}, R2: {lasso_r2}")


In [None]:

from sklearn.linear_model import ElasticNet

# Elastic Net Regression
elastic_net_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net_model.fit(X_train, y_train)
elastic_net_predictions = elastic_net_model.predict(X_test)

# Evaluation
elastic_net_rmse = mean_squared_error(y_test, elastic_net_predictions, squared=False)
elastic_net_r2 = r2_score(y_test, elastic_net_predictions)
print(f"Elastic Net RMSE: {elastic_net_rmse}, R2: {elastic_net_r2}")



## Conclusion

The project demonstrated the application of regression models for predicting house prices. Among the models:
- Ridge Regression achieved RMSE: {ridge_rmse}, R2: {ridge_r2}
- Lasso Regression achieved RMSE: {lasso_rmse}, R2: {lasso_r2}
- Elastic Net Regression achieved RMSE: {elastic_net_rmse}, R2: {elastic_net_r2}

This project highlights the importance of data preprocessing, model selection, and performance evaluation in real-world scenarios.
