# House price prediction

## 1. Introduction

### 1.1 Problem statment

The real estate market lacks transparency and accuracy in house price estimation, making informed decisions difficult. Current methods rely on outdated data and manual evaluations, failing to address the market's dynamic nature. A Machine Learning model can bridge this gap by leveraging property features, location metrics, and historical data to deliver accurate price predictions for better decision-making.

### 1.2 Objective

To design and implement a machine learning-based system that predicts house prices in Delhi, India with high accuracy.

## 2. Dataset
The dataset, sourced from Kaggle ([Kaggle link](https://www.kaggle.com/datasets/saipavansaketh/pune-house-data?select=Delhi+house+data.csv)), provides housing data with features relevant to price estimation.
##### Features Used:
1. **Area**: Total built-up area of the house (in sq. ft.).
2. **Number of bedrooms**: Total number of bedrooms in the house.
3. **Parking Space**: Number of available parking spaces.
##### Target:
**House price**, measured in INR (Indian Rupees).

## 3. Package and imports

In [2]:
# for array computations and loading data
import numpy as np
import pandas as pd


# for building linear regression models and preparing data
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# function to load data
#from function_module import plot_data

## 4. Model evaluation and selection
### 4.1 initial evaluation of dataset
This includes the examination of the data, including verification of the input and target data sizes, as well as primary visualization of the data.

In [7]:
import matplotlib.pyplot as plt

# Data for plotting
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

# Create the plot
plt.plot(x, y, label="y = 2x", color="blue", marker="o")

# Adding title and labels
plt.title("Simple Line Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Adding a legend
plt.legend()

# Display the plot
plt.show()



ModuleNotFoundError: No module named 'matplotlib.backends.registry'

In [3]:
# Load the dataset from the csv file
data = pd.read_csv('./Data/Delhi house data.csv', delimiter=',', skiprows=1, na_values=[''])

# Split the inputs and outputs into separate arrays using .iloc for integer-based indexing
x = data.iloc[:, 0:3].values
y = data.iloc[:, 3].values

# Convert 1-D arrays into 2-D

y = np.expand_dims(y, axis=1)

# Display the first few rows of x and y to verify
#print(x[:5])
#print(y[:5])

print(f"the shape of the inputs x is: {x.shape}")
print(f"the shape of the targets y is: {y.shape}")

the shape of the inputs x is: (1258, 3)
the shape of the targets y is: (1258, 1)


In [4]:
# data visvaliazation
fig, axs = plt.subplots(1, 3, figsize=(10, 3))

# Loop through each feature in x and plot
for i in range(x.shape[1]):
    axs[i].scatter(x[:, i], y, marker='x', c='r')
    axs[i].set_title(f"Price vs {feature_names[i]}")
    axs[i].set_xlabel(feature_names[i])
    axs[i].set_ylabel("Price")
    axs[i].grid(True)

# Adjust layout for better spacing
plt.tight_layout()
plt.show()

NameError: name 'plt' is not defined

### 4.2 Split the dataset into training, cross validation, and test sets
* ***training set*** - utilized to train the model.
* ***cross validation set*** - used to assess various models.
* ***test set*** - provides an unbiased estimate of the model's performance on unseen data.

split the entire dataset to 60% training, 20% cross validation, and 20% test.

In [8]:
# Get 60% of the dataset as the training set. Put the remaining 40% in temporary variables: x_ and y_.
x_train, x_, y_train, y_ = train_test_split(x, y, test_size=0.40, random_state=1)

# Split the 40% subset above into two: one half for cross validation and the other for the test set
x_cv, x_test, y_cv, y_test = train_test_split(x_, y_, test_size=0.50, random_state=1)

# Delete temporary variables
del x_, y_

print(f"the shape of the training set (input) is: {x_train.shape}")
print(f"the shape of the training set (target) is: {y_train.shape}\n")
print(f"the shape of the cross validation set (input) is: {x_cv.shape}")
print(f"the shape of the cross validation set (target) is: {y_cv.shape}\n")
print(f"the shape of the test set (input) is: {x_test.shape}")
print(f"the shape of the test set (target) is: {y_test.shape}")

the shape of the training set (input) is: (754, 3)
the shape of the training set (target) is: (754, 1)

the shape of the cross validation set (input) is: (252, 3)
the shape of the cross validation set (target) is: (252, 1)

the shape of the test set (input) is: (252, 3)
the shape of the test set (target) is: (252, 1)
