<a href="https://colab.research.google.com/github/srijitt/house-price-prediction/blob/main/housePrice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **House Price Prediction**

In this notebook, I have created my first Machine Learning project- using RandomForestRegressor, in scikit-learn package. This represents a basic model of how ML algorithms work. I have used a pre-defined dataset of the House Price details in India, imported from Kaggle.

*Link to the dataset: [here](https://www.kaggle.com/datasets/mohamedafsal007/house-price-dataset-of-india)*


Libraries Used: 
*   `Pandas`: for organising and cleaning of dataset.
*   `Scikit-learn`: for importing machine learning algorithms for our model.



In [1]:
# Import necessary libraries
import pandas as pd
from os import path
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

First, we specify the path of the dataset, and then use the `read_csv()` method to read the file, in the form of a pandas dataframe. Then, we specify the *target*, i.e. the parameter we want to predict, which is `Price`. After that, we specify the features on which the model will train and predict. Keeping the model at basic level, parameters used to train the model are:


*   Lot area
*   Built year
*   Number of bedrooms
*   Number of bathrooms
*   Number of floors

Having done that, we split the data into `training` and `validation` datasets, using `train_test_split()` funtion.

In [10]:
# Load the Dataset
path= 'House Price India.csv'
home_data= pd.read_csv(path)

# Specify target, training, and validation data
y= home_data['Price']

features= ['lot area', 'living area', 'condition of the house', 'Built Year', 'number of bedrooms', 'number of bathrooms', 'number of floors',
           'Distance from the airport', 'Number of schools nearby']
X= home_data[features]
print(X.head())

train_X, val_X, train_y, val_y= train_test_split(X, y, train_size=0.75, random_state=1)

   lot area  living area  condition of the house  Built Year  \
0      9050         3650                       5        1921   
1      4000         2920                       5        1909   
2      9480         2910                       3        1939   
3     42998         3310                       3        2001   
4      4500         2710                       4        1929   

   number of bedrooms  number of bathrooms  number of floors  \
0                   5                 2.50               2.0   
1                   4                 2.50               1.5   
2                   5                 2.75               1.5   
3                   4                 2.50               2.0   
4                   3                 2.00               1.5   

   Distance from the airport  Number of schools nearby  
0                         58                         2  
1                         51                         2  
2                         53                         1  
3 

Once the training data is ready, we create the model, and fit the training data into it using `fit()` function. Then, we move on to make predictions using our model, by the `predict()` method. *Remember, fit the model using the training data- `train_X, train_y`, but make predictions using the validation data- `val_X`*

In [11]:
# Train the model
model= RandomForestRegressor(random_state=1)
model.fit(train_X,train_y)

# Predict
predictions= model.predict(val_X)
print(predictions)

[ 337050.25 1047900.5   282245.3  ... 1614919.5   332273.01  630333.49]


Error calculation is a very important step in ML predictions. Hence, we calculate `mean_absolute_error` and `R2 score` of our predictions by comparing the predicted results with our validation target data- `val_y`. 

In [12]:
mae= mean_absolute_error(predictions, val_y)
r2= r2_score(val_y, predictions)

print("Validation MAE for Random Forest Model: {:,.0f}".format(mae))
print("Validation R2 score for Random Forest Model: ", r2)

Validation MAE for Random Forest Model: 141,627
Validation R2 score for Random Forest Model:  0.5871672568586559


R2 score criteria: (less than 0.3 - weak),
                   (0.3 to 0.5 - moderate),
                   (greater than 0.5 - strong)



Since, this is my very first ML project, here are some suggested improvisations I would plan, as I delve deeper into machine-learning:

*   Working on more concise model fitting
*   Improving accuracy
*   Adding more meaningful training data
*   Managing overfitting and underfitting issues



`End of Project`