# Bengaluru (India) House Price Prediction

**Dataset:** Bengaluru House Price Prediction

**Explanatory Variable:** a)Area type, b) Availability, c) Location, d)Size, e) Total Sqft, f) Bath,

**Response Variable:** House Price

**Research Question:** Is there a linear relationship between house price and its corresponding bath and Bhk




In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns     
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("../input/bengaluru-house-price-data/datasets_20710_26737_Bengaluru_House_Data.csv")
data.head()

In [None]:
data.info()

In [None]:
print('Shape of the dataset',data.shape)
print('\n')
print('Features of the dataset', data.columns)
print('\n')
print(data['area_type'].value_counts())

In [None]:
# Dropping Null Values
data = data.dropna()

data['bhk'] = data['size'].apply(lambda x: int(x.split(' ')[0]))

data.describe()

In [None]:
# New data

data.head(5)

In [None]:
sns.pairplot(data)

### **Finding:** From this pairplot it is clear that house price has a linear relationship with bhk and bath

**Explanatory Variables:** bhk and Bath

**Response Variables:** House Price

In [None]:
num_vars = ["bath", "balcony", "price", "bhk"]
sns.heatmap(data[num_vars].corr(),cmap="coolwarm", annot=True)

### Finding :

### a) From this correlation matrix it is clear that house price is correlated with bhk and bath. 

### b) bhk and bath is also highly correlated


**Explanatory Variables:** bhk and Bath

**Response Variables:** House Price

In [None]:
data['bhk'].value_counts().plot(kind='bar')

### Finding: 2 and 3 bhk buildings are more common in Bengalure, India

In [None]:
data['bath'].value_counts().plot(kind='bar')

### Finding: 2 and 3 bathroom facility is more common in most of the buildings in Bengalure (India)

### Model Building

In [None]:
X = data.drop(['price','area_type','location','availability','size','society','total_sqft'],axis='columns')
Y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=10)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'linear_regression' : {
            'model': LinearRegression(),
            'params': {
                'normalize': [True, False]
            }
        },
        'lasso': {
            'model': Lasso(),
            'params': {
                'alpha': [1,2],
                'selection': ['random', 'cyclic']
            }
        },
        'decision_tree': {
            'model': DecisionTreeRegressor(),
            'params': {
                'criterion' : ['mse','friedman_mse'],
                'splitter': ['best','random']
            }
        },
    
        'RandomForest': {
            'model': RandomForestRegressor(),
            'params': {
                'n_estimators': [200, 500],
                'max_features': ['auto', 'sqrt', 'log2'],
                'max_depth' : [4,5,6,7,8],
            }
        }
    }
    scores = []
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,Y)

## Finding: 

### Since Random Forest Regressor is giving the best accuracy of 46.44% with parameters such as max_depth = 5, max_features = sqrt and n_estimators = 500. This model will be used for our training

In [None]:
model = RandomForestRegressor(max_depth=5, n_estimators=500, max_features='sqrt')
model.fit(X_train, y_train)
model.score(X_test, y_test)

### **Finding: So our models testing accuracy is 48.69%**

## Prediction

In [None]:
X = pd.DataFrame({'bath': [3],
    'balcony':[2],
    'bhk':[3]})
print(X)
prediction_price = model.predict(X)
print('Predicted House price', prediction_price[0])

## Conclusion:

### **From this data, it is understood that the house price in Bengalure (India) largely depends on the number of bathrooms attached and number of bedrooms. Thus, it proves there is a linear relationship between the house price and with it corressponding bhk and bath**