# Overview
Welcome to the House Price Prediction Challenge, we will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
import numpy as np
from sklearn.metrics import mean_squared_log_error,r2_score
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')
data=pd.read_csv('../input/house-price-prediction-challenge/train.csv')
data.head()

In [None]:
data.info()

Here we could see that there is no null value in the dataset.

## POSTED_BY

Here we will find how the price of the house depends on the person(i.e, dealer , owner or builder) who posted the ad of selling house.

In [None]:
data.POSTED_BY.value_counts()

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data.POSTED_BY,data['TARGET(PRICE_IN_LACS)'],hue=data.POSTED_BY)

From the above scatterplot it is quit evident that there are some outlier in the provided data

In [None]:
def remove_outliers(main_data,column):
    threshold=3
    mean=np.mean(column)
    std=np.std(column)
    for i in range(len(column)):
        z_score=(column[i]-mean)/std
        if np.abs(z_score)>threshold:
            main_data.drop(index=i,inplace=True)
    return main_data

In [None]:
remove_outliers(data,data['TARGET(PRICE_IN_LACS)'])
plt.figure(figsize=(10,8))
sns.scatterplot(data.POSTED_BY,data['TARGET(PRICE_IN_LACS)'],hue=data.POSTED_BY)

Here we could observe in the provided scatterplot that the outliers are removed.
And it could be observed that the property posted by dealer costs quite higher than others

In [None]:
data.POSTED_BY=pd.Categorical(data.POSTED_BY,categories=['Owner','Builder','Dealer'],ordered=True).codes


In [None]:
price=data['TARGET(PRICE_IN_LACS)']
data.drop(['TARGET(PRICE_IN_LACS)'],axis=1,inplace=True)

## Under-Construction 

In [None]:
data.UNDER_CONSTRUCTION.value_counts()

0 :- construction completed       
      1:- under construction

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data.UNDER_CONSTRUCTION,price,hue=data.UNDER_CONSTRUCTION)

The above box plot indicates that price of under construction houses are less than that of completely constructed houses.

##  RERA

In [None]:
data.RERA.value_counts()

0:-RERA not approved                            
1:-RERA approved

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data.RERA,price,hue=data.RERA)

Here the price distribution is quit significant in both cases

## BHK_NO 

In [None]:
data['BHK_NO.'].value_counts()

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data['BHK_NO.'],price,hue=data['BHK_NO.'])

## BHK OR RK

In [None]:
data.BHK_OR_RK.value_counts()

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data.BHK_OR_RK,price,hue=data.BHK_OR_RK)

Assigning :        
0:- for RK           
           1:- for BHK

In [None]:
data.BHK_OR_RK=pd.Categorical(data.BHK_OR_RK,categories=['RK','BHK'],ordered=True).codes

## READY TO MOVE 

In [None]:
data.READY_TO_MOVE.value_counts()

0:- no not ready to move             
1:- yes ready to move

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data.READY_TO_MOVE,price,hue=data.READY_TO_MOVE)

Ready to move houses costs quit large than other type.

## RESALE 

In [None]:
data.RESALE.value_counts()

In [None]:
plt.figure(figsize=(10,8))
sns.scatterplot(data.RESALE,price,hue=data.RESALE)

## ADDRESS 

In [None]:
data.ADDRESS.value_counts()

In [None]:
data.ADDRESS=pd.Categorical(data.ADDRESS).codes

## Spliting of data 

In [None]:
train_data,test_data,train_price,test_price=train_test_split(data,price,test_size=.2)

##  DecisionTreeRegressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
regressor1=DecisionTreeRegressor(max_depth=9)
regressor1.fit(train_data,train_price)

In [None]:
print("mean_squared_log Train Error is ",np.sqrt(mean_squared_log_error(train_price,regressor1.predict(train_data))))
print("mean_squared_log Test Error is ",np.sqrt(mean_squared_log_error(test_price,regressor1.predict(test_data))))
print('r2_score of train data is:-',r2_score(train_price,regressor1.predict(train_data)))
print('r2_score of test data is:-',r2_score(test_price,regressor1.predict(test_data)))

## RandomForestRegressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor2=RandomForestRegressor(max_depth=12,random_state=35)
regressor2.fit(train_data,train_price)

In [None]:
print("mean_squared_log Train Error is ",np.sqrt(mean_squared_log_error(train_price,regressor2.predict(train_data))))
print("mean_squared_log Test Error is ",np.sqrt(mean_squared_log_error(test_price,regressor2.predict(test_data))))
print('r2_score of train data is:-',r2_score(train_price,regressor2.predict(train_data)))
print('r2_score of test data is:-',r2_score(test_price,regressor2.predict(test_data)))

## Conclusion:- Thus here it is evident that the RandomForestRegressor Works quite well with r2_score on test dataset as 0.76 i.e, 76% accuracy.