# House Prices In India

This data set was taken from: https://www.kaggle.com/anmolkumar/house-price-prediction-challenge

Features Columns:

POSTED_BY -	Category marking who has listed the property

UNDER_CONSTRUCTION - Under Construction or Not

RERA - Rera approved or Not

BHK_NO - Number of Rooms

BHK_OR_RK - Type of property

SQUARE_FT - Total area of the house in square feet 

READYTOMOVE - Category marking Ready to move or Not

RESALE - Category marking Resale or not

ADDRESS - Address of the property

LONGITUDE - Longitude of the property

LATITUDE - Latitude of the property

TARGET(PRICE_IN_LACS) - The price of the property

In total we have 12 features

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
file_path_test = "../input/house-price-prediction-challenge/test.csv"
file_path_train = "../input/house-price-prediction-challenge/train.csv"

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train_data = pd.read_csv(file_path_train)
test_data = pd.read_csv(file_path_test)

In [None]:
train_data.head(12)

# Feature Engeneering

Here we will study the features and its importance to the prediction of the house price.

1) Null Values

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

2) Price Outliers

In [None]:
sns.distplot(train_data['TARGET(PRICE_IN_LACS)'])

We can see outliers from 4000 to higher prices since we do not have much data on that so we can clean them.
Lets start to clear 5% of the higher values of the price.

In [None]:
train_data['TARGET(PRICE_IN_LACS)'].value_counts()

In [None]:
len(train_data)*(0.03)

In [None]:
# Remove the price outliers
train_data = train_data.sort_values('TARGET(PRICE_IN_LACS)',ascending=False).iloc[884:] 
train_data.describe()

In [None]:
sns.distplot(train_data['TARGET(PRICE_IN_LACS)'])

Now we obtained a distribution plot with the most comun prices we have, so we cleaned the price outliers.

3) Latitude and Longitude

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x='LONGITUDE',y='LATITUDE',data=train_data,hue='TARGET(PRICE_IN_LACS)', 
                palette = 'RdYlGn', edgecolor = None, alpha = 0.2)

Here we can see the scatter of longitude and latitude and see the areas with more houses to sell along with the most expensive ones. Hilighting the importance of this data to the model. Now we can discard the address since we have the latitude and longitude, avoiding overfitting.

4) Square_Ft

In [None]:
sns.distplot(train_data['SQUARE_FT'])

So we need to clean also some outliers from the square feat feature.

In [None]:
train_data['SQUARE_FT'].value_counts()

In [None]:
len(train_data)*(0.03)

In [None]:
# Cleaning the sqaure_ft outliers
train_data = train_data.sort_values('SQUARE_FT',ascending=False).iloc[852:]
train_data.describe()

In [None]:
sns.distplot(train_data['SQUARE_FT'])

Now we cleaned the feature 'SQUARE_FT', eliminating the outliers.

5) BHK or RK

In [None]:
train_data['BHK_OR_RK'].value_counts()

In [None]:
# Turning into a binary variable
train_data['BHK_OR_RK'] = train_data['BHK_OR_RK'].replace(to_replace=['BHK', 'RK'], value=[1, 0])   

In [None]:
train_data.head()

Now we convert the 'BHK_OR_RK' feature to 0 and 1 to be easy to analyse.

6) Conversion of the 'Posted By' to numerical feature

In [None]:
train_data['POSTED_BY'] = train_data['POSTED_BY'].replace(to_replace=['Owner', 'Dealer','Builder'], value=[0, 1, 2]) 
test_data['POSTED_BY'] = test_data['POSTED_BY'].replace(to_replace=['Owner', 'Dealer','Builder'], value=[0, 1, 2])   

7) Data Correlation

In [None]:
train_data.corr()['TARGET(PRICE_IN_LACS)']

With the correlation we see that the price is related more with the Resale, Number of rooms, square feat, posted by. Lets see the all picture of all the features using sns heatmap.

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(train_data.corr(), annot=True, cmap='coolwarm')

According to the heatmap the 'Ready_to_move' feature is 100% related with the 'Under_construction' feature, so we need to remove one to avoid overfitting.

# Processing the Data

Based on the conclusions we obtained in the previous section we will drop some disposable features and features that my cause overfiting.

1) we conluded that the address is disposable since we have the latitude and longitude, so lets drop if from both datasets

In [None]:
train_data.drop(['ADDRESS'], axis = 1, inplace = True)
test_data.drop(['ADDRESS'], axis = 1, inplace = True)

2) We easely see that the feature BHK_OR_RK is unbalanced and has less correlation with the price so we can drop it

In [None]:
train_data.drop(['BHK_OR_RK'], axis = 1, inplace = True)
test_data.drop(['BHK_OR_RK'], axis = 1, inplace = True)

3) As mentioned in the previous section 'Ready to move' feature is 100% correlated with the 'under construction' feature so to avoid overfitting we drop one of them.

In [None]:
train_data.drop(['READY_TO_MOVE'], axis = 1, inplace = True)
test_data.drop(['READY_TO_MOVE'], axis = 1, inplace = True)

In [None]:
train_data.head()

# Train Test Split

In [None]:
X_train = train_data.drop('TARGET(PRICE_IN_LACS)',axis=1).values
Y_train = train_data['TARGET(PRICE_IN_LACS)'].values

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_train,Y_train,test_size=0.1,random_state=42)

# Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

In [None]:
X_train.shape

In [None]:
X_test.shape

# Creating a Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import Callback, ModelCheckpoint
from tensorflow.keras.layers import Dropout

In [None]:
model_price = Sequential()

# Number of neurons equal to te feautres on the dataset
model_price.add(Dense(8,activation='relu',input_shape=(8,)))
model_price.add(Dropout(0.5))
model_price.add(Dense(8,activation='relu'))
model_price.add(Dropout(0.5))
model_price.add(Dense(8,activation='relu'))
model_price.add(Dropout(0.5))
model_price.add(Dense(8,activation='relu'))
model_price.add(Dropout(0.5))
model_price.add(Dense(1, activation = 'linear'))

model_price.compile(optimizer='adam',loss='mae')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
cb = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

# Training The Model

In [None]:
model_price.fit(x=X_train,y=Y_train, validation_data=(X_test, Y_test), batch_size=128, epochs=150, callbacks=[cb])
# batch_size in power of two

In [None]:
losses = pd.DataFrame(model_price.history.history)
losses.plot()

# Test and Evaluate the Model

In [None]:
from sklearn.metrics import mean_absolute_error, max_error, mean_squared_error

In [None]:
y_pred = model_price.predict(X_test).reshape(X_test.shape[0])

# Create a dataframe to put the two columns of the true value and the prediction
pred_df = pd.DataFrame({'Actual value':Y_test, 'Predicted value':y_pred})
print(pred_df.head())

In [None]:
mean_absolute_error(y_true=pred_df['Actual value'], y_pred=pred_df['Predicted value'])

In [None]:
print(mean_squared_error(y_true=pred_df['Actual value'], y_pred=pred_df['Predicted value']))

In [None]:
from sklearn.metrics import explained_variance_score
explained_variance_score(Y_test, y_pred)

In [None]:
pd.DataFrame(y_pred).to_csv('submission.csv')