# STATUS: MARKED AS FINAL

# House Price Prediction Challenge

Welcome to the House Price Prediction Challenge, you will test your regression skills by designing an algorithm to accurately predict the house prices in India. Accurately predicting house prices can be a daunting task. The buyers are just not concerned about the size(square feet) of the house and there are various other factors that play a key role to decide the price of a house/property. It can be extremely difficult to figure out the right set of attributes that are contributing to understanding the buyer's behavior as such. This dataset has been collected across various property aggregators across India. In this competition, provided the 12 influencing factors your role as a data scientist is to predict the prices as accurately as possible.

Also, in this competition, you will get a lot of room for feature engineering and mastering advanced regression techniques such as Random Forest, Deep Neural Nets, and various other ensembling techniques. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')
test = pd.read_csv('/kaggle/input/house-price-prediction-challenge/test.csv')
sample_submission = pd.read_csv('/kaggle/input/house-price-prediction-challenge/sample_submission.csv')

# First look at data

The following are the features available in the dataset:

1. POSTED_BY 	         - Category marking who has listed the property
2. UNDER_CONSTRUCTION    - Under Construction or Not
3. RERA 	- Real Estate (Regulation and Development) Act, 2016
4. BHK_NO 	- Number of Rooms
5. BHKORRK 	- Type of property -  Room and Kitchen (RK) or Bedroom, Hall, Kitchen (BHK)
6. SQUARE_FT 	- Total area of the house in square feet
7. READYTOMOVE - 	Category marking Ready to move or Not
8. RESALE 	- Category marking Resale or not
9. ADDRESS 	- Address of the property
10. LONGITUDE - 	Longitude of the property
11. LATITUDE - 	Latitude of the property

The target column is the Price in Lacs.

In [None]:
train.head(5)

My main interest in the dataset will be how the different features affect the price and then building some kind of a model for prediction.

All the features are categorical expect Square feet.

As a side task, I will also try to create some kind of a map using the Longitudes and Latitudes features.

Since the test dataset has no actuals available, I will further split the train dataset into test and train dataset for testing the accuracy of the model.

# Exploring the data

## Taking a look at the price column

Looks like there is an outlier in the dataset.

In [None]:
sns.violinplot(data=train, y='TARGET(PRICE_IN_LACS)')
plt.show()

The difference between the median and the mean clearly shows the outliers. The maximum price is 30,000  Lacs. Is this a data error? 

In [None]:
train['TARGET(PRICE_IN_LACS)'].describe()

I found three houses with price which I will classify as outliers. They are located in Bangalore and have a huge area covering. In fact these are the only ones which have an area of more than 10,000,000 square feet.

In [None]:
train[train['TARGET(PRICE_IN_LACS)']>3999]

Taking a look at houses of area more than 10,000,000:

In [None]:
train[train['SQUARE_FT']>10000000]

# Area covered and relation to price

In [None]:
f, axes = plt.subplots(1,1,figsize=(15,5))
sns.scatterplot(data=train, x='SQUARE_FT', y='TARGET(PRICE_IN_LACS)')
plt.show()

In [None]:
f, axes = plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(data=train[train['SQUARE_FT']<399999], x='SQUARE_FT', y='TARGET(PRICE_IN_LACS)', ax=axes[0])
sns.scatterplot(data=train[train['SQUARE_FT']>399999], x='SQUARE_FT', y='TARGET(PRICE_IN_LACS)', ax=axes[1])
plt.show()

# Extracting the cities

In [None]:
def get_city_name(address):
    return address[address.find(',')+1:]

train['CITY'] = train['ADDRESS'].apply(get_city_name)

In [None]:
len(train['CITY'].unique())

# Converting more than 5 BHK to 5 BHK

In [None]:
def BHK(BHK_NO):
    if BHK_NO > 5:
        return 5
    else:
        return BHK_NO

train['BHK_NO.'] = train['BHK_NO.'].apply(BHK)

In [None]:
train['BHK_NO.']

# The other categorical features

In [None]:
train.columns

In [None]:
features = ['POSTED_BY', 'UNDER_CONSTRUCTION', 'RERA', 'BHK_NO.', 
            'BHK_OR_RK', 'READY_TO_MOVE', 'RESALE']

for feature in features:

    f, axes = plt.subplots(1,2,figsize=(15,5))

    sns.countplot(data=train, x=feature, ax=axes[0])
    sns.violinplot(data=train, x=feature, y='TARGET(PRICE_IN_LACS)', ax=axes[1])
    plt.show()

# Train and test data

I will further split the train data available to test and train.

I will experiment with the different type of data transformations.

In all scenarios, the decision tree regressor is gving a better accuracy.

## The lazy model

Using the dataset as it is, the winner is decision tree regressor. The linear regression has a r2 score in negative..

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

## Dropping the latitude and logitude columns


The decision tree regressor has improved. The linear regression still has a r2 score in negative.. Ridge and Lasso is almost the same as before..

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')
df.drop(labels=['LONGITUDE','LATITUDE'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

## Dropping the address, latitude and logitude columns

The linear regression model has improved a lot, however a long way to coming closer to the decision tree regressor. Ridge and lasso still has no change. 

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')
df.drop(labels=['LONGITUDE','LATITUDE', 'ADDRESS'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

## Using only the area of the house

All the r2 scores are worse than the previous scenario. However, this can be because of the outliers present in both the price and area.

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')
df.drop(labels=['LONGITUDE','LATITUDE', 'ADDRESS'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df['SQUARE_FT'].to_numpy().reshape(-1, 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

# Extracting city out of address

Hmm.. This actually showed slightly reduced performance compared to when we were using full addresses..

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')

def get_city_name(address):
    return address[address.find(',')+1:]

train['CITY'] = train['ADDRESS'].apply(get_city_name)

df.drop(labels=['LONGITUDE','LATITUDE', 'ADDRESS'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

# Using transformed BHK feature

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')

def get_city_name(address):
    return address[address.find(',')+1:]



train['CITY'] = train['ADDRESS'].apply(get_city_name)

def BHK(BHK_NO):
    if BHK_NO > 5:
        return 5
    else:
        return BHK_NO

train['BHK_NO.'] = train['BHK_NO.'].apply(BHK)

df.drop(labels=['LONGITUDE','LATITUDE', 'ADDRESS'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')


def BHK(BHK_NO):
    if BHK_NO > 5:
        return 5
    else:
        return BHK_NO

train['BHK_NO.'] = train['BHK_NO.'].apply(BHK)

df.drop(labels=['LONGITUDE','LATITUDE', 'ADDRESS'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor(), LinearRegression(), Ridge(),  Lasso()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')

# And the winner is:

In [None]:
df = pd.read_csv('/kaggle/input/house-price-prediction-challenge/train.csv')
df.drop(labels=['LONGITUDE','LATITUDE', 'ADDRESS'],axis=1, inplace=True)

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

categorical_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        categorical_columns.append(column)
        
df = pd.get_dummies(df,columns=categorical_columns, dtype=int, drop_first=True)
df.fillna(0, inplace=True)

y = df['TARGET(PRICE_IN_LACS)']
X = df.drop(labels = ['TARGET(PRICE_IN_LACS)'], axis = 1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True)

models = [DecisionTreeRegressor()]

for model in models:
    
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    from sklearn import metrics
    print('Model:', model)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('r2_score:', metrics.r2_score (y_test, y_pred))
    print('-------------------------------------')