# What matters the most when you buy a house? Location, location, location!

#### That a very popular statement nowadays and I am a believer of it. Because it's true, no matter in which country all over the world. 

#### Based on our life experience, house price would be strongly related to below factors:
1. Which city? A well-developed and modern city must be more attractive than others
2. Which area in the city? House in CBD, or city center must have higher price than surburb area
3. Within the same area? There might be other factors fluctuate the sales price, like transpotaion, convenience, education, etc. Those factors normally just slighly increase or decrease the price, compared to the average price of this area. But those factors would not be that crucial, to overtune the price in countryside to higher than an apartment in downtown.

#### From checking below column description
* ADDRESS + LONGITUDE + LATITUDE might help identify location?
* Other columns might be the factors that bring the price higher or lower than the average?

|Column | Description |
| --- | --- |
| POSTED_BY          | Category marking who has listed the property |
| UNDER_CONSTRUCTION | Under Construction or Not|
| RERA	             | Rera approved or Not|
| BHK_NO	         | Number of Rooms|
| BHK_OR_RK	         | Type of property|
| SQUARE_FT	         | Total area of the house in square feet|
| READY_TO_MOVE	     | Category marking Ready to move or Not|
| RESALE 	         | Category marking Resale or not|
| ADDRESS	         | Address of the property|
| LONGITUDE 	     | Longitude of the property|
| LATITUDE  	     | Latitude of the property|




 ### So above are my initial thinking when firstly glance on the data. Let's do a quick EDA and see if they are applicable in this dataset.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.metrics import confusion_matrix, roc_curve, auc
import xgboost as xgb 
import sklearn.metrics as metrics
from sklearn.metrics import mean_squared_error as MSE 

import seaborn as sns
sns.set_style('darkgrid')

import warnings
import datetime as dt
warnings.filterwarnings('ignore')
pd.options.display.max_rows = None
pd.options.display.max_columns = None
train_file = '/kaggle/input/house-price-prediction-challenge/train.csv'
train_df = pd.read_csv(train_file)

#### Checkout the datatype and cleanliness
* Datatype
* Description
* Null value ratio

In [None]:
train_df.head()

In [None]:
train_df.info()
# Data has no null value, pretty clean

In [None]:
train_df.describe(include='all')
# On BHK_OR_RK field, there are 29427 BHKs out of 29451, might not be useful in our final model

## Checkout distribution of target column and what might be most related to it.
Size (SQUARE_FT) should be strongly correlated

In [None]:
fig,ax = plt.subplots(ncols=2,nrows=3,dpi=100,figsize=(20,20))
sns.distplot(a=train_df['TARGET(PRICE_IN_LACS)'], kde=False, ax=ax[0][0])
df1 = train_df[train_df['TARGET(PRICE_IN_LACS)']<500]
sns.distplot(a=df1['TARGET(PRICE_IN_LACS)'], kde=False, ax=ax[0][1])


sns.distplot(a=train_df['SQUARE_FT'], kde=False, ax=ax[1][0])
df1 = train_df[train_df['SQUARE_FT'] < 25000]
sns.distplot(a=df1['SQUARE_FT'], kde=False, ax=ax[1][1])


sns.scatterplot(x=train_df['SQUARE_FT'], y=train_df['TARGET(PRICE_IN_LACS)'], ax=ax[2][0])
df1 = train_df[(train_df['TARGET(PRICE_IN_LACS)']<500) & (train_df['SQUARE_FT'] < 25000)]
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], ax=ax[2][1])

ax[0][0].set_title('TARGET(PRICE_IN_LACS) Histogram')
ax[0][1].set_title('TARGET(PRICE_IN_LACS) Histogram (<500)')

ax[1][0].set_title('SQUARE_FT Histogram')
ax[1][1].set_title('SQUARE_FT Histogram (<25000)')

ax[2][0].set_title('SQUARE_FT vs TARGET(PRICE_IN_LACS)')
ax[2][1].set_title('SQUARE_FT vs TARGET(PRICE_IN_LACS) without outlier')
# ax[1].set_title('AC power & DC power during day hours')

### We can see the SQUARE_FT is somehow correlated to target price. There are a lot of outliers with very low price/SQF. We need to keep explore other fields and see if we can find out something useful to address them.

#### For categorical fields, we can see all the distinct values

In [None]:
for x in train_df.columns:
    if train_df[x].dtype != 'float64':        
        print(x, train_df[x].unique())
        print('-'*10)


In [None]:
fig,ax = plt.subplots(ncols=2,nrows=3,dpi=100,figsize=(20,20))
df1 = train_df[(train_df['TARGET(PRICE_IN_LACS)']<500) & (train_df['SQUARE_FT'] < 25000)]
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], hue=df1['POSTED_BY'], ax=ax[0][0])
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], hue=df1['UNDER_CONSTRUCTION'], ax=ax[0][1])
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], hue=df1['RERA'], ax=ax[1][0])
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], hue=df1['BHK_OR_RK'], ax=ax[1][1])
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], hue=df1['READY_TO_MOVE'], ax=ax[2][0])
sns.scatterplot(x=df1['SQUARE_FT'], y=df1['TARGET(PRICE_IN_LACS)'], hue=df1['RESALE'], ax=ax[2][1])


## Check correlation heatmap between those features

In [None]:
#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=10)

## Map poster and bhk ind to integer
posted_by_map = {'Owner':1, 'Dealer':2, 'Builder':3}
train_df['POSTED_BY_CODE'] = train_df['POSTED_BY'].map(posted_by_map)

candidates_col = ['POSTED_BY_CODE', 'UNDER_CONSTRUCTION', 'RERA', 'BHK_NO.', 'READY_TO_MOVE', 'RESALE']
correlation_heatmap(train_df[candidates_col])

### From above diagrams, we can draw these conclusion
1. those fields are having very limit effect to the price, if we were only to put those into the model, don't think that's gonna work.
2. except POSTED_BY, we can see house price POSTED_BY owner is slighly lower than others
3. UNDER_CONSTRUCTION & READY_TO_MOVE are exactly the same field, with corr == -1. We can just keep one of them during model train

### This is more convincing for me that I need extra location-related features

### Let's label those categorical fields first then further explore on below features
* BHK_NO.
* ADDRESS
* LONGITUDE/LATITUDE

In [None]:
## Generate a unit price feature for each hosue
train_df['Price/SQF'] = train_df['TARGET(PRICE_IN_LACS)']/train_df['SQUARE_FT'] * 1000


## Explore BHK_OR_RK
#### BHK_OR_RK has too imbalance distribution, we won't use this in the model training

In [None]:
## Checkout how price/SQF related to BHK_NO.
train_df[['BHK_OR_RK','Price/SQF']].groupby('BHK_OR_RK', as_index=False).agg({'Price/SQF':['mean','count']})

## Explore BHK_NO.

In [None]:
## Checkout how price/SQF related to BHK_NO.
train_df[['BHK_NO.','Price/SQF']].groupby('BHK_NO.', as_index=False).agg({'Price/SQF':['mean','count']})

### We can see there are very limit price data for house num larger than 7, and unit price is very random when > 7. 
So Let's mark it as 2 (mode of the dataset) for those BNK_NO. > 7

In [None]:
train_df['BHK_NO.'] = train_df['BHK_NO.'].apply(lambda x: 2 if x > 7 else x)

### Explore ADDRESS / LONGITUDE / LATITUDE
* In address field, we can see the street, sometimes building or block number, and city at the end
    * e.g. Ksfc Layout,Bangalore
    *      Sector-1 Vaishali,Ghaziabad
* Based on initial assumption, city is one of the most decisive factor for price

### Looks like we can derive city from each address by splitting ',' and take the latest element. 
### So let's do it.


In [None]:
def extract_city(x):
    l = x.split(',')
    return ' '.join(l[:len(l)-1])

train_df['City'] = train_df['ADDRESS'].apply(lambda x: x.split(',')[-1])
train_df['Street'] = train_df['ADDRESS'].apply(lambda x: extract_city(x))
city_count = train_df.groupby('City', as_index=False)['Price/SQF'].count()
city_count.rename(columns={'Price/SQF':'count'},inplace=True)
print("City number of the dataset:", city_count.shape[0])

In [None]:
city_count.sort_values(by='count').head()

In [None]:
city_count.sort_values(by='count').tail()

### Looks like we now have the 'City', can see famous cities like Bangalore, Lalitpur, Mumbai, Pune, and etc, dominating the dataset, which is expected.

### let's verify if city would be greatly impact the unit price (Price/SQF)

In [None]:
df = train_df[['City','Price/SQF']] \
        .groupby('City', as_index=False)['Price/SQF'] \
        .mean() \
        .sort_values(by='Price/SQF', ascending=False)
df.head()

In [None]:
## Showing top 30 only, there 256 in total.
plt.figure(figsize=(14,7))
g = sns.barplot(x=df.head(20)['City'], y=df.head(20)['Price/SQF'])
g.set_xticklabels(g.get_xticklabels(), rotation=30)

### Looks like we are getting somewhere, but Hajipur, Haldia are outlier, they only have 1 row each in train_df. 

### Let's remove them and see how it goes.

In [None]:
## Showing top 30 only, there 256 in total.
df = df[~df['City'].isin(['Hajipur','Haldia'])]
plt.figure(figsize=(14,7))
g = sns.barplot(x=df.head(30)['City'], y=df.head(30)['Price/SQF'])
g.set_xticklabels(g.get_xticklabels(), rotation=30)

### City could be a very decisive feature, so we will include it to the model. But what might differentiate price within the same city?
* As mentioned before, CBD or downtown would have higher price than suburb area

### So let's check if LONGITUDE/LATITUDE can help with that
#### Initial thought of how to use these 2 features - 
1. Assuming CBD has the highest house price, we can rank the price for houses in each city, get the highest percentile of price and take the average long/lat value as city center long/lat (name the field C_LONG, C_LAT) 
2. Calculate the distance to (C_LONG, C_LAT), further away from CBD would lower the price
3. Calculate the average price for each city as a cursor of price, assuming when we do prediction, each city should be fluctuating from its own average price. In short, training set and prediction set should presumebly share the same average price.

### Now, let's use Mumbai for example to explore

In [None]:
Mumbai_df = train_df[train_df['City']=='Mumbai'].copy()
fig,ax = plt.subplots(ncols=2,nrows=1,dpi=100,figsize=(20,5))
sns.distplot(a=Mumbai_df['LONGITUDE'], kde=False, ax=ax[0])
sns.distplot(a=Mumbai_df['LATITUDE'], kde=False, ax=ax[1])

### There are a some outlier
* India's longitude/latitude should be within long(6.75,35.5) and lat(68.12,97.42)
* also long/lat should not be negative

### So data quality issue here. Does that mean we can't use these feature? 

### Not yet to give up, because not all the values are wrong, maybe we can derive the approximate distanc for those.

In [None]:
def is_in_India(long, lat):
    if 6.75 < long < 35.5 and 68.12 < lat < 97.42:
        return 1
    else:
        return 0

Mumbai_df['IS_IN_INDIA'] = Mumbai_df.apply(lambda x: is_in_India(x['LONGITUDE'], x['LATITUDE']), axis=1)
print(Mumbai_df[['IS_IN_INDIA','Price/SQF']].groupby('IS_IN_INDIA',as_index=False).count())


In [None]:
city_df = train_df.copy()
city_df['IS_IN_INDIA'] = city_df.apply(lambda x: is_in_India(x['LONGITUDE'], x['LATITUDE']), axis=1)
city_df = city_df.merge(city_count, on='City')
df = city_df[['City','IS_IN_INDIA','count']].groupby(['City','count'],as_index=False).sum()
df['outlier_ratio'] = (1 - df['IS_IN_INDIA']/df['count'])*100
df.sort_values(by='outlier_ratio',ascending=False).head()

In [None]:
## How many cities has count < 10
(df['count'] < 10).sum()

### Only 1 city without any long/lat in INDIA. Think it's good enough for us to derive the C_LONG/C_LAT and calculate distance for each house, if we make below assumption
1. Big city has lots of house price data, small city would only have very few data here.
2. Small city's house price would be very close to its neighbor
3. Deriving distance based on C_LONG/C_LAT of the neighbor might get larger distance than it's reality, but don't think it would cause much impact. Assuming it's neighbor is also a small city, price in city center is still very low compared to big city. If neigher is a big city, then you get larger distance then the price has larger decline from a higher price.

### Below are the basic idea and the code I come up with after considering all different edge cases (can skip to check if it's too trivial), if the rules doesn't apply, will just remove the data from training set


1. Divide by big and small cities
2. For Big city
    - Get mode of long/lat
    - Get C_LONG/C_LAT = average LONG/LAT of house with top 1% percentile ranked by price (LONG/LAT has to be in INDIA)
    - If long/lat not correct, marked the long/lat closer to mode to be considered as in the same city
    - Calculate the distance between LONG/LAT and C_LONG/C_LAT
    - Record C_LONG/C_LAT, mean price for each big city in to a dict (later used by feature engineering for test result set)
3. For small city
    - Get mode of long/lat which is in India
    - Try to get it's neighbor city comparing with big city's C_LONG/C_LAT
    - calculate the distance and get the mean long/lat as C_LONG/C_LAT
    - Record C_LONG/C_LAT, mean price in to a dict (later used by feature engineering for test result set)
    

In [None]:
city_count = train_df.groupby('City', as_index=False)['Price/SQF'].count()
city_count.rename(columns={'Price/SQF':'count'},inplace=True)
# print('City less than 100 price', city_count[city_count['count']<10].shape[0])
# city_count = city_count[city_count['count'] > 10]
large_cities = city_count[city_count['count'] > 10]['City'].unique().tolist()
small_cities = city_count[city_count['count'] <= 10]['City'].unique().tolist()

def is_in_India(long, lat):
    if 6.75 < long < 35.5 and 68.12 < lat < 97.42:
        return 1
    else:
        return 0
    
city_df = train_df.merge(city_count, on='City', how='inner')
city_df['IS_COOR_INDIA'] = city_df.apply(lambda x: is_in_India(x['LONGITUDE'], x['LATITUDE']), axis=1)


long_lat_list = []

def fix_long_lat(l, l_mode):
    if l < l_mode - 0.3:
        return l_mode
    if l > l_mode + 0.3:
        return l_mode
    return l

def get_distance_to_center(long,lat,c_long, c_lat, is_wrong_coor):
    if is_wrong_coor==0:
        return 0.25
    else:
        return np.sqrt((long-c_long)**2 + (lat-c_lat)**2)

city_df_with_dist_list = []

for city in large_cities:
    df = city_df[(city_df['City'] == city) & (city_df['IS_COOR_INDIA'] == 1)].copy()
    long_mode = df['LONGITUDE'].mode().values[0]
    lat_mode = df['LATITUDE'].mode().values[0]
    count = df['count'].mode().values[0]
    df['LONG_MODE'] = long_mode
    df['LAT_MODE'] = lat_mode
    df['LONGITUDE'] = df.apply(lambda x: fix_long_lat(x['LONGITUDE'],x['LONG_MODE']), axis=1)
    df['LATITUDE'] = df.apply(lambda x: fix_long_lat(x['LATITUDE'],x['LAT_MODE']), axis=1)
    
    # city_suburbs.head()
    
    top_percentile = 1 if count < 10 else (10 if count < 100 else int(count * .01))
    city_center_long_lat = df.sort_values(by='Price/SQF').head(top_percentile)[['LONGITUDE','LATITUDE']].mean().values.tolist()
    city_center_long = city_center_long_lat[0]
    city_center_lat = city_center_long_lat[1]
    df['distance'] = df.apply(lambda x: get_distance_to_center(x['LONGITUDE'], x['LATITUDE'],city_center_long, city_center_lat, x['IS_COOR_INDIA']), \
                              axis=1)
    city_df_with_dist_list.append(df)
    city_mean_price = df['Price/SQF'].mean()
    long_lat_list.append([city, city_center_long, city_center_lat, city_mean_price])

long_lat_center_df = pd.DataFrame(data=long_lat_list, columns=['City','C_LONG','C_LAT', 'Price/SQF Mean'])

# long_lat_center_df
for city in small_cities:
    df = city_df[city_df['City'] == city].copy()
    if df[df['IS_COOR_INDIA']==1].shape[0] > 1:
        long_mean = df[df['IS_COOR_INDIA'] == 1]['LONGITUDE'].mean()
        lat_mean = df[df['IS_COOR_INDIA'] == 1]['LATITUDE'].mean()
        ll_df = long_lat_center_df.copy()
        ll_df['LONG_MEAN'] = long_mean
        ll_df['LAT_MEAN'] = lat_mean
        ll_df['distance'] = ll_df.apply(lambda x: get_distance_to_center(x['LONG_MEAN'], x['LAT_MEAN'], x['C_LONG'], x['C_LAT'], 1), axis=1)
        min_distance = ll_df.sort_values(by='distance')['distance'].values[0]
        df['distance'] = min_distance
        city_df_with_dist_list.append(df)
        city_center_long = df[df['IS_COOR_INDIA'] == 1]['LONGITUDE'].mean()
        city_center_lat = df[df['IS_COOR_INDIA'] == 1]['LATITUDE'].mean()
        city_mean_price = df[df['IS_COOR_INDIA'] == 1]['Price/SQF'].mean()
        long_lat_list.append([city, city_center_long, city_center_lat, city_mean_price])
    else:
        df['distance'] = 0
        city_df_with_dist_list.append(df)
        city_mean_price = df['Price/SQF'].mean()
        long_lat_list.append([city, None, None, city_mean_price])

long_lat_center_df = pd.DataFrame(data=long_lat_list, columns=['City','C_LONG','C_LAT', 'Price/SQF Mean'])
city_df_with_dist = pd.concat(city_df_with_dist_list)
city_df_with_dist.head()

### Now we've derived below feature to represent the location factor
* distance: how far away the house from CBD (own city or neighbor if the long/lat is wrong)
* Price/SQF mean: baseline house price for each city

### Let's verify the distribution for each feature

In [None]:
fig,ax = plt.subplots(ncols=2,nrows=1,dpi=100,figsize=(20,5))
df = city_df_with_dist[city_df_with_dist['Price/SQF'] < 1500] # remove outlier
df = df.merge(long_lat_center_df, on='City', how='inner')
df = df[df['Price/SQF Mean'] < 1500] 
sns.scatterplot(x=df['distance'], y=df['Price/SQF'], ax=ax[0])
sns.scatterplot(x=df['Price/SQF Mean'], y=df['Price/SQF'], ax=ax[1])
ax[0].set_title('Distance')
ax[1].set_title('Price/SQF Mean')

### Now we can start train the model, here I'm using XGBBoosting Regressor with hyper param tuning
### Here are feature being used
* POSTED_BY_CODE
* UNDER_CONSTRUCTION
* RERA
* BHK_NO.
* RESALE
* distance
* SQUARE_FT 
* Price/SQF Mean'

### Target
* TARGET(PRICE_IN_LACS)    -- here I'm not predicting Price/SQT, because seems there's size effect, small size might give higher unit price, and vice versa

#### Setup param tuning function

In [None]:
def hyper_param_tuning(X_train, X_test, Y_train, Y_test):
    params = {
        # Parameters that we are going to tune.
        'max_depth':6,
        'min_child_weight': 1,
        'eta':.3,
        'subsample': 1,
        'colsample_bytree': 1,
        # Other parameters
        'objective':'reg:squarederror',
    }
    dtrain = xgb.DMatrix(X_train, label=Y_train)
    dtest = xgb.DMatrix(X_test, label=Y_test)

    params['eval_metric'] = "mae"
    num_boost_round = 999

    gridsearch_params = [
        (max_depth, min_child_weight)
        for max_depth in range(5,20)
        for min_child_weight in range(3,10)
    ]
    min_mae = float("Inf")
    best_params = None
    for max_depth, min_child_weight in gridsearch_params:
        print("CV with max_depth={}, min_child_weight={}".format(
                                 max_depth,
                                 min_child_weight))
        # Update our parameters
        params['max_depth'] = max_depth
        params['min_child_weight'] = min_child_weight
        # Run CV
        cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round=num_boost_round,
            seed=42,
            nfold=5,
            metrics={'mae'},
            early_stopping_rounds=10
        )
        # Update best MAE
        mean_mae = cv_results['test-mae-mean'].min()
        boost_rounds = cv_results['test-mae-mean'].argmin()
        print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
        if mean_mae < min_mae:
            min_mae = mean_mae
            best_params = (max_depth,min_child_weight)
    print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

In [None]:
city_df_with_dist_df = city_df_with_dist.merge(long_lat_center_df, on='City', how='inner')
selected_cols = ['POSTED_BY_CODE', 'UNDER_CONSTRUCTION', 'RERA', 'BHK_NO.','RESALE','distance', 'SQUARE_FT', 'Price/SQF Mean'] # removed READY_TO_MOVE
target_col = 'TARGET(PRICE_IN_LACS)'
X = city_df_with_dist_df[selected_cols]
Y = city_df_with_dist_df[target_col]

# Create train & test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.25, random_state=1)
hyper_param_tuning(X_train, X_test, Y_train, Y_test)


### Start training the model

In [None]:
xgb_r = xgb.XGBRegressor(objective ='reg:squarederror', 
                  max_depth=12, min_child_weight=3)
xgb_r.fit(X_train, Y_train) 
  
# Predict the model 
Y_test_pred = xgb_r.predict(X_test) 
# test_predictedvalues = np.exp(test_predictedvalues) - 1

plt.figure(figsize=(8,8))
plt.scatter(Y_test, Y_test_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.tight_layout()

# RMSE Computation 
acc = xgb_r.score(X_test, Y_test)
print("Accuracy is ", acc)
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_test_pred)))

### The diagram looks in very good shape, but if looking at it carefully, there are some negative values at the bottom. This is not something we want (Doesn't make sense to have negative house price.)

In [None]:
(Y_test_pred < 0).sum()

### To avoid that, we can try to do a transformation on Y -> np.log(Y+1) instead. 
### So let's try it again with the transformation

In [None]:
Y_train = np.log(1+Y_train)
Y_test = np.log(1+Y_test)
hyper_param_tuning(X_train, X_test, Y_train, Y_test)

In [None]:
# Instantiation 
# objective = 'count:poisson'
xgb_r = xgb.XGBRegressor(objective ='reg:squarederror', 
                  max_depth=9, min_child_weight=5)
xgb_r.fit(X_train, Y_train) 

# Predict the model
Y_test_pred = xgb_r.predict(X_test) 
# test_predictedvalues = np.exp(test_predictedvalues) - 1

plt.figure(figsize=(8,8))
plt.scatter(Y_test, Y_test_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.tight_layout()

# RMSE Computation 
acc = xgb_r.score(X_test, Y_test)
print("Accuracy is ", acc)
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_test_pred)))

### Cool, think we have a model now. Just redo the feature engineering for the test.csv
1. Label encoding for categorical field: POSTED_BY, BHK_NO.
2. Enrich with field: City
3. Derive distance and Price/SQF Mean (we can query from the dict created in previously)
4. If there's a new city in test.csv, then we get its neighbor city, and get Price/SQF Mean from neighbor city and calcualte distance based on C_LONG/C_LAT of neighbor city

In [None]:
test_file = '/kaggle/input/house-price-prediction-challenge/test.csv'
test_df = pd.read_csv(test_file)

posted_by_map = {'Owner':1, 'Dealer':2, 'Builder':3}
test_df['POSTED_BY_CODE'] = test_df['POSTED_BY'].map(posted_by_map)
test_df['BHK_NO.'] = test_df['BHK_NO.'].apply(lambda x: 2 if x > 7 else x)
test_df['City'] = test_df['ADDRESS'].apply(lambda x: x.split(',')[-1])


In [None]:
city_count = test_df.groupby('City', as_index=False)['POSTED_BY'].count()
city_count.rename(columns={'POSTED_BY':'count'},inplace=True)

def is_in_India(long, lat):
    if 6.75 < long < 35.5 and 68.12 < lat < 97.42:
        return 1
    else:
        return 0
    
city_df = test_df.merge(city_count, on='City', how='inner')
city_df['IS_COOR_INDIA'] = city_df.apply(lambda x: is_in_India(x['LONGITUDE'], x['LATITUDE']), axis=1)

In [None]:
def get_distance_to_center(long,lat,c_long, c_lat, is_wrong_coor):
    if is_wrong_coor == 0:
        return 0
    else:
        return np.sqrt((long-c_long)**2 + (lat-c_lat)**2)
    
city_df = city_df.merge(long_lat_center_df, on='City', how='left')
city_df['distance'] = city_df.apply(lambda x: get_distance_to_center(x['LONGITUDE'], x['LATITUDE'], x['C_LONG'], x['C_LAT'], x['IS_COOR_INDIA']), axis=1)
print('Null Price/SQF mean',city_df['Price/SQF Mean'].isnull().sum())
print('We need to derive those with below code')



In [None]:

def fix_long_lat(l, l_mode):
    if l < l_mode - 0.3:
        return l_mode
    if l > l_mode + 0.3:
        return l_mode
    return l

def get_distance_to_center(long,lat,c_long, c_lat, is_wrong_coor):
    if is_wrong_coor == 0:
        return 0.25
    else:
        return np.sqrt((long-c_long)**2 + (lat-c_lat)**2)

large_cities = city_count[city_count['count'] > 10]['City'].unique().tolist()
small_cities = city_count[city_count['count'] <= 10]['City'].unique().tolist()

city_df_with_dist_list = []

for city in large_cities:
    df = city_df[city_df['City'] == city].copy()
    long_mode = df['LONGITUDE'].mode().values[0]
    lat_mode = df['LATITUDE'].mode().values[0]
    count = df['count'].mode().values[0]
    df['LONG_MODE'] = long_mode
    df['LAT_MODE'] = lat_mode
    df['LONGITUDE'] = df.apply(lambda x: fix_long_lat(x['LONGITUDE'],x['LONG_MODE']), axis=1)
    df['LATITUDE'] = df.apply(lambda x: fix_long_lat(x['LATITUDE'],x['LAT_MODE']), axis=1)    
#     print(df.columns)
    df['distance'] = df.apply(lambda x: get_distance_to_center(x['LONGITUDE'], x['LATITUDE'],x['C_LONG'], x['C_LAT'], False), \
                              axis=1)
    city_df_with_dist_list.append(df)
    
    
# long_lat_center_df
for city in small_cities:
    df = city_df[city_df['City'] == city].copy()
    if df[df['IS_COOR_INDIA']==1].shape[0] > 1:
        long_mean = df[df['IS_COOR_INDIA'] == 1]['LONGITUDE'].mean()
        lat_mean = df[df['IS_COOR_INDIA'] == 1]['LATITUDE'].mean()
        ll_df = long_lat_center_df.copy()
        ll_df['LONG_MEAN'] = long_mean
        ll_df['LAT_MEAN'] = lat_mean
        ll_df['distance'] = ll_df.apply(lambda x: get_distance_to_center(x['LONG_MEAN'], x['LAT_MEAN'], x['C_LONG'], x['C_LAT'], True), axis=1)
        min_distance = ll_df.sort_values(by='distance')['distance'].values[0]
        df['distance'] = min_distance
        city_df_with_dist_list.append(df)

    else:
        df['distance'] = 0
        city_df_with_dist_list.append(df)

city_df_with_dist = pd.concat(city_df_with_dist_list)


city_with_null_mean_price = city_df_with_dist[city_df_with_dist['Price/SQF Mean'].isnull()]['City'].unique()
price_distance_map = {}

for city in city_with_null_mean_price:
    df = city_df_with_dist[city_df_with_dist['City'] == city].copy()
    if df[df['IS_COOR_INDIA']==1].shape[0] > 1:
        long_mean = df[df['IS_COOR_INDIA'] == 1]['LONGITUDE'].mean()
        lat_mean = df[df['IS_COOR_INDIA'] == 1]['LATITUDE'].mean()
        ll_df = long_lat_center_df.copy()
        ll_df['LONG_MEAN'] = long_mean
        ll_df['LAT_MEAN'] = lat_mean
        ll_df['distance'] = ll_df.apply(lambda x: get_distance_to_center(x['LONG_MEAN'], x['LAT_MEAN'], x['C_LONG'], x['C_LAT'], 1), axis=1)
        values = ll_df.sort_values(by='distance')[['distance','Price/SQF Mean']].head(1).values.tolist()
        min_distance = values[0][0]
        price_mean = values[0][1]
        price_distance_map[city] = (min_distance, price_mean)
    else:
        long_mean = df['LONGITUDE'].mean()
        lat_mean = df['LATITUDE'].mean()
        ll_df = long_lat_center_df.copy()
        ll_df['LONG_MEAN'] = long_mean
        ll_df['LAT_MEAN'] = lat_mean
        ll_df['distance'] = ll_df.apply(lambda x: get_distance_to_center(x['LONG_MEAN'], x['LAT_MEAN'], x['C_LONG'], x['C_LAT'], 0), axis=1)
        min_distance = ll_df.sort_values(by='distance')['distance'].values[0]
        values = ll_df.sort_values(by='distance')[['distance','Price/SQF Mean']].head(1).values.tolist()
        # wrong coordinate, find a closest city based on wrong coordinate, and assign distance 0
        # Data quality issue on the raw data, can't do much about it
        min_distance = 0 
        price_mean = values[0][1]
        price_distance_map[city] = (0, price_mean)

import math

def fillna_price_mean(city, price_mean):
    if math.isnan(price_mean):
        return price_distance_map[city][1]
    return price_mean

def fillna_distance(city, distance):
    if math.isnan(distance):
        return price_distance_map[city][0]
    return distance

city_df_with_dist['Price/SQF Mean'] = city_df_with_dist.apply(lambda x: fillna_price_mean(x['City'],x['Price/SQF Mean']), axis=1)
city_df_with_dist['distance'] = city_df_with_dist.apply(lambda x: fillna_distance(x['City'],x['distance']), axis=1)
city_df_with_dist.head()

### Prediction with the trained model

In [None]:
selected_cols = ['POSTED_BY_CODE', 'UNDER_CONSTRUCTION', 'RERA', 'BHK_NO.','RESALE','distance', 'SQUARE_FT', 'Price/SQF Mean'] # removed READY_TO_MOVE
# target_col = 'Price/SQF'
X = city_df_with_dist[selected_cols]

test_predictedvalues = xgb_r.predict(X) 

### Do not forget to transform back from np.log(Y+1) -> np.exp(Y_pred) - 1

In [None]:
test_predictedvalues = np.exp(test_predictedvalues) - 1

### Submit

In [None]:
pd.DataFrame(test_predictedvalues).to_csv('submission.csv')

# Summary

### 1. Thanks for the dataset and the contest. Dataset is pretty clean and data structure is easy to understand
### 2. Useful features for house price prediction is (rank by importance) ADDRESS, LONGITUDE, LATITUDE, POSTED_BY, and others)
### 3. There are some discussions saying that longitude and latitude might not be useful. But I think they are very important features. Location always matters the most for a property. Would be great if the long/lat value is correct. Without the data quality issue, I believe we could have a better model here.

## Thanks for reading and please upvote if you like it!