# India House Price Prediction Challenge

This is a regression problem because we have a continuous target variable and the feautures with which we shall use to train our model.
Below are the 12 variables which are contained in the datasets.


* This Dataset has the following variables:

Column and it's                                       Description
* POSTED_BY  -                                 Category marking who has listed the property
* UNDER_CONSTRUCTION -                        Under Construction or Not
* RERA -                                       Rera approved or Not
* BHK_NO -                                     Number of Rooms
* BHKORRK -                                    Type of property
* SQUARE_FT -                                  Total area of the house in square feet
* READYTOMOVE -                                Category marking Ready to move or Not
* RESALE -                                     Category marking Resale or not
* ADDRESS -                                    Address of the property
* LONGITUDE -                                  Longitude of the property
* LATITUDE -                                   Latitude of the property
* TARGET -                                     Price in lacs

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
from sklearn.preprocessing import  StandardScaler
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV,train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
trainD=pd.read_csv('../input/house-price-prediction-challenge/train.csv')
testD =pd.read_csv('../input/house-price-prediction-challenge/test.csv')

In [None]:
#getting to know the kind of data we have
trainD.shape,testD.shape

In [None]:
trainD.head()

In [None]:
testD.head()

In [None]:
#displaying some more detials about the data
trainD.info()

From the above display, we can note that there are no rows with missing values among other information
that we can extract such as the various data types of the columns

In [None]:
#brief statistics  about numerical variables
trainD.describe()

The above code shows that we have  houses with very high prices of about 30000 lacs. These could be probably fancy ones. The median price is about 62 implying that most houses are relatively cheaper.

In [None]:
#brief statistics  about categorical variables
trainD.describe(include=['O'])

The code above shows us that we have 3 categorical variables, very many records with unique addresses
among other information that is extractable

In [None]:
#lets examine the count of  values in the posted_by categorical variable.
posted_counts= pd.value_counts(trainD.POSTED_BY)
posted_counts

The above code shows us that most house records have been posted by dealers, followed by owners and then builders post the least.

****Lets do an exploratory data analysis on the data to train our models.****

In [None]:
# A categorical plot of POSTED_BY variable
plt.figure(figsize=(10,8))
ax=sns.catplot(x='POSTED_BY',kind='count',data=trainD)
plt.xlabel('POSTED_BY')
plt.ylabel('Count of Posted_By Records')
plt.title('A count distribution of Posted_by category ')
plt.show()

In [None]:
UNDER_CONSTRUCTION_counts= pd.value_counts(trainD.UNDER_CONSTRUCTION)
UNDER_CONSTRUCTION_counts

In [None]:
# A plot of UNDER_CONSTRUCTION categories
plt.figure(figsize=(10,8))
ax=sns.catplot(x='UNDER_CONSTRUCTION',kind='count',data=trainD)
plt.xlabel('UNDER_CONSTRUCTION')
plt.ylabel('Count of UNDER_CONSTRUCTION ')
plt.title('A count distribution of UNDER_CONSTRUCTION categories')
plt.show()

The above code and visualisation shows that we have about 24k records of houses not under construction  and about 5k records of houses under construction.

In [None]:
RERA_counts= pd.value_counts(trainD.RERA)
RERA_counts

In [None]:
# A plot of RERA categories
plt.figure(figsize=(10,8))
ax=sns.catplot(x='RERA',kind='count',data=trainD)
plt.xlabel('RERA')
plt.ylabel('Count of RERA records')
plt.title('A count distribution of UNDER_CONSTRUCTION categories')
plt.show()

The code above and visualisation shows that we have about 20k recordes that are not approved by RERA and about 10k records  approved by RERA

In [None]:
BHK_NO_counts= pd.value_counts(trainD['BHK_NO.'])
BHK_NO_counts

In [None]:
# A plot of BHK_NO categories
plt.figure(figsize=(10,8))
ax=sns.catplot(x='BHK_NO.',kind='count',data=trainD)
plt.xlabel('number of Rooms.')
plt.ylabel('Count of houses')
plt.title('A count distribution of number of Room categories')
plt.show()

The above code shows us that most  houses are majorly built with 2 rooms or 3 then followed by 1,4,5. the other number of rooms are highly skewed with few records

In [None]:
BHK_OR_RKcounts= pd.value_counts(trainD['BHK_OR_RK'])
BHK_OR_RKcounts

In [None]:
# A plot of property type
plt.figure(figsize=(10,8))
ax=sns.catplot(x='BHK_OR_RK',kind='count',data=trainD)
plt.xlabel('Type of Property.')
plt.ylabel('Count of Property')
plt.title('A count distribution of Type of Property')
plt.show()

The above code shows that the major property type is BHK with close to 30k records constituting almost all the records posted

In [None]:
#visualise the square feet of the data using seaborn distribution plot.
ax=sns.distplot(trainD.SQUARE_FT,kde=True)
plt.title('A Square feet distribution ')
plt.show()

In [None]:
# decreasing the skewness in square feet feature using log transformation in both train and test data
trainD['SQUARE_FT'] = np.log(trainD['SQUARE_FT'])
testD['SQUARE_FT'] = np.log(testD['SQUARE_FT'])

In [None]:
#Aplot close to normal distribution as a result of decrease in skewness
ax=sns.distplot(trainD.SQUARE_FT,kde=True)
plt.title('A Square feet distribution of Total area of the house ')
plt.show()

In [None]:
READY_TO_MOVE_counts= pd.value_counts(trainD['READY_TO_MOVE'])
READY_TO_MOVE_counts

In [None]:
# A plot of Category marking Ready to move or Not
plt.figure(figsize=(10,8))
ax=sns.catplot(x='READY_TO_MOVE',kind='count',data=trainD)
plt.xlabel('Category marking Ready to move or Not.')
plt.ylabel('Count of Category marking')
plt.title('A count distribution of Category marking Ready to move or Not')
plt.show()

In [None]:
Resale_counts= pd.value_counts(trainD['RESALE'])
Resale_counts

In [None]:
# A plot of Resale records or not
plt.figure(figsize=(10,8))
ax=sns.catplot(x='RESALE',kind='count',data=trainD)
plt.xlabel('Category marking Resale or not')
plt.ylabel('Count of marking Resale or not ')
plt.title('A distribution of Resale or not ')
plt.show()

The above code and visual shows that over 90% of houses are available for resale  and just a few not available for  resale.

**For my analysis here, i won't include the address variable.It needs further manipulation before it can become more informative  and be used in our modelling since it's a text variable yet our models work best with numbers.**

In [None]:
#visualise the square feet of the data using seaborn distribution plot.
ax=sns.distplot(trainD.LONGITUDE,kde=True)
plt.title('A Longitude distribution ')
plt.show()

In [None]:
#visualise the square feet of the data using seaborn distribution plot.
ax=sns.distplot(trainD.LATITUDE,kde=True)
plt.title('A Latitude distribution ')
plt.show()

In [None]:
#visualise the square feet of the data using seaborn distribution plot.
ax=sns.distplot(trainD['TARGET(PRICE_IN_LACS)'],kde=True)
plt.title('A Price distribution')
plt.show()

In [None]:
trainD['TARGET(PRICE_IN_LACS)'].describe()

In [None]:
#Lets examine the house with the highest price
trainD[trainD['TARGET(PRICE_IN_LACS)']==30000]

The code above shows that the highest house price was posted by category dealer, it's available for resale,it has got 3 rooms and  it goes for upto 30000 lacs .

In [None]:
#Lets examine the house with the lowest price
trainD[trainD['TARGET(PRICE_IN_LACS)']==0.25]

The code above shows that the lowest house price was posted by category owner,it's not under construction, it has got 3 rooms , it's available for resale and it's ready to be occupied

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(trainD.corr(),vmax=0.8, annot=True)

We can see from the above heat map that the target is more correlated to the square feet feature than any other numerical columns.
we will probably use all these features in our modeling process

**Lets investigate if there is a relationship between number of rooms  and the prices**

In [None]:
BHK_NO_counts= pd.value_counts(trainD['BHK_NO.'])
BHK_NO_counts = list(BHK_NO_counts[BHK_NO_counts.values > 1000].index)

In [None]:
BHK_NO_counts

In [None]:
# Plot of distribution of sprices for rooms
plt.figure(figsize=(8,6))

# Plot each room distribution of prices
for BHK in BHK_NO_counts:
    # Select the room category
    subset = trainD[trainD['BHK_NO.'] == BHK]
    
    # Density plot of prices
    sns.kdeplot(subset['TARGET(PRICE_IN_LACS)'],label = BHK,shade = False, alpha = 0.8);
    
# label the plot
plt.xlabel('House prices', size = 10); plt.ylabel('Density', size = 10); 
plt.title('Density Plot of house prices by rooms', size = 8);

The room price distribution is largely skewed to the right with room category 4 having a high density distribution

In [None]:
#lets investigate the relationship between posted_by and prices.
posted_counts= pd.value_counts(trainD.POSTED_BY)
posts=posted_counts.index

In [None]:
# Plot of distribution of Prices for Posted_By
plt.figure(figsize=(8,6))

# Plot each Posted_by distribution of prices
for post in posts:
    # Select the posted_by type
    subset = trainD[trainD['POSTED_BY'] == post]
    
    # Density plot of prices
    sns.kdeplot(subset['TARGET(PRICE_IN_LACS)'],label = post,shade = False, alpha = 0.8);
    
# label the plot
plt.xlabel('House prices', size = 10); plt.ylabel('Density', size = 10); 
plt.title('Density Plot of house prices by Posted ', size = 10);

The distribution of posted_by variable is largely skewed to the right with dealers having generaly high prices, with the builder category having more prices centered around median thus having a high density distribution

In [None]:
base_features = ['POSTED_BY','UNDER_CONSTRUCTION','RERA','BHK_NO.','BHK_OR_RK','SQUARE_FT','LONGITUDE','LATITUDE']

In [None]:
train_data = trainD[base_features]
test_data=   testD[base_features]

In [None]:
train_data.shape,test_data.shape

In [None]:
y = trainD['TARGET(PRICE_IN_LACS)']

In [None]:
cat_cols = [cname for cname in train_data.columns 
                    if  train_data[cname].dtype == "object"]

In [None]:
Train_cat_colsOH= pd.get_dummies(train_data[cat_cols])
Test_cat_colsOH= pd.get_dummies(test_data[cat_cols])

In [None]:
#Select numerical columns
num_cols = [cname for cname in train_data.columns 
            if train_data[cname].dtype in ['int64', 'float64']]

scaler = StandardScaler()
train_data[num_cols] = scaler.fit_transform(train_data[num_cols] )

In [None]:
test_data[num_cols] = scaler.transform(test_data[num_cols] )

In [None]:
train_num_data = pd.DataFrame(train_data[num_cols])
test_num_data = pd.DataFrame(test_data[num_cols])

In [None]:
train_data =pd.concat([Train_cat_colsOH, train_num_data],axis=1) 
test_data =pd.concat([Test_cat_colsOH, test_num_data],axis=1) 

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
# split validation set from training data
X_train, X_val, y_train, y_val =train_test_split(train_data,y,test_size=0.2,random_state=0)

In [None]:
# function to  train a given  model and evaluate it on the validation set
def fit_and_evaluate(model):
    
    # Train the model
    model.fit(X_train, y_train)
    y_pred=model.predict(X_val)
    mea = mean_absolute_error(y_val,y_pred)
    R2_score =r2_score(y_val,y_pred)
    rmse = np.sqrt((mean_squared_error(y_val, y_pred)))
    print("The root mean squared error generated...is {:.2f}".format(rmse))
    print("The R2_score value .....................is {:.4f}".format(R2_score))
    print("The mean absolute  error generated is .......is {:.2f}".format(mea))

In [None]:
Ridge_model = Ridge()

fit_and_evaluate(Ridge_model) 

In [None]:
linear_model = LinearRegression()

fit_and_evaluate(linear_model)

In [None]:
random_forest = RandomForestRegressor(random_state=0)

fit_and_evaluate(random_forest)

In [None]:
gradient_boosted = GradientBoostingRegressor(random_state=4)

fit_and_evaluate(gradient_boosted) 

**The metric am caring more about is the model with the a least considerable root mean sqaured value,therefore am going to take the gradient boosting model for this prediction since it has more paramters to tune than the random forest. However, for me to furture improve on my accuracy, am going to user another metric of mean absolute error generated using the random search as the baseline then fine tune with gridsearch basing on the variation in the number of estimators.**

In [None]:
# Number of trees used in the boosting process
n_estimators = [100, 500, 900, 1100, 1500]

#loss function to be minimized
loss = ['ls', 'lad', 'huber']

# Maximum depth of each tree
max_depth = [2, 3, 5, 10, 15]
#how much the contribution of each tree will shrink.

learning_rate = [0.005,0.01,0.05,0.1,0.5]

# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 10]

# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]

In [None]:
# Define the grid of hyperparameters to search
hyperparameter_grid = {'loss': loss,
                       'learning_rate':learning_rate,
                       'n_estimators': n_estimators,
                       'max_depth': max_depth,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}

In [None]:
model = GradientBoostingRegressor(random_state=4)

In [None]:
random_cv = RandomizedSearchCV(estimator=model,
                               param_distributions=hyperparameter_grid,
                               scoring='neg_mean_absolute_error',
                               cv=5, n_iter=30, 
                               n_jobs = -1, verbose = 1, 
                               return_train_score = True,
                               random_state=42)

In [None]:
random_cv.fit(X_train,y_train)

In [None]:
# Get all of the cv results and sort by the test performance
random_results = pd.DataFrame(random_cv.cv_results_).sort_values('mean_test_score', ascending = False)
random_results.head(3)

In [None]:
random_cv.best_estimator_

In [None]:
# Create a range of trees to evaluate
param_grid = {'n_estimators': [200,300,400,500, 800,1000 ]}
model =  GradientBoostingRegressor( max_depth =15,
                                   loss = 'ls',
                                   alpha=0.9,
                                   learning_rate=0.1,
                                  min_samples_leaf = 1,
                                  min_samples_split = 6,
                                  max_features ='auto',
                                  max_leaf_nodes=None,
                                  random_state = 4)


In [None]:
# Grid Search Object using the trees range and the random forest model
grid_search = GridSearchCV(estimator = model, param_grid=param_grid, cv = 5, 
                           scoring = 'neg_mean_absolute_error', verbose = 1,
                           n_jobs = -1, return_train_score = True)

In [None]:
grid_search.fit(X_train,y_train)

In [None]:
# Get the results into a dataframe
results = pd.DataFrame(grid_search.cv_results_)

# Plot the training and testing error vs number of trees
plt.figure(figsize=(8, 8))
plt.style.use('fivethirtyeight')
plt.plot(results['param_n_estimators'], -1 * results['mean_test_score'], label = 'Test_Err')
plt.plot(results['param_n_estimators'], -1 * results['mean_train_score'], label = 'Train_Err')
plt.xlabel('Number of Trees'); plt.ylabel('Mean Abosolute Error'); plt.legend("best");
plt.title('Performance vs Number of Trees');

We realise that as we increase the number of trees, the model starts overfitting 
however, we shall consider the optimum value of the model before it starts to overfit with 200 estimators

In [None]:
results.sort_values('mean_test_score', ascending = False).head(3)

In [None]:
# Select the best model
final_modelGBR = grid_search.best_estimator_
final_modelGBR

In [None]:
#final model performance

fit_and_evaluate(final_modelGBR)

**After fitting my final model, my model performance price prediction greatly improved to approximately within 30 points from the  true market price with a root mean squared error of 129 and an R2_score accuracy value of 96%**

In [None]:
SalesPrediction = final_modelGBR.predict(test_data)

In [None]:
submission = pd.DataFrame({'Id': test_data.index, 'SalePrice': SalesPrediction})
submission.to_csv('submission.csv', index=False)
submission.head(10).set_index('Id')

All comments are welcome! there is always room for improvement.