# Description

Housing costs demand a significant investment from both consumers and developers. And when it comes to planning a budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their biggets expenses. Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about realty prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building.

Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal.

In this competition, Sberbank is challenging Kagglers to develop algorithms which use a broad spectrum of features to predict realty prices. Competitors will rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# To check all the columns present
pd.set_option('display.max_columns', 500)

# Unzip Files

In [None]:
! unzip /kaggle/input/sberbank-russian-housing-market/train.csv.zip

In [None]:
! unzip /kaggle/input/sberbank-russian-housing-market/test.csv.zip

In [None]:
! ls /kaggle/working

**Reading train and test data**

In [None]:
train_df = pd.read_csv(f"/kaggle/working/train.csv")
test_df = pd.read_csv(f"/kaggle/working/test.csv")
print(f"train data shape:- {train_df.shape}")
print(f"test data shape:- {test_df.shape}")

*datetime column into timestamp*

In [None]:
train_df['timestamp'] =pd.to_datetime(train_df.timestamp)
test_df['timestamp'] =pd.to_datetime(test_df.timestamp)
train_df.head()

**Sorting the training data based on time stamp because house prices does increases over time**

In [None]:
train_df = train_df.sort_values(by=['timestamp'])
train_df.head()

In [None]:
test_df.head()

**Lets dig deep into o/p variable price_doc**

In [None]:
train_df['year'] = pd.DatetimeIndex(train_df['timestamp']).year
train_df['month'] = pd.DatetimeIndex(train_df['timestamp']).month
test_df['year'] = pd.DatetimeIndex(test_df['timestamp']).year
test_df['month'] = pd.DatetimeIndex(test_df['timestamp']).month
train_df['day'] = pd.DatetimeIndex(train_df['timestamp']).day
train_df['week'] = pd.DatetimeIndex(train_df['timestamp']).week
test_df['day'] = pd.DatetimeIndex(test_df['timestamp']).day
test_df['week'] = pd.DatetimeIndex(test_df['timestamp']).week

In [None]:
# Groupby year in price mean
mean_year_df = train_df.groupby("year")["price_doc"].agg("mean").reset_index()
plt.figure(figsize=(8,6))
plt.scatter(range(mean_year_df.shape[0]), mean_year_df.price_doc.values)
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

> From above price plot we can see that mean price has increased every year

> We also check median price and how much there is a difference in price every year

In [None]:
# Groupby month on price
mean_month_df = train_df.groupby("month")["price_doc"].agg("mean").reset_index()
plt.figure(figsize=(8,6))
plt.scatter(range(mean_month_df.shape[0]), mean_month_df.price_doc.values)
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

> From above we can see that price is maximum b/w 2nd to 4th month

In [None]:
# groupby year and month on price mean
mean_year_month_df = train_df.groupby(["year", "month"])["price_doc"].agg("mean").reset_index()
plt.figure(figsize=(8,6))
plt.scatter(range(mean_year_month_df.shape[0]), mean_year_month_df.price_doc.values)
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

> Above graph represents how the price has increased over the year and month

In [None]:
mean_month_year_df = train_df.groupby(["month", "year"])["price_doc"].agg("mean").reset_index()
plt.figure(figsize=(8,6))
plt.scatter(range(mean_month_year_df.shape[0]), mean_month_year_df.price_doc.values)
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

In [None]:

train_df.plot.scatter(x='product_type', y='price_doc')


> Seems like people tend to pay more when they are buying for investment purpose

In [None]:

train_df.plot.scatter(x='num_room', y='price_doc')

> With increse in num rooms price increases

> Here are some num rooms greater than 10 but prices are very less, these seems like an outlier we can define remove them

In [None]:
train_df.shape

In [None]:
train_df = train_df[(train_df["num_room"]<10) | (train_df["num_room"].isnull())]
train_df.plot.scatter(x='num_room', y='price_doc')

In [None]:
train_df.shape

**Note: Same way we can do multivariate analysis for other categorial variable also**

In [None]:
train_df = train_df.drop(["id", "timestamp"], 1)
test_df = test_df.drop(["id", "timestamp"], 1)


In [None]:
# Separating the target price as series
target = train_df["price_doc"]
#train_df = train_df.drop(["price_doc"], 1)
train_df.shape

In [None]:
target 

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(range(target.shape[0]), np.sort(target.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

> This is how price has increased over the years, not filtering any of the price for now after sorting

**without sorting**

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.scatter(range(target.shape[0]), target.values)
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

> Semms like there are some outliers in price as well. But let's model it without filtering it first

# Merging both train and test for preprocessing

In [None]:
#merged_df = pd.concat([train_df,test_df])

In [None]:
train_df.info()

In [None]:
# All cols present
all_cols = train_df.columns
# numerical columns
num_and_float_cols = train_df._get_numeric_data().columns
# categories columns
object_cols = list(set(all_cols) - set(num_and_float_cols))
print(f"total numeric cols :- {len(num_and_float_cols)}, categorical cols:- {len(object_cols)}")

In [None]:
# Filtering out all the columns which contains some nan values
null_cols_val = {all_cols[col_idx]:val for col_idx, val in enumerate(train_df.isnull().sum()) if val>0}
null_cols = [i[0] for i in null_cols_val.items() ]
# List of all columns having null value
null_cols

In [None]:
train_df.head()

# Multicolinear check starts here

In [None]:
corelated_df = train_df[num_and_float_cols].corr()#.reset_index()

In [None]:
corelated_df.head()

In [None]:
price_important_feat = corelated_df.loc["price_doc"].to_dict()

In [None]:
# Plot of all corelated variables
plt.figure(figsize=(50,50))
plt.bar(range(len(price_important_feat)), list(price_important_feat.values()), align='center')
plt.xticks(range(len(price_important_feat)), list(price_important_feat.keys()))
plt.show()

In [None]:
#correlated_features = set()
# Corelated sets
correlated_features = {}
 # Featues that is related to price
features_related_to_price = []
corr_cols = corelated_df.columns
already_done = []
for i in range(len(corr_cols)):
    correlated_features[corr_cols[i]] = []
    for j in range(len(corr_cols)):
        if  i!=j and ([i,j] not in already_done or [j,i] not in already_done)  and abs(corelated_df.iloc[i, j]) > 0.8:
            already_done.append([i,j])
            if corr_cols[i]=="price_doc":
                features_related_to_price.append([corr_cols[j],corelated_df.iloc[i, j]])
            elif corr_cols[j]=="price_doc":
                features_related_to_price.append([corr_cols[i],corelated_df.iloc[i, j]])    
            else:    
                #correlated_features.add(corr_cols[i])
                correlated_features[corr_cols[i]].append(corr_cols[j])
    if not correlated_features[corr_cols[i]]:
        del correlated_features[corr_cols[i]]
           

In [None]:
value_correlated_features = []
for kv in correlated_features.items():
    value_correlated_features.extend(kv[1])
final_corelated_sets = []    
restricted_sets = set(value_correlated_features)
for kv in correlated_features.items():
    if kv[0] not in restricted_sets:

        restricted_sets.add(kv[0])

In [None]:
# Uncomment to check features
#corelated_sets_to_remove

> corelated_sets_to_remove these are the inter corelated sets which is intercorelated and more than 80%

In [None]:
len(restricted_sets)
corelated_sets_to_remove = list(restricted_sets)
train_df = train_df.drop(corelated_sets_to_remove,axis=1)
test_df = test_df.drop(corelated_sets_to_remove,axis=1)

In [None]:
new_cols_list = train_df.columns
print(train_df.shape)
num_and_float_cols = [col for col in num_and_float_cols if col in new_cols_list]
object_cols = [col for col in object_cols if col in new_cols_list]

**corelation free depandend variable and plots**

In [None]:
limited_price_important_feat = {i[0]:i[1] for i in price_important_feat.items() if i[0] not in corelated_sets_to_remove+["price_doc"]}
plt.figure(figsize=(20,20))
plt.barh(*zip(*limited_price_important_feat.items()))
# plt.bar(range(len(limited_price_important_feat)), list(limited_price_important_feat.values()), align='center')
# plt.xticks(range(len(limited_price_important_feat)), list(limited_price_important_feat.keys()))
plt.show()

> Seems like num_rooms is contributing more than full_sq followed by others

In [None]:
train_df[object_cols[-5]]

In [None]:
len(object_cols), len(num_and_float_cols)

**Numeric univariate plote and its distribution in space**

In [None]:
for col in num_and_float_cols:
    if col in null_cols:
        print(col)
        plt.figure(figsize=(10,6))
        sns.distplot(train_df[col].values, bins=50, kde=True)
        plt.xlabel(col, fontsize=12)
        plt.show()

# Feature Engineering

In [None]:
train_df.head()

# Bin some of the features

In [None]:
# Converting continuous variables into limited bins bases on quantiles
from sklearn.preprocessing import KBinsDiscretizer

features_to_bin = ["industrial_km", "big_market_km", "market_shop_km", "church_synagogue_km", "incineration_km", "big_road1_km", "bus_terminal_avto_km", "mosque_km"]
binned_features = []
def binning():
    global binned_features
    for feature in features_to_bin:
        binf = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
        binf = binf.fit(train_df[feature].values.reshape(-1,1))#.astype(int)
        train_df[f"{feature}_bin"] = binf.transform(train_df[feature].values.reshape(-1,1)).astype(int)
        test_df[f"{feature}_bin"] = binf.transform(test_df[feature].values.reshape(-1,1)).astype(int)
        binned_features.append(f"{feature}_bin")

         
binning()

In [None]:
train_df.head()

In [None]:
def feature_engneering(data):
    numeric_data_added = []
    categoric_data_added = []
    
    
    # When someone buys a home for living he makes sure, school is nearby, hospital is nearby
    # metro is nearbuy, market is nearby, water is nearby
    data["sub_area_hospital_centres"] = data["sub_area"] + data["healthcare_centers_raion"].astype("str")
    categoric_data_added.append("sub_area_hospital_centres")
    data["sub_area_school"] = data["sub_area"] + data["school_education_centers_top_20_raion"].astype("str")
    categoric_data_added.append("sub_area_school")
    data["sub_area_market"] = data["sub_area"] + data["big_market_raion"].astype("str")
    categoric_data_added.append("sub_area_market")
    data["sub_area_metro"] = data["sub_area"] + data["ID_metro"].astype("str")
    categoric_data_added.append("sub_area_metro")
    for feature in binned_features:
        data[f"sub_area_{feature}"] = data["sub_area"] + data[feature].astype("str")
        categoric_data_added.append(f"sub_area_{feature}")
        
    return data, categoric_data_added, numeric_data_added
    
    
train_df, categoric_data_added, numeric_data_added = feature_engneering(train_df) 
test_df, categoric_data_added, numeric_data_added = feature_engneering(test_df) 


# Note: Have done some basic Feature Engineering. Could have been extended to few more featues but I am stopping with these features only

In [None]:
#err

**filling null values with -1, for continus we can also try, mean, median values. I used -1 just to separate this feature and it will also contribute less while training**

In [None]:
for col in num_and_float_cols:
    if col in null_cols:
        #print(null_cols_val[col])
        #print(abs(price_important_feat[col]))
        train_df[col].fillna(-1, inplace=True)
        test_df[col].fillna(-1, inplace=True)

**Based on experience onehot encoding should have worked better but that will increase the number of features as well. So I decided to target encoding(Just for testing)**

In [None]:
# Import label encoder
#from sklearn import preprocessing
#label_encoder = preprocessing.LabelEncoder()
from category_encoders import TargetEncoder
 
for col in object_cols+categoric_data_added:
    if col in null_cols:
        train_df[col].fillna(f"{col}nan", inplace=True)
        test_df[col].fillna(f"{col}nan", inplace=True)
    encoder = TargetEncoder()
    encoder = encoder.fit(train_df[col], train_df['price_doc'])
    train_df[col] = encoder.transform(train_df[col])
    test_df[col] = encoder.transform(test_df[col])
#     label_encoder= label_encoder.fit(train_df[col])
#     train_df[col] = label_encoder.transform(train_df[col])
#     test_df[col] = label_encoder.transform(test_df[col])

In [None]:
train_df.head()

**Uncomment below codes to get univariate box plot to remove outliers if any. Sicnce Currenty I am going with outliers also. So commented it**

In [None]:
# for cols in num_and_float_cols:
#     fig = plt.figure(figsize =(5, 3))

#     # Creating plot
#     plt.boxplot(train_df[cols].values)

#     # show plot
#     plt.show()

In [None]:
# Get all the X variables
X = train_df.drop(["price_doc"], 1)


In [None]:
target = target#.iloc[:10000]

In [None]:
#X.info()

**Train test split data into train 80% and cv*(X_test) 20%**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2, random_state=0)


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Modelling

In [None]:
logreg = LinearRegression(n_jobs=-1)
logreg.fit(X_train, y_train)

In [None]:
y_pred_log = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

In [None]:
y_pred_test = logreg.predict(test_df)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error


In [None]:
mean_squared_error(y_pred_log, y_test)

In [None]:
mean_absolute_error(y_pred_log, y_test)

# Xgboost

In [None]:
# Train XGBoost model and validate results

import xgboost as xgb
from sklearn import metrics
clf = xgb.XGBRegressor(n_estimators=150, max_depth=7, learning_rate=0.01, min_child_weight=20)
clf.fit(X_train, y_train)

print(metrics.mean_squared_error(y_test, clf.predict(X_test))**0.5)

In [None]:
# Plot importances of XGBoost model
# Some of created features can be noticed in top 50 important features!
fig, ax = plt.subplots(1, 1, figsize=(8, 16))
xgb.plot_importance(clf, max_num_features=50, height=0.5, ax=ax);

In [None]:
# Plot true values vs preicted ones

plt.scatter(y_train, clf.predict(X_train), alpha=0.3, c='red')
plt.scatter(y_test, clf.predict(X_test), alpha=0.3, c='blue');
plt.xlabel('true values')
plt.ylabel('predicted values')
plt.axis([13,19,13,19])
plt.plot([13,19],[13,19]);


In [None]:
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error


**https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f**

In [None]:
import numpy as np
# "Learn" the mean from the training data
mean_train = np.mean(y_train)
# Get predictions on the test set
baseline_predictions = np.ones(y_test.shape) * mean_train
# Compute MAE
mae_baseline = mean_squared_error(y_test, baseline_predictions)
print("Baseline MAE is {:.2f}".format(mae_baseline))

In [None]:
# I chose this parameters for initial testing.
# We can also use gridsearchcv or randomsearch to select different best features 
params = {'eta': 0.05, 'max_depth': 5, 'subsample': 0.8, 'colsample_bytree': 0.8, 'silent':1,
          'min_child_weight': 1, 'gamma': 0, 'objective': 'reg:linear', 'eval_metric': 'rmse'} # default params

In [None]:
#params['eval_metric'] = "rmse"


In [None]:
num_boost_round = 999


In [None]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=20
)

In [None]:
#https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f
xgb_cv = model.predict(dtest)
mean_absolute_error(xgb_cv, y_test)


In [None]:
r2_score(xgb_cv, y_test)


In [None]:
mean_squared_error(model.predict(dtest), y_test)

# Simple LSTM time series(Browser started to get freezed at this point)

# Things left in time series.

> Prepare the data accordingly considering all previous results and predicting future values Like below link

> https://towardsdatascience.com/simple-multivariate-time-series-forecasting-7fa0e05579b2

In [None]:
# !pip install --upgrade tensorflow
# !pip install --upgrade tensorflow-gpu

In [None]:
#! pip install keras

In [None]:
import numpy
import matplotlib.pyplot as plt
import pandas
import math
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

In [None]:
# normalize the dataset
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# normalize the dataset
#scaler = MinMaxScaler(feature_range=(0, 1))
#dataset = scaler.fit_transform(dataset)
X_train_new = np.reshape(X_train, X_train.shape + (1,))
X_test_new = np.reshape(X_test, X_test.shape + (1,))

**Training with few epochs and layers because of hanging issue**

In [None]:
# Reason I used Dense + LSTM because it's capturing better result.
# Optimiser is adam
model = Sequential()
model.add(LSTM(4, input_shape=(X_train.shape[1], 1)))
model.add(Dense(50))
model.add(Dense(25))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train_new, y_train, epochs=1000, batch_size=64, verbose=2)

In [None]:
cv_predict = model.predict(X_test_new)


In [None]:
fig, axs = plt.subplots(figsize=(20,20), sharey=True)
plt.title('output plot')
axs.scatter(list(range(len(y_test))),cv_predict, color="red")
axs.scatter(list(range(len(y_test))), y_test, color="blue")

In [None]:
mean_absolute_error(cv_predict, y_test)


In [None]:
mean_squared_error(cv_predict, y_test)

# Evaluation

> I chose mean_squared_error as metric because it is generally used when large errors are particularly undesirable. Like in this case we are tend to get large errors.

> 2nd metric i tried on in mean absolute error.

> Lstm + dense model tend to have worked better

# Comparision

In [None]:
! pip install prettytable

In [None]:
from prettytable import PrettyTable
x=PrettyTable()
x.field_names = ["Model", "mean_squared_error", "mean_absolute_error"]

x.add_row(["Lstm", f"{mean_squared_error(cv_predict, y_test)}", f"{mean_absolute_error(cv_predict, y_test)}"])
x.add_row(["Xgboost", f"{mean_squared_error(xgb_cv, y_test)}", f"{mean_absolute_error(xgb_cv, y_test)}"])
x.add_row(["linear regression", f"{mean_squared_error(y_pred_log, y_test)}", f"{mean_absolute_error(y_pred_log, y_test)}"])

print(x)