# Regression Analysis - Flat Price in Moscow
This notebook will analyse the price of housing in Moscow. Regressions will first be used to understand the data. Prediction and optimisation will also be presented.

The methods used are:
* linear regression of multiple variables  
* decision tree regression
* XGBoost regression

In [None]:
#importing the necessary libraries
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor 
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

In [None]:
df = pd.read_csv('../input/price-of-flats-in-moscow/flats_moscow.csv', index_col = 0)   #Importing data 
df

The data description, translated by one of the available translators, showed something non-intuitive. More specifically, it concerns a column named 'floor'.

The description reads: 
> 1 - floor other than the first and last floor, 0 - otherwise.

Personally, I would prefer the reverse designation. 1 for first and last floor, 0 otherwise. The code below will replace this column to make it easier to read. In addition, the commands will remove the remaining columns of categorical data. "walk" and "brick" do not appear to be a significant factor affecting housing prices. 

In [None]:
def floor_changer(f):    #easy function, just changing 1 to 0, and 0 to 1. 
    if f == 1:      
        f = 0
    else: 
        f = 1 
    return f

df.floor = df.floor.apply(floor_changer)   #using this function on 'floor' column

df = df.drop(columns=['brick', 'walk', 'code'])   #deleting of unnecessary columns

df #showing dataframe

In order to get any information about the data presented and, if necessary, to remove outliers, each variable can be presented graphically in relation to price. Price will be the predicted value in later regression models. Only the categorical variable 'floor' will be shown as a bar chart with the mean divided into house on first or last floor (1) and others. 

In [None]:
plt.subplots(2,3 , figsize=(20,15))

i = 1    #to display one graph next to another
for variable in df.columns[1:]:
    plt.subplot(2,3,i)
    if variable == 'floor':     #logical test to present categorical data in different form than a scatter plot
        plt.title(variable)
        categorical_value_plot = sns.barplot(x='floor', y='price', data=df, color='#191970')   
        categorical_value_plot.axhline(df.price.mean(), color='black', lw=2)    #adding a line on the level of general mean
        i +=1    #to display one graph next to another
    else:
        plt.scatter(x = df[variable], y=df.price, c='#191970')  
        plt.title(variable)
        plt
        i +=1   #to display one graph next to another

As can be seen, the graphs form a more or less linear relationship. It can also be seen, which is consistent with people's subjective assessment of ground floor flats as potentially more dangerous, that the price of these flats is on average lower than houses on other floors. 

One outlier observation was also located.  This is the highest priced flat (over $700,000). It also has the largest living space, but performs average in other categories. With an observation count of 2040, this outlier can be removed without major inaccuracies for the rest of the analysis. 

Unfortunately, there is a different data problem here. With regard to the graphs and the descriptions of the variables, it has to be said that there is a good chance that there is multicollinearity between the variables 'totsp' and 'livesp'. This is not a big problem for predictive models, but it can make the data difficult to understand. 

To test this it will be run a series of linear models in which each predictor variable becomes an predicted variable. In this way, it will be possible to calculate the Variance inflation Factor. The VIF allows to check whether collinearity actually exists. Depending on the literature, different indicators are given as limits. In general, a VIF greater than 5 indicates collinearity between the variables. 

In [None]:
df = df[df.price<700]   #deleting outlier. Only one value has a price above 700. 


def VIF(R_2):              #function to calcute VIF, in which R_2 is R^2 score for every model 
    return (1/(1-R_2))

for i in range(len(df.columns)):       #every variable (so every column) has to be treated as explained variable 
    model = LinearRegression()         #used from scikitlearn library -  Ordinary least squares Linear Regression
    Y = df.iloc[:,i]                   #just one column at a time should be picked as explained variable 
    X = df.drop(columns = df.columns[i])    #explained variable must be deleted from other predictors 
    model.fit(X, Y)                         #fitting the model 
    R_2 = model.score(X, Y)                 #finding R^2 score
    print('VIF in case of ' + df.columns[i] + " treated as outcome variable: " + str(VIF(R_2)))  #showing VIF

As said, a large VIF can be observed with the variables 'totsp' and 'livesp'. There are several ways to deal with collinearity. In this paper a slightly unusual approach will be used, that is, the variable "totsp" will be modified to retain as much information as possible, minimising collinearity. 

Because "totsp", as the description implies, is the extra space of the flat in relation to the living space ("livesp"). It seems that additional space understood in this way can be e.g. balconies, gardens, cellars. Therefore, if "livesp" is subtracted from "totsp", the value of "extra square metres" is obtained. The column thus created will be called "extrasp" and will replace the existing "totsp". 

In [None]:
df['extrasp'] = df.totsp-df.livesp    #modifying "totsp" somehow to reduce a collinearity 
df = df.drop(columns='totsp')         #deleting unwanted column 

In [None]:
#perfoming the previous test once again to check for changes in the VIF

for i in range(len(df.columns)):       
    model = LinearRegression()        
    Y = df.iloc[:,i]                  
    X = df.drop(columns = df.columns[i])    
    model.fit(X, Y)                         
    R_2 = model.score(X, Y)                 
    print('VIF in case of ' + df.columns[i] + " treated as outcome variable: " + str(VIF(R_2))) 

The situation with collinearity has improved. Several predictions will now be made in order to achieve the highest possible R2 and the lowest possible MAPE (Mean Absolute Percent Error)

In [None]:
df

In [None]:
X = df.drop(columns='price')  #price will be outcome variable
Y = df.price

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2) #division into training and test data sets 

In [None]:
lr_model = LinearRegression()          #OLS Model
lr_model.fit(X_train, y_train)         #using just training data 
y_predicted = lr_model.predict(X_test) #prediction 



parameters = list(zip(X.columns, lr_model.coef_))  #creating list with the variable and its parameter
parameters #printing

All values appear to be logically correct. The larger the size of a flat, the higher its price. The distance from the centre and from the metro decreases the price of the flat. If a house is located on the ground or top floor, its price is on average lower by more than 6 thousand dollars. 

In [None]:
lr_model.score(X_test, y_test), mean_absolute_error(y_test, y_predicted)/df.price.mean()  #Showing R2 and MAPE

The model explains approximately 65% of housing prices. The relative error of the forecast is about 15%. 

Below is an attempt to get a better result using decision tree regression.

In [None]:
tree_regr = DecisionTreeRegressor(max_depth=3) #Predictors are amounted to just 6 so depth on level 3 could be seen as fairly reasonable 
tree_regr.fit(X_train, y_train) #fitting 
y_tree_predicted = tree_regr.predict(X_test) #Prediction using the same test sample which was in Linear Regression
tree_regr.score(X_test, y_test), mean_absolute_error(y_test, y_tree_predicted)/df.price.mean() #Printing R2 and MAPE to compare with lr_model

The model is inferior to linear regression. It is mainly based on the variables 'livesp' and 'kitsp'. This is not particularly surprising, with the linear regression model these two variables also had high parameters. 

Below is a graphical representation of the tree using a dedicated method from the scikitlearn library. 

In [None]:
fig = plt.figure(figsize=(25,20))   #adjusting size
tree.plot_tree(tree_regr, filled=True)  

In [None]:
#To improve Decision Tree Regressor this code cell will try to find best fitted tree model changing max_depth parameter

MAX_DEPTH = 10

for i in range(1, MAX_DEPTH+1):
    tree_model = DecisionTreeRegressor(max_depth=i)
    tree_model.fit(X_train, y_train)
    y_prediction = tree_model.predict(X_test)
    print('For depth {deep_value} R_2: {r_score} and MAPE: {mape}'.format
          (deep_value=i, 
           r_score=tree_model.score(X_test, y_test), 
           mape = mean_absolute_error(y_test, y_prediction)/df.price.mean()))

Increasing the depth of the decision tree does not significantly affect the level of predictive validity of the model. Still the linear model with an R2 of 65% is the best. In order to beat this value, the XGBoost model will be used, which is, to a large simplification, a more advanced variation of decision trees.

In [None]:
xgb_model = xgb.XGBRegressor(n_estimators = 100, max_depth = 1, )
X_train = np.ascontiguousarray(X_train)    #Changing format to avoid XGBoost Warning
y_train = np.ascontiguousarray(y_train) 
X_test = np.ascontiguousarray(X_test)
y_test = np.ascontiguousarray(y_test)

xgb_model.fit(X_train, y_train)

xgb_prediction = xgb_model.predict(X_test)
xgb_model.score(X_test, y_test), mean_absolute_error(y_test, xgb_prediction)/df.price.mean()

XGBoost performs far better than a simple decision tree. It is now necessary to find the most suitable parameters. 

In [None]:
r_score_results = []   #In this list will be gathered all R2 scores for every number of "n_estimators"
MAPE_result = []       #In this list will be gathered all R2 scores for every number of "n_estimators"

for i in range(5,100):
    xgb_model = xgb.XGBRegressor(n_estimators = i, max_depth = 4)  #After some trials this depth should be defined as most effective
    xgb_model.fit(X_train, y_train)
    xgb_prediction = xgb_model.predict(X_test)
    r_score_results.append(xgb_model.score(X_test, y_test))
    MAPE_result.append(mean_absolute_error(y_test, xgb_prediction)/df.price.mean())

In [None]:
#Plotting how R2 and MAPE change in relation to n_estimators
plt.figure(figsize=(12,12))
plt.plot(r_score_results, c='#191970')
plt.plot(MAPE_result, c = '#4e6e81')
plt.legend(labels=['R2', 'MAPE'])

In [None]:
max(r_score_results), min(MAPE_result)

In [None]:
r_score_results.index(max(r_score_results)) #Finding idex of max value 
n = list(range(5,100))[r_score_results.index(max(r_score_results))]  #Recreating a list given in code before to get a value of n_estimators
print(n)

In [None]:
#Last try 
xgb_model = xgb.XGBRegressor(n_estimators = n, max_depth = 4)
xgb_model.fit(X_train, y_train)
xgb_prediction = xgb_model.predict(X_test)
xgb_model.score(X_test, y_test), mean_absolute_error(y_test, xgb_prediction)/df.price.mean()

To sum up. A simple linear regression model performs quite well in predicting the value of housing in Moscow based on the analysed data set. The simplest decision tree regression model does not do so well (it is possible that the sample is a bit too small). The highest R2 was achieved using XGBRegressor with parameters as in the last cell with code. However, an improvement of this percentage points in R2 over linear regression is probably not worth the loss of model readability. 