# Explore data
To see how we can use our data, we do some investigation.

In [1]:
# Load libraries
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from yellowbrick.regressor import ResidualsPlot, PredictionError



In [2]:
# Load pickle file
df = pd.read_pickle("../data/intermediate.pkl")

In [None]:
[col for col in df.columns if col.startswith("vve")]

In [3]:
# Temp solution
df.rename(columns={"rf_plat dak": "rf_plat_dak", 
                   "address_x": "address"}, 
          inplace=True)

In [None]:
# Use a subset without neighborhoods for correlation check
subset = df[[col for col in df.columns if not col.startswith("ne")]]

## Correlation check
Initially we want to know what factors have a big influence on the asking price.

In [None]:
# Produce a heatmap 
fig, ax = plt.subplots(figsize=(16,10))
sns.heatmap(subset.corr())

There seems to be high correlation between the various VVE columns, so we decide to drop all but 1.

The same applies to for 2 roof types and forms. 

In [None]:
# Drop columns
vve = [col 
       for col in df.columns 
       if col.startswith("vve") 
       and col not in ["vve_contribution", "vve_maintenance"]]
others = ["rt_pannen", "rf_plat_dak", "address", "price_m2"]

a = df.drop(columns=vve + others)

For some reason the model doesn't work when we remove all these columns, so only address and price_m2 are removed. 

In [None]:
# Correlation viewed related to asking price
corr_series = df.corr()[["asking_price"]].sort_values(by="asking_price", ascending=False)

plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(corr_series[1:20], 
                      vmin=-1, 
                      vmax=1, 
                      annot=True, 
                      cmap='BrBG')
heatmap.set_title("Correlation with asking price", 
                  fontdict={'fontsize':18}, 
                  pad=16);

## Preprocessing

In [None]:
# Select column names of factors with more than 2 values
num_cols = [col for col in df.columns 
            if df[col].nunique() > 2 
            and df[col].dtype in ["int64", "float64"] 
            and col != "asking_price"]

In [None]:
df["days_online"].hist();

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
# Fit scaler model and apply to dataframe
std = StandardScaler()
scaled_fit = std.fit(df[num_cols])
df[num_cols] = pd.DataFrame(scaled_fit.transform(df[num_cols]), columns=num_cols)
df["days_online"].hist()

## Split data

In [None]:
# Set variables
X = df[[col for col in df.columns if col != "asking_price"]]
y = df["asking_price"]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=7)

## Linear regression
Now we train a model to check for linear regression.

In [None]:
# Instantiate the class
lin_model = linear_model.LinearRegression()
# Create the model
lin_model.fit(X_train, y_train)
score = lin_model.score(X_train, y_train)
print(f"R2: {score:.5f}")

Our initial score (0.70) is not very high. This could have various reasons. Most obvious is the way we selected our features. 

In [None]:
def visualize_model(plot, model, train, test):
    visualizer = plot(model)
    visualizer.fit(*train)
    visualizer.score(*test)
    visualizer.show() 

In [None]:
#fig, axes = plt.subplots(2, 1, figsize=(16,10))
for i, plot in enumerate([ResidualsPlot, PredictionError]):
    visualize_model(plot, lin_model, (X_train, y_train), (X_test, y_test))

#### Interpretation
The points are not randomly dispersed around the horizontal axis,  which means that a linear regression model is probably not appropriate for the data and we should use a non-linear model. The R<sup>2</sup> for the training set is very good, however the R<sup>2</sup> for the test set is average, which also shows in the fact that the train data (green) is normally distributed around 0, but not the test data.

# Externalize in full func
To be able to run the full thing in one go, we save the whole thing in a external script.

In [1]:
import importlib
importlib.import_module("modelling")
import modelling



In [2]:
mdl = modelling.DataFrameModel("intermediate.pkl")

In [None]:
models = ["LR", "DT", "RF"]
for model in models:
    mdl.evaluate_model(model)


-----LR-----


Model achieved an mean absolute error of 118270.104.
R2 score is 0.608

-----DT-----


Model achieved an mean absolute error of 115000.000.
R2 score is 0.339
