In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # dataviz
import seaborn as sns # dataviz

df = pd.read_csv("../input/apartment-rental-offers-in-germany/immo_data.csv")

%matplotlib inline

In [None]:
df = df.drop(['telekomHybridUploadSpeed', 'noParkSpaces', 'houseNumber', \
              'heatingCosts', 'energyEfficiencyClass', 'lastRefurbish', \
              'telekomTvOffer', 'telekomUploadSpeed', 'streetPlain', \
              'numberOfFloors', 'floor', 'firingTypes', 'serviceCharge', \
              'description', 'facilities', 'scoutId', 'pricetrend',
              'baseRentRange', 'noRoomsRange', 'livingSpaceRange', \
              'yearConstructedRange'], axis=1) #drop columns that aren't needed

df = df[df.baseRent.between(100,10000, inclusive=True)] #drop extreme rent values
df = df[df.livingSpace.between(10, 500, inclusive=True)] #drop extreme and wrongly coded values
df = df[df.noRooms.between(0,15, inclusive=True)] #drop extreme and probably wrongly coded value
df = df[np.isfinite(df['totalRent'])] #drop observations where totalRent isn't available
df = df[df.totalRent.between(100,10000, inclusive=True)] #drop extreme totalRent value


Let's explore the dataset through some visualizations. 

In [None]:
df['baseRent'].hist(bins=30, range=(100,4000), grid=False, color='#86bf91')
plt.title('Distribution of Base Rents')
plt.xlabel('Base Rent')
plt.ylabel('Count')

Let's further explore these values according to States.

We can see that a lot of these rental offers are concentrated in Nordrhein-Westfalen and Sachsen. The mode in most states is towards the lower side (within the first 3 bars). However, for Bayern, Baden-Wurttemberg and Hessen, the mode is a bit higher. This offers us some insight into rental price variations in different states. The Southern states tend to be more expensive than the Northern or the Eastern ones. 

If we observe the mean Base Rent by state, we can get a confirmation on this trend. In addition to the three states already mentioned, we can also see that the mean rent in Hamburg and Berlin is relatively high. This makes sense as these states entirely consist of metropolises and big metropolises do generally have higher rents. Unlike other states, there are no rural rental offers that dampen the mean. 

In [None]:
df['regio1'].value_counts()

g = sns.FacetGrid(df, col='regio1', col_wrap=4)
g = g.map(plt.hist, 'baseRent', bins=20, range=(100,4000))

In [None]:
df.groupby(['regio1'])['baseRent'].mean()

In [None]:
plt.scatter(x='yearConstructed', y='baseRent', data=df)
plt.title('Price by Year of Construction')
plt.xlabel('Year of Construction')
plt.ylabel('Price')

Plotting a simple scatter plot of Price against Year of Construction, we can see that the vast majority of rental ads are for properties constructed recently and generally, older properties tend to be cheaper. This might probably be because of their condition, lack of modern facilities etc. 

In [None]:
sns.regplot(x='livingSpace', y='baseRent', data=df)

When we visualize Base Rent against Living Space, we can see a general trend. Bigger apartments tend to be more expensive. 

In [None]:
#FEATURE ENGINEERING

#make a single binary variable to indicate if the apartment is refurbished/new
df['refurbished'] = (df.condition == 'refurbished') | (df.condition == 'first_time_use') | \
                    (df.condition == 'mint_condition') | (df.condition == 'fully_renovated') | \
                    (df.condition == 'first_time_use_after_refurbishment')

#make a binary variable to indicate if the property is located in a 'rich' state i.e. states where GDP/capita is over 40,000 + Berlin, since it is a metropolis
df['richstates'] = (df.regio1 == 'Bayern') | (df.regio1 == 'Hamburg') | \
                    (df.regio1 == 'Baden_Württemberg') | (df.regio1 == 'Hessen') | \
                    (df.regio1 == 'Bremen') | (df.regio1 == 'Berlin')

#make a binary variable to indicate if the property is located in a poor state where property prices are low (the poorest five states of Germany)
df['poorstates'] = (df.regio1 == 'Mecklenburg_Vorpommern') | (df.regio1 == 'Sachsen_Anhalt') | \
                    (df.regio1 == 'Thüringen') | (df.regio1 == 'Brandenburg') | (df.regio1 == 'Sachsen')

#make a binary variable to indicate if the rental property has good interior
df['greatInterior'] = (df.interiorQual == 'sophisticated') | (df.interiorQual == 'luxury')

#make a binary variable to indicated if the rental property has good heating
df['goodHeating'] = (df.heatingType == 'central_heating') | (df.heatingType == 'floor_heating') | \
                    (df.heatingType == 'self_contained_central_heating')

#make a binary variable to identify rental ads from last year to factor in any inflationary effects.
df['2018_ads'] = (df.date == 'Sep18')

#transform totalRent into log(totalRent) to get a better distribution + better interpretive quality
df['logRent'] = np.log(df['totalRent'])

In [None]:
y_var = ['logRent']
X_var = ['balcony', 'hasKitchen', 'cellar', 'livingSpace', 'noRooms', 'garden',
         'refurbished', 'richstates', 'poorstates', 'greatInterior', 'newlyConst',
         '2018_ads', 'lift']
y = df[y_var].values
X = df[X_var].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, 
                                                    random_state=42)


In order to build the model and to find appropriate hyperparameters for Random Forest and GBM, I conducted Random Search on an IDE on my computer. I am not copying my code for the Random Search over here since it takes a long time and Kaggle kernels (at least, for me) sometimes disconnect automatically. 

I will be running a Linear Regression model, a Random Forest Regressor Model and a Gradient Boosting Regressor Model. After I'm done with all three, I will be making a simple stacked model out of it and see if it improves predictions. 

In [None]:
#LINEAR REGRESSION
from sklearn.linear_model import LinearRegression
from sklearn import metrics

def linearregression(xtrain, ytrain, xtest, ytest):
    linreg = LinearRegression()
    linreg.fit(xtrain, ytrain)
    y_pred = linreg.predict(xtest)
    print('MAE:', metrics.mean_absolute_error(ytest, y_pred))
    print('MSE:', metrics.mean_squared_error(ytest, y_pred))

linearregression(X_train, y_train, X_test, y_test)

In [None]:
#RANDOM FOREST
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

#Best hyperparamters from the Random Search:
#minsamleaf: 30, maxfeat: 11, maxdepth: 24 

def randomforestreg(msl, mf, md, xtrain, ytrain, xtest, ytest):
    rfr_best = RandomForestRegressor(n_estimators=70, random_state=1111,
                                     max_depth=md, max_features=mf, min_samples_leaf=msl)
    rfr_best.fit(xtrain,ytrain)
    y_pred_rfr = rfr_best.predict(xtest)
    print('MAE:', metrics.mean_absolute_error(ytest, y_pred_rfr))
    print('MSE:', metrics.mean_squared_error(ytest, y_pred_rfr))
    
randomforestreg(30, 11, 24, X_train, y_train, X_test, y_test)

In [None]:
#GRADIENT BOOSTING
from sklearn.ensemble import GradientBoostingRegressor

#Best hyperparameters from Random Search:
#maxdepth: 16, minsamleaf: 117, n: 73, maxfeat: 10, lr: 0.07
def gradientboostingmachine(md, msl, n, mf, lr, xtrain, ytrain, xtest, ytest):
    gbm_best = GradientBoostingRegressor(n_estimators=n, random_state=1111,
                                         max_depth=md, max_features=mf, 
                                         min_samples_leaf=msl, learning_rate=lr
                                         )
    gbm_best.fit(xtrain, ytrain)
    y_pred_gbm = gbm_best.predict(xtest)
    print('MAE:', metrics.mean_absolute_error(ytest, y_pred_gbm))
    print('MSE:', metrics.mean_squared_error(ytest, y_pred_gbm))
    
gradientboostingmachine(16, 117, 73, 10, 0.07, X_train, y_train, X_test, y_test) 

In order to build a stacked model, I have just used a simple Linear Regression to stack up the individual predictions of the model. However, if we use Lasso, Ridge or Elastic Net methods, we might end up getting a better result. 

In [None]:
def stackedmodel(xtrain, ytrain, xtest, ytest):
    x_training, x_valid, y_training, y_valid = train_test_split(xtrain, ytrain,
                                                                test_size=0.5,
                                                                random_state=42)
    model1 = LinearRegression()
    model2 = RandomForestRegressor(n_estimators=70, random_state=1111,
                                   max_depth=24, max_features=11, 
                                   min_samples_leaf=24)
    model3 = GradientBoostingRegressor(n_estimators=73, random_state=1111,
                                       max_depth=16, max_features=10, 
                                       min_samples_leaf=117, learning_rate=0.07)
    
    model1.fit(x_training, y_training)
    model2.fit(x_training, y_training)
    model3.fit(x_training, y_training)
    
    preds1 = model1.predict(x_valid)
    preds2 = model2.predict(x_valid)
    preds3 = model3.predict(x_valid)
    
    testpreds1 = model1.predict(xtest)
    testpreds2 = model2.predict(xtest)
    testpreds3 = model3.predict(xtest)
    
    stackedpredictions = np.column_stack((preds1, preds2, preds3))
    stackedtestpredictions = np.column_stack((testpreds1, testpreds2,
                                              testpreds3))
    
    metamodel = LinearRegression()
    metamodel.fit(stackedpredictions, y_valid)
    final_predictions = metamodel.predict(stackedtestpredictions)
    print('MAE:', metrics.mean_absolute_error(ytest, final_predictions))
    print('MSE:', metrics.mean_squared_error(ytest, final_predictions))

stackedmodel(X_train, y_train, X_test, y_test)

Our results from the stacked model give us an Mean Absolute Error of 17%. Since we used Log(TotalRent) as a dependent variable, we can perceive MAP as Mean Absolute Percentage Error. This means that our stacked model predicts our rent with an average absolute error of 17%, which isn't a bad prediction. However, we can see that the individual predictions from Gradient Boosting Regressor were slightly better than the stacked model. 

I haven't tested alternative stacking methods for the stacked model but there is a chance that they might give us a better overall error rate than a simple Linear Regression metamodel. I think that any kind of a shrinkage model in lieu of the Linear Regression stacker might allow us to improve upon our predictions.

It goes without saying that this is an extremely simple model. I think we can improve upon our predictions quite a lot with some heavier feature engineering.

The dataset provides us with more detailed locational variables. Clever locational engineering can allow us to better capture the differences in rents across different locations (urban vs. rural), (metro vs. small city) etc. Moreover, I have not included several variables in the analysis because of the high number of missing values. If we cleverly impute these values, these additional variables might further allow us to improve upon our model.

Another interesting thing can be to add Neural Networks to our stacked model. Neural Networks might be able to predict our prices even better and the only reason I didn't include them here was due to the time constraint. 

I encourage those who are interested to try playinig around with this model by inluding more variables.
