![logo](images/Makeathon_Logo.png)
# Notebook 4 - Regression

# House Price Prediction with scikit-learn

**In this notebook you'll get started with tackling regression problems. What are regression problems? In essence, problems where you have to predict a continuous value (like a price or a temperature) as opposed to a class for example. In this notebook we'll try to predict house prices given some information on each house (number of bedrooms, garage or not etc.)** 

In [1]:
# Import ML tools
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

  from numpy.core.umath_tests import inner1d


In [2]:
# Link to dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
# Load and preprocess data
train_data = pd.read_csv('./house_prices/train.csv')
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)

y = train_data.SalePrice
X = train_data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.fit_transform(test_X)

**For regression problems, decision trees, not neural networks, are oftentimes the tool of choise. Why? In many use cases, decision tree-based architectures perform similarly well whilst being easier to train and better interpretable. So let's test a tree!**

In [3]:
# Load, train, and test model
decision_model = DecisionTreeRegressor()  
decision_model.fit(train_X, train_y) 
predicted_decision_trees = decision_model.predict(test_X)
print ("Mean Absolute Error Decision Trees :", mean_absolute_error(test_y, predicted_decision_trees))

Mean Absolute Error Decision Trees : 27088.09315068493


In [4]:
# Have a look at the results
for i in range(10):
    y_hat = decision_model.predict(test_X[:10])
    out = "Pred: {}$| Real: {}$| Diff: {}$".format(int(y_hat[i]), int(test_y[i]), int(y_hat[i]-test_y[i]))
    print(out)

Pred: 395000$| Real: 410000$| Diff: -15000$
Pred: 230000$| Real: 261500$| Diff: -31500$
Pred: 124000$| Real: 135000$| Diff: -11000$
Pred: 188000$| Real: 150000$| Diff: 38000$
Pred: 156500$| Real: 185750$| Diff: -29250$
Pred: 186500$| Real: 177000$| Diff: 9500$
Pred: 82500$| Real: 175500$| Diff: -93000$
Pred: 127000$| Real: 153000$| Diff: -26000$
Pred: 125500$| Real: 112000$| Diff: 13500$
Pred: 180000$| Real: 149000$| Diff: 31000$


**As you've seen, decision trees are very easy to train and seem to have sufficient predictive power. At least for predicting house prices it seems ... <br>**

**However, there is one major problem with decision trees: They tend to overfit. This is why Random Forests (a lot of decision trees making predictions + taking the average prediction) are oftentimes the better choice.**

In [5]:
forest_model = RandomForestRegressor(n_estimators=100, max_depth=10)
forest_model.fit(train_X, train_y )
predicted_random_forest = forest_model.predict(test_X)
print("Mean Absolute Error Random Forest:", mean_absolute_error(test_y, predicted_random_forest))

Mean Absolute Error Random Forest: 18295.54722261151


In [6]:
for i in range(10):
    y_hat = forest_model.predict(test_X[:10])
    out = "Pred: {}$| Real: {}$| Diff: {}$".format(int(y_hat[i]), int(test_y[i]), int(y_hat[i]-test_y[i]))
    print(out)

Pred: 367638$| Real: 410000$| Diff: -42361$
Pred: 241377$| Real: 261500$| Diff: -20122$
Pred: 137441$| Real: 135000$| Diff: 2441$
Pred: 151238$| Real: 150000$| Diff: 1238$
Pred: 164177$| Real: 185750$| Diff: -21572$
Pred: 153651$| Real: 177000$| Diff: -23348$
Pred: 164973$| Real: 175500$| Diff: -10526$
Pred: 139116$| Real: 153000$| Diff: -13883$
Pred: 117771$| Real: 112000$| Diff: 5771$
Pred: 163035$| Real: 149000$| Diff: 14035$


**Looks quite precise! Of course, as shown in the NLP-Notebook, you could now use widgets for this type of problem too to have a nice little UI to interact with your regression model of choice.**