# "House Prices: Advanced Regression Techniques" Kaggle competition

## Authors: David Fernández & Rafael Lazcano

## Introduction:

This document is the final report for the "Artificial Intelligence and Machine Learning" subject. It illustrates a whole process of prediction, especifically the challenge of modelling house prices based on a wide set of features. 

The only rule is to use models covered by the subject's syllabus.



## Data cleansing 

    - objectivos generales
    - fill na values: 0, media
    - one hot encode: por defecto para categorías, pero feature engineering a veces convierte categorías en valores ordinales
    - merge one hot encoded columns


## Feature selection and engineering



    - objetivos generales
    - explicar funciones:
        - sum_SF, Baths, Porch
        - categorical to ordinal: Some textual features(e.g.basement quality) should be handled as numerical (i.e.ordinal) values
        - transform_sales_to_log_of_sales: closer to normal distribution. Better formance
        - add_expensive_neighborhood_feature: binary classifier de si es caro
        - add_home_quality
        - add...
        - remove.
    - todas las funciones modifican training set, pero algunas:
        - también el test set, otras no
        - algunas implican modificar dinámicamente los features del test set
        
    - FORWARD SELECTION OF FE FUNCTIONS:
        - primero, brute force de todo el espacio: inabarcable
        -después, forward selection algorithm: choose one, fix it, search next one. Cubre espacio de las combinaciones demasiado pequeño (mínimo local). Tradeoff: seleccionar tres mejores funciones en cada step.

## Model selection and tuning

Starting out from the dataset produced by the feature selection & engineering process, we carried out model training and evaluation on these models:
* Linear Regression  (Un nivel por encima de lasso y ridge en la jerarquia)??
* Lasso Regression
* Ridge Regression
* Decision trees (Un nivel por encima de rf y xgboost en la jerarquia)??
* Random Forest
* XGBoost

Among these models the ones that without fine tuning gave best results were *Linear Regression*, *Ridge Regression* and *XGBoost*.

### Ridge regression:

In [None]:
# Load the dataset.
train = pd.read_csv('training_FE_data.csv', index_col=0)
submission_set = pd.read_csv('test_FE_data.csv', index_col=0)

X = train.loc[:, train.columns != 'SalePrice']
y = train.loc[:, 'SalePrice']


#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
ridge_regression = Ridge(normalize=True, max_iter=1e6,alpha=1)


scores = cross_val_score(ridge_regression, X, y, cv=4)
print(scores)

ridge_regression.fit(X, y)
result = ridge_regression.predict(submission_set)
submission_set['SalePrice'] = result
submission = submission_set[['SalePrice']]
submission.to_csv('FirstRidgeRegression.csv')

    - modelos
    - tuneado hiper-parámetros
    - ensemble / stacking

## Stacking

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, Lasso, RidgeCV, Ridge
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

In [2]:
clean_train = pd.read_csv('training_FE_data.csv', index_col=0)
clean_test = pd.read_csv('test_FE_data.csv', index_col=0)

X = clean_train.loc[:, clean_train.columns != 'SalePrice']
y = clean_train.loc[:, 'SalePrice']


In [3]:
"""
lassoCV = LassoCV(alphas=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1], cv=5, random_state=1)
"""

In [4]:
bestLasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0001, random_state=1))

In [5]:
bestLinear = make_pipeline(RobustScaler(), LinearRegression())

In [6]:
"""
parameters = {'n_estimators': [500, 1000, 2000, 3000, 5000], 
              'learning_rate': [0.05, 0.1, 0.5, 1],
              'max_depth': [3, 4, 5],
              'min_samples_leaf': [5, 10, 15, 20],
              'min_samples_split': [2, 5, 7, 10]}
GBoost = GridSearchCV(GradientBoostingRegressor(max_features='sqrt', loss='huber', random_state=1), parameters, cv=5)
GBoost.fit(X, y)
GBoost.best_estimator_
"""

"\nparameters = {'n_estimators': [500, 1000, 2000, 3000, 5000], \n              'learning_rate': [0.05, 0.1, 0.5, 1],\n              'max_depth': [3, 4, 5],\n              'min_samples_leaf': [5, 10, 15, 20],\n              'min_samples_split': [2, 5, 7, 10]}\nGBoost = GridSearchCV(GradientBoostingRegressor(max_features='sqrt', loss='huber', random_state=1), parameters, cv=5)\nGBoost.fit(X, y)\nGBoost.best_estimator_\n"

In [7]:
bestGBoost = GradientBoostingRegressor(n_estimators=2000, learning_rate=0.05,
                                   max_depth=5, max_features='sqrt',
                                   min_samples_leaf=10, min_samples_split=7, 
                                   loss='huber', random_state=1)

In [8]:
ridge = RidgeCV(alphas=(0.001, 0.005, 0.1, 0.5, 1))
ridge.fit(X, y)
ridge.alpha_

1.0

In [9]:
bestRidge = Ridge(alpha=1)

In [10]:
"""
parameters = {'max_depth': [3, 5, 10],
              'min_samples_leaf': [5, 10, 15, 20],
              'min_samples_split': [2, 5, 7, 10],
              'max_features': ['auto', 'sqrt', 'log2']}
decisionTree = GridSearchCV(DecisionTreeRegressor(random_state=1), parameters, cv=5)
decisionTree.fit(X, y)
decisionTree.best_estimator_
"""

"\nparameters = {'max_depth': [3, 5, 10],\n              'min_samples_leaf': [5, 10, 15, 20],\n              'min_samples_split': [2, 5, 7, 10],\n              'max_features': ['auto', 'sqrt', 'log2']}\ndecisionTree = GridSearchCV(DecisionTreeRegressor(random_state=1), parameters, cv=5)\ndecisionTree.fit(X, y)\ndecisionTree.best_estimator_\n"

In [11]:
bestDecisionTree = DecisionTreeRegressor(criterion='mse', max_depth=10, max_features='auto',
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=15,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

In [12]:
"""
parameters = {
              'max_depth': [3, 5, 7, 10, 50, None],
              'min_samples_leaf': [5, 10, 15, 20],
              'min_samples_split': [2, 5, 7, 10],
              'max_features': ['auto', 'sqrt', 'log2']}
randomForest = GridSearchCV(RandomForestRegressor(random_state=1), parameters, cv=5)
randomForest.fit(X, y)
randomForest.best_estimator_
"""

"\nparameters = {\n              'max_depth': [3, 5, 7, 10, 50, None],\n              'min_samples_leaf': [5, 10, 15, 20],\n              'min_samples_split': [2, 5, 7, 10],\n              'max_features': ['auto', 'sqrt', 'log2']}\nrandomForest = GridSearchCV(RandomForestRegressor(random_state=1), parameters, cv=5)\nrandomForest.fit(X, y)\nrandomForest.best_estimator_\n"

In [13]:
bestRandomForest = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
           max_features='log2', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=5, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=1, verbose=0, warm_start=False)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [15]:
bestModels = [bestLasso, bestLinear, bestRidge, bestGBoost, bestRidge, bestDecisionTree, bestRandomForest]
for model in bestModels:
    model.fit(X_train, y_train)

In [None]:
import itertools
from sklearn.metrics import r2_score
weights = [0, 0.1, 0.5, 1] 

all_results = []
for weights in itertools.product(weights, repeat=len(bestModels)):
    if np.sum(weights) == 0 :
        continue
    predictions = np.column_stack([
        model.predict(X_test) for model in bestModels
    ])
    weighted_predictions = predictions * weights
    summed_predictions = np.sum(weighted_predictions, axis=1)
    result_predictions = summed_predictions / np.sum(weights)
    score = r2_score(y_test, result_predictions)
    all_results.append((weights, score))
    print(weights)

all_results.sort(key=lambda tup: tup[1], reverse=True)
all_results


In [16]:
best_weights = (0, 0, 0.1, 0.95, 0.1, 0, 0.1)
predictions = np.column_stack([
        model.predict(clean_test) for model in bestModels
    ])
weighted_predictions = predictions * best_weights
summed_predictions = np.sum(weighted_predictions, axis=1)
result_predictions = summed_predictions / np.sum(best_weights)

result_predictions = np.expm1(result_predictions)

In [17]:
clean_test['SalePrice'] = result_predictions
submission = clean_test[['SalePrice']]
submission.to_csv('stacked.csv')