# Predicción Precios Argentina - MD 2022

Notebook para la competencias de [Kaggle](https://www.kaggle.com/competitions/fcen-md-2022-prediccion-precio-de-propiedades/leaderboard) de la materia MD de la maestria de DM de la UBA.

Autor: Tomás Delvechio

# Configuración de la notebook

A continuación se definen configuraciones que afectaran el resto del notebook.

In [8]:
import sys
import os.path

In [10]:
EJECUTA_COLAB = 'google.colab' in sys.modules
DATASET_COMPRESSED_LOCAL_NAME = 'ar_properties.csv.gz'
DATASET_RAW_LOCAL_NAME = 'ar_properties.csv'


# Descarga del dataset

Controla si el dataset esta o no descargado, en caso de no estarlo, lo descarga y descomprime

In [12]:
if EJECUTA_COLAB:
    BASE_PATH = '/content'
else:
    BASE_PATH = '.data/'
    ! mkdir -p $BASE_PATH

fulllpath_dataset_properati = os.path.join(
    BASE_PATH, DATASET_COMPRESSED_LOCAL_NAME)
fulllpath_raw_dataset_properati = os.path.join(
    BASE_PATH, DATASET_RAW_LOCAL_NAME)

if os.path.exists(fulllpath_raw_dataset_properati) or os.path.exists(fulllpath_dataset_properati):
    # skip download
    print("Dataset de properati descargado... No se descarga...")
else:
    # https://www.properati.com.ar/data
    ! wget -N -O $fulllpath_dataset_properati -q https://storage.googleapis.com/properati-data-public/ar_properties.csv.gz
    ! gzip -d -f $fulllpath_dataset_properati

# Importación de librerias

In [13]:
import pandas as pd

import sklearn as sk
from sklearn import model_selection
from sklearn import ensemble
from sklearn import metrics

# Lectura de datos

In [15]:
df = pd.read_csv(fulllpath_raw_dataset_properati, index_col="id")


In [18]:
# filtro para pruebas rápidas
df = df.loc[(df.l2 == "Córdoba") & (df.operation_type == 'Venta') & (df.property_type == 'Casa')]
df.shape

(17972, 24)

# Análisis del conjunto de datos

# Tratamiento de los datos

# Modelización

In [19]:
df = df.select_dtypes(include=['float64', 'int64'])
df.fillna(0, inplace=True, downcast= "infer")

In [20]:
X = df[df.columns.drop('price')]
y = df['price']

for n_estimators in [100, 500, 1000]:     
    for max_depth in [3, 7, 11]:

        ## Tienen que usar RandomForestRegressor si o si o si. Pueden cambiar los parámetros
        reg = sk.ensemble.RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42, n_jobs=-1)

        ## Validación cruzada en 5 partes (lo van a ver en AA), -RMSE. No tocar
        scores = sk.model_selection.cross_val_score(reg, X, y, cv=10, scoring='neg_root_mean_squared_error')

        ## Imprimimos scores. Cuando más bajo mejor
        print(f"n_estimators={n_estimators}, max_depth={max_depth} --> {-scores.mean():.3f} +/- {scores.std():.3f}")


n_estimators=100, max_depth=3 --> 1092877.244 +/- 326348.764
n_estimators=100, max_depth=7 --> 1091396.303 +/- 303108.089
n_estimators=100, max_depth=11 --> 1057311.935 +/- 261268.926
n_estimators=500, max_depth=3 --> 1093316.561 +/- 326591.421
n_estimators=500, max_depth=7 --> 1090799.671 +/- 303480.374
n_estimators=500, max_depth=11 --> 1054131.309 +/- 261351.210
n_estimators=1000, max_depth=3 --> 1093700.609 +/- 326763.124
n_estimators=1000, max_depth=7 --> 1089087.818 +/- 300652.070
n_estimators=1000, max_depth=11 --> 1053844.945 +/- 261281.861


In [None]:
df_pred = pd.read_csv('/content/a_predecir.csv', index_col="id")
df_pred.fillna(0, inplace=True)

In [None]:
X_pred = df_pred[X.columns]

In [None]:
# los mejores hiperparámetros encontrados antes
n_estimators = 100
max_depth = 3

# entrenamiento
reg = sk.ensemble.RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42, n_jobs=-1)
reg.fit(X, y)

# predicción
df_pred["price"] = reg.predict(X_pred)

In [None]:
# grabo la solución
df_pred[["price"]].to_csv("solucion.csv")