# Decision Tree Regressor
### scikit-learn method
---
Based on Chapter 6, Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow_ Concepts, Tools, and Techniques to Build Intelligent Systems - Aurélien Géron 

## DecisionTreeRegressor()
---
Check here:
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

## Libraries

In [None]:
import os
import tarfile
from six.moves import urllib
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.tree import DecisionTreeRegressor 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

## Getting the data

In [None]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

In [None]:
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): 
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [None]:
def load_housing_data(housing_path=HOUSING_PATH): 
    csv_path = os.path.join(housing_path, "housing.csv") 
    return pd.read_csv(csv_path)

In [None]:
fetch_housing_data()

In [None]:
housing = load_housing_data()

## Exploratory Data Analysis
---
- Realice analisis exploratorio de datos:
 - Datos faltantes
 - Histogramas
 - Correlaciones, etc

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population", figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
            )
plt.legend()
plt.show()

In [None]:
X = housing.drop(['ocean_proximity','median_house_value'], axis = 1).astype('float64')

In [None]:
y = housing.median_house_value

## X, y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [None]:
X_train

In [None]:
y_train

## DecisionTreeRegressor

In [None]:
tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)

### Train performance metrics

In [None]:
housing_predictions = tree_reg.predict(X_train)
tree_mse = mean_squared_error(y_train, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)

### Test performance metrics

In [None]:
housing_predictions_test = tree_reg.predict(X_test)
tree_mse_test = mean_squared_error(y_test, housing_predictions_test)
tree_rmse_test = np.sqrt(tree_mse_test)
print(tree_mse_test)

### Pregunta
---
¿Qué puede opinar del desempeño de este modelo de regresión?

## Cross validation 

In [None]:
scores = cross_val_score(tree_reg, X_train, y_train,
                             scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)


In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
display_scores(tree_rmse_scores)

## GridSearchCV()
---
Busque en http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html como usar el método

In [None]:
param_grid = {'max_depth': [1,2,3], 'min_samples_split': [2,3,4,5]}

In [None]:
grid_search = GridSearchCV(tree_reg, param_grid, cv=5,
                               scoring='neg_mean_squared_error',
                               return_train_score=True)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
best_reg =  grid_search.best_estimator_

In [None]:
best_reg

In [None]:
cvres = grid_search.cv_results_

In [None]:
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
train_pred = best_reg.predict(X_train)
test_pred = best_reg.predict(X_test)

In [None]:
mse_train = mean_squared_error(y_train,train_pred)
mse_test = mean_squared_error(y_test, test_pred)
print(np.sqrt(mse_train), np.sqrt(mse_test))

## Grafique el mejor árbol encontrado con GridSearchCV