As dicas abaixo foram incialmente obtidas em: https://scikit-learn.org/stable/modules/tree.html#

Meu objetivo com estas anotações é me aprofundar na otimização, pretendo referênciar outros estudos, e aqui mesmo anotar alguns algorítmos de teste.

## Dicas

1. As *Árvores de decisão* tendem a superestimar(overfit) os dados com um grande número de recursos(features). É importante obter a proporção adequada de amostras para o número de recursos, pois é muito provável que uma árvore com poucas amostras em um espaço de alta dimensão fique muito limitado.
   Considere uma redução dimencional (PCA, ICA, ou seleção de *Features*) antecipadamente para dar a sua arvore de decisão a chance de encontrar *features* que sejam discriminativo.
2. Entender a estrutura da arvore de decisão ajudará a obter mais informações sobre como a *árvore de decisão* faz suas decisões, o que é importante para entender os recursos importantes nos dados.
3. Visualizar sua *árvore* em quanto está treinando, usando funções de exportação. Use `max_depth=3` como uma profundidade inicial da arvore para ter uma ideia de com a arvore está se adaptando aos seus dados, e então aumente a profundidade.

* Visualise your tree as you are training by using the export function. Use max_depth=3 as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth.

* Remember that the number of samples required to populate the tree doubles for each additional level the tree grows to. Use max_depth to control the size of the tree to prevent overfitting.
* Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification with few classes, min_samples_leaf=1 is often the best choice.
* Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant. Class balancing can be done by sampling an equal number of samples from each class, or preferably by normalizing the sum of the sample weights (sample_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min_weight_fraction_leaf, will then be less biased toward dominant classes than criteria that are not aware of the sample weights, like min_samples_leaf.
* If the samples are weighted, it will be easier to optimize the tree structure using weight-based pre-pruning criterion such as min_weight_fraction_leaf, which ensure that leaf nodes contain at least a fraction of the overall sum of the sample weights.
* All decision trees use np.float32 arrays internally. If training data is not in this format, a copy of the dataset will be made.
* If the input matrix X is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples.

In [None]:
# Code you have previously used to load data
import sys
import os
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


In [None]:
#  Say, all airline-safety files...
import zipfile
dataset_name = "sberbank-russian-housing-market"
working_train_file = "./train.csv"

import shutil
#shutil.rmtree('/kaggle/working')

#shutil.rmtree('/kaggle/working/pdf')

#os.remove('/kaggle/working/Model*')
#os.remove('/kaggle/working/train.csv')

# Will unzip the files so that you can see them..
with zipfile.ZipFile("../input/"+dataset_name+"/train.csv.zip","r") as z:
    z.extractall(".")

shutil.rmtree('./__MACOSX')
    
print("unziped train.csv.zip")

In [None]:
# Path of the file to read
train_data = pd.read_csv(working_train_file,index_col='id')
# Clean na data
train_data.dropna(inplace=True)

train_data.describe()

In [None]:
# Create X, seguindo a dica 1, escolhendo alguns *features* para evitar *overfit*
"""
price_doc: sale price (this is the target variable)
id: transaction id
timestamp: date of transaction
full_sq: total area in square meters, including loggias, balconies and other non-residential areas
life_sq: living area in square meters, excluding loggias, balconies and other non-residential areas
floor: for apartments, floor of the building
max_floor: number of floors in the building
material: wall material
build_year: year built
num_room: number of living rooms
kitch_sq: kitchen area
state: apartment condition
product_type: owner-occupier purchase or investment
sub_area: name of the district
"""
# features = ['num_room', 'max_floor', 'full_sq', 'life_sq', 'floor', 'material', 'build_year', 'kitch_sq']
# features = ['num_room', 'max_floor', 'kitch_sq', 'full_sq', 'life_sq', 'floor', 'material']
#
# Maxdepth: 05, Random State: 01, Validation MAE: 2,563,367, 
#    Features: ['num_room', 'max_floor', 'kitch_sq', 'full_sq', 'life_sq', 'floor', 'material', 'sub_area', 'product_type']
#
# Maxdepth: 05, Random State: 01, Validation MAE: 2,527,532, 
#    Features: ['num_room', 'max_floor', 'kitch_sq', 'full_sq', 'life_sq', 'floor', 'material']

features = ['num_room', 'max_floor', 'kitch_sq', 
            'full_sq', 'life_sq', 'floor', 
            'material']
targets = ['price_doc']

# Create target object and call it y
y = train_data[targets]
X = train_data[features]
X.head()

In [None]:
y.describe()

In [None]:
X.describe()

In [None]:
## Add these lines to turn off the warnings
import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import LabelEncoder

if 'sub_area' in X:
    sub_area_encoder = LabelEncoder()
    sub_area_encoder.fit(X['sub_area'].astype(str))
    X['sub_area'] = sub_area_encoder.transform(X['sub_area'].astype(str))
if 'product_type' in X:
    product_type_encoder = LabelEncoder()
    product_type_encoder.fit(X['product_type'].astype(str))
    X['product_type'] = product_type_encoder.transform(X['product_type'].astype(str))
X.describe()

In [None]:
def testDecisionTree(train_X, train_y, val_X, val_y, max_depth=3,random_state=1):
    train_model = DecisionTreeRegressor(max_depth=max_depth,random_state=random_state)
    # Fit Model
    train_model.fit(train_X, train_y)

    # Make validation predictions and calculate mean absolute error
    val_predictions = train_model.predict(val_X)
    val_mae = mean_absolute_error(val_predictions, val_y)
    #print("Maxdepth: {depth:2d}, Random State: {random:2d}, Validation MAE: {mae:,.0f}"
    #      .format(depth=max_depth, random=random_state, mae=val_mae))
    return train_model, val_predictions, val_mae

In [None]:
import graphviz 
from sklearn.datasets import load_iris
from sklearn import tree 
def showDecisionTree(model,features,targets, file_name='DecisionTree'):
    dot_data = tree.export_graphviz(model, out_file=None, 
                                        feature_names=features,  
                                        class_names=targets,  
                                        filled=True, rounded=True,  
                                        special_characters=True)  
    graph = graphviz.Source(dot_data)  
    #graph 
    #dot_data = tree.export_graphviz(model, out_file=None) 
    #graph = graphviz.Source(dot_data)  
    graph.render("pdf/"+file_name)


In [None]:
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

min_random_state = 0
max_random_state = 5
min_depth = 3
max_depth = len(features)* 2

#shutil.rmtree('/kaggle/working/pdf')

# Conforme a dica 3, usando inicialmente uma arvore raza para ajustar conforme as decisões.
# Iniciado com max_depth=3
# Specify Model
best_train_model = None
best_train_depth = 3
best_train_state = 1
best_train_mae = sys.float_info.max
for depth in range(min_depth,max_depth+1):
    for state in range(min_random_state, max_random_state+1):
        train_model, val_predictions, val_mae = testDecisionTree(train_X, train_y, val_X, val_y, max_depth=depth, random_state=state)
        if best_train_mae > val_mae:
            best_train_mae = val_mae
            best_train_depth = depth
            best_train_state = state
            best_train_model = train_model
        print("Maxdepth: {depth:2d}, Random State: {random:2d}, Validation MAE: {mae:,.0f}, Best MAE: {bmae:,.0f}"
            .format(depth=depth, random=state, mae=val_mae, bmae=best_train_mae))

In [None]:
showDecisionTree(train_model,features,targets,file_name="ModelDecisionTree_{depth:02d}_{random:02d}_{mae:0f}"
              .format(depth=best_train_depth, random=best_train_state, mae=best_train_mae))
print("Maxdepth: {depth:02d}, Random State: {random:02d}, Validation MAE: {mae:,.0f}, Features: {f}"
              .format(depth=best_train_depth, random=best_train_state, mae=best_train_mae, f=features))