# Underfitting and Overfitting

Fine-tune your model for better performance.

## Example

Kode ini mendefinisikan fungsi get_mae yang digunakan untuk menghitung Mean Absolute Error (MAE) pada model Decision Tree Regressor.

In [1]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

1. Baris pertama:
    * Baris ini mengimpor fungsi mean_absolute_error dari library sklearn.metrics untuk menghitung MAE.
    * Baris ini mengimpor class DecisionTreeRegressor dari library sklearn.tree untuk membuat model regresi pohon keputusan.

2. Baris kedua membuat fungsi bernama get_mae dan menerima 5 parameter:

    * max_leaf_nodes: Menentukan jumlah maksimum node daun pada pohon keputusan.
    * train_X: Fitur data training.
    * val_X: Fitur data validasi.
    * train_y: Variabel target data training.
    * val_y: Variabel target data validasi.

3. Baris ketiga membuat model Decision Tree Regressor dengan parameter max_leaf_nodes yang membatasi jumlah maksimum node daun di pohon keputusan.

4. Baris keempat melatih model menggunakan data training train_X dan train_y.

5. Baris kelima menggunakan model yang sudah dilatih untuk memprediksi variabel target pada data validasi val_X. Hasil prediksi disimpan di preds_val.

6. Baris ketujuh menghitung MAE antara nilai prediksi preds_val dan nilai aktual val_y menggunakan fungsi mean_absolute_error.

7. Baris terakhir mengembalikan nilai MAE yang telah dihitung.

In [3]:
import pandas as pd

melbourne_file_path = 'https://raw.githubusercontent.com/robitalhazmi/intro-to-machine-learning/main/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [4]:
y = melbourne_data.Price

In [5]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

In [6]:
X = melbourne_data[melbourne_features]

In [7]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

In [8]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  354662
Max leaf nodes: 50  		 Mean Absolute Error:  266447
Max leaf nodes: 500  		 Mean Absolute Error:  231301
Max leaf nodes: 5000  		 Mean Absolute Error:  248846
