# Overview Materi

Source: https://www.youtube.com/watch?v=LDRbO9a6XPU

Jelaskan secara singkat apa itu decision tree menurut pemahamanmu!

Decision tree (pohon keputusan) adalah sebuah model prediksi dalam machine learning yang menggunakan struktur seperti pohon untuk mengambil keputusan. Intinya, dia membagi data ke kelompok-kelompok kecil berdasarkan aturan aturan (split) dari fitur data, hingga akhirnya tiap kelompok (daun pohon) punya prediksi kelas tertentu.

# Import Data & Libraries

In [1]:
from __future__ import print_function

# label kolom
header = ["color", "diameter", "label"]

# data training
training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],]

# data testing
testing_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 4, 'Apple'],
    ['Red', 2, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],]

# Fungsi Dasar

In [2]:
# fungsi mencari apa saja unique value dari suatu kolom

def unique_vals(rows, col):
    return set([row[col] for row in rows])

# contoh penggunaan
print(unique_vals(training_data, 0))
print(unique_vals(training_data, 1))

{'Yellow', 'Red', 'Green'}
{1, 3}


In [3]:
# fungsi Menghitung jumlah unique value dari suatu kolom (label terakhir)

def class_counts(rows):
    counts = {}
    for row in rows:
        label = row[-1]
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    return counts

# contoh penggunaan
print(class_counts(training_data))

{'Apple': 2, 'Grape': 2, 'Lemon': 1}


In [7]:
# fungsi pengecekan suatu value numerik atau bukan

def is_numeric(value):
    return isinstance(value, int) or isinstance(value, float)

# contoh penggunaan
print(is_numeric(3))
print(is_numeric(3.5))
print(is_numeric("Apple"))
print(is_numeric("Red"))

True
True
False
False


In [8]:
# kelas untuk merepresentasikan pertanyaan pada decision tree

class Question:

    # inisialisasi kolom dan nilai pertanyaan
    def __init__(self, column, value):
        self.column = column
        self.value = value

    # mengecek apakah contoh data sesuai dengan pertanyaan
    def match(self, example):
        val = example[self.column]
        if is_numeric(val):
            return val >= self.value
        else:
            return val == self.value

    # menampilkan pertanyaan dalam format string yang mudah dibaca
    def __repr__(self):
        condition = ">=" if is_numeric(self.value) else "=="
        return f"Is {header[self.column]} {condition} {str(self.value)}?"

# contoh penggunaan 1
q = Question(0, "Green")
print(q)
print(q.match(training_data[0]))
print(q.match(training_data[1]))

q2 = Question(1, 3)
print(q2)
print(q2.match(training_data[0]))
print(q2.match(training_data[2]))


Is color == Green?
True
False
Is diameter >= 3?
True
False


In [9]:
# membagi dataset menjadi dua berdasarkan pertanyaan

def partition(rows, question):
    true_rows, false_rows = [], []
    for row in rows:
        if question.match(row):
            true_rows.append(row)
        else:
            false_rows.append(row)
    return true_rows, false_rows

# contoh penggunaan
question = Question(0, "Green")
true_rows, false_rows = partition(training_data, question)

print("Pertanyaan:", question)
print("True rows:", true_rows)
print("False rows:", false_rows)

Pertanyaan: Is color == Green?
True rows: [['Green', 3, 'Apple']]
False rows: [['Yellow', 3, 'Apple'], ['Red', 1, 'Grape'], ['Red', 1, 'Grape'], ['Yellow', 3, 'Lemon']]


**apa itu gini impurity?**
<br> gini impurity berfungsi mengukur tingkat ketidakmurnian atau ketidakteraturan pada sebuah simpul (node) dalam pohon

In [10]:
# menghitung nilai Gini Impurity untuk sebuah dataset

def gini(rows):
    counts = class_counts(rows)  # hitung jumlah tiap kelas
    impurity = 1
    for lbl in counts:
        prob_of_lbl = counts[lbl] / float(len(rows))
        impurity -= prob_of_lbl**2
    return impurity


# contoh penggunaan
print("Gini(training_data) =", gini(training_data))

Gini(training_data) = 0.6399999999999999


**apa itu information gain?**
<br> information gain berfungsi mengukur seberapa efektif sebuah fitur dalam memisahkan data berdasarkan kelas-kelasnya

In [11]:
# menghitung nilai Information Gain dari pemisahan dataset

def info_gain(left, right, current_uncertainty):
    p = float(len(left)) / (len(left) + len(right))
    return current_uncertainty - p * gini(left) - (1 - p) * gini(right)


# contoh penggunaan
# misal pakai pertanyaan "color == Green"
question = Question(0, "Green")
true_rows, false_rows = partition(training_data, question)
current_uncertainty = gini(training_data)

print("Information Gain:", info_gain(true_rows, false_rows, current_uncertainty))

Information Gain: 0.1399999999999999


In [12]:
# mencari pertanyaan terbaik untuk membagi dataset berdasarkan information gain tertinggi

def find_best_split(rows):
    best_gain = 0
    best_question = None
    current_uncertainty = gini(rows)
    n_features = len(rows[0]) - 1

    for col in range(n_features):
        values = set([row[col] for row in rows])

        for val in values:
            question = Question(col, val)

            # split dataset
            true_rows, false_rows = partition(rows, question)

            # Skip split jika tidak ada pembagian
            if len(true_rows) == 0 or len(false_rows) == 0:
                continue

            # hitung Information Gain
            gain = info_gain(true_rows, false_rows, current_uncertainty)

            if gain >= best_gain:
                best_gain, best_question = gain, question

    return best_gain, best_question

# contoh penggunaan
best_gain, best_question = find_best_split(training_data)

print("Best Gain:", best_gain)
print("Best Question:", best_question)

Best Gain: 0.37333333333333324
Best Question: Is diameter >= 3?


# Fungsi Decision Tree

In [13]:
# merepresentasikan node daun (leaf) pada decision tree yang berisi hasil prediksi

class Leaf:

    # inisialisasi leaf dengan menghitung jumlah kemunculan tiap kelas
    def __init__(self, rows):
        self.predictions = class_counts(rows)

# contoh penggunaan
leaf = Leaf(training_data)
print("Prediksi di leaf:", leaf.predictions)

Prediksi di leaf: {'Apple': 2, 'Grape': 2, 'Lemon': 1}


In [14]:
# merepresentasikan node keputusan (decision node) yang berisi pertanyaan dan cabang

class Decision_Node:

    # inisialisasi node dengan pertanyaan, cabang benar, dan cabang salah
    def __init__(self, question, true_branch, false_branch):
        self.question = question
        self.true_branch = true_branch
        self.false_branch = false_branch

# contoh penggunaan
q = Question(0, "Green")
true_leaf = Leaf([['Green', 3, 'Apple']])
false_leaf = Leaf([
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon'],
])

node = Decision_Node(q, true_leaf, false_leaf)

print("Pertanyaan di node:", node.question)
print("True branch (prediksi):", node.true_branch.predictions)
print("False branch (prediksi):", node.false_branch.predictions)

Pertanyaan di node: Is color == Green?
True branch (prediksi): {'Apple': 1}
False branch (prediksi): {'Apple': 1, 'Grape': 2, 'Lemon': 1}


In [15]:
# membangun decision tree secara rekursif

def build_tree(rows):
    # cari pertanyaan terbaik
    gain, question = find_best_split(rows)

    # jika tidak ada info gain → buat leaf
    if gain == 0:
        return Leaf(rows)

    # split dataset
    true_rows, false_rows = partition(rows, question)

    # rekursif bangun subtree
    true_branch = build_tree(true_rows)
    false_branch = build_tree(false_rows)

    # return decision node
    return Decision_Node(question, true_branch, false_branch)

# contoh penggunaan
tree = build_tree(training_data)
print(tree.question)

Is diameter >= 3?


In [16]:
# mencetak struktur decision tree secara rekursif dalam format teks

def print_tree(node, spacing=""):

    # base case: jika sudah mencapai leaf
    if isinstance(node, Leaf):
        print(spacing + "Predict", node.predictions)
        return

    # mencetak pertanyaan pada node saat ini
    print(spacing + str(node.question))

    # mencetak cabang true secara rekursif
    print(spacing + '--> True:')
    print_tree(node.true_branch, spacing + "  ")

    # mencetak cabang false secara rekursif
    print(spacing + '--> False:')
    print_tree(node.false_branch, spacing + "  ")

# contoh penggunaan
tree = build_tree(training_data)
print_tree(tree)

Is diameter >= 3?
--> True:
  Is color == Yellow?
  --> True:
    Predict {'Apple': 1, 'Lemon': 1}
  --> False:
    Predict {'Apple': 1}
--> False:
  Predict {'Grape': 2}


In [17]:
# mengklasifikasikan satu baris data menggunakan decision tree

def classify(row, node):

    # base case: jika sudah mencapai leaf
    if isinstance(node, Leaf):
        return node.predictions

    # jika pertanyaan cocok → lanjut ke cabang true
    if node.question.match(row):
        return classify(row, node.true_branch)
    else:
        return classify(row, node.false_branch)

# contoh penggunaan
tree = build_tree(training_data)

# coba klasifikasikan data uji
for row in testing_data:
    print("Data:", row)
    print("Prediksi:", classify(row, tree))
    print()

Data: ['Green', 3, 'Apple']
Prediksi: {'Apple': 1}

Data: ['Yellow', 4, 'Apple']
Prediksi: {'Apple': 1, 'Lemon': 1}

Data: ['Red', 2, 'Grape']
Prediksi: {'Grape': 2}

Data: ['Red', 1, 'Grape']
Prediksi: {'Grape': 2}

Data: ['Yellow', 3, 'Lemon']
Prediksi: {'Apple': 1, 'Lemon': 1}



In [18]:
# menampilkan prediksi pada leaf dalam format persentase

def print_leaf(counts):
    total = sum(counts.values())
    probs = {}
    for lbl in counts:
        probs[lbl] = str(int(counts[lbl] / total * 100)) + "%"
    return probs

# contoh penggunaan
tree = build_tree(training_data)

for row in testing_data:
    prediction = classify(row, tree)
    print("Data:", row)
    print("Prediksi:", print_leaf(prediction))
    print()

Data: ['Green', 3, 'Apple']
Prediksi: {'Apple': '100%'}

Data: ['Yellow', 4, 'Apple']
Prediksi: {'Apple': '50%', 'Lemon': '50%'}

Data: ['Red', 2, 'Grape']
Prediksi: {'Grape': '100%'}

Data: ['Red', 1, 'Grape']
Prediksi: {'Grape': '100%'}

Data: ['Yellow', 3, 'Lemon']
Prediksi: {'Apple': '50%', 'Lemon': '50%'}



# Predict Using Decision Tree

In [19]:
# menguji decision tree dengan data uji dan membandingkan hasil prediksi dengan label asli
for row in testing_data:
    prediction = classify(row, tree)
    print("Data:", row[:-1])
    print("Label asli:", row[-1])
    print("Prediksi:", print_leaf(prediction))
    print()

Data: ['Green', 3]
Label asli: Apple
Prediksi: {'Apple': '100%'}

Data: ['Yellow', 4]
Label asli: Apple
Prediksi: {'Apple': '50%', 'Lemon': '50%'}

Data: ['Red', 2]
Label asli: Grape
Prediksi: {'Grape': '100%'}

Data: ['Red', 1]
Label asli: Grape
Prediksi: {'Grape': '100%'}

Data: ['Yellow', 3]
Label asli: Lemon
Prediksi: {'Apple': '50%', 'Lemon': '50%'}

