# Notebook 01: Introduction to Decision Trees with Play Tennis

## üéØ What is This Notebook About?

Welcome to your first hands-on experience with AI! This notebook introduces you to **decision trees** - one of the most intuitive machine learning algorithms - through a fun, relatable example: predicting whether to play tennis based on weather conditions.

**What we'll do:**
1. **Explore the problem** - Understand what we're trying to predict and why it matters
2. **Load and examine data** - See what information we have to work with
3. **Prepare the data** - Get it ready for the machine learning algorithm
4. **Train a decision tree** - Teach the computer to make predictions
5. **Evaluate the model** - See how well it performs
6. **Visualize the tree** - Understand how the model makes decisions

**Why this matters:**
- Decision trees are the foundation for many advanced AI techniques
- This example shows how AI can learn patterns from data
- You'll understand the basic workflow used in all machine learning projects
- This knowledge applies directly to IT operations (predicting incidents, classifying problems, routing tickets, etc.)

**How it connects:**
- This is Module 1 of the workshop - your introduction to AI concepts
- The skills you learn here will be used throughout the workshop
- Next modules will build on these foundations to solve real IT operations problems

---

## üìö Key Concepts Explained

### What is Machine Learning?

**Machine Learning (ML)** is a way for computers to learn patterns from data without being explicitly programmed for every scenario.

**Think of it like:** Teaching a child to recognize cats by showing them many cat pictures, rather than describing every possible cat feature. The child learns the pattern.

**Why it matters:** In IT operations, we can't write rules for every possible incident. ML learns patterns from historical data to help predict and classify new situations.

### What is a Decision Tree?

A **decision tree** is a simple, visual way to make decisions by asking a series of yes/no questions.

**Think of it like:** A flowchart or a game of "20 Questions" - you ask questions that narrow down the possibilities until you reach an answer.

**Example:** 
- "Is it sunny?" ‚Üí Yes ‚Üí "Is it windy?" ‚Üí No ‚Üí "Play tennis!"
- "Is it sunny?" ‚Üí No ‚Üí "Is it raining?" ‚Üí Yes ‚Üí "Don't play tennis"

**Why it matters:** Decision trees are:
- Easy to understand (you can see exactly how decisions are made)
- Don't require complex math
- Work well with categorical data (like weather conditions)
- Form the basis for more advanced algorithms (Random Forests, Gradient Boosting)

### What is OpenShift AI?

**OpenShift AI** is Red Hat's platform for building, training, and deploying AI/ML models in enterprise environments.

**Think of it like:** A complete workshop for AI projects - it provides:
- **Jupyter Notebooks** (like this one) for experimentation
- **Compute resources** for training models
- **Model serving** capabilities for production
- **Integration** with your existing IT infrastructure

**Why we're using it:** OpenShift AI gives you a professional, enterprise-ready environment to learn and experiment with AI, just like you'd use in real IT operations.

---


## üéØ Learning Objectives

By the end of this notebook, you will:
- ‚úÖ Understand what machine learning is and why it matters
- ‚úÖ Know how decision trees work and when to use them
- ‚úÖ Be able to prepare data for machine learning
- ‚úÖ Know how to train and evaluate a simple model
- ‚úÖ Understand how to interpret a decision tree visualization
- ‚úÖ See how AI concepts apply to IT operations

---


## ‚ö†Ô∏è Prerequisites

Before starting, make sure you have:
- [ ] Access to OpenShift AI environment (or Jupyter Notebook)
- [ ] Python 3.8+ installed
- [ ] Basic understanding of Python (variables, lists, functions)
- [ ] Familiarity with pandas DataFrames (we'll explain as we go)

**Don't worry if:** You're new to machine learning - this notebook is designed for beginners! We'll explain everything step by step.

---


## üìã Step-by-Step Guide

Let's begin! We'll work through this notebook step by step, explaining what we're doing and why at each stage.

---


In [None]:
!pip install six

# Instalar pydotplus

Este passo garante que a biblioteca `pydotplus` esteja instalada, o que √© necess√°rio para gerar a representa√ß√£o gr√°fica da √°rvore de decis√£o.

In [None]:
%pip install pydotplus

# Carregar bibliotecas para an√°lise de dados e machine learning

Importamos as bibliotecas `numpy` e `pandas` para manipula√ß√£o e an√°lise de dados, e a biblioteca `sklearn` para fun√ß√µes de machine learning, especificamente o m√≥dulo `metrics` para avaliar o desempenho do modelo.

**What we imported:**
- `pandas` - For working with data tables (like Excel spreadsheets)
- `numpy` - For numerical operations
- `sklearn` - Machine learning library (scikit-learn)
- Visualization tools - For displaying our decision tree

**Why these libraries:** Each serves a specific purpose in our workflow. We'll use pandas to load and explore data, sklearn to build our model, and visualization tools to see how the model makes decisions.

---


In [None]:
# Import our toolkit
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus

print("üéØ Toolkit loaded! Ready to build our decision tree.")

### Step 2: Loading Our Data

**What we're doing:** Loading the Play Tennis dataset - a classic ML example that's perfect for learning.

**Why this dataset:** It's small enough to understand quickly, but demonstrates all the key concepts. Plus, the "play tennis or not?" question is intuitive - you can almost guess the answer just by looking at the weather!

**What to expect:** We'll see weather conditions (outlook, temperature, humidity, wind) and whether tennis was played on each day.

**What we see:**
- Each row represents one day
- **Features** (what we use to predict): `Outlook`, `Temprature`, `Humidity`, `Wind`
- **Target** (what we want to predict): `Play_Tennis` (Yes or No)

**Key observation:** This is a small, simple dataset - perfect for learning! In real-world scenarios, you'd work with much larger datasets, but the concepts are the same.

---


In [None]:
# Load the dataset
df = pd.read_csv("../data/play_tennis.csv")

# Define our features (what we'll use to make predictions)
feature_cols = ['Outlook', 'Temprature', 'Humidity', 'Wind']

print(f"üìä Loaded {len(df)} days of tennis data")
print(f"Features: {', '.join(feature_cols)}")
print(f"Target: Play_Tennis\n")
df

Let's take a quick look at what we're working with. Notice the patterns? Sunny days with weak wind seem to favor tennis, while rainy days... not so much. Our decision tree will learn these patterns automatically!

In [None]:
# Quick stats
print(f"Dataset: {df.shape[0]} days, {df.shape[1]} columns")
print(f"Played tennis: {df['Play_Tennis'].value_counts().get('Yes', 0)} days")
print(f"Didn't play: {df['Play_Tennis'].value_counts().get('No', 0)} days")

In [None]:
# First few days
df.head()

In [None]:
# Last few days
df.tail()

**Key observation:** Our data is categorical (text like "Sunny", "Hot") but ML algorithms need numbers. Time to transform it!

---


In [None]:
# Data summary
df.describe()

### Step 3: Preparing Data for Machine Learning

**What we're doing:** Converting categorical text to numbers, then splitting into features (X) and target (y), and finally creating train/test sets.

**Why:** ML algorithms are math-based - they need numbers, not text. We also need separate train/test sets to see if our model actually learned patterns (not just memorized the training data).

**The fun part:** This is where we transform messy real-world data into something the algorithm can understand. It's like translating between human language and computer language!

In [None]:
# Convert categorical text to numbers using LabelEncoder
# "Sunny" becomes 2, "Rainy" becomes 1, etc.
label_encoder = preprocessing.LabelEncoder()
df_encoded = df.apply(label_encoder.fit_transform)

print("‚ú® Data encoded! Text categories are now numbers:")
print(f"Original shape: {df.shape}")
print(f"Encoded shape: {df_encoded.shape}\n")
df_encoded

**What happened:** We've transformed our data from text to numbers and split it into training/testing sets. The model will learn from the training set, then we'll test it on data it's never seen before.

**Why this matters:** Testing on unseen data tells us if the model actually learned patterns or just memorized the training examples. This is crucial for real-world applications!

---


In [None]:
# Split into features (X) and target (y)
# X = what we use to predict, y = what we want to predict
X = df_encoded[feature_cols]
y = df_encoded['Play_Tennis']

print(f"Features (X): {X.shape[1]} columns - {', '.join(feature_cols)}")
print(f"Target (y): {y.name} - {y.nunique()} unique values")
print(f"Total samples: {len(X)}")

In [None]:
# Split into training (70%) and testing (30%) sets
# We train on one set, test on another to see if it actually learned
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

print(f"üìö Training set: {len(X_train)} samples")
print(f"üß™ Testing set: {len(X_test)} samples")
print(f"Training ratio: {len(X_train)/len(X):.0%}")

### Step 4: Training Our Decision Tree

**What we're doing:** Teaching our decision tree to learn patterns from the training data.

**Why entropy:** We're using "entropy" as our splitting criterion - it measures how "mixed" a group is. The algorithm tries to create pure groups (all "Yes" or all "No") by asking the best questions first.

**The magic:** Watch how the algorithm automatically figures out which questions to ask and in what order. It's learning the decision-making process!

**What to expect:** After training, we'll have a model that can predict "Play Tennis?" for any weather condition.

**What happened:** The model made predictions! Now let's see how accurate it is.

---


In [None]:
# Create and train our decision tree
# entropy = measure of "mixedness" - algorithm tries to create pure groups
classifier = DecisionTreeClassifier(criterion="entropy", random_state=42)
classifier.fit(X_train, y_train)

print("üéì Model trained! The decision tree has learned the patterns.")
print(f"Tree depth: {classifier.get_depth()}")
print(f"Number of leaves: {classifier.get_n_leaves()}")

In [None]:
# Make predictions on the test set (data the model hasn't seen)
y_pred = classifier.predict(X_test)

print(f"‚úÖ Made predictions for {len(y_pred)} test samples")
print(f"Predictions: {y_pred}")
print(f"Actual values: {y_test.values}")

### Step 5: Evaluating Our Model

**What we're doing:** Measuring how well our model performs on unseen data.

**Why accuracy matters:** Accuracy tells us what percentage of predictions were correct. But we'll also look at a confusion matrix to see where it makes mistakes.

**The reality check:** This is where we find out if our model actually learned useful patterns or just got lucky!

Let's compare predictions vs actual values side-by-side:

In [None]:
# Compare predictions vs actual
comparison = pd.DataFrame({
    'Actual': y_test.values,
    'Predicted': y_pred,
    'Correct': y_test.values == y_pred
})
comparison

Now let's get a deeper look at performance with a confusion matrix and classification report:

**What these metrics tell us:**
- **Precision:** When we predict "Yes", how often are we right?
- **Recall:** Of all actual "Yes" cases, how many did we catch?
- **F1-score:** Balance between precision and recall

**Key insight:** A good model needs both high precision AND high recall - it should be right when it predicts, and catch most of the actual cases.

---


In [None]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

### Step 6: Visualizing the Decision Tree

**What we're doing:** Creating a visual representation of how our decision tree makes decisions.

**Why this is cool:** You can actually SEE how the model thinks! Each node asks a question, each branch is an answer, and each leaf is a final prediction.

**How to read it:** Start at the top (root), follow the branches based on conditions, and end at a leaf with the prediction. It's like a flowchart that the algorithm created automatically!

**Try this:** Pick a weather condition and trace through the tree - can you predict what it will say?

- Agora vamos comparar alguns dos nossos valores previstos com os valores reais e ver o qu√£o precisos fomos:

**How to read this tree:**
1. **Root node (top):** The first question the algorithm asks
2. **Branches:** "Yes" goes left, "No" goes right (or vice versa)
3. **Internal nodes:** More questions based on previous answers
4. **Leaves (bottom):** Final predictions (No/Yes)

**Example walkthrough:** 
- Start at root: "Outlook <= 0.5?"
- If Yes ‚Üí "Humidity <= 0.5?" ‚Üí Continue following...
- Eventually reach a leaf with the prediction

**The beauty:** The algorithm automatically figured out which questions to ask first to make the best predictions!

---


In [None]:
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(
    classifier,
    out_file=dot_data,
    filled=True,
    rounded=True,
    special_characters=True,
    feature_names=feature_cols,
    class_names=["No", "Yes"],
)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

## üíº How This Applies to IT Operations

The same decision tree approach can solve real IT problems:

**Incident Classification:**
- "Is it affecting production?" ‚Üí "Is it user-reported?" ‚Üí "Priority: Critical"
- Automatically route tickets to the right team

**Failure Prediction:**
- "Is CPU usage high?" ‚Üí "Is memory usage high?" ‚Üí "Risk: High"
- Predict which systems are likely to fail

**Change Risk Assessment:**
- "Is it a production system?" ‚Üí "Is it during business hours?" ‚Üí "Risk Level: Medium"
- Evaluate the risk of deployments automatically

**The pattern is the same:** Ask questions, narrow down possibilities, make decisions based on patterns learned from historical data. The difference? Instead of weather, you're using system metrics, logs, and incident data.

---


## üéì Key Takeaways

- **Machine Learning** learns patterns from data without explicit programming
- **Decision Trees** make decisions by asking a series of questions - you can actually see how they think!
- **Data preparation** is crucial - ML algorithms need data in the right format (numbers, not text)
- **Train/test split** ensures we test on unseen data to verify the model actually learned
- **Evaluation metrics** (accuracy, precision, recall) tell us how well the model performs
- **Visualization** helps us understand and trust the model's decision-making process
- **These concepts** apply directly to IT operations - same algorithms, different data

---


## üîó Next Steps

- **Next module:** `2-reducing-mttd/` - Learn how to evaluate and improve AI outputs using multiple evaluation methods
- **Practice:** Try modifying the dataset or parameters - see how it affects the tree!
- **Explore:** Can you trace through the tree for different weather conditions?

**Related Resources:**
- [Scikit-learn Decision Trees Documentation](https://scikit-learn.org/stable/modules/tree.html)
- [OpenShift AI Documentation](https://access.redhat.com/documentation/en-us/red_hat_openshift_ai)
- [Pandas Documentation](https://pandas.pydata.org/docs/)

---

**üéâ Congratulations!** You've built your first machine learning model. The concepts you learned here form the foundation for all the advanced AI techniques we'll explore in the next modules.
