# Welcome to the ML-Entry Workshop! 🚀

## What We’ll Cover:
- Introduction
- How does a Data Science project look?
- How do we choose a model and train it?
- Hands-on experience: Building Tinder! Create a dataset, and train a model to solve a real-world problem.
- Theory Keyconcepts: Hypothesis set, Train vs Validation/Test, loss function, Backpropagtion, CNN...
- Take it further: recommended steps if you want to deepen your skills and Knowledge

By the end, you'll understand the **core steps** in building an ML model and how it applies to problems like **finding matches in dating apps**.

---

# 1. Introduction to Machine Learning

## Types of Machine Learning  

Machine learning is broadly categorized into three types:  

- **Supervised Learning** – Learning from labeled examples (e.g., spam detection, image classification).  
- **Unsupervised Learning** – Finding patterns in **unlabeled** data (e.g., clustering, anomaly detection).  
- **Reinforcement Learning** – Learning by interacting with an environment (e.g., game-playing agents, robotics).  

Today, we will **focus solely on supervised learning**, the most widely used ML approach in industry.

Whenever we say ML we mean Supervised and vice versa.

---

## When Do We Use Machine Learning?

### Supervised learning ~ Test driven developement

Imagine you need to write a function, but instead of defining its logic, you only have a **set of test cases**.

Machine Learning is **like solving a problem by passing tests, without writing explicit rules**:

✅ A pattern exists.  
❌ But we don’t know how to define it with hardcoded logic.  
✅ We have examples (data) showing expected results (Tests).

Instead of manually coding the transformation logic, we **show examples, an a model "learns" from it**—just like refining an implementation until all tests pass.

![Machine Learning Types](data/presentation_files/show_you_the_door.png)


## Core Components of an ML Model

### Defining the Problem  
In machine learning, we aim to approximate an _UNKNOWN_ **Target function**:

$$f: X \rightarrow Y$$

Where:
- **X** is the input space (features, vectore, images, text,....).
- **Y** is the output space (labels or predictions: {0,1}, [-1,1], {cats,dogs}, {documents_labels}).
- The goal is to learn a function **f** that best maps inputs to outputs.

---

#### 1️⃣ Dataset D (Examples, Test Cases)  
Our training data consists of **labeled examples**:

$$(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$$

- Each **x** is an input (e.g., an image, text, or structured data).
- Each **y** is the correct output (e.g., a category label).
- The dataset acts like a **test suite** for learning.

---

#### 2️⃣ Hypothesis Class (Possible Implementations H)  
- The **hypothesis class** defines the set of functions the model can learn.  
- This may help focus our algorithm in practice. In theory it doesn't limit the model.

---

#### 3️⃣ Learning Algorithm A (Process of Finding f)  
- The **learning algorithm** _searches for the best_ function **h ∈ H**, where **H** is the **hypothesis set**.
- _searches for the best_ = **optimizes parameters** to minimize errors. This Process is also called **expectancy loss minimazation**.
  $$ 
\min_{h \in H} \sum_{i=1}^{N} L(h(x_i), y_i)
$$
- Think of this as an **automated debugging and optimization** process—like refining an implementation until all test cases pass.

---

<div style="text-align: center;">
  <img src="data/presentation_files/learning_paradigm.png" alt="ML Components" width="600" height="400">
</div>

- Different hypothesis sets (e.g., linear regression, decision trees, neural networks) provides Different ML models.

In [2]:
#TODO: Enter an example on linear regresion
# (x1,y1), .... ()
# H = aX+b -> infinite set
# Learning algorithm = find a and b that explains the dataset best 

# h1
# h2
# h10

# 2. A Data Science Project - Our own Tinder

### Starts with the Problem  

### Key Questions to Ask:  
🔹 What are we trying to solve?  
🔹 Why is ML a good solution for this?  
🔹 What data do we have (or need to collect)?  

Our goal is not to **force ML** but to **determine whether ML is the right approach**.


#### Example when ML is not needed

### Build your own Tinder

We want to build a model that **learns what you find attractive**.  

#### Why is this a Good ML Problem?  
✅ **There is a pattern** – Your preferences are not random.  
❌ **It’s hard to code manually** – You can’t write explicit rules for what makes someone attractive.  
✅ **It’s easy to show with examples** – Instead of defining a rule, you can **label examples** of what you like.  

This makes it a **classic supervised learning problem**:  
- **Inputs ($X$)**: Images of people.  
- **Outputs ($Y$)**: Your rating (like/dislike).  
- **Goal**: Learn a function $f: X \to Y$ that predicts your taste.

To build our model, we must:  

### 1️⃣ **Create a Dataset**  
Choosing data can come from two directions:  
1. **Use existing data** – Work with what you already have.  
   - Extract relevant features from it.  
2. **Generate new data** – Collect data based on your understanding of the problem.  

---

### 2️⃣ **Choose a Hypothesis Class**  
The hypothesis class defines the set of functions the model can learn. Common choices include:  

- **Decision Trees** – Learn a series of if-else rules to classify inputs. Needs well defined features (e.g. height, weight, skin_color, eye_color, hair_type, ...)
- **Linear Regression** – Model relationships between features using a weighted sum.  Needs well defined numeric, *continoues, features.
- **Convolutional Neural Networks (CNNs)** – Extract spatial patterns from images, making them ideal for vision tasks.  

---

### 3️⃣ **Our Case: Images + CNN**  
In our case, we will use **images** because:  
- They are the most natural way to represent visual preferences.  
- They allow us to capture complex patterns that are hard to define manually.  

Since CNNs excel at **image-based learning**, we will use a **Convolutional Neural Network (CNN)** to model preferences.  

---

### 4️⃣ **Select a Learning Algorithm**  
Once we choose a hypothesis class, we need an algorithm to **train** the model.  

- In **99.999...% of cases**, the model you work with has a **built-in learning algorithm**.  
- For **Neural Networks**, the standard training method is **Backpropagation**.  

(*We will dive a little bit in on how **Backpropagation** works in the training section.*)

### 5️⃣ **Evaluate results**
improve this bullet


Create a detailed explanation.

# 3. Creating Our Dataset - The most important part of them all. Allways. 

### Why Does Data Matter? 

 
Before training a model, we need **high-quality data**.  

🔹 **You Can’t Optimize Without Data** – In ML, we have **three main components**:  
   - **Data** – The foundation; without it, learning is impossible.  
   - **Hypothesis Class** – The set of possible functions the model can learn.  
   - **Learning Algorithm** – The method to optimize parameters.  

If we **remove the learning algorithm**, we can still train a model manually.  
If we **choose a suboptimal hypothesis class**, we still learn something.  
But **without data, nothing works.**  

🔹 **Garbage In, Garbage Out** – A model is only as good as the data it learns from.  

---

## Our Task: Collect Personal Preference Data  
To train our model, we need labeled examples of **what we find attractive**.  

We will use the **Photo-Rater App** to label images, creating a dataset that reflects individual preferences.  

🔜 Next: Let's start label data!  

![Tagg it all](data/presentation_files/xally.jpg)

Go now... swipe right and left, and come back when you're done.

# 4. Train a model

## 4.1 Prepare Our Dataset for Training  


Now that we have labeled data, we need to **organize it for training**.  

### 1️⃣ Train, Validation, and Test Split  
To evaluate our model properly, we split the data into three parts:  

- **Training Set** – Used to train the model.  
- **Validation Set** – Used to tune hyperparameters and detect overfitting.  
- **Test Set** – Used to evaluate final model performance on unseen data.  

Finally - some code:

In [3]:
import os
import shutil
import random
from pathlib import Path

# Define dataset paths
DATASET_DIR = "data/dataset"  # Base dataset directory
OUTPUT_DIR = "data"  # Where train/val/test splits will be stored

# Define train, validation, and test split ratios
TRAIN_RATIO = 0.7
VAL_RATIO = 0.15
TEST_RATIO = 0.15

# Set a fixed seed for reproducibility
random.seed(42)

# Ensure the output directories exist
for split in ["train", "val", "test"]:
    for label in ["Like", "Dislike"]:  # Ensure we maintain class labels
        os.makedirs(os.path.join(OUTPUT_DIR, split, label), exist_ok=True)

# Function to split and copy images while maintaining original dataset
def split_and_copy(label):
    label_dir = Path(DATASET_DIR) / label
    all_images = list(label_dir.glob("*.jpg"))  # Adjust for different image formats if needed
    random.shuffle(all_images)  # Shuffle with fixed seed for reproducibility

    # Compute split sizes
    num_images = len(all_images)
    train_split = int(num_images * TRAIN_RATIO)
    val_split = int(num_images * (TRAIN_RATIO + VAL_RATIO))

    # Split dataset
    train_files = all_images[:train_split]
    val_files = all_images[train_split:val_split]
    test_files = all_images[val_split:]

    # Function to copy files instead of moving
    def copy_files(files, split):
        dest_dir = Path(OUTPUT_DIR) / split / label
        existing_files = set(f.name for f in dest_dir.glob("*.jpg"))  # Track existing files
        for file in files:
            if file.name not in existing_files:  # Avoid duplicates if rerunning
                shutil.copy(str(file), os.path.join(dest_dir, file.name))

    # Copy files to respective folders
    copy_files(train_files, "train")
    copy_files(val_files, "val")
    copy_files(test_files, "test")

    return len(train_files), len(val_files), len(test_files)

# Process both classes
train_like, val_like, test_like = split_and_copy("Like")
train_dislike, val_dislike, test_dislike = split_and_copy("Dislike")

# Print summary
print(f"Dataset split complete! 🎉")
print(f"Train: {train_like + train_dislike} (Like: {train_like}, Dislike: {train_dislike})")
print(f"Validation: {val_like + val_dislike} (Like: {val_like}, Dislike: {val_dislike})")
print(f"Test: {test_like + test_dislike} (Like: {test_like}, Dislike: {test_dislike})")

Dataset split complete! 🎉
Train: 139 (Like: 23, Dislike: 116)
Validation: 30 (Like: 5, Dislike: 25)
Test: 32 (Like: 6, Dislike: 26)


In [4]:
# Sanity check: Count files in each split
def count_files():
    for split in ["train", "val", "test"]:
        for label in ["Like", "Dislike"]:
            path = Path(OUTPUT_DIR) / split / label
            num_files = len(list(path.glob("*.jpg")))
            print(f"{split.capitalize()} - {label}: {num_files} images")

count_files()

Train - Like: 23 images
Train - Dislike: 116 images
Val - Like: 5 images
Val - Dislike: 25 images
Test - Like: 6 images
Test - Dislike: 26 images


### 4.1.1 Why are we doing this?  

### 4.1.2 Overfitting vs. Generalization – A TDD Analogy

🚀 Imagine You’re an Engineer...
I give you **100 test cases** and tell you:

👉 **"Just make sure all these tests pass."**

A straighforward solution would be:
look at each $test_i$ that has $input_i$ and $expected_output_i$
```python
 def my_function(some_input):
     if some_input==input_i:  
         return expected_output_i  
     else:  
         return None  
```
This **hardcodes answers** instead of solving the real problem.  
✅ **Passes all known tests.**  
❌ **Fails on new cases.**  
**This is overfitting!** In extreme cases, it's just **memorization**.  

---

##### **How Do We Ensure Generalization?**
Because I'm smart, I **don't give you all 100 test cases**!  
👉 **I give you 80**, but keep **20 hidden.**  

I tell you:  
**"Write a good function based on these 80 examples.  
If it also works on my secret 20 tests, I’ll give you 500 shekels in BuyMe!"**  

##### **Two advantages for this approch?**
✅ **Now you must generalize!**  You can't just memorize, you have to find the real pattern!  
✅ **I can test your generalization** Because I kep 20 examples to myself.
 
This is **exactly why we split our dataset** into:  
- **Train Set (80%)** → The model learns from this.  
- **Test Set (20%)** → The model must perform well on unseen data.  

---

##### **Key Takeaway:**  
💡 A model that **only memorizes the training set is useless**. We need **generalization** for real-world performance!  

---

### 4.1.3 Good Questions at This Point 🤔


1️⃣ **How do we evaluate this performance?**  
📢 We will talk about it in the **Evaluation section** (after training).  

2️⃣ **This explanation doesn't explain how the model knows not to overfit the 80 examples I gave him.**  
That is very true! **Splitting the data only allows us to measure generalization performance.**  
Just like we gave the engineer a **500 shekels motivation**, we need ways to **motivate the model not to overfit**.  

This is where **Regularization** comes in! Regularization techniques **penalize complexity** to encourage the model to find simpler, more generalizable patterns. This is advanced so we might just mention this during the follwoing training session! 🚀


### 4.1.4 A Few words on Preprocessing Before Training

In real-world cases, we usually preprocess the data further before passing it to a model. **Preprocessing** can involve:  

🔹 **Feature Extraction** - Creating additional features from raw data. While this is **less common in images** due to the nature of CNNs, it is **very useful in other models**.
* Example (House Prices) → Instead of using the raw address, we can preprocess it into “distance from the city center”, turning an informative but hard-to-use string into a continuous, easy-to-work-with number.
* Example (NLP) → Before using text in a model, we must convert words into numbers, and better to meaningful numerical vectors (a.k.a. embeddings) to capture their relationships and meanings.
  
🔹 **Data Manipulation** – Standardizing input formats (e.g., resizing images, filtering out low-resolution images, handling missing values).  
🔹 **Normalization & Scaling** – Ensuring that numerical features are on a similar scale to improve training stability.  

##### 🚀 Why Is Preprocessing Crucial?  
Preprocessing is usually where a Data Scientist has the most **room to shine**! Unlike modeling, where architectures and optimizers are often well-defined, **there is no single "correct" way to preprocess data**.  
**_"It is art"_** as some Feinshmekers would say  


## 4.2 (Really) Train

In [5]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import os

# Set device (GPU if available, otherwise CPU
# The device is usuall cuda, but on mac we have a locall "mps" as GPU
device = torch.device("mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define dataset paths
train_dir = "data/train"
val_dir = "data/val"

# Define transformations
transform = transforms.Compose([
    transforms.Resize((224, 224)),  # Resize images to match MobileNet input
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Standard ImageNet normalization
])

# Load datasets
train_dataset = datasets.ImageFolder(root=train_dir, transform=transform)
val_dataset = datasets.ImageFolder(root=val_dir, transform=transform)

# Data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Check dataset sizes
print(f"Train samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Using device: mps
Train samples: 139
Validation samples: 30


In [6]:
# Load pretrained MobileNet model
model = models.mobilenet_v2(pretrained=True)

# Modify the classifier for our binary classification task (Like vs. Dislike)
num_features = model.classifier[1].in_features
model.classifier = nn.Sequential(
    nn.Linear(num_features, 2)  # Two output classes (Like & Dislike)
)

# Send model to device
model = model.to(device)





**Assigning a Loss Functions**

A **loss function** measures how wrong our model's predictions are.
**Cross-Entropy** (Log Loss) function works well for **binary classification** (Like vs. Dislike), 
\[
L = - \sum y \log(\hat{y})
\]
✅ Encourages the model to assign **high probability to the correct class**.  
✅ Punishes confident wrong predictions **more than uncertain ones**.  x

In [7]:
# Define loss function
criterion = nn.CrossEntropyLoss()


**Explaining the Training Loop: Gradient Descent & Backpropagation**

**Gradient Descent – The Optimization Process**
Gradient Descent guides how we update the model’s weights to reduce loss.  

\[
\theta := \theta - \eta \cdot \frac{\partial L}{\partial \theta}
\]

📌 **What’s happening?**  
- \( \theta \) = Model parameters (weights & biases).  
- \( \eta \) = Learning rate (step size).  
- \( \frac{\partial L}{\partial \theta} \) = Gradient of the loss function w.r.t. \( \theta \).  

Each step **adjusts the weights** in the direction that lowers the loss.


---


## How Does Training Work? (Gradient Descent in Action)

It consists of three key steps:

1️⃣ **Forward Pass** → The model processes input images and makes predictions.  
2️⃣ **Compute Loss** → Compare predictions to the true labels using the loss function.  
3️⃣ **Backward Pass (Backpropagation)** → Compute how much each weight contributed to the error and adjust weights iteratively, layer by layer. It propagates the loss function **backward through the network**, layer by layer, using the chain rule of derivatives, and updates weights based on their gradients.

(Add sketch here)


In [8]:
# Define the optimizer to be Adam (advance GD) with a learning rate of 0.001 
# lr=0.001 is a common practice in the industry.
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 5  # Keep training short for the workshop
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    avg_loss = running_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")

print("Training complete! 🎉")

Epoch [1/5], Loss: 0.6140
Epoch [2/5], Loss: 0.2759
Epoch [3/5], Loss: 0.1210
Epoch [4/5], Loss: 0.0743
Epoch [5/5], Loss: 0.0858
Training complete! 🎉
