# üçì Strawberry Detection - Direct YOLOv8 Training Notebook

This notebook provides a **direct, ready-to-run** solution for training YOLOv8 models on your strawberry dataset.

## üöÄ Quick Start

1. **Run all cells** in order (Cell ‚Üí Run All)
2. **Dataset will be automatically downloaded** from Kaggle
3. **Choose model** (YOLOv8m or YOLOv8l)
4. **Monitor training** with real-time plots
5. **Download results** when complete

## üìä Current Status

- ‚úÖ **YOLOv8n**: Trained (0.989 mAP@50)
- ‚úÖ **YOLOv8s**: Trained (98.5% mAP@50)
- ‚è≥ **YOLOv8m**: Ready for training
- ‚è≥ **YOLOv8l**: Ready for training

## ‚öôÔ∏è Hardware Requirements

- **Minimum**: 8GB GPU VRAM (RTX 3070/2070)
- **Recommended**: 10GB+ GPU VRAM (RTX 3080/3090)
- **CPU**: 4+ cores, 16GB RAM
- **Storage**: 10GB free space for models

## üí∞ Estimated Training Times

| Model | Epochs | RTX 3080 | RTX 3090 | Cost (RTX 3080) |
|-------|--------|----------|----------|-----------------|
| YOLOv8m | 120 | 3-4 hours | 2.5-3.5 hours | $0.51-$0.68 |
| YOLOv8l | 150 | 4-5 hours | 3-4 hours | $0.68-$0.85 |

**Total for both**: 7-9 hours, ~$1.19-$1.53

## üì• Dataset Download from Kaggle

This cell will download the fruit ripeness dataset from Kaggle and prepare it for YOLOv8 training.

In [None]:
# Install required packages including kaggle
!pip install ultralytics torch torchvision opencv-python matplotlib Pillow tqdm tensorboard kaggle -q

In [None]:
import os
import zipfile
import shutil
import yaml
from pathlib import Path
import subprocess
import sys

def download_kaggle_dataset():
    """Download and prepare the Kaggle fruit ripeness dataset."""
    
    print("=" * 60)
    print("üì• DOWNLOADING KAGGLE DATASET")
    print("=" * 60)
    
    # Kaggle dataset details
    kaggle_dataset = "dudinurdiyansah/fruit-ripeness-dataset"
    dataset_dir = "model/dataset_strawberry_kaggle"
    
    # Check if dataset already exists
    if os.path.exists(dataset_dir) and os.path.exists(os.path.join(dataset_dir, "data.yaml")):
        print(f"‚úÖ Dataset already exists at: {dataset_dir}")
        print("Skipping download...")
        return dataset_dir
    
    # Create directory
    os.makedirs(dataset_dir, exist_ok=True)
    
    try:
        # Download using kaggle CLI
        print(f"Downloading dataset: {kaggle_dataset}")
        !kaggle datasets download -d {kaggle_dataset} -p {dataset_dir}
        
        # Find the downloaded zip file
        zip_files = list(Path(dataset_dir).glob("*.zip"))
        if not zip_files:
            print("‚ùå No zip file found after download")
            return None
            
        zip_path = zip_files[0]
        print(f"Extracting: {zip_path.name}")
        
        # Extract zip file
        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(dataset_dir)
        
        # Remove zip file
        os.remove(zip_path)
        
        # Look for the actual dataset structure
        extracted_dirs = list(Path(dataset_dir).glob("*"))
        for item in extracted_dirs:
            if item.is_dir():
                # Check if this looks like the dataset
                train_dir = item / "train"
                valid_dir = item / "valid"
                
                if train_dir.exists() and valid_dir.exists():
                    # Move contents up one level
                    for subitem in item.glob("*"):
                        shutil.move(str(subitem), str(dataset_dir / subitem.name))
                    # Remove the now-empty directory
                    shutil.rmtree(item)
                    break
        
        # Create data.yaml if it doesn't exist
        data_yaml_path = os.path.join(dataset_dir, "data.yaml")
        if not os.path.exists(data_yaml_path):
            print("Creating data.yaml configuration...")
            
            # Find class names from directory structure
            train_path = os.path.join(dataset_dir, "train")
            if os.path.exists(train_path):
                # Look for label files to determine classes
                label_files = list(Path(train_path).glob("labels/*.txt"))
                if label_files:
                    # Read first label file to get class IDs
                    with open(label_files[0], 'r') as f:
                        lines = f.readlines()
                    class_ids = set()
                    for line in lines:
                        if line.strip():
                            class_ids.add(int(line.split()[0]))
                    
                    # Create class names (we'll use generic names)
                    class_names = [f"class_{i}" for i in sorted(class_ids)]
                else:
                    # Default to strawberry if we can't determine
                    class_names = ["strawberry"]
            else:
                class_names = ["strawberry"]
            
            # Create data.yaml
            data_yaml_content = f"""# Fruit Ripeness Dataset
train: {os.path.join(dataset_dir, 'train/images')}
val: {os.path.join(dataset_dir, 'valid/images')}
test: {os.path.join(dataset_dir, 'test/images') if os.path.exists(os.path.join(dataset_dir, 'test')) else ''}

nc: {len(class_names)}
names: {class_names}
"""
            
            with open(data_yaml_path, 'w') as f:
                f.write(data_yaml_content)
            
            print(f"‚úÖ Created data.yaml with {len(class_names)} classes")
        
        print(f"‚úÖ Dataset downloaded and prepared at: {dataset_dir}")
        
        # Count images
        train_images = list(Path(dataset_dir).glob("train/images/*.jpg")) + list(Path(dataset_dir).glob("train/images/*.png"))
        val_images = list(Path(dataset_dir).glob("valid/images/*.jpg")) + list(Path(dataset_dir).glob("valid/images/*.png"))
        
        print(f"   Training images: {len(train_images)}")
        print(f"   Validation images: {len(val_images)}")
        
        return dataset_dir
        
    except Exception as e:
        print(f"‚ùå Error downloading dataset: {e}")
        print("\nManual download instructions:")
        print(f"1. Visit: https://www.kaggle.com/datasets/{kaggle_dataset}")
        print(f"2. Download the dataset manually")
        print(f"3. Extract to: {dataset_dir}")
        return None

# Download the dataset
dataset_path = download_kaggle_dataset()

In [None]:
import torch
import ultralytics
import matplotlib.pyplot as plt
import cv2
import random
import pandas as pd
from datetime import datetime

print("=" * 60)
print("üçì STRAWBERRY DETECTION - YOLOv8 TRAINING")
print("=" * 60)

# System check
print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"GPU Memory: {gpu_memory:.1f} GB")
    
    # Recommend batch size
    if gpu_memory >= 24:  # RTX 3090/A5000
        batch_size_m, batch_size_l = 48, 32
    elif gpu_memory >= 10:  # RTX 3080
        batch_size_m, batch_size_l = 24, 16
    elif gpu_memory >= 8:   # RTX 3070/2070
        batch_size_m, batch_size_l = 16, 8
    else:
        batch_size_m, batch_size_l = 8, 4
    print(f"Recommended batch size - YOLOv8m: {batch_size_m}, YOLOv8l: {batch_size_l}")
else:
    print("‚ö†Ô∏è  WARNING: No GPU detected! Training will be VERY slow on CPU.")
    batch_size_m, batch_size_l = 4, 2

print(f"Ultralytics: {ultralytics.__version__}")

In [None]:
# Verify dataset
dataset_path = "model/dataset_strawberry_kaggle"
data_yaml = os.path.join(dataset_path, "data.yaml")

if os.path.exists(data_yaml):
    print(f"‚úÖ Dataset found: {data_yaml}")
    with open(data_yaml, 'r') as f:
        data = yaml.safe_load(f)
    print(f"   Classes: {data.get('nc', 'N/A')}")
    print(f"   Class names: {data.get('names', 'N/A')}")
    
    # Check train/val paths
    train_path = data.get('train', '')
    val_path = data.get('val', '')
    
    # Fix relative paths if needed
    if train_path.startswith('../'):
        train_path = os.path.join(os.path.dirname(data_yaml), train_path)
    if val_path.startswith('../'):
        val_path = os.path.join(os.path.dirname(data_yaml), val_path)
    
    # Also check for train/images structure
    if not os.path.exists(train_path):
        # Try alternative paths
        alt_train_path = os.path.join(dataset_path, "train/images")
        if os.path.exists(alt_train_path):
            train_path = alt_train_path
            print(f"   Using alternative train path: {train_path}")
    
    if not os.path.exists(val_path):
        alt_val_path = os.path.join(dataset_path, "valid/images")
        if os.path.exists(alt_val_path):
            val_path = alt_val_path
            print(f"   Using alternative val path: {val_path}")
    
    if train_path and os.path.exists(train_path):
        train_images = list(Path(train_path).glob("*.jpg")) + list(Path(train_path).glob("*.png"))
        print(f"   Training images: {len(train_images)}")
    
    if val_path and os.path.exists(val_path):
        val_images = list(Path(val_path).glob("*.jpg")) + list(Path(val_path).glob("*.png"))
        print(f"   Validation images: {len(val_images)}")
else:
    print(f"‚ùå Dataset not found: {data_yaml}")
    print("Please ensure the dataset is in the correct location.")

In [None]:
# Visualize sample images
def show_sample_images(num_samples=3):
    """Display sample images from the dataset."""
    if not os.path.exists(data_yaml):
        print("Dataset not found, skipping visualization")
        return
    
    with open(data_yaml, 'r') as f:
        config = yaml.safe_load(f)
    
    train_path = config.get('train', '')
    
    # Fix relative paths if needed
    if train_path.startswith('../'):
        train_path = os.path.join(os.path.dirname(data_yaml), train_path)
    
    # Also check for train/images structure
    if not os.path.exists(train_path):
        alt_train_path = os.path.join(dataset_path, "train/images")
        if os.path.exists(alt_train_path):
            train_path = alt_train_path
            print(f"Using train path: {train_path}")
        else:
            print(f"Training path not found: {train_path}")
            return
    
    image_files = list(Path(train_path).glob("*.jpg")) + list(Path(train_path).glob("*.png"))
    if not image_files:
        # Try to find images in subdirectories
        image_files = list(Path(train_path).rglob("*.jpg")) + list(Path(train_path).rglob("*.png"))
        
    if not image_files:
        print(f"No images found in {train_path}")
        print("Checking directory structure...")
        !find {dataset_path} -name "*.jpg" | head -5
        return
    
    samples = random.sample(image_files, min(num_samples, len(image_files)))
    
    fig, axes = plt.subplots(1, len(samples), figsize=(15, 5))
    if len(samples) == 1:
        axes = [axes]
    
    for idx, (ax, img_path) in enumerate(zip(axes, samples)):
        try:
            img = cv2.imread(str(img_path))
            if img is None:
                ax.text(0.5, 0.5, f"Failed to load\n{img_path.name}", ha='center', va='center')
                ax.axis('off')
                continue
                
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            ax.imshow(img)
            ax.set_title(f"Sample {idx+1}\n{img_path.name}")
            ax.axis('off')
        except Exception as e:
            ax.text(0.5, 0.5, f"Error\n{str(e)[:30]}", ha='center', va='center')
            ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    print(f"üì∏ Displayed {len(samples)} sample images from {train_path}")

# Show samples
show_sample_images(3)

## üéØ Training Configuration

In [None]:
from ultralytics import YOLO
import time

def train_model(model_size="m", epochs=120, batch_size=24):
    """Train YOLOv8 model with given parameters."""
    
    print(f"\n{'='*60}")
    print(f"üöÄ STARTING YOLOv8{model_size.upper()} TRAINING")
    print(f"{'='*60}")
    print(f"Model: yolov8{model_size}.pt")
    print(f"Epochs: {epochs}")
    print(f"Batch Size: {batch_size}")
    print(f"Dataset: {data_yaml}")
    
    # Configuration
    config = {
        "model": f"yolov8{model_size}.pt",
        "data": data_yaml,
        "epochs": epochs,
        "imgsz": 640,
        "batch": batch_size,
        "workers": 8,
        "device": 0 if torch.cuda.is_available() else "cpu",
        "project": "model/detection",
        "name": f"yolov8{model_size}_direct",
        "exist_ok": True,
        "pretrained": True,
        "optimizer": "AdamW",
        "lr0": 0.01 if model_size == "m" else 0.008,
        "amp": True,  # Mixed precision
        "plots": True,
        "save_period": 10,
        "val": True,
        "save": True,
        "verbose": True,
    }
    
    # Create output directory
    output_dir = os.path.join("model/detection", f"yolov8{model_size}_direct")
    os.makedirs(output_dir, exist_ok=True)
    
    print(f"Output directory: {output_dir}")
    print(f"{'='*60}\n")
    
    # Start training
    start_time = time.time()
    
    try:
        model = YOLO(f"yolov8{model_size}.pt")
        results = model.train(**config)
        
        training_time = time.time() - start_time
        hours = int(training_time // 3600)
        minutes = int((training_time % 3600) // 60)
        seconds = int(training_time % 60)
        
        print(f"\n{'='*60}")
        print(f"‚úÖ TRAINING COMPLETED SUCCESSFULLY!")
        print(f"‚è±Ô∏è  Time: {hours}h {minutes}m {seconds}s")
        print(f"üìÅ Results saved to: {output_dir}")
        
        if hasattr(results, 'results_dict'):
            metrics = results.results_dict
            print(f"üìä Final mAP@50: {metrics.get('metrics/mAP50(B)', 'N/A'):.4f}")
            print(f"üìä Final mAP@50-95: {metrics.get('metrics/mAP50-95(B)', 'N/A'):.4f}")
        
        return results
        
    except Exception as e:
        print(f"‚ùå Training failed: {e}")
        import traceback
        traceback.print_exc()
        return None

## üèãÔ∏è‚Äç‚ôÇÔ∏è Train YOLOv8m (Medium Model)

In [None]:
# Train YOLOv8m
print("‚ö†Ô∏è  YOLOv8m Training - Estimated: 3-4 hours on RTX 3080")
print("üí∞ Estimated cost on cloud: $0.51-$0.68 (at $0.17/hr)")

response = input("\nStart YOLOv8m training? (yes/no): ")

if response.lower() == 'yes':
    print("Starting YOLOv8m training...")
    results_m = train_model(model_size="m", epochs=120, batch_size=batch_size_m)
    
    if results_m:
        print("\nüéâ YOLOv8m training completed!")
        print("Next: You can train YOLOv8l or download the model.")
else:
    print("Skipping YOLOv8m training.")

## üèãÔ∏è‚Äç‚ôÇÔ∏è Train YOLOv8l (Large Model)

In [None]:
# Train YOLOv8l (optional)
print("\n" + "="*60)
print("YOLOv8l Training - Estimated: 4-5 hours on RTX 3080")
print("üí∞ Estimated cost on cloud: $0.68-$0.85 (at $0.17/hr)")

response = input("\nStart YOLOv8l training? (yes/no): ")

if response.lower() == 'yes':
    print("Starting YOLOv8l training...")
    results_l = train_model(model_size="l", epochs=150, batch_size=batch_size_l)
    
    if results_l:
        print("\nüéâ YOLOv8l training completed!")
        print("Both models are now ready for use.")
else:
    print("Skipping YOLOv8l training.")

## üìà Monitor Training Progress

In [None]:
def monitor_progress(model_size="m"):
    """Monitor training progress."""
    results_dir = f"model/detection/yolov8{model_size}_direct"
    results_csv = os.path.join(results_dir, "results.csv")
    
    if not os.path.exists(results_csv):
        print(f"Results not found: {results_csv}")
        return
    
    df = pd.read_csv(results_csv)
    
    print(f"\nüìä YOLOv8{model_size.upper()} Training Progress")
    print(f"Epochs completed: {len(df)}")
    
    if 'metrics/mAP50(B)' in df.columns:
        latest_map = df['metrics/mAP50(B)'].iloc[-1]
        print(f"Latest mAP@50: {latest_map:.4f}")
    
    # Plot
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    if 'train/box_loss' in df.columns:
        axes[0, 0].plot(df['epoch'], df['train/box_loss'], label='Box Loss')
        axes[0, 0].set_title('Box Loss')
        axes[0, 0].grid(True)
    
    if 'train/cls_loss' in df.columns:
        axes[0, 1].plot(df['epoch'], df['train/cls_loss'], label='Class Loss', color='orange')
        axes[0, 1].set_title('Class Loss')
        axes[0, 1].grid(True)
    
    if 'metrics/mAP50(B)' in df.columns:
        axes[1, 0].plot(df['epoch'], df['metrics/mAP50(B)'], label='mAP@50', color='green')
        axes[1, 0].set_title('mAP@50')
        axes[1, 0].grid(True)
    
    if 'lr/pg0' in df.columns:
        axes[1, 1].plot(df['epoch'], df['lr/pg0'], label='Learning Rate', color='red')
        axes[1, 1].set_title('Learning Rate')
        axes[1, 1].grid(True)
    
    plt.tight_layout()
    plt.show()
    
    return df

In [None]:
# Monitor YOLOv8m progress
try:
    monitor_progress("m")
except Exception as e:
    print(f"Could not monitor progress: {e}")

In [None]:
# Monitor YOLOv8l progress
try:
    monitor_progress("l")
except Exception as e:
    print(f"Could not monitor progress: {e}")

## üì¶ Package Trained Models

In [None]:
import zipfile

def package_models():
    """Package trained models for download."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    zip_filename = f"strawberry_yolov8_models_{timestamp}.zip"
    
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        for model_size in ['m', 'l']:
            model_dir = f"model/detection/yolov8{model_size}_direct"
            best_pt = os.path.join(model_dir, "weights", "best.pt")
            
            if os.path.exists(best_pt):
                zipf.write(best_pt, f"yolov8{model_size}/best.pt")
                print(f"‚úÖ Added yolov8{model_size}/best.pt")
            
            # Add results
            results_csv = os.path.join(model_dir, "results.csv")
            if os.path.exists(results_csv):
                zipf.write(results_csv, f"yolov8{model_size}/results.csv")
                print(f"‚úÖ Added yolov8{model_size}/results.csv")
    
    size_mb = os.path.getsize(zip_filename) / 1024 / 1024
    print(f"\nüì¶ Models packaged: {zip_filename} ({size_mb:.2f} MB)")
    print("\nTo download:")
    print(f"1. Right-click '{zip_filename}' in Jupyter file browser")
    print(f"2. Select 'Download'")
    print(f"3. Or use: `!cp {zip_filename} /mnt/` if mounted")
    
    return zip_filename

In [None]:
# Package models
print("Packaging trained models...")
try:
    zip_file = package_models()
    print(f"\n‚úÖ Ready for download: {zip_file}")
except Exception as e:
    print(f"‚ùå Error: {e}")

## üéØ Next Steps After Training

1. **Download the models** using the zip file above
2. **Test inference** with the trained models:
   ```python
   from ultralytics import YOLO
   model = YOLO('model/detection/yolov8m_direct/weights/best.pt')
   results = model('path/to/image.jpg')
   ```

3. **Convert to ONNX** for faster inference:
   ```python
   model.export(format='onnx')
   ```

4. **Deploy to Raspberry Pi** using ONNX Runtime

## üìä Expected Performance

- **YOLOv8m**: ~99.0% mAP@50, 25-30 FPS on Raspberry Pi 4
- **YOLOv8l**: ~99.2% mAP@50, 15-20 FPS on Raspberry Pi 4

## üÜò Troubleshooting

- **Out of memory**: Reduce batch size
- **Slow training**: Ensure GPU is being used
- **Dataset issues**: Check `model/dataset_strawberry_kaggle/data.yaml`
- **Installation problems**: Run `!pip install ultralytics --upgrade`
- **Kaggle authentication**: If download fails, manually download from https://www.kaggle.com/datasets/dudinurdiyansah/fruit-ripeness-dataset

## üìû Support

- Check the main project README.md
- Review training logs in `model/detection/yolov8*_direct/`
- Monitor GPU usage with `!nvidia-smi`

In [None]:
# Final status check
print("\n" + "="*60)
print("üéâ TRAINING COMPLETE - SUMMARY")
print("="*60)

for model_size in ['m', 'l']:
    model_dir = f"model/detection/yolov8{model_size}_direct"
    best_pt = os.path.join(model_dir, "weights", "best.pt")
    
    if os.path.exists(best_pt):
        size_mb = os.path.getsize(best_pt) / 1024 / 1024
        print(f"‚úÖ YOLOv8{model_size.upper()}: {size_mb:.1f} MB at {best_pt}")
    else:
        print(f"‚è≥ YOLOv8{model_size.upper()}: Not trained yet")

print("\nüìÅ All models saved in: model/detection/")
print("üì¶ Package ready: strawberry_yolov8_models_*.zip")
print("="*60)