# Task 3 – Data Enrichment with Object Detection (YOLO)


## 📖 Overview

In this task, we’ll enrich our raw image data with object-detection metadata using a modern, pre-trained YOLOv8 model. The detected objects will be written back into our data warehouse as a new fact table, linking visual content to our core message data model :contentReference[oaicite:0]{index=0}.

---

## 🎯 Objectives

1. **Environment Setup**  
   - Install the Ultralytics YOLOv8 package:  
     ```bash
     pip install ultralytics
     ```  
2. **Image Discovery**  
   - Write a Python script (`src/yolo_enrich.py`) that scans your data lake for newly scraped images under `data/raw/images/YYYY-MM-DD/<channel>/<message_id>.jpg`.  
3. **Object Detection**  
   - Load the YOLOv8 model in inference mode.  
   - Run detection on each image, extracting:  
     - `detected_object_class`  
     - `confidence_score`  
     - `message_id` (to join back to messages)  
4. **DBT Fact Table**  
   - Define a dbt model `models/marts/fct_image_detections.sql` with columns:  
     - `message_id` (FK → `marts.fct_messages`)  
     - `detected_object_class`  
     - `confidence_score`  
   - Materialize as a table or incremental model.  
5. **Testing & Documentation**  
   - Add dbt tests to ensure:  
     - No null `message_id` or `detected_object_class`.  
     - Confidence scores are between `0.0` and `1.0`.  
   - Document the new model in your dbt `schema.yml` and regenerate docs.





# YOLOv8 Enrichment
Run your `enrich_images()` function and verify the inserted detections.

In [1]:
import sys
import os

# Go two levels up from the notebook to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

# Join the path to 'src'
src_path = os.path.join(project_root, "src")

# Add 'src' to Python path
if src_path not in sys.path:
    sys.path.append(src_path)

# Confirm it's added
print("src path added:", src_path)


src path added: c:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\src


In [2]:
# 2️⃣ Load .env and set PYTHONPATH
import os, sys
from dotenv import load_dotenv
from pathlib import Path

project_root = Path().cwd().parent / "Shipping-a-Data-Product"
load_dotenv(dotenv_path=project_root / ".env")
sys.path.append(str(project_root / "src"))


In [3]:
# 1️⃣ Setup imports & paths
import os
from pathlib import Path
import psycopg2
import pandas as pd
from dotenv import load_dotenv



from yolo_enrich import enrich_images, get_db_conn

# Load environment variables
load_dotenv(dotenv_path=project_root / ".env")


False

In [5]:
# 0️⃣ Move into the project root so relative paths line up
import os
from pathlib import Path

# Adjust this if your notebook lives somewhere else
project_root = Path("..").resolve()

os.chdir(project_root)
print("Working directory is now:", Path.cwd())


Working directory is now: C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\notebooks


In [6]:
# Cell 1: force CWD to the project root, one level up from notebooks/
import os
from pathlib import Path

# If this notebook lives in .../Shipping-a-Data-Product/notebooks,
# then its parent (..) is the project root.
project_root = Path.cwd().parent

# Change into the project root
os.chdir(project_root)

print("Working directory is now:", Path.cwd())


Working directory is now: C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product


In [7]:
# 2️⃣ Run the enrichment for today's date (or pass a string YYYY-MM-DD)
enrich_images()


image 1/1 C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\data\raw\telegram_images\2025-07-13\CheMed123\33.jpg: 544x640 2 oranges, 1 book, 186.9ms
Speed: 8.0ms preprocess, 186.9ms inference, 2.7ms postprocess per image at shape (1, 3, 544, 640)

image 1/1 C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\data\raw\telegram_images\2025-07-13\CheMed123\34.jpg: 544x640 (no detections), 116.5ms
Speed: 6.6ms preprocess, 116.5ms inference, 1.2ms postprocess per image at shape (1, 3, 544, 640)

image 1/1 C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\data\raw\telegram_images\2025-07-13\CheMed123\38.jpg: 544x640 (no detections), 106.3ms
Speed: 6.9ms preprocess, 106.3ms inference, 1.0ms postprocess per image at shape (1, 3, 544, 640)

image 1/1 C:\Users\ABC\Desktop\10Acadamy\week_7\Shipping-a-Data-Product\data\raw\telegram_images\2025-07-13\CheMed123\39.jpg: 320x640 8 mouses, 97.0ms
Speed: 2.1ms preprocess, 97.0ms inference, 1.7ms postprocess per image a

In [8]:
# 3️⃣ Query the database to see what was inserted
conn = get_db_conn()
df = pd.read_sql(
    """
    SELECT *
    FROM analytics.fct_image_detections
    ORDER BY detection_time DESC
    LIMIT 20
    """,
    conn
)
conn.close()

df

  df = pd.read_sql(


Unnamed: 0,id,message_id,object_class,confidence_score,detection_time
0,246,40,person,0.719146,2025-07-13 11:53:37.734685+00:00
1,253,40,person,0.436991,2025-07-13 11:53:37.734685+00:00
2,241,39,mouse,0.623763,2025-07-13 11:53:37.734685+00:00
3,245,39,mouse,0.414451,2025-07-13 11:53:37.734685+00:00
4,249,40,person,0.597804,2025-07-13 11:53:37.734685+00:00
5,252,40,person,0.452076,2025-07-13 11:53:37.734685+00:00
6,237,33,book,0.271593,2025-07-13 11:53:37.734685+00:00
7,240,39,mouse,0.892961,2025-07-13 11:53:37.734685+00:00
8,243,39,mouse,0.554678,2025-07-13 11:53:37.734685+00:00
9,244,39,mouse,0.526148,2025-07-13 11:53:37.734685+00:00
