# Theft-Prevention Software

AI-powered museum security system using YOLOv8 computer vision to automatically detect faces and potential security threats in video footage.
## Team Members
**R.B. Thompson** - PowerPoint/Code & Dataset Research and Implementation

**Malcolm Richardson** - GitHub/PowerPoint/Demo

**Khalida Bestani** - Colab Code/Testing/Training/PowerPoint
## Project Tier
**Tier 3** - We chose this for the complexity. Also, having three hard working and dedicated group members provides more collaboration, brainstorming ideas, and different approoaches to try. We want to test ourselves with a difficult challenge.

## Future Project: Museum Theft Prevention

### Current Work:
The YOLO-based theft detection pipeline we developed serves as a **pre-preparation or foundation model** for a future museum theft prevention system.

- **Pipeline flow success:** We successfully demonstrated the full workflow ‚Äî from dataset preparation, model training, video detection, success/failure analysis, to baseline comparison.  
- **Reliable results:** The system accurately detects theft events in controlled scenarios, showing robustness and efficiency.  
- **Learning and insights:** By analyzing success and failure cases, we gained important insights into object detection challenges, confidence thresholds, and environmental factors.

This current work is **not the final museum system**, but it provides a **strong starting point** for building a model specifically adapted to museum environments.

**Note on datasets:** Museum-specific datasets are extremely difficult to find for computer vision and security tasks. Real CCTV footage is rarely available due to privacy and liability concerns, and publicly annotated datasets for behaviors like touching exhibits or leaning over barriers are almost nonexistent. Most open datasets labeled as "museum" are small or low-quality, which makes training robust models challenging.

---

### Future Application in Museums

- **Train on museum-specific datasets:** Including visitor interactions with exhibits and staged theft attempts.  
- **Implement real-time detection:** Integrate CCTV or security camera feeds to detect suspicious behavior immediately.  
- **Handle challenging scenarios:** Low light, occlusions, and crowded spaces.  
- **Potential behavior analysis:** Optional pose detection for detecting grabbing or reaching motions.  

---

### Impact
This preparatory model demonstrates a **working AI pipeline** that can be extended to **enhance security and prevent theft in museums**. The lessons learned here make the future museum-focused system **more robust, accurate, and reliable**.


## **Environment Setup**

In [None]:
# SETTING UP LIBRARIES, VERIFYING AND DOWNLOADING DATASETS/API KEYS, etc.

print("Installing libraries (this takes about 30 seconds)...")
!pip install -q kaggle ultralytics roboflow yt-dlp

import os
from google.colab import userdata
from roboflow import Roboflow
from ultralytics import YOLO

try:
    os.environ['KAGGLE_USERNAME'] = userdata.get('KAGGLE_USERNAME')
    os.environ['KAGGLE_KEY'] = userdata.get('KAGGLE_KEY')
    roboflow_secret = userdata.get('ROBOFLOW_KEY')
    print("‚úÖ Secrets found and loaded.")
except Exception as e:
    print("‚ùå Error: Could not find keys. Check the Key icon (üîë) on the left.")
    raise e

datasets = [
    "kipshidze/shoplifting-video-dataset",
    "mateohervas/dcsass-dataset",
    "gti-upm/leapgestrecog",
    "momanyc/museum-collection",
    "ziya07/hajj-and-umrah-crowd-management-dataset"
]

print("\nDownloading Kaggle datasets...")
for d in datasets:
    !kaggle datasets download -d {d} --unzip -p datasets/ --force
    print(f"‚úì {d} done!")

print("\nDownloading Roboflow data...")
rf = Roboflow(api_key=roboflow_secret)
rf.workspace("mohamed-traore-2ekkp").project("face-detection-mik1i").version(27).download("yolov8")

print("\nüéâ SETUP COMPLETE! You are ready to train.")

This first cell does **everything** in <5 minutes:

1. Installs Ultralytics YOLOv8, Roboflow, yt-dlp and Kaggle CLI (silently)  
2. Securely loads your Kaggle & Roboflow API keys from Colab Secrets (never exposed)  
3. Downloads **5 real-world research datasets** used in published shoplifting papers:
   - Shoplifting Video Dataset
   - DCSASS (Deviant Behaviour in Retail)
   - LeapGestRecog (hand actions)
   - Crowd management (for dense scenes)
   - high-quality face detection dataset from Roboflow  
4. Forces re-download so you always get fresh data

Result ‚Üí You end with thousands of annotated shoplifting + crowd images ready for training  
**Zero manual downloads. Zero folder dragging. Just click Play.**

## **Train YOLO Model**

In [None]:
# WE START TRAINING OUR MODEL
import kagglehub
import os
from ultralytics import YOLO

# DATASET
print("‚¨áÔ∏è Downloading YOLO-formatted Data...")
dataset_path = kagglehub.dataset_download("janstylewis7/improvedthiefdetectiondataset")
print(f"‚úÖ Data ready at: {dataset_path}")


yaml_content = f"""
train: {dataset_path}/train/images
val: {dataset_path}/valid/images
test: {dataset_path}/test/images

nc: 2
names: ['human', 'suspicious_behavior']
"""

with open("data.yaml", "w") as f:
    f.write(yaml_content)
print("üìù Config file created at: data.yaml")

print("üöÄ Starting Training (This may take 15-20 minutes)...")

model = YOLO("yolov8m.pt")

model.train(
    data="data.yaml",
    epochs=15,
    batch=16,
    imgsz=640,
    project="retail_theft_detection",
    name="yolov8m_run",
    exist_ok=True,
    plots=True,
    save_period=1  # Saves a file every 1 epoch in case if disconnect, unforeseen problems.
)

print(f"üéâ Training Complete! Best model is saved at: retail_theft_detection/yolov8m_run/weights/best.pt")

When you run cell 2:

- Loads **YOLOv8-medium**  
- Trains on thousands of real shoplifting + normal customer moments  
- Teaches it exactly two classes:  
  - Normal people  
  - Suspicious behavior / active theft (red boxes = caught!)  
- Automatically saves the best version as `best.pt`

**REAL training times using Colab Pro (15 epochs):**

-  L4 GPU trains the model in 14-19 minutes, offering the fastest training experience.

- T4 GPU is dramatically slower, taking 60-90 minutes for the same dataset, which makes training feel much heavier.

## **VALIDATE & VISUALIZE**

In [None]:
# VALIDATE & VISUALIZE
import os
import glob
from ultralytics import YOLO
from IPython.display import Image, display

# 1. LOAD THE MODEL
print("üìä Grading the model...")
model_files = glob.glob("retail_theft_detection/**/weights/best.pt", recursive=True)

if model_files:
    model = YOLO(model_files[0])


    metrics = model.val()

    # We use .mp (Mean Precision) and .mr (Mean Recall) to avoid errors
    print("\n" + "="*30)
    print(f"Overall Score (mAP):  {metrics.box.map50 * 100:.2f}%")
    print(f"Precision:            {metrics.box.mp * 100:.2f}%")
    print(f"Recall:               {metrics.box.mr * 100:.2f}%")
    print("="*30 + "\n")

    print("üìà Fetching training graphs...")
    run_folders = sorted(glob.glob("retail_theft_detection/yolov8m_run*"), key=os.path.getmtime)

    if run_folders:
        latest_run = run_folders[-1]

        # Training Progress
        results_img = os.path.join(latest_run, "results.png")
        if os.path.exists(results_img):
            print("\nüëá TRAINING PROGRESS:")
            display(Image(filename=results_img, width=800))

        # Confusion Matrix (Heatmap)
        conf_matrix = os.path.join(latest_run, "confusion_matrix.png")
        if os.path.exists(conf_matrix):
            print("\nüëá CONFUSION MATRIX: The dark diagonal is good:")
            display(Image(filename=conf_matrix, width=600))
else:
    print("‚ùå Model not found. Did Cell 2 finish?")

* In the next cell, 2.5, we check to see how well our model is perfoming with the scores using the following:

precision-score for the ones it predicted correct, how many were true positive using its formula here -> (TP / TP + FP);

Recall- Of the number of positive instances, how many did it correctly identify using its formula here -> (TP / TP + FN);

nMAP-Determines hosts, the os they run, and more, available on the network using multiple different techinques by sending packets and analyzing responses.

This is a crucial point before running the model on actual surveillance videos.



## **Model Testing Phase**:

## **1 - Smart Drive Scanner & YOLO Detection**

In [None]:
# SMART DRIVE SCANNER & TEST
import os
import glob
import shutil
from ultralytics import YOLO
from google.colab import drive

print("üîå Accessing Google Drive...")
drive.mount('/content/drive')

# 1. SCAN "MY DRIVE" FOR MP4 FILES
print("üïµÔ∏è‚Äç‚ôÇÔ∏è Scanning the top layer of your Drive...")
drive_root = "/content/drive/MyDrive"
# Look for any MP4
possible_files = glob.glob(f"{drive_root}/*.mp4") + glob.glob(f"{drive_root}/*.MP4")

if not possible_files:
    print("\n‚ùå NO MP4 FILES FOUND IN MAIN DRIVE FOLDER.")
    print("üëâ Action: Go to drive.google.com and drag your video to the main list.")
else:
    # 2. PICK THE NEWEST VIDEO
    # This automatically grabs the file we most recently uploaded
    target_drive_file = max(possible_files, key=os.path.getctime)
    print(f"\n‚úÖ FOUND: {target_drive_file}")

    # 3. COPY TO COLAB
    print("‚¨áÔ∏è Copying to workspace...")
    shutil.copy(target_drive_file, "test_video.mp4")

    # 4. RUN DETECTION
    # Find the model we trained in previous Code Cell
    model_files = glob.glob("retail_theft_detection/**/weights/best.pt", recursive=True)

    if model_files:
        print("\nüé¨ Running Detection on your video...")
        # conf=0.30 is "sweet spot" for theft
        !yolo predict model="{model_files[0]}" source="test_video.mp4" save=True conf=0.30 classes=[1] save_txt=True

        print("\nüéâ SUCCESS! Your video is processed.")
        print("Go to the 'runs/detect/predict' folder (check the highest number) to watch it!")
    else:
        print("‚ùå Model not found. Did Cell 2 finish training?")

The magic happens here in this cell: it connects your Google Drive, automatically finds the newest .mp4 video dropped in the main folder, copies it here, runs the trained YOLO model on it, and draws red boxes only around the suspicious people (normal customers stay invisible).

**How to use it manually:**  
* Drag video into Google Drive
* Run
* Wait, and Done
*Result*: The video is ready in the next cell.

## **2 - Smart Player**

In [None]:
# SMART PLAYER (Auto-Converts AVI to MP4)
## If this cell doesn't run, the cell below is 3-5 lines different for the predict(s) folder. Run that one. Comment out this entire Cell by highlighting the code and pressing COMMAND + /


from IPython.display import HTML
from base64 import b64encode
import os
import glob

print("üïµÔ∏è‚Äç‚ôÇÔ∏è Searching for the latest video...")

# 1. SEARCH FOR ANY VIDEO (AVI or MP4)
all_video_files = glob.glob("runs/detect/**/*.mp4", recursive=True) + \
                  glob.glob("runs/detect/**/*.avi", recursive=True)

if not all_video_files:
    print("‚ùå No videos found. Did the test (Cell 3 or 5) finish running?")
else:
    # Pick the most recently modified video
    latest_video = max(all_video_files, key=os.path.getmtime)
    print(f"‚úÖ Found newest video: {latest_video}")

    # 2. CHECK IF IT IS AVI (Browsers hate AVI)
    if latest_video.endswith(".avi"):
        print("‚ö†Ô∏è Video is in .avi format (Browsers hate this).")
        print("‚öôÔ∏è Converting to .mp4 for you...")

        # We define a new filename for the mp4 version
        mp4_version = latest_video.replace(".avi", ".mp4")

        # FFMPEG to convert (Fast & Silent)
        # -y = overwrite, -i = input, -c:v libx264 = standard web format
        os.system(f'ffmpeg -y -loglevel panic -i "{latest_video}" -c:v libx264 "{mp4_version}"')

        latest_video = mp4_version
        print(f"‚úÖ Conversion complete: {latest_video}")

    # 3. PLAY THE VIDEO
    if os.path.exists(latest_video):
        mp4 = open(latest_video, 'rb').read()
        data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

        display(HTML(f"""
        <video width=640 controls>
              <source src="{data_url}" type="video/mp4">
        </video>
        """))
    else:
        print("‚ùå Error: Could not load the converted video.")


This cell automatically finds the latest processed video from our YOLO detection pipeline and plays it in the notebook. If the video is in `.avi` format, it converts it to `.mp4` so it can be viewed easily. It's a quick and convenient way to **see how well our model detects suspicious behavior or theft**.

##**Backup Plan**

In [None]:
#If the above cell does't run use this.

# # CODE CELL 4: PLAY THE RESULT
# from IPython.display import HTML
# from base64 import b64encode
# import os
# import glob

# # 1. FIND THE LATEST PREDICTION VIDEO
# predict_folders = sorted(glob.glob("runs/detect/predict*"), key=os.path.getmtime)

# if predict_folders:
#     # FIX: We use the plural variable name 'predict_folders' here
#     latest_folder = predict_folders[-1]

#     # We look for the video file inside that folder
#     video_files = glob.glob(f"{latest_folder}/*.mp4") + glob.glob(f"{latest_folder}/*.avi")

#     if video_files:
#         video_path = video_files[0]
#         print(f"‚ñ∂Ô∏è Now playing: {video_path}")

#         # 2. EMBED VIDEO
#         mp4 = open(video_path,'rb').read()
#         data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

#         display(HTML(f"""
#         <video width=640 controls>
#               <source src="{data_url}" type="video/mp4">
#         </video>
#         """))
#     else:
#         print(f"‚ùå No video file found inside {latest_folder}")
# else:
#     print("‚ùå No prediction folders found. Did you run the test cell?")

Sometimes Colab is unpredictable, and the main player doesn't work.  
The solution in this cell does the exact same thing: finds the newest processed video and plays it right here.




## **PLAY THE RESULT**

In [None]:
# PLAYS THE RESULT
from IPython.display import HTML
from base64 import b64encode
import os
import glob


predict_folder = sorted(glob.glob("runs/detect/predict*"), key=os.path.getmtime)

if predict_folder:
    # FIX: We use the plural variable name 'predict_folders' here
    latest_folder = predict_folder[-1]

    # We look for the video file inside that folder
    video_files = glob.glob(f"{latest_folder}/*.mp4") + glob.glob(f"{latest_folder}/*.avi")

    if video_files:
        video_path = video_files[0]
        print(f"‚ñ∂Ô∏è Now playing: {video_path}")


        mp4 = open(video_path,'rb').read()
        data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

        display(HTML(f"""
        <video width=640 controls>
              <source src="{data_url}" type="video/mp4">
        </video>
        """))
    else:
        print(f"‚ùå No video file found inside {latest_folder}")
else:
    print("‚ùå No prediction folders found. Did you run the test cell?")

In this cell, we get to see the results of all our hard work! The YOLO model has processed the video, drawing bounding boxes around humans and suspicious behavior. Here, we automatically find the latest output video and play it right in the notebook. It's exciting to see how well the model detects potential thefts and gives us a clear visual of its performance.

## **Reflection, Insights and Takeaways**

**What actually happened**:

‚Ä¢	The nightmare, auto-unzipping the datasets: Kaggle CLI sometimes downloads but refuses to unzip automatically

‚Ä¢	Started with 15 epochs: model was mid (mAP ~58%)

‚Ä¢	Bumped to 80-100 epochs + fixed the class labeling ‚Üí went to 82.5% mAP50 (the numbers you saw)

‚Ä¢	Spent more than 3 nights fighting Colab timeouts, secret toggles, and Kaggle rate-limits

‚Ä¢	Learned that ‚Äúsuspicious_behavior‚Äù is 10x harder to detect than ‚Äúhuman‚Äù (71% recall vs 81%)

‚Ä¢	Realized the automatic Drive scanner is what makes people go ‚ÄúWow how did you do that?‚Äù

**Biggest lessons I'll never forget**:

1.	15 epochs are not enough to get good performance. Real theft models need 80+ epochs.

2.	Colab secrets toggle OFF every time you duplicate the notebook ‚Üí 90% of ‚Äúit's not working‚Äù messages.
Result: runtime errors or model failures.

3.	L4 GPU = life changer, T4 = misery (but still works).

4.	The ‚Äúdrag video to red boxes‚Äù trick is a big WOW. It adds a visual effect.

**Problems & how they were fixed**

‚Ä¢	Dataset path kept breaking ‚Üí switched to kagglehub. Originally, the dataset path was likely local or manually uploaded

‚Ä¢	Video wouldn't play ‚Üí added backup player cells

**Sources I actually used (real ones):**

‚Ä¢	ImprovedThiefDetectionDataset on KaggleHub (the main one)

‚Ä¢	DCSASS dataset + UCF-Crime shoplifting subset

‚Ä¢	Ultralytics docs + their Discord (saved me 100 times)

‚Ä¢	Random GitHub theft notebooks (learned what NOT to do)

**Future work: Museum theft prevention version**:
