# CS171 Project
## Data Preprocessing
### Nothing in this notebook is finalized or in its final format
**Author** - Helena Thiessen

**Date** - Nov 20/25

### Motivation
Preparing data for machine learning with the goal of using an R-CNN to identify recyclable objects among waste to aid in recycling sorting facilities. Preparing the Warp dataset downloaded from Kaggle for machine learning and constructing a validation set from personally sourced images.

## Import Necessary Libraries

In [29]:
import os
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from torch.utils.data import Dataset, DataLoader
from pathlib import Path
from PIL import Image
import torchvision.transforms.functional as F
import traceback
from pycocotools.coco import COCO
import os

In [2]:
if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
print("Using device:", device)

Using device: cuda


## Pre-Processing For Warp Dataset

In [28]:
class RCNN_Warp_Data(Dataset):
    def __init__(self, root_path, split, transforms=None):

        ##get the paths for reading the data
        ##replace if you change any structure or names
        self.root_path = root_path
        self.split_path = split
        self.class_path = "classes.txt"
        self.transforms = transforms

        self.class_names = self.__get_classes()  
        self.image_paths, self.label_paths = self.__get_paths()            

    def __get_paths(self):
        img_dir  = self.root_path / self.split_path / "images"
        label_dir = self.root_path / self.split_path / "labels"
        
        img_paths = [p for p in img_dir.iterdir()]
        label_paths = [label_dir / (p.stem + ".txt") for p in img_paths]

        return img_paths, label_paths

    
    def __get_classes(self):
        ##Get the classes and properly associate them to the labels
        classes = []

        try:
            with open(os.path.join(self.root_path, self.class_path), 'r') as class_file:
                for line in class_file:
                    classes.append(line.strip())
        except FileNotFoundError:
            print("Error: The class file was not found.")
        except Exception as e:
            print(f"[CLASSES ERROR] {self.class_path}: {e}")
            traceback.print_exc()

    def __read_labels(self, file_path, W, H):
        boxes = []
        labels = []

        try:
            with open(os.path.join(self.root_path, self.split_path, "labels", file_path), 'r') as label_file:
                for line in label_file:
                    s = line.strip()

                    if not s or s.startswith("#"):            # <- skip blank/comment
                        continue
                    parts = s.split()
                    if len(parts) < 5:                         # <- malformed line
                        print(f"[BAD LINE] {full}:{ln} -> {s!r}")
                        continue
                    
                    parts = line.split()
                    item_class = int(parts[0])
                    cx, cy, w, h = map(float, parts[1:5])
                    x_c, y_c = cx * W, cy * H
                    bw, bh   = w * W,  h * H
                    x1 = max(0.0, x_c - bw/2)
                    y1 = max(0.0, y_c - bh/2)
                    x2 = min(float(W), x_c + bw/2)
                    y2 = min(float(H), y_c + bh/2)
                    boxes.append([x1, y1, x2, y2])
                    labels.append(item_class + 1)
                    
        except FileNotFoundError:
            print(f"[LABEL MISSING] {file_path}")
        except Exception as e:
            print(f"[READ LABEL ERROR] {file_path}: {e}")
            traceback.print_exc()

        return boxes, labels

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, index):
        image_path = self.image_paths[index]
        file_path = self.label_paths[index]
        img = Image.open(image_path)
        W, H = img.size
        boxes, labels = self.__read_labels(file_path, W, H)

        boxes  = torch.as_tensor(boxes,  dtype=torch.float32) 
        labels = torch.as_tensor(labels, dtype=torch.int64)  
        area   = (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) 

        target = {
            "boxes": boxes,
            "labels": labels,
            "image_id": torch.tensor([index]),
            "area": area,
            "iscrowd": torch.zeros((len(boxes),), dtype=torch.int64),
        }

        img = F.to_tensor(img)

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    

In [4]:
    def __create_super_class_dict(self):
        ##Also create a super class dict so we can analyze at different levels/ combine categories

        self.super_class_dict = {"Full Plastic Bottle" : [], "Crushed Plastic Bottle" : [],
                    "Glass Bottle" : [], "Detergent Container" : [], "Cardboard Carton" : [], "Metal Can" : [], "Other Plastic Jug" : []}

        for i in self.class_dict:
            if "full" in self.class_dict[i]:
                self.super_class_dict["Full Plastic Bottle"].append(i)
            elif "bottle" in self.class_dict[i]:
                self.super_class_dict["Crushed Plastic Bottle"].append(i)
            elif "glass" in self.class_dict[i]:
                self.super_class_dict["Glass Bottle"].append(i)
            elif "detergent" in self.class_dict[i]:
                self.super_class_dict["Detergent Container"].append(i)
            elif "cardboard" in self.class_dict[i]:
                self.super_class_dict["Cardboard Carton"].append(i)
            elif "canister" in self.class_dict[i]:
                self.super_class_dict["Other Plastic Jug"].append(i)
            elif "can" in self.class_dict[i]:
                self.super_class_dict["Metal Can"].append(i)

In [8]:
    def __create_class_dict(self):
        ##Get the classes and properly associate them to the labels
        self.class_dict = {}
        i = 1

        try:
            with open(os.path.join(self.root_path, self.class_path), 'r') as class_file:
                for line in class_file:
                    class_dict[i] = line.strip()
                    i += 1
        except FileNotFoundError:
            print("Error: The file was not found.")
        except Exception:
            print("An error occurred")

## We might not like warp, lets try taco

In [45]:
class TACODataset(Dataset):
    def __init__(self, img_dir, annotation_file, transform=None):
        self.img_dir = img_dir
        self.transform = transform

        with open(annotation_file, 'r') as f:
            self.annotations = json.load(f)

        self.images = self.annotations['images']
        self.categories = self.annotations['categories']
        self.annotations_by_image_id = self._group_annotations_by_image()

    def _group_annotations_by_image(self):
        annotations_map = {}
        for ann in self.annotations['annotations']:
            image_id = ann['image_id']
            if image_id not in annotations_map:
                annotations_map[image_id] = []
            annotations_map[image_id].append(ann)
        return annotations_map

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_info = self.images[idx]
        img_id = img_info['id']
        img_filename = img_info['file_name']
        img_path = os.path.join(self.img_dir, img_filename)

        image = Image.open(img_path).convert('RGB')

        # Get annotations for this image
        anns = self.annotations_by_image_id.get(img_id, [])
        boxes = []
        labels = []
        for ann in anns:
            # COCO bounding box format is [x_min, y_min, width, height]
            x_min, y_min, width, height = ann['bbox']
            x_max = x_min + width
            y_max = y_min + height
            boxes.append([x_min, y_min, x_max, y_max])
            labels.append(ann['category_id'])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = torch.tensor([img_id])

        if self.transform:
            image = self.transform(image)

        image = F.to_tensor(image)  

        return image, target

## Pre-processing for Validation Dataset

In [6]:
##Additional paths that are needed
val_folder_path = "val"

### Description of pre-processing performed by hand
The following pre-processing cannot be shown directly in this notebook because it was not done through code, but instead was done by hand. To illustrate these steps I have included images of the process as well as descriptions of what had to be done.

**Obtaining Images**

The images in the WARP dataset followed a specific format with recylcables interspersed amongst trash in a sorting facility. To test the model made from that data I needed images that were similar enough for the model to handle and had the types of recycling that were present in the WARP classes. To assess the generalization ability of the model, I also wanted to obtain images that were formatted differently to see if it could correctly locate the recycables in different situations.