# Data in Wild! Wildlife Insights Camera Trap


The goal of this notebook is to download a sample dataset from Wildlife Insights, organize and prepare it for a simple machine learning model

In [None]:
import pandas as pd

annotations = pd.read_csv("/orange/ewhite/b.weinstein/CameraTraps/wildlife-insights_e658cfd5-62cf-4ef1-8c97-cb27cbfafb81_all-platform-data/images.csv")
annotations.head()

There are 13801 images


Unnamed: 0,project_id,deployment_id,image_id,sequence_id,filename,location,is_blank,identified_by,wi_taxon_id,class,...,individual_id,number_of_objects,individual_animal_notes,behavior,highlighted,markings,cv_confidence,license,fuzzed,deployment_fuzzed
0,2002681,235ae8b3-c867-47c3-bb14-014c693c5b46,5ee97c56-159b-4ad1-9120-a3c9aecd4ae8,,02070366.JPG,https://app.wildlifeinsights.org/download/2007...,,Jessica Pacheco,d62cd43d-a519-4cd5-aef2-b76ae4bd16c1,Mammalia,...,,1,,,False,,,CC-BY,True,False
1,2002681,af1868f3-d47d-4035-9d8a-21b7dc39ea5a,b20cb762-7b69-4404-bd73-58ee46be75e1,,12280095.JPG,https://app.wildlifeinsights.org/download/2007...,,Jessica Pacheco,d4a15eb4-1c47-4b90-8e3f-15fb51dc07e8,Mammalia,...,,1,,,False,,,CC-BY,True,False
2,2002681,f65b9a20-4e64-4ab9-b3fd-f717095f760f,6a03e7c0-ecf4-4daf-bfcb-4f9abefd86c8,,12300191.JPG,https://app.wildlifeinsights.org/download/2007...,,Jessica Pacheco,c5ae795b-300f-4cc7-99f7-e2c2d2810a87,Mammalia,...,,1,,,False,,,CC-BY,True,False
3,2002681,af1868f3-d47d-4035-9d8a-21b7dc39ea5a,fe7b1810-3494-4405-ad2c-9445fe578cd5,,12200074.JPG,https://app.wildlifeinsights.org/download/2007...,,Jessica Pacheco,d4a15eb4-1c47-4b90-8e3f-15fb51dc07e8,Mammalia,...,,1,,,False,,,CC-BY,True,False
4,2002681,f65b9a20-4e64-4ab9-b3fd-f717095f760f,ccc252ad-82e5-4375-8ea3-9829f4a6c845,,12240117.JPG,https://app.wildlifeinsights.org/download/2007...,,Jessica Pacheco,c5ae795b-300f-4cc7-99f7-e2c2d2810a87,Mammalia,...,,1,,,False,,,CC-BY,True,False


In [11]:
annotations["common_name"].value_counts()

common_name
Spectacled Bear               237
Western Mountain Coati        161
Northern Tiger Cat             43
Dwarf Red Brocket              42
Andean White-eared Opossum     41
Black Agouti                   26
Puma                           26
Southern Tamandua              17
Mammal                         12
Tayra                          11
Ocelot                          9
Red-tailed Squirrel             7
Sylvilagus Species              5
White-browed Guan               2
Common Opossum                  2
Spix's Guan                     1
No CV Result                    1
Carnivorous Mammal              1
Name: count, dtype: int64

In [None]:
import os
import requests
from tqdm import tqdm

# Download images
# Create the directory if it doesn't exist
image_dir = "wildlife_insights/"
os.makedirs(image_dir, exist_ok=True)

# Function to download images
def download_images(annotations, image_dir):
    image_list = annotations["filename"].tolist()
    for filename in tqdm(image_list, desc="Downloading images"):
        url = annotations.loc[annotations["filename"] == filename, "location"].values[0]
        file_path = os.path.join(image_dir, filename)
        if not os.path.exists(file_path):  # Avoid re-downloading
            try:
                response = requests.get(url, stream=True)
                if response.status_code == 200:
                    with open(file_path, "wb") as f:
                        for chunk in response.iter_content(1024):
                            f.write(chunk)
                else:
                    print(f"Failed to download {filename}: {response.status_code}")
            except Exception as e:
                print(f"Error downloading {filename}: {e}")

# Download train and test images
download_images(annotations, image_dir)


Downloading images: 100%|██████████| 644/644 [00:58<00:00, 11.03it/s]


In [None]:
# Filter for an image of an ocelot
ocelot_image_filename = annotations[annotations["common_name"] == "Ocelot"]["filename"].iloc[0]
# Full path to this notebook
notebook_path = os.path.abspath(__file__)

# Full path to the image
ocelot_image_path = os.path.join(notebook_path,image_dir,ocelot_image_filename)

# Display the image
if ocelot_image_path:
    image = Image.open(ocelot_image_path)
    image.show()
else:
    print("Ocelot image not found.")

NameError: name '__file__' is not defined

## Key Terminology

In the pytorch universe there are three key elements, datasets, dataloaders and models.

Torch datasets are abstractions that handle data loading and preprocessing. They are typically subclasses of `torch.utils.data.Dataset` and define two main methods: `__len__` (returns the size of the dataset) and `__getitem__` (retrieves a data sample).

Dataloaders, implemented via `torch.utils.data.DataLoader`, provide an iterable over a dataset, enabling efficient batching, shuffling, and parallel data loading.

Models in PyTorch are defined as subclasses of `torch.nn.Module`. They encapsulate layers and define the forward pass of the network.

PyTorch Lightning modules integrate these components by organizing the training, validation, and testing logic. A `LightningModule` defines methods like `training_step`, `validation_step`, and `test_step`, and connects datasets, dataloaders, and models into a cohesive workflow. This abstraction simplifies training loops and enables seamless integration with hardware accelerators.

In [4]:
from sklearn.model_selection import train_test_split

# Train test split
# Filter out classes with less than 5 images
class_counts = annotations["common_name"].value_counts()
valid_classes = class_counts[class_counts >= 5].index
filtered_annotations = annotations[annotations["common_name"].isin(valid_classes)]

# Split images into train and test sets
train_images = []
test_images = []

for common_name in valid_classes:
    class_images = filtered_annotations[filtered_annotations["common_name"] == common_name]["filename"].tolist()
    if len(class_images) > 5:
        train, test = train_test_split(class_images, test_size=5, random_state=42)
        train_images.extend(train)
        test_images.extend(test)

print("valid_classes: ", valid_classes)
print("Number of train images: ", len(train_images))
print("Number of test images: ", len(test_images))

valid_classes:  Index(['Spectacled Bear', 'Western Mountain Coati', 'Northern Tiger Cat',
       'Dwarf Red Brocket', 'Andean White-eared Opossum', 'Black Agouti',
       'Puma', 'Southern Tamandua', 'Mammal', 'Tayra', 'Ocelot',
       'Red-tailed Squirrel', 'Sylvilagus Species'],
      dtype='object', name='common_name')
Number of train images:  572
Number of test images:  60


In [5]:
import torch
from torch.utils.data import Dataset
from PIL import Image

class WildlifeDataset(Dataset):
    def __init__(self, image_paths, annotations, transform=None):
        self.image_paths = image_paths
        self.annotations = annotations
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        label = self.annotations.loc[self.annotations['filename'] == img_path, 'common_name'].values[0]
        
        # Load the image
        image = Image.open(img_path).convert("RGB")
        
        # Apply transformations if provided
        if self.transform:
            image = self.transform(image)
        
        return image, label

# Example usage
train_dataset = WildlifeDataset(train_images, filtered_annotations)
test_dataset = WildlifeDataset(test_images, filtered_annotations)

In [6]:
from torch.utils.data import DataLoader

# Dataloader

# Create DataLoaders for train and test datasets
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Example: Iterate through the train_loader
for images, labels in train_loader:
    print(f"Batch size: {len(images)}")
    break

FileNotFoundError: [Errno 2] No such file or directory: '03090523.JPG'

## Terminología Clave de un Módulo de PyTorch Lightning

### Terminología Clave
1. **Datasets**: Abstracciones que manejan la carga y preprocesamiento de datos. Son subclases de `torch.utils.data.Dataset` y definen dos métodos principales: `__len__` (devuelve el tamaño del conjunto de datos) y `__getitem__` (recupera una muestra de datos).

2. **Dataloaders**: Implementados mediante `torch.utils.data.DataLoader`, proporcionan un iterable sobre un conjunto de datos, permitiendo un batching eficiente, mezcla de datos y carga paralela.

3. **Models**: Definidos como subclases de `torch.nn.Module`. Encapsulan capas y definen el paso hacia adelante (forward pass) de la red.

4. **LightningModule**: Un módulo de PyTorch Lightning que organiza la lógica de entrenamiento, validación y prueba. Define métodos como `training_step`, `validation_step` y `test_step`, conectando datasets, dataloaders y modelos en un flujo de trabajo cohesivo. Esta abstracción simplifica los bucles de entrenamiento y permite una integración fluida con aceleradores de hardware.



In [None]:
import pytorch_lightning as pl
from torch import nn

import torchvision.models as models
import torch.optim as optim

class ResNetClassifier(pl.LightningModule):
    def __init__(self, num_classes):
        super(ResNetClassifier, self).__init__()
        # Load a pre-trained ResNet model
        self.model = models.resnet18(pretrained=True)
        # Replace the final fully connected layer to match the number of classes
        self.model.fc = nn.Linear(self.model.fc.in_features, num_classes)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        images, labels = batch
        outputs = self(images)
        loss = self.criterion(outputs, labels)
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

# Instantiate the model
num_classes = len(valid_classes)
model = ResNetClassifier(num_classes)

# Create a PyTorch Lightning trainer
trainer = pl.Trainer(max_epochs=1)

# Fit the model
trainer.fit(model, train_loader)

KeyboardInterrupt: 