# Coral Reef Dataset Exploration
This notebook explores the coral reef dataset by jxwleong (https://www.kaggle.com/datasets/jxwleong/coral-reef-dataset) to evaluate its suitability for machine learning applications in marine conservation. 

This exploration aims to answer the following questions:
* What file formats are used for the data?
* How are annotations stored and what fields do they include?
* Do all images have the same size dimensions?
* What are the color channels (RGB, grayscale, etc..)?
* What is the label distribution? 
* Is the dataset balanced? If not, how severe is the imbalance?
* Are there ambiguous labels?
* How many annotations exist per image?
* What fraction of pixels are annotated vs. unannotated?
* Are annotations dense (every pixel labeled) or sparse (only specific regions)?
* Are there images with no annotations?
* Is the dataset suitable for developing an AI system for coral monitoring?

In [None]:
# Correct working directory.
# This is necessary for imports because the notebook is not in the main folder of the project.
if not "working_directory_corrected" in vars():
    %cd ..
    working_directory_corrected = True

# Import dependencies

from data.dataset import CoralDataset

## 1. Dataset Initialization and Basic Properties
The dataset consists of two parts:
1) A CSV file containing pixel-level labels for the images.
2) A folder containing underwater images of coral reefs.

In [15]:
# Initialize dataset
dataset = CoralDataset()
annotations = dataset.load_annotations()

# Basic statistics
print(f"Total images: {annotations['Name'].nunique()}")
print(f"Total annotations: {len(annotations)}")
print(f"Annotation columns: {list(annotations.columns)}")

# Display sample of annotations
annotations.head()

Total images: 2455
Total annotations: 418310
Annotation columns: ['Name', 'Row', 'Column', 'Label', 'Unnamed: 4']


Unnamed: 0,Name,Row,Column,Label,Unnamed: 4
0,i0201a.png,111,94,broken_coral_rubble,
1,i0201a.png,173,243,broken_coral_rubble,
2,i0201a.png,84,366,broken_coral_rubble,
3,i0201a.png,54,802,broken_coral_rubble,
4,i0201a.png,313,66,sand,


There are 4505 in the images folder, while we have 2455 images recorded in the CSV file.

The CSV file contains 418310 entries and records these features:
- **Name**: Filename of the corresponding coral image.
- **Row**: Vertical coordinate of the annotated pixel (0 = top of image).
- **Column**: Horizontal coordinate of the annotated pixel (0 = left edge).
- **Label**: Ecological classification of the coral at this location.
- **Unnamed: 4**: An empty column that can be safely ignored.

This structure suggests a sparse sampling strategy where ecologically significant points are labeled rather than full segmentation masks.

In [16]:
print(f"Average annotations per image: {len(annotations)/annotations['Name'].nunique():.1f}")

Average annotations per image: 170.4


Initial observations show: 
- There are images in the dataset that have not been annotated, and may need to be removed. 
- With 418,310 annotations across 2,455 images, we have approximately 170 annotated points per image on average.
- The Row/Column values represent a discrete sampling approach rather than full segmentation masks.
- Multiple classes are present within single images (as seen in the first five rows showing both 'broken_coral_rubble' and 'sand'). 

This point-based annotation strategy differs from more common bounding box approaches, and will probably influence the model selection stage later on.

## 2. Image Data Properties
Before examining the label distribution, we should understand the properties of the image data. 
By manual inspection of the image data we can identify:
* The images have different dimensions.
* The images come in two formats: .png and .jpg.

Having different dimensions and formats indicate some adjustments will be needed. 
An illustration is shown below: