# **Apple Identification Model** 

See the [README.md](README.md) for a full project overview, setup instructions, and additional context.

## **Imports**

In [8]:
from kaggle.api.kaggle_api_extended import KaggleApi
from pathlib import Path
import shutil

## **Table of Contents**

[**Dataset**](#dataset) — Overview of the dataset and filtering process

- [**Filtering Data**](#filtering-data) — Selecting only apple images from the full dataset  
- [**Data Preprocessing**](#data-preprocessing) — Cleaning and organizing the apple dataset for training  
- [**Dataset Structure**](#dataset-structure) — Explaining the layout of the training, test, and validation sets  
  
[**Model**](#model) — CNN architecture for apple species classification

- [**Model Architecture**](#model-architecture) — Structure and layers of the neural network  
- [**Training**](#training) — Optimizer, loss function, and training parameters  
- [**Evaluation**](#evaluation) — Accuracy, loss, and model performance analysis  

[**Results**](#results) — Final performance metrics and visualizations

[**Conclusion**](#conclusion) — Summary of results and future work

## **Dataset**

The dataset used in this project is the **Fruits360** dataset from Kaggle, which contains images of a wide variety of fruits. For the purposes of this project, the dataset will be filtered to include only apples and their different varieties.

### **Filtering Data**

The original dataset contains a wide variety of fruit images, but for this project, we are only interested in apple images.

In this step, we will:
- Download the full fruit dataset using the Kaggle API
- Filter out only the apple-related data
- Create a clean, organized apple-only dataset to use for training and evaluation

In [9]:
kaggle_api = KaggleApi()
kaggle_api.authenticate()

dataset = Path('fruits_dataset')

if dataset.is_dir():
    print('Fruits dataset already downloaded')
else:
    print('Downloading fruits dataset')
    kaggle_api.dataset_download_files('moltean/fruits', path=dataset, unzip=True)

Fruits dataset already downloaded


In [10]:
dataset = dataset / 'fruits-360_original-size/fruits-360-original-size'

apple_dataset = Path('apple_dataset')

apple_species = [
    'apple_breaburn', 
    'apple_crimson_snow',
    'apple_golden', 
    'apple_granny_smith', 
    'apple_hit',
    'apple_pink_lady',
    'apple_red', 
    'apple_red_delicious',
    'apple_red_yellow'
]

print(f'{len(apple_species)} species of apples:')
for species in apple_species:
    print(f'  - {species.replace("apple_", "").replace("_", " ").title()}')

9 species of apples:
  - Breaburn
  - Crimson Snow
  - Golden
  - Granny Smith
  - Hit
  - Pink Lady
  - Red
  - Red Delicious
  - Red Yellow


### **Data Preprocessing**

Fortunately, the dataset was already split into training, validation, and test sets, which simplifies the initial setup.

**TODO:**
- Create a filtered dataset containing only apple species
- Relabel folders as needed and remove any irrelevant or unwanted data

In [11]:
def is_apple_species(name: str, species: list[str]) -> bool:
    return any(specie in name.lower() for specie in species)

def copy_apple_dirs(source_dir: Path, dest_dir: Path, species_list=apple_species):
    for dir in source_dir.iterdir():
        if (dir.is_dir() 
            and is_apple_species(dir.name, species_list) 
            and not (dest_dir / dir.name).exists()
        ):
            shutil.copytree(dir, dest_dir / dir.name)

In [12]:
# Create apple dataset
apple_trainning = Path(apple_dataset / 'train')
copy_apple_dirs(dataset / 'Training', apple_trainning)

apple_testing = Path(apple_dataset / 'test')
copy_apple_dirs(dataset / 'Test', apple_testing)

apple_validation = Path(apple_dataset / 'validation')
copy_apple_dirs(dataset / 'Validation', apple_validation)

In [None]:
# Clean up names and labels
for dir in apple_dataset.iterdir():
    for dir in dir.iterdir():
        dir.rename(dir.parent / f'{dir.name.replace("apple_", "")}')

**Possible Future work:**
- Combine duplicate species, for example: `golden_1`, `golden_2`, `golden_3` $\longrightarrow$ `golden`
- Relabel imgs to more explicit labels, for example: `r0_1.jpg` $\longrightarrow$ `roation_0_golden_1.jpg`
- Add more apples from other datasets

### **Dataset Structure**

**[2025-03-23]** — The dataset currently mirrors the structure of the original source in order to maintain clarity and consistency during this initial phase of model development. Future changes to the structure may be implemented as needed and will be documented here.

Each folder represents a distinct apple species class. In some cases, duplicate species are split into separate folders (e.g., `golden_1` and `golden_2`). Image filenames include rotation labels along the third axis, noted as `r0` or `r1`.

The dataset is divided into three subsets:
- **Training** contains `k + 2` images per apple
- **Validation** contains `k + 1` images
- **Test** contains `k + 3` images

This results in the training set comprising approximately 50% of the total images, while the validation and test sets each contain around 25%.

## **Model** 

### **Model Architecture**

### **Training**

### **Evaluation**

## **Results**

## **Conclusion**