# Dog Breed Identification: Machine Learning from Kaggle Competition

My name is André Fernandes and in this notebook is presented my solution proposal for the competition. Feel free to connect with me on LinkedIn and check out my other projects on GitHub:

[LinkedIn](https://www.linkedin.com)
[GitHub](https://www.linkedin.com/in/andr%C3%A9-fernandes-868006207/)

Below is the description of the competition and the link to the main page if you want to check it for yourself.

**Competition Description**

The Dog Breed Identification competition challenges participants to identify the breed of a dog in an image. This notebook will guide you through the process of building a predictive model that classifies dog breeds based on image data.

This competition is hosted on Kaggle: [Dog Breed Identification](https://www.kaggle.com/competitions/dog-breed-identification/overview)

## Table of Contents

1. Introduction
2. Data Description
3. Exploratory Data Analysis (EDA)
4. Data Preprocessing
5. Modeling
6. Model Evaluation
7. Conclusion
8. References

## Introduction

The Dog Breed Identification dataset is a collection of dog images categorized into various breeds. This notebook aims to develop a model that accurately classifies dog breeds based on image data.

## Data Description

The dataset consists of three main components:

- **train.zip**: The training dataset containing images of dogs and their corresponding breed labels.
- **test.zip**: The test dataset for which predictions need to be made.
- **labels.csv**: A CSV file containing the breed labels for the training dataset images.

## Exploratory Data Analysis (EDA)

In this section, we will explore the dataset to understand the distribution of dog breeds and visualize relationships between features.

## Data Preprocessing

Data preprocessing steps include:
...

## Modeling

...

## Model Evaluation

....

## Conclusion

...

## References

- Kaggle Dog Breed Identification Competition: [Kaggle Dog Breed Identification](https://www.kaggle.com/competitions/dog-breed-identification/overview)
- ChatGPT

# ----- Beginning of my Solution Proposal -----
Create a folder named "data" and get there the data given by kaggle, that is, the test and train folders, labels.csv and sample_submission.csv files. If you need to do some installs, create a cell bellow this one and install whats needed :)

# Imports

In [4]:
# Manipulate data and files
import numpy as np
import pandas as pd
from PIL import Image
import os
# To use parallel processing
from concurrent.futures import ThreadPoolExecutor, as_completed

# Get the data

In [6]:
# Get labels
labels = pd.read_csv('../data/labels.csv')

In [7]:
# Lets define some function to image processing

# Function to resize all images to have them all in equal size because they varie in size
def resize_image(image, target_size):
    return image.resize(target_size, Image.LANCZOS)

# Function to convert each image to array so we can have the pixel values information
def process_image(image_path, target_size=(128, 128)):
    try:
        with Image.open(image_path) as img:
            # Resize image to target size
            img = resize_image(img, target_size)
            img_array = np.array(img)
            # Flatten the array and convert to list
            return img_array.flatten().tolist()
    except Exception as e:
        print(f"Error processing {image_path}: {e}")
        return None
        
# Make the image processing in parallel to be faster because the data set has +10000 images
def process_images_in_parallel(image_paths, target_size=(128, 128), max_workers=4):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_image = {executor.submit(process_image, path, target_size): path for path in image_paths}
        for future in as_completed(future_to_image):
            result = future.result()
            if result is not None:
                results.append(result)
    return results

# Convert to a pandas dataframe
def images_to_dataframe(image_folder, target_size=(128, 128), max_workers=4, csv_path=None):
    if csv_path and os.path.exists(csv_path):
        print(f"Loading existing DataFrame from {csv_path}")
        return pd.read_csv(csv_path)
    
    print(f"CSV file does not exist. Processing images in {image_folder}...")
    
    # List all image files
    image_paths = [os.path.join(image_folder, file) for file in os.listdir(image_folder) if file.endswith('.jpg')]
    
    # Process images in parallel
    pixel_values = process_images_in_parallel(image_paths, target_size, max_workers=max_workers)
    
    # Create DataFrame
    df = pd.DataFrame(pixel_values)
    
    # Save DataFrame to CSV if path is provided
    if csv_path:
        print(f"Saving DataFrame to {csv_path}")
        df.to_csv(csv_path, index=False)
    
    return df

In [8]:
# Set directory paths
train_dir = "../data/train"
test_dir = "../data/test"
train_csv_path = '../data/train_dataframe.csv'
test_csv_path = '../data/test_dataframe.csv'

# Define target size for resizing
target_size = (128, 128)

# Process training images
train_df = images_to_dataframe(train_dir, target_size=target_size, max_workers=8, csv_path=train_csv_path)
test_df = images_to_dataframe(test_dir, target_size=target_size, max_workers=8, csv_path=test_csv_path)

CSV file does not exist. Processing images in ../data/train...
Saving DataFrame to ../data/train_dataframe.csv
CSV file does not exist. Processing images in ../data/test...
Saving DataFrame to ../data/test_dataframe.csv


# Lets performe exploratory Data Analysis (EDA) & Data Preprocessing

In [10]:
# Lets see labels info
labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10222 entries, 0 to 10221
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      10222 non-null  object
 1   breed   10222 non-null  object
dtypes: object(2)
memory usage: 159.8+ KB


In [11]:
# Lets see some labels
labels.head()

Unnamed: 0,id,breed
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo
2,001cdf01b096e06d78e9e5112d419397,pekinese
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever


In [12]:
# Lets see training dataset info
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10222 entries, 0 to 10221
Columns: 49152 entries, 0 to 49151
dtypes: int64(49152)
memory usage: 3.7 GB


In [13]:
# Lets some training images information
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,49142,49143,49144,49145,49146,49147,49148,49149,49150,49151
0,85,107,67,88,108,71,80,100,66,81,...,44,96,122,39,104,128,53,92,116,47
1,86,83,82,93,96,93,49,58,53,44,...,107,109,102,81,110,95,72,115,90,84
2,105,104,99,110,109,104,110,109,104,113,...,109,115,114,110,114,113,109,110,109,104
3,156,99,53,177,133,80,192,155,91,157,...,91,118,42,42,141,76,46,202,150,92
4,142,135,144,143,135,144,158,150,159,151,...,148,158,149,158,161,147,156,145,129,140
