# Notebook 1: Fruit Dataset - Exploratory Data Analysis (EDA)

Welcome to the first part of our Fruit Image Classification case study! In this notebook, we'll explore the fruit image dataset to understand its structure, characteristics, and any potential challenges we might face when building our classification model.

**Our Goals for this EDA:**
1. Understand the dataset structure (training and testing sets).
2. Determine the number of fruit classes and the distribution of images across these classes.
3. Examine the properties of the images themselves (e.g., size, brightness).
4. Identify any potential issues like class imbalance or variations that might affect model training.

Let's get started!

## 1. Setup and Library Imports

First, we need to import the necessary Python libraries. We'll be using:
- `os` for interacting with the file system (to find our image files).
- `PIL` (Pillow) for opening and manipulating images.
- `pandas` for working with data in a structured way (like our label files).
- `matplotlib.pyplot` for creating plots and visualizations.
- `random` for selecting random samples if needed.

In [None]:
import os
from PIL import Image
import pandas as pd
import matplotlib.pyplot as plt
import random

# Ensure plots appear inline in the notebook
%matplotlib inline
plt.style.use('ggplot') # Using a visually appealing style for plots

## 2. Define Data Directories

Next, we'll specify the paths to our training and testing image datasets.
**Important:** Make sure your `Fruits_Dataset_Train` and `Fruits_Dataset_Test` folders (which you downloaded separately) are placed in a `DATA` directory relative to where this notebook is.

The dataset is expected to be organized as follows:
```
../DATA/
├── Fruits_Dataset_Train/
│   ├── 1/  (Class 1 images)
│   ├── 2/  (Class 2 images)
│   └── ... (other class folders)
├── Fruits_Dataset_Test/
│   ├── 1/  (Class 1 images)
│   ├── 2/  (Class 2 images)
│   └── ... (other class folders)
└── Labels_Train.csv
└── Labels_Test.csv
```

In [None]:
# Adjust these paths if your DATA folder is located elsewhere relative to the SCRIPTS folder
base_data_dir = "../DATA/"
train_dir = os.path.join(base_data_dir, "Fruits_Dataset_Train")
test_dir = os.path.join(base_data_dir, "Fruits_Dataset_Test")

labels_train_path = os.path.join(base_data_dir, "Labels_Train.csv")
labels_test_path = os.path.join(base_data_dir, "Labels_Test.csv")

# Verify that the directories exist
if not os.path.exists(train_dir):
    print(f"ERROR: Training directory not found at {train_dir}. Please check the path and dataset structure.")
if not os.path.exists(test_dir):
    print(f"ERROR: Testing directory not found at {test_dir}. Please check the path and dataset structure.")
if not os.path.exists(labels_train_path):
    print(f"ERROR: Training labels CSV not found at {labels_train_path}.")

## 3. Image Counts per Class (from Directory Structure)

The images are organized into subdirectories within `Fruits_Dataset_Train` and `Fruits_Dataset_Test`. Each subdirectory (e.g., '1', '2') represents a different fruit class. Let's count how many images are in each class folder for both the training and testing sets.

In [None]:
def count_images_per_class_from_dirs(data_dir):
    """Counts images in each subdirectory (class) of a given directory."""
    class_counts = {}
    if not os.path.exists(data_dir):
        print(f"Directory {data_dir} does not exist.")
        return class_counts

    for class_name in os.listdir(data_dir):
        class_path = os.path.join(data_dir, class_name)
        if os.path.isdir(class_path):
            # Count only image files (e.g., .jpg, .png) to avoid counting other files
            num_images = len([f for f in os.listdir(class_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))])
            class_counts[class_name] = num_images
    return class_counts

train_counts_dirs = count_images_per_class_from_dirs(train_dir)
test_counts_dirs = count_images_per_class_from_dirs(test_dir)

print("Image counts per class (from Training directory structure):")
print(train_counts_dirs)
print("\nImage counts per class (from Testing directory structure):")
print(test_counts_dirs)

### Visualizing Class Distribution (from Directories)

A bar chart is a good way to visualize these counts and see if there's any class imbalance (i.e., some classes having significantly more or fewer images than others).

In [None]:
# Convert counts to pandas DataFrames for easier plotting
train_df_dirs = pd.DataFrame(list(train_counts_dirs.items()), columns=["Class", "Train Count"]).sort_values(by="Class")
test_df_dirs = pd.DataFrame(list(test_counts_dirs.items()), columns=["Class", "Test Count"]).sort_values(by="Class")

# Merge the training and testing counts for a combined plot
class_summary_dirs = pd.merge(train_df_dirs, test_df_dirs, on="Class", how="outer").fillna(0) # Use outer merge and fillna for safety

class_summary_dirs.set_index("Class")[["Train Count", "Test Count"]].plot(kind="bar", figsize=(12, 7))
plt.title("Image Count per Class (from Directory Structure)")
plt.ylabel("Number of Images")
plt.xlabel("Class Folder Name")
plt.xticks(rotation=45, ha="right")
plt.grid(axis='y', linestyle='--')
plt.tight_layout() # Adjusts plot to prevent labels from overlapping
plt.show()

**Observation:**
*(e.g., Are the classes balanced? Are there similar numbers of images in train and test for each class?)*

## 4. Exploring the Label Files

The dataset also comes with `Labels_Train.csv` and `Labels_Test.csv`. These files provide explicit labels for each image, often in a 'one-hot encoded' format. This means for each image, there's a row, and for each possible fruit type, there's a column with a 1 if the image contains that fruit and 0 otherwise.

Let's load the training labels and see what they look like.

In [None]:
labels_df_train = pd.read_csv(labels_train_path)
print("First 5 rows of Labels_Train.csv:")
labels_df_train.head()

The columns (excluding `FileName`) represent the different fruit types our model will learn to identify. The actual names of these fruits are the column headers.

In [None]:
fruit_names = labels_df_train.columns.drop("FileName").tolist()
num_classes = len(fruit_names)
print(f"There are {num_classes} fruit classes based on the label file:")
print(fruit_names)

### Image Counts per Fruit Type (from Label File)

We can sum the '1's in each fruit column in the label file to get the total number of images labeled for each specific fruit type. This gives us another view of class distribution, this time based on the explicit labels rather than just the folder structure.

In [None]:
# Sum each fruit column to count how many images are labeled for that fruit
# This assumes the labels are one-hot encoded or binary for each fruit type.
fruit_label_counts = labels_df_train[fruit_names].sum().sort_values(ascending=False)

print("\nTotal images labeled for each fruit type (from Labels_Train.csv):")
print(fruit_label_counts)

fruit_label_counts.plot(kind="bar", figsize=(12, 7), color='skyblue')
plt.title("Image Count per Labeled Fruit Type (Training Set)")
plt.ylabel("Number of Labeled Images")
plt.xlabel("Fruit Type")
plt.xticks(rotation=45, ha="right")
plt.grid(axis='y', linestyle='--')
plt.tight_layout()
plt.show()

**Observation:**
*(Write your observations here!)*

### Investigating a Specific Fruit (e.g., Apples)
The project description mentions distinguishing between different *types* of apples. Let's see how many images are labeled as 'Apple' in general.

In [None]:
if 'Apple' in labels_df_train.columns:
    apple_df = labels_df_train[labels_df_train["Apple"] == 1]
    print(f"Number of images explicitly labeled as 'Apple' in training set: {len(apple_df)}")
    # You could further explore if other apple-related columns exist, e.g., 'Apple Golden', 'Apple Red'
    # apple_related_cols = [col for col in fruit_names if 'apple' in col.lower()]
    # print(f"Apple-related columns: {apple_related_cols}")
else:
    print("'Apple' column not found in the label file.")

## 5. Image Properties

Now let's look at the images themselves. We'll check their dimensions (width and height) and brightness.

### Image Size Distribution
Neural networks typically require input images to be of a fixed size. Let's see if our images vary in size. We'll take a random sample of images to speed this up.

In [None]:
def get_image_sizes(data_directory, sample_limit=300):
    """Gets the dimensions (width, height) of a sample of images from the directory."""
    sizes = []
    images_processed = 0
    if not os.path.exists(data_directory):
        print(f"Directory {data_directory} does not exist.")
        return sizes

    for class_name in os.listdir(data_directory):
        if images_processed >= sample_limit:
            break
        class_path = os.path.join(data_directory, class_name)
        if os.path.isdir(class_path):
            for img_file in os.listdir(class_path):
                if not img_file.lower().endswith(('.png', '.jpg', '.jpeg')):
                    continue # Skip non-image files
                img_path = os.path.join(class_path, img_file)
                try:
                    with Image.open(img_path) as img:
                        sizes.append(img.size)  # (width, height)
                        images_processed += 1
                        if images_processed >= sample_limit:
                            break
                except IOError: # Handles corrupted images
                    print(f"Could not open image: {img_path}")
                    continue
    return sizes

# Get sizes from a sample of training images
image_sizes_sample = get_image_sizes(train_dir, sample_limit=300)

if image_sizes_sample:
    widths, heights = zip(*image_sizes_sample) # Unzip into separate lists

    plt.figure(figsize=(12, 6))

    plt.subplot(1, 2, 1)
    plt.hist(widths, bins=20, alpha=0.7, color='coral', edgecolor='black')
    plt.title("Distribution of Image Widths (Sample)")
    plt.xlabel("Width (pixels)")
    plt.ylabel("Frequency")
    plt.grid(axis='y', linestyle='--')

    plt.subplot(1, 2, 2)
    plt.hist(heights, bins=20, alpha=0.7, color='teal', edgecolor='black')
    plt.title("Distribution of Image Heights (Sample)")
    plt.xlabel("Height (pixels)")
    plt.ylabel("Frequency")
    plt.grid(axis='y', linestyle='--')

    plt.tight_layout()
    plt.show()

    # Also print some summary statistics
    print(f"\nSampled {len(widths)} images.")
    print(f"Width - Min: {min(widths)}, Max: {max(widths)}, Avg: {sum(widths)/len(widths):.2f}")
    print(f"Height - Min: {min(heights)}, Max: {max(heights)}, Avg: {sum(heights)/len(heights):.2f}")
else:
    print("No image sizes collected. Check dataset path and content.")

**Observation:**
*(e.g., Are the image sizes consistent, or do they vary a lot? If they vary, we'll definitely need a resizing step in our preprocessing.)*

### Image Brightness Distribution
Variations in lighting can make classification harder. Let's estimate the average brightness for a sample of images from each class. We can convert images to grayscale and calculate an average pixel intensity.

A Kernel Density Estimation (KDE) plot can help visualize the distribution of brightness scores for each class.

In [None]:
def get_average_brightness(image_pil):
    """Calculates the average brightness of a PIL image."""
    # Convert to grayscale
    grayscale_image = image_pil.convert("L")
    # Get pixel data as a list of values
    pixels = list(grayscale_image.getdata())
    # Calculate average pixel value
    if len(pixels) > 0:
        return sum(pixels) / len(pixels)
    return 0 # Should not happen for valid images

brightness_data = []
samples_per_class = 15 # Take a few samples from each class to estimate brightness

if os.path.exists(train_dir):
    for class_name in os.listdir(train_dir):
        class_path = os.path.join(train_dir, class_name)
        if os.path.isdir(class_path):
            image_files = [f for f in os.listdir(class_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]
            # Take a random sample if more images than samples_per_class, otherwise take all
            sample_files = random.sample(image_files, min(len(image_files), samples_per_class))

            for img_file in sample_files:
                img_path = os.path.join(class_path, img_file)
                try:
                    with Image.open(img_path) as img:
                        brightness = get_average_brightness(img)
                        brightness_data.append({"Class": class_name, "Brightness": brightness})
                except IOError:
                    print(f"Could not open image for brightness check: {img_path}")
                    continue

if brightness_data:
    df_brightness = pd.DataFrame(brightness_data)

    plt.figure(figsize=(14, 8))
    # Using seaborn for potentially nicer KDE plots
    import seaborn as sns
    sns.kdeplot(data=df_brightness, x="Brightness", hue="Class", fill=True, alpha=.5, linewidth=2)
    plt.title("Brightness Distribution per Class (Sampled)")
    plt.xlabel("Average Brightness (0-255)")
    plt.ylabel("Density")
    plt.grid(axis='y', linestyle='--')
    plt.show()

    # Displaying average brightness per class for another perspective
    print("\nAverage Brightness per Class (Sampled):")
    print(df_brightness.groupby("Class")["Brightness"].mean().sort_values())
else:
    print("No brightness data collected. Check dataset path and content.")

**Observation:**
*(e.g., Do some classes tend to be brighter or darker than others? Significant differences might suggest that brightness normalization or augmentation could be beneficial.)*

## 6. EDA Summary and Next Steps

This concludes our initial exploration of the fruit dataset!

**Key Takeaways from EDA:**
1.  Dataset structure:
2.  Class distribution:
3.  Image characteristics (size, brightness):
4.  Potential challenges or considerations for modeling:

Based on these findings, we can now move on to preprocessing the data and building our image classification model. The insights gained here will help inform our choices in the next stages.