# 🧪 Soil Classification - Part 2 Challenge

## 📌 Introduction

Soil identification plays a vital role in agriculture, land management, and environmental science. In this competition, the task is to **classify whether an image depicts soil or not**. This binary classification problem is a foundational step toward more complex soil-type identification systems and supports real-world applications like crop planning, geotechnical analysis, and environmental monitoring.

Given that the training data contains **only soil images**, this task is framed as a **One-Class Learning** problem. Specifically, the challenge is to build a robust model that can learn the distribution of soil images and correctly identify whether an unseen image belongs to the soil category or not.

---

## 🎯 Objective

- Develop a **binary classifier** to detect soil images based on their visual features.
- Implement a solution using **One-Class SVM**, a powerful method trained only on positive (soil) examples.
- Use pre-trained deep CNNs (**ResNet50** and **EfficientNet-B0**) to extract high-quality features from raw images.

---

## 📈 Evaluation Metric

The official metric for this challenge is:

> **F1-Score (Binary Classification)**

This score is the harmonic mean of **precision** and **recall**, which rewards models that balance both false positives and false negatives. A high F1-score ensures the model performs well at identifying true soil images without being misled by outliers.

---

## 🧠 Our Approach

- 🧠 **Feature Extraction**: 
  - **ResNet50** and **EfficientNet-B0** pretrained on ImageNet.
  - Use **global average pooling** to convert feature maps to embeddings.
- 🧼 **Preprocessing**:
  - Resize all images to **224×224**.
  - Normalize pixel values and apply appropriate model-specific preprocessing.
- 🧮 **Dimensionality Reduction**:
  - Apply **StandardScaler** and **PCA** to compress the high-dimensional feature space.
- ✅ **Model Training**:
  - Use a **One-Class SVM** to learn the distribution of soil image features.
  - Predict whether a test image is an inlier (soil) or outlier (non-soil).
- 📤 **Submission**:
  - Generate predictions on the hidden test set.
  - Submit a CSV file with binary labels (1 = soil, 0 = not soil).

---

Let’s build a robust one-class classifier and see if we can dig up the soil from the noise! 🌍🧑‍🌾


In [48]:
# Importing the essential libraries

import numpy as np                    # For numerical operations
import pandas as pd                   # For data manipulation and CSV handling
import os                             # For directory and file operations
import matplotlib.pyplot as plt       # For visualization
from PIL import Image                 # To handle image file reading
from keras.preprocessing import image 
from keras.applications.resnet50 import ResNet50, preprocess_input 
from keras.applications import EfficientNetB0
from keras.applications.efficientnet import preprocess_input as effnet_preprocess
from scipy.stats import uniform


# PyTorch and torchvision libraries

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models

# Sklearn for evaluation metrics

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import svm
from sklearn.ensemble import IsolationForest
from sklearn.mixture import GaussianMixture
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import OneClassSVM


# tqdm for progress bars

from tqdm import tqdm

import copy  # For saving the best model
import time  # For tracking training time

In [3]:
# Check if GPU is available and use it; else fall back to CPU

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [4]:
# Set random seeds for reproducibility

torch.manual_seed(42)
np.random.seed(42)

In [5]:
# Since the notebook was made in Kaggle, the only change is if the user wants to run this notebook in another place
# would be changing the path below

# Define paths to training and test folders

train_dir = '/kaggle/input/soil-classification-part-2/soil_competition-2025/train'
test_dir = '/kaggle/input/soil-classification-part-2/soil_competition-2025/test'

# Load the CSV files with training labels and test image IDs

train_df = pd.read_csv('/kaggle/input/soil-classification-part-2/soil_competition-2025/train_labels.csv')
test_df = pd.read_csv('/kaggle/input/soil-classification-part-2/soil_competition-2025/test_ids.csv')

In [6]:
# Preview training data

train_df.head()

Unnamed: 0,image_id,label
0,img_ed005410.jpg,1
1,img_0c5ecd2a.jpg,1
2,img_ed713bb5.jpg,1
3,img_12c58874.jpg,1
4,img_eff357af.jpg,1


In [7]:
# Checking the unique labels in the training CSV file

label_counts = train_df['label'].value_counts()

print("Unique label counts in training data:")
print(label_counts)

Unique label counts in training data:
label
1    1222
Name: count, dtype: int64


In [8]:
# Preview testing data
# Since there are only labels of soil images we need to find out how to train such that
# the model detects non-soil images for test dataset

test_df.head()

Unnamed: 0,image_id
0,6595f1266325552489c7d1635fafb88f.jpg
1,4b614841803d5448b59e2c6ca74ea664.jpg
2,ca30e008692a50638b43d944f46245c8.jpg
3,6a9046a219425f7599729be627df1c1a.jpg
4,97c1e0276d2d5c2f88dddbc87357611e.jpg


In [9]:
# Checking the training data info

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1222 entries, 0 to 1221
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   image_id  1222 non-null   object
 1   label     1222 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 19.2+ KB


In [10]:
# Checking testing data info

test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 967 entries, 0 to 966
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   image_id  967 non-null    object
dtypes: object(1)
memory usage: 7.7+ KB


In [11]:
# Defining image and batch size

IMAGE_SIZE = (224, 224)
BATCH_SIZE = 32

In [14]:
# Loading images with labels for training dataset

def load_images_with_labels(df, image_dir):
    """Load images and their corresponding labels from dataframe"""
    images = []
    labels = []
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        img_path = os.path.join(image_dir, row['image_id'])
        try:
            img = image.load_img(img_path, target_size=IMAGE_SIZE)
            img_array = image.img_to_array(img)
            images.append(img_array)
            labels.append(row['label'])
        except Exception as e:
            print(f"Skipping {row['image_id']} - error: {str(e)}")
    return np.array(images), np.array(labels)

print("Loading training images...")
X_all, y_all = load_images_with_labels(train_df, train_dir)

Loading training images...


100%|██████████| 1222/1222 [00:14<00:00, 85.26it/s] 


In [18]:
# Assuming y_all contains only 1s (soil images) since it's one-class learning


print("Unique labels in y_all:", np.unique(y_all))  # Should output: [1]

# Get all soil indices (since we have no non-soil in training)

soil_indices = np.arange(len(y_all))  # All indices are soil

# Split soil images into train (80%), val (10%), test (10%)

soil_train_idx, soil_temp_idx = train_test_split(
    soil_indices, test_size=0.2, random_state=42
)
soil_val_idx, soil_test_idx = train_test_split(
    soil_temp_idx, test_size=0.5, random_state=42
)

print(f"""
Data splits:
- Training: {len(soil_train_idx)} samples
- Validation: {len(soil_val_idx)} samples
- Test: {len(soil_test_idx)} samples
""")


Unique labels in y_all: [1]

Data splits:
- Training: 977 samples
- Validation: 122 samples
- Test: 123 samples



In [19]:
# Create datasets (X_train will contain only soil)

X_train = X_all[soil_train_idx]
y_train = y_all[soil_train_idx]  # Will be all 1s

In [20]:
# Validation and test sets (also only soil in this case)

X_val = X_all[soil_val_idx]
y_val = y_all[soil_val_idx]  # All 1s

X_test = X_all[soil_test_idx]
y_test = y_all[soil_test_idx]  

In [23]:
# Preprocess images for ResNet

def preprocess_images(images):
    return preprocess_input(images.copy())

print("Preprocessing images...")
X_train_preprocessed = preprocess_images(X_train)
X_val_preprocessed = preprocess_images(X_val)
X_test_preprocessed = preprocess_images(X_test)

# Feature extraction with ResNet-50

def extract_features(images, batch_size=32):
    base_model = ResNet50(weights='imagenet', include_top=False, pooling='avg')
    num_images = len(images)
    features = []
    
    for i in tqdm(range(0, num_images, batch_size)):
        batch = images[i:i+batch_size]
        batch_features = base_model.predict(batch)
        features.append(batch_features)
    
    return np.concatenate(features)

print("Extracting features from training set...")
X_train_features = extract_features(X_train_preprocessed)

print("Extracting features from validation set...")
X_val_features = extract_features(X_val_preprocessed)

print("Extracting features from test set...")
X_test_features = extract_features(X_test_preprocessed)

Preprocessing images...
Extracting features from training set...


I0000 00:00:1748055050.231651      35 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m94765736/94765736[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


I0000 00:00:1748055055.885187      82 service.cc:148] XLA service 0x7c0d88002f40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1748055055.886255      82 service.cc:156]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1748055056.525882      82 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step


I0000 00:00:1748055059.625480      82 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
  3%|▎         | 1/31 [00:06<03:14,  6.47s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 89ms/step


  6%|▋         | 2/31 [00:06<01:19,  2.76s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step


 10%|▉         | 3/31 [00:06<00:43,  1.57s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 13%|█▎        | 4/31 [00:06<00:27,  1.00s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 16%|█▌        | 5/31 [00:07<00:17,  1.45it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 19%|█▉        | 6/31 [00:07<00:12,  1.99it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step


 23%|██▎       | 7/31 [00:07<00:09,  2.59it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 26%|██▌       | 8/31 [00:07<00:07,  3.27it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step


 29%|██▉       | 9/31 [00:07<00:05,  3.89it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step


 32%|███▏      | 10/31 [00:07<00:04,  4.45it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 35%|███▌      | 11/31 [00:07<00:04,  4.95it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 39%|███▊      | 12/31 [00:08<00:03,  5.48it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step


 42%|████▏     | 13/31 [00:08<00:03,  5.82it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 45%|████▌     | 14/31 [00:08<00:02,  6.22it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 48%|████▊     | 15/31 [00:08<00:02,  6.54it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 52%|█████▏    | 16/31 [00:08<00:02,  6.75it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step


 55%|█████▍    | 17/31 [00:08<00:02,  6.61it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


 58%|█████▊    | 18/31 [00:08<00:01,  6.73it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 61%|██████▏   | 19/31 [00:09<00:01,  6.88it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 65%|██████▍   | 20/31 [00:09<00:01,  7.01it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 68%|██████▊   | 21/31 [00:09<00:01,  7.07it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 71%|███████   | 22/31 [00:09<00:01,  7.16it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step


 74%|███████▍  | 23/31 [00:09<00:01,  7.04it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step


 77%|███████▋  | 24/31 [00:09<00:01,  6.95it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 81%|████████  | 25/31 [00:09<00:00,  7.07it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step


 84%|████████▍ | 26/31 [00:10<00:00,  6.99it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 88ms/step


 87%|████████▋ | 27/31 [00:10<00:00,  6.95it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 90%|█████████ | 28/31 [00:10<00:00,  7.10it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 94%|█████████▎| 29/31 [00:10<00:00,  7.19it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step


 97%|█████████▋| 30/31 [00:10<00:00,  7.06it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step


100%|██████████| 31/31 [00:16<00:00,  1.93it/s]

Extracting features from validation set...



  0%|          | 0/4 [00:00<?, ?it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step


 25%|██▌       | 1/4 [00:04<00:13,  4.47s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 77ms/step


 50%|█████     | 2/4 [00:04<00:03,  1.93s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 76ms/step


 75%|███████▌  | 3/4 [00:04<00:01,  1.12s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step


100%|██████████| 4/4 [00:10<00:00,  2.56s/it]

Extracting features from test set...



  0%|          | 0/4 [00:00<?, ?it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step


 25%|██▌       | 1/4 [00:04<00:13,  4.62s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


 50%|█████     | 2/4 [00:04<00:03,  2.00s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


 75%|███████▌  | 3/4 [00:04<00:01,  1.16s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step


100%|██████████| 4/4 [00:10<00:00,  2.59s/it]


In [43]:
# ====== 2. NEW FEATURE FUSION SECTION ======

def extract_effnet_features(images, batch_size=32):
    """Extract EfficientNet features"""
    base_model = EfficientNetB0(weights='imagenet', include_top=False, pooling='avg')
    num_images = len(images)
    features = []
    
    for i in tqdm(range(0, num_images, batch_size)):
        batch = effnet_preprocess(images[i:i+batch_size].copy())
        batch_features = base_model.predict(batch, verbose=0)
        features.append(batch_features)
    
    return np.concatenate(features)

print("\nExtracting EfficientNet features...")
X_train_effnet = extract_effnet_features(X_train_preprocessed)
X_val_effnet = extract_effnet_features(X_val_preprocessed)
X_test_effnet = extract_effnet_features(X_test_preprocessed)
X_competition_effnet = extract_effnet_features(X_competition_test_preprocessed)


Extracting EfficientNet features...
Downloading data from https://storage.googleapis.com/keras-applications/efficientnetb0_notop.h5
[1m16705208/16705208[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


100%|██████████| 31/31 [00:19<00:00,  1.56it/s]
100%|██████████| 4/4 [00:15<00:00,  3.85s/it]
100%|██████████| 4/4 [00:15<00:00,  3.76s/it]
100%|██████████| 31/31 [00:18<00:00,  1.72it/s]


In [27]:
# Standardization and PCA

print("Applying standardization and PCA...")
ss = StandardScaler()
ss.fit(X_train_features)
X_train_scaled = ss.transform(X_train_features)
X_test_scaled = ss.transform(X_test_features)

pca = PCA(n_components=512, whiten=True)
pca.fit(X_train_scaled)
print(f'Explained variance: {sum(pca.explained_variance_ratio_):.2f}')

X_train_pca = pca.transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

Applying standardization and PCA...
Explained variance: 1.00


In [28]:
# One-Class SVM Training

print("\nTraining One-Class SVM...")
oc_svm = svm.OneClassSVM(gamma='scale', kernel='rbf', nu=0.05)
oc_svm.fit(X_train_pca)


Training One-Class SVM...


In [30]:
# Now we will load test images 

def load_competition_test_images(test_df, test_dir):
    """Load all competition test images"""
    test_images = []
    for img_id in tqdm(test_df['image_id'], desc="Loading Competition Test Images"):
        img_path = os.path.join(test_dir, img_id)
        try:
            img = image.load_img(img_path, target_size=IMAGE_SIZE)
            img_array = image.img_to_array(img)
            test_images.append(img_array)
        except Exception as e:
            print(f"Skipping {img_id} - error: {str(e)}")
            # For competition, we can't skip images - raise error
            raise ValueError(f"Failed to load competition test image {img_id}")
    return np.array(test_images)

print("\nLoading competition test images...")
X_competition_test = load_competition_test_images(test_df, test_dir)


Loading competition test images...


Loading Competition Test Images: 100%|██████████| 967/967 [00:07<00:00, 125.90it/s]


In [31]:
# Preprocess and extract features for competition test

print("Preprocessing competition test images...")
X_competition_test_preprocessed = preprocess_images(X_competition_test)

print("Extracting features from competition test set...")
X_competition_test_features = extract_features(X_competition_test_preprocessed)

# Apply the same transformations

print("Transforming competition test features...")
X_competition_test_scaled = ss.transform(X_competition_test_features)
X_competition_test_pca = pca.transform(X_competition_test_scaled)

# Make predictions for competition

print("Making competition predictions...")
competition_preds = oc_svm.predict(X_competition_test_pca)
competition_preds = np.where(competition_preds == 1, 1, 0)

# Verify lengths

print(f"Number of competition test images: {len(test_df)}")
print(f"Number of competition predictions: {len(competition_preds)}")

Preprocessing competition test images...
Extracting features from competition test set...


  0%|          | 0/31 [00:00<?, ?it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4s/step


  3%|▎         | 1/31 [00:04<02:01,  4.06s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 86ms/step


  6%|▋         | 2/31 [00:04<00:50,  1.76s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 10%|▉         | 3/31 [00:04<00:28,  1.02s/it]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step


 13%|█▎        | 4/31 [00:04<00:18,  1.47it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 82ms/step


 16%|█▌        | 5/31 [00:04<00:12,  2.06it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step


 19%|█▉        | 6/31 [00:04<00:09,  2.71it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step


 23%|██▎       | 7/31 [00:04<00:07,  3.41it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step


 26%|██▌       | 8/31 [00:05<00:05,  4.09it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 29%|██▉       | 9/31 [00:05<00:04,  4.65it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step


 32%|███▏      | 10/31 [00:05<00:04,  5.23it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 35%|███▌      | 11/31 [00:05<00:03,  5.61it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step


 39%|███▊      | 12/31 [00:05<00:03,  6.03it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step


 42%|████▏     | 13/31 [00:05<00:02,  6.21it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 93ms/step


 45%|████▌     | 14/31 [00:05<00:02,  6.26it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step


 48%|████▊     | 15/31 [00:06<00:02,  6.21it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step


 52%|█████▏    | 16/31 [00:06<00:02,  6.19it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step


 55%|█████▍    | 17/31 [00:06<00:02,  6.36it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step


 58%|█████▊    | 18/31 [00:06<00:01,  6.56it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 61%|██████▏   | 19/31 [00:06<00:01,  6.59it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 90ms/step


 65%|██████▍   | 20/31 [00:06<00:01,  6.62it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step


 68%|██████▊   | 21/31 [00:07<00:01,  6.79it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 71%|███████   | 22/31 [00:07<00:01,  6.76it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


 74%|███████▍  | 23/31 [00:07<00:01,  6.90it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 77%|███████▋  | 24/31 [00:07<00:01,  6.83it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


 81%|████████  | 25/31 [00:07<00:00,  6.96it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 84%|████████▍ | 26/31 [00:07<00:00,  6.81it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 91ms/step


 87%|████████▋ | 27/31 [00:07<00:00,  6.76it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 89ms/step


 90%|█████████ | 28/31 [00:08<00:00,  6.75it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step


 94%|█████████▎| 29/31 [00:08<00:00,  6.90it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 79ms/step


 97%|█████████▋| 30/31 [00:08<00:00,  7.01it/s]

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5s/step


100%|██████████| 31/31 [00:13<00:00,  2.31it/s]

Transforming competition test features...
Making competition predictions...
Number of competition test images: 967
Number of competition predictions: 967





In [32]:
# Final Submission Preparation

print("\nFinalizing competition submission...")

# 1. Verify predictions are binary (0 or 1)

assert set(competition_preds).issubset({0, 1}), "Predictions contain invalid values"

# 2. Check class distribution

print("Prediction distribution:")
print(pd.Series(competition_preds).value_counts())

# 3. Create submission DataFrame with proper ordering

submission_df = pd.DataFrame({
    'image_id': test_df['image_id'],
    'label': competition_preds
})

# 4. Verify no missing values

assert not submission_df.isnull().any().any(), "Submission contains missing values"


Finalizing competition submission...
Prediction distribution:
0    722
1    245
Name: count, dtype: int64


In [33]:
# 5. Save to CSV

submission_file = 'submission_svm.csv'
submission_df.to_csv(submission_file, index=False)
print(f"\nSubmission saved to {submission_file}")
print("First 5 predictions:")
print(submission_df.head())


Submission saved to submission_svm.csv
First 5 predictions:
                               image_id  label
0  6595f1266325552489c7d1635fafb88f.jpg      1
1  4b614841803d5448b59e2c6ca74ea664.jpg      1
2  ca30e008692a50638b43d944f46245c8.jpg      0
3  6a9046a219425f7599729be627df1c1a.jpg      1
4  97c1e0276d2d5c2f88dddbc87357611e.jpg      1


---

## 🏁 Results

After training and fine-tuning our One-Class SVM pipeline with deep CNN-based feature extraction, we evaluated the model on the competition's test set.

### 📊 Prediction Distribution

The distribution of predicted labels on the test set:

- **Soil (1)**: 245 images  
- **Not Soil (0)**: 722 images

This indicates that the model is **conservative in labeling an image as soil**, which aligns with the nature of One-Class models where the goal is to detect deviations from the known class (soil).

### 🏆 Leaderboard Performance

> **Final F1-Score**: `0.855`

Our model achieved an impressive **0.855 F1-score** on the public leaderboard, reflecting its strong balance between precision and recall for detecting soil images.

---

The results validate that **deep feature representations + One-Class SVM** form an effective strategy for binary soil image classification — especially when only positive-class examples are available during training.
