# 🌱 Soil Classification Challenge - Part 1
Welcome to the Soil Classification Challenge! In this notebook, we will walk through the entire machine learning workflow, from loading the dataset to submitting predictions for evaluation. This task involves classifying images of soil into different types using a Convolutional Neural Network (CNN).

## 1. 📚 Importing Required Libraries
In this section, we import all the required Python libraries for data handling, visualization, image processing, and building our machine learning model.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import cv2
import os
import numpy as np
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from PIL import Image

## 2. 📄 Loading Dataset
We begin by loading the training labels CSV which contains image IDs and corresponding soil type labels.

In [None]:
csv_path = "/content/soil_classification-2025/train_labels.csv"
train_df = pd.read_csv(csv_path)
print("✅ train_labels.csv loaded successfully.")
display(train_df.head())

## 3. 🖼️ Viewing a Sample Image
Let's load and visualize one sample image from the dataset to get a sense of the input data.

In [None]:
image_folder = "/content/soil_classification-2025/train"
sample_row = train_df.iloc[0]
img_path = os.path.join(image_folder, sample_row['image_id'])

img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(8, 6))
plt.imshow(img)
plt.title(f"Label: {sample_row['soil_type']}")
plt.axis('off')
plt.show()

## 4. 🧹 Data Preprocessing
In this step, we:
- Encode categorical soil labels into numbers
- Resize and normalize all images
- Store them as NumPy arrays for model training.

In [None]:
IMG_SIZE = 128
label_encoder = LabelEncoder()
train_df['label'] = label_encoder.fit_transform(train_df['soil_type'])

print("Label mapping:")
for i, soil_type in enumerate(label_encoder.classes_):
    print(f"{i}: {soil_type}")

X = []
y = []

for i in tqdm(range(len(train_df))):
    row = train_df.iloc[i]
    img_path = os.path.join(image_folder, row['image_id'])
    try:
        img = cv2.imread(img_path)
        if img is None:
            continue
        img = cv2.resize(img, (IMG_SIZE, IMG_SIZE))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        X.append(img)
        y.append(row['label'])
    except Exception as e:
        print(f"Error processing {row['image_id']}: {e}")

X = np.array(X) / 255.0
y = np.array(y)

print(f"✅ Loaded {len(X)} images of shape {X[0].shape}")

## 5. 🔀 Train-Validation Split
We split the dataset into training and validation sets to evaluate model performance.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"✅ Train size: {len(X_train)}, Validation size: {len(X_val)}")

## 6. 🏗️ Building the CNN Model
We use a simple CNN architecture with convolutional and pooling layers followed by dense layers.

In [None]:
model = Sequential([
    Conv2D(32, (3,3), activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 3)),
    MaxPooling2D(2,2),
    Conv2D(64, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    Flatten(),
    Dropout(0.5),
    Dense(128, activation='relu'),
    Dense(len(label_encoder.classes_), activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

## 7. 🎯 Model Training
Train the model for 10 epochs and monitor validation performance.

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)

## 8. 📉 Training Performance
Visualize accuracy and loss over training epochs.

In [None]:
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.show()

## 9. 🧪 Validation Evaluation
Evaluate final performance on the validation set.

In [None]:
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Accuracy: {accuracy*100:.2f}%")

## 10. 🧾 Test Data Prediction & Submission
Make predictions on test images and create a submission CSV file.

In [None]:
test_image_dir = "/content/soil_classification-2025/test"
test_images = []
image_ids = []

for img_name in tqdm(os.listdir(test_image_dir)):
    img_path = os.path.join(test_image_dir, img_name)
    try:
        image = Image.open(img_path).convert('RGB').resize((IMG_SIZE, IMG_SIZE))
        image = np.array(image) / 255.0
        test_images.append(image)
        image_ids.append(img_name)
    except Exception as e:
        print(f"Error processing image {img_name}: {e}")

test_images = np.array(test_images)
predictions = model.predict(test_images)
predicted_classes = np.argmax(predictions, axis=1)
predicted_labels = label_encoder.inverse_transform(predicted_classes)

submission_df = pd.DataFrame({
    'image_id': image_ids,
    'soil_type': predicted_labels
})
submission_df.to_csv('submission.csv', index=False)
print("✅ Submission file created successfully.")

## 11. 🖼️ Visualizing Test Predictions
Let's visualize some test images with predicted soil type.

In [None]:
plt.figure(figsize=(15, 10))
for i in range(min(9, len(test_images))):
    plt.subplot(3, 3, i+1)
    plt.imshow(test_images[i])
    plt.title(f"Pred: {predicted_labels[i]}")
    plt.axis('off')
plt.tight_layout()
plt.show()

## 12. ✅ Conclusion
- We successfully built and trained a CNN to classify soil images.
- Achieved good validation accuracy.
- Generated predictions for test data.
