Inspired by "Deep Learning for Computer Vision" CS231N 2025 (Stanford Univ.)

In image classification, a system takes an input image and assigns one or more of several predefined labels (e.g., "dog" or "cat"). If we assign one label to an image, it is single-label classification; if we assign multiple labels, it is multi-label classification. I deliberately chose the word "system" to refer to the complete computational setup, including the algorithm, its trained parameters, the runtime environment, input/output interfaces, and supporting infrastructure.  

On a computer, images are typically represented as tensors (multidimensional arrays) of integers.  

For example, an 800×600 color (RGB) image is represented as a 3rd-order tensor with shape (800, 600, 3), where each value is an unsigned 8-bit integer in the range `[0, 255]`. The third dimension (size 3) corresponds to the RGB color channels (Red, Green, Blue). There are, of course, multiple ways to represent images (e.g., grayscale, CMYK, floating-point values normalized to `[0, 1]`, or other color spaces besides RGB).

In [22]:
# 3rd-order tensor: 2x2 RGB color image
import numpy as np

np.random.seed(42)
rgb_image = np.random.randint(0, 256, size=(2, 2, 3), dtype=np.uint8)
print("RGB image shape:", rgb_image.shape)   # (2, 2, 3)
print("RGB image dtype:", rgb_image.dtype)   # uint8
print(rgb_image)

RGB image shape: (2, 2, 3)
RGB image dtype: uint8
[[[102 220 225]
  [ 95 179  61]]

 [[234 203  92]
  [  3  98 243]]]


The example image above is a 2×2 pixel RGB image (4 pixels total). The first element of the first row in the 3rd-order tensor represents the RGB values of the top-left pixel at position (0,0): `[102, 220, 225]`, where 102 is the red channel intensity, 220 is the green channel intensity, and 225 is the blue channel intensity.

For a black-and-white (grayscale) image, only one channel is needed. It can be represented either as:
- a 2nd-order tensor of shape e.g., `(800, 600)`, or  
- a 3rd-order tensor of shape e.g., `(800, 600, 1)` (to maintain consistency with color images).

In [23]:
# 2nd-order tensor: 2x2 grayscale image
import numpy as np

np.random.seed(42)
grayscale_image = np.array([[100, 200], [50, 150]], dtype=np.uint8)
print("Grayscale image shape:", grayscale_image.shape)  # (2, 2)
print("Grayscale image dtype:", grayscale_image.dtype)  # uint8
print(grayscale_image)

Grayscale image shape: (2, 2)
Grayscale image dtype: uint8
[[100 200]
 [ 50 150]]


There are two main approaches to image classification:

1. Hard-coded rules (actually tried in early computer vision). You can check this out, here engineers manually defined rules to detect features like edges, corners, and textures.
2. Machine learning (ML) (data-driven method). This is the approach we focus on...

In ML-based image classification, we basically follow three steps:

1. Collect a labeled dataset (e.g., ImageNet)  
2. Train a classifier using ML algorithms  
3. Evaluate the model on new, unseen images  

Here is a simplified illustration (in pseudo code):

In [24]:
# 1. Collect a labeled dataset (e.g., ImageNet)
images = None
labels = None

# 2. Train a classifier using ML algorithms
def train(images, labels):
    # Machine learning!
    return "model"

model = train(images, labels)

# 3. Evaluate on new, unseen images
def predict(model, image):
    # Use model to predict label for the input image
    return "cat"

image = None  # The image you want to classify
label = predict(model, image)
print(label)  # Output: cat

cat


A classic and simple baseline for image classification is the nearest neighbor method (also known as k-nearest neighbors, or k-NN). We will explore the "k" part later.

For this model, the train() function simply stores all training images and their labels in memory (as 2nd or 3rd-order tensors, as described before). It is important to understand that this is not learning. No patterns or features are extracted; the model simply stores the raw training examples (as tensors).  
The predict() function assigns to a new image the label of the most similar training image, where similarity is measured by a distance function that computes a numerical value between the new image and each stored training image.  

There are many distance metrics used in k-NN. Two of the most common for images are:

- L1 distance (Manhattan distance):  
  Sum the absolute differences $|I_1^p - I_2^p|$ over every pixel $p$ in the image.  
  $$
  d_1(I_1, I_2) = \sum_{p} |I_1^p - I_2^p|
  $$

- L2 distance (Euclidean distance):  
  Sum the squared differences $(I_1^p - I_2^p)^2$ over every pixel $p$, then take the square root.  
  $$
  d_2(I_1, I_2) = \sqrt{ \sum_{p} (I_1^p - I_2^p)^2 }
  $$

Just as a side note, many other distance metrics exist (e.g., cosine similarity, Hamming distance, Minkowski distances of order $p$), but L1 and L2 remain the standard default choices in raw-pixel k-NN.

To make this practical, suppose we have only two training images (one labeled "cat" and one labeled "dog").  

Using 1-nearest neighbors (i.e., pick the single most similar training example) with L1 distance, the prediction works as follows:

In [25]:
import numpy as np

# Training data (each entry: (image_array, label))
training_data = [
    (np.array([[10, 20], [30, 40]]), "dog"),
    (np.array([[100, 110], [120, 130]]), "cat"),
]

# Image to classify
new_image = np.array([[12, 22], [28, 38]])

best_label = None
best_distance = np.inf

for image, label in training_data:
    # L1 distance (sum of absolute pixel-wise differences)
    distance = np.abs(image - new_image).sum()
    print(f"Distance to {label}: {distance}")

    if distance < best_distance:
        best_distance = distance
        best_label = label

print(f"\nPredicted label: {best_label}")


Distance to dog: 8
Distance to cat: 360

Predicted label: dog


So this was quite straightforward, wasn't it? We took the new image (represented as a tensor), computed the L1 distance to each training image, and assigned the label of the one with the smallest distance. It's that simple.  

In practice, k-nearest neighbors uses k > 1 (rather than simply copying the label of the single closest neighbor). For example, with six training images (three labeled "dog" and three labeled "cat"), we can set k = 3. We typically choose an odd k (like 3, 5, or 7) to avoid ties in voting. The algorithm computes the L1 distance from the new image to all six training examples, sorts them by distance, selects the three closest, and predicts the majority label among them.

This process is just as simple and clearly shows that k-NN does not learn. It does not extract patterns, build rules, or adjust internal parameters during training. It simply stores all training data in memory and makes predictions by comparing new inputs to these stored examples at runtime.

In [32]:
import numpy as np
from collections import Counter

# Training data: (each entry: (image_array, label))
training_data = [
    (np.array([[12, 18], [31, 45]]), "dog"),
    (np.array([[15, 22], [29, 41]]), "dog"),
    (np.array([[9, 19], [35, 43]]), "dog"),
    (np.array([[95, 108], [118, 132]]), "cat"),
    (np.array([[102, 111], [121, 138]]), "cat"),
    (np.array([[98, 105], [119, 128]]), "cat"),
]

# Image to classify
new_image = np.array([[14, 21], [32, 44]])


# k-NN settings
k = 3

# L1 distance (sum of absolute pixel-wise differences) and collect (distance, label) pairs
distances = [(np.abs(img - new_image).sum(), label) for img, label in training_data]

# Sort by distance (ascending)
distances.sort(key=lambda x: x[0])

# Get k nearest labels
nearest_labels = [label for _, label in distances[:k]]

# Predict via majority vote
predicted = Counter(nearest_labels).most_common(1)[0][0]

# Output
print("Distances (sorted):")
for dist, label in distances:
    print(f"  {label}: {dist}")

print(f"\n{k} nearest labels: {nearest_labels}")
print(f"Predicted: {predicted}")


Distances (sorted):
  dog: 7
  dog: 8
  dog: 11
  cat: 339
  cat: 342
  cat: 361

3 nearest labels: ['dog', 'dog', 'dog']
Predicted: dog


Perfect!.

EXTRA MATERIAL:  

There are multiple challenges in image classification. For example:

- Viewpoint variation – the same object looks completely different from different angles  
- Illumination changes – lighting conditions alter pixel (RGB) values significantly (e.g., sunlight vs. indoor)  
- Background clutter – the background resembles the object, making it hard to isolate (e.g., a camouflaged insect)  
- Occlusion – parts of the object are hidden by other objects (e.g., only ears or tail visible)  
- Deformation – non-rigid objects change shape (e.g., a cat curled up, stretched, or in a weird pose)  
- Intra-class variation – objects in the same class vary greatly in appearance (e.g., different breeds or sizes)  
- Context – classification depends on surrounding objects or scene