In [2]:
import numpy as np
import pandas as pd 
import pickle
from glob import glob
import os

Summary:
Template setup for ImageNet data: The code loads metadata from a Dogs_vs_Wolves subset of ImageNet and stores it as a pandas DataFrame.

Food-101 Dataset:

It sets up paths for the Food-101 dataset, including image files, mappings for classes/labels, and splits for training and validation.
It uses glob() to list all image files in the dataset and prepares file paths for class labels and train/validation splits.
In short, this code sets up the paths and variables needed to process the ImageNet and Food-101 datasets for later use (like loading images, class labels, and creating train/validation sets).

In [3]:
# Template stuff
imagenet_root_path = '/bigstor/zsarwar/Imagenet_2012'
imagenet_subsets_path = '/bigstor/zsarwar/Imagenet_2012_subsets'
template_path = os.path.join(imagenet_subsets_path, "Dogs_vs_Wolves_metadata.pkl")
df_template = pd.read_pickle(template_path)
df_template = pd.DataFrame.from_dict(df_template, orient='index')


dataset_path = "/bigstor/common_data/food_101/food-101/images/*"
df_path = "/bigstor/common_data/food_101/DF/"
all_folders = glob(dataset_path)
#all_folders = [fold for fold in all_folders if ".txt" not in fold]
class_mapping = "/bigstor/common_data/food_101/food-101/meta/classes.txt"
label_mapping = "/bigstor/common_data/food_101/food-101/meta/labels.txt"

train_split = "/bigstor/common_data/food_101/food-101/meta/train.txt"
val_split = "/bigstor/common_data/food_101/food-101/meta/test.txt"

It reads the content of two text files:
classes.txt: Contains the class names (e.g., food categories).
labels.txt: Contains the numeric label mappings for each class.

It removes any trailing newline characters from each line in both files (class names and class labels).
The resulting lists (all_labels and all_classes) will be cleaned and stored:
all_labels: A list of class names (e.g., "apple", "banana", etc.).
all_classes: A list of class labels (usually numeric values representing the classes).

In [4]:
# Process label to class mapping
with open(class_mapping, 'r') as i_file:
    all_labels = i_file.readlines()
all_labels = [cl.replace("\n", "") for cl in all_labels]

with open(label_mapping, 'r') as i_file:
    all_classes = i_file.readlines()
all_classes = [cl.replace("\n", "") for cl in all_classes]



Summary:
labels_classes dictionary: Maps each class name (from all_labels) to its corresponding class label (from all_classes).

Example:

labels_classes = {
    "apple": "0",
    "banana": "1",
    "carrot": "2"
}


labels_idx dictionary: Maps each class name (from all_labels) to its index (position) in the all_labels list.

Example:

labels_idx = {
    "apple": 0,
    "banana": 1,
    "carrot": 2
}
This allows you to easily access the label or the index associated with a class name. For instance:

labels_classes["apple"] would return "0", the class label for "apple".
labels_idx["apple"] would return 0, the index of "apple" in the list.


In [5]:
labels_classes = {all_labels[i]: all_classes[i] for i in range(len(all_labels))}
labels_idx = {all_labels[i]: i for i in range(len(all_labels))}

This code effectively constructs a metadata list containing dictionaries for each image. Each dictionary holds various metadata fields, including class label, index, image path, and more, which can be used later for further processing or saving as a DataFrame.

The result of this code will be a list called metadata, where each element is a dictionary containing metadata for a single image. The metadata includes:
The class label for the image (from labels_classes).
The index of the class label (from labels_idx).
A few static values ('food101', None for some columns).
The file path for the image.
A relative index for the image in its folder.

Example:
If the following were true:

cols = ['col1', 'label', 'label_idx', 'dataset', 'dataset_name', 'other', 'image_path']
labels_classes = {"apple": "0", "banana": "1"}
labels_idx = {"apple": 0, "banana": 1}
all_folders = ["path/to/apple", "path/to/banana"]
all_images = ["path/to/apple/image1.jpg", "path/to/apple/image2.jpg"] for the "apple" folder.

Then, metadata will look like this:

metadata = [
    {
        'col1': None,
        'label': "0",  # Class label for "apple"
        'label_idx': 0,  # Index for "apple"
        'dataset': 'food101',
        'dataset_name': 'food101',
        'other': None,
        'image_path': 'path/to/apple/image1.jpg',
        'index': 'apple/image1.jpg'
    },
    {
        'col1': None,
        'label': "0",
        'label_idx': 0,
        'dataset': 'food101',
        'dataset_name': 'food101',
        'other': None,
        'image_path': 'path/to/apple/image2.jpg',
        'index': 'apple/image2.jpg'
    }
]
This will continue for all the folders and images in the all_folders list.

In [None]:
cols = df_template.columns
metadata = []
for idx, folder in enumerate(all_folders):
    label = folder.split("/")[-1]
    all_images = glob(folder + "/*")
    for j in range(len(all_images)):
        t_dict = {}
        t_dict[cols[0]] = None
        t_dict[cols[1]] = labels_classes[label]
        t_dict[cols[2]] = labels_idx[label]
        t_dict[cols[3]] = 'food101'
        t_dict[cols[4]] = 'food101'
        t_dict[cols[5]] = None
        t_dict[cols[6]] = all_images[j]
        t_dict['index'] = '/'.join(all_images[j].split("/")[-2:])
        metadata.append(t_dict)        

Summary:
1. Create DataFrame: df_food is created from the metadata list, where each dictionary in metadata becomes a row in the DataFrame and the dictionary keys become column names.
2. Set 'index' as the row index: The column index (which contains the relative file paths) is set as the index of the DataFrame.
3. Remove index name: The name of the index is removed, so the index no longer has a name associated with it.
This will give you a DataFrame where the rows are indexed by the relative paths of the images, and the other metadata columns (like label, dataset, etc.) are accessible for each image.

Example:
After this line, the df_food will look the same as the previous one, but the index will not have a name:

col1	label	label_idx	dataset	dataset_name	other	image_path
None	0	0	food101	food101	None	path/to/apple/image1.jpg
None	0	0	food101	food101	None	path/to/apple/image2.jpg
Notice that the row labels are no longer named 'index'; they are simply unnamed row labels.

In [7]:
df_food = pd.DataFrame.from_dict(metadata)
df_food = df_food.set_index('index')
df_food.index.name = None

Summary:
For the training images:

The code reads the list of filenames from train_split, removes the newline character, and appends .jpg to each filename.
This results in a list of fully-qualified image filenames for the training set, like: ["apple_image1.jpg", "banana_image2.jpg", ...].
For the validation images:

The code does the same for the validation set by reading the list of filenames from val_split, removing the newline character, and appending .jpg to each.
This results in a list of fully-qualified image filenames for the validation set, like: ["apple_image101.jpg", "banana_image102.jpg", ...].

In [25]:
with open(train_split, "r") as i_file:
    train_images = i_file.readlines()
    train_images = [img.replace("\n", "") + ".jpg" for img in train_images]

with open(val_split, "r") as i_file:
    val_images = i_file.readlines()
    val_images = [img.replace("\n", "") + ".jpg" for img in val_images]

Example:
Assume df_food looks like this:

col1	label	label_idx	dataset	dataset_name	other	image_path
None	0	0	food101	food101	None	apple/image1.jpg
None	1	1	food101	food101	None	banana/image2.jpg
None	0	0	food101	food101	None	apple/image3.jpg

And train_images contains:

['apple/image1.jpg', 'apple/image3.jpg']
After running the code:

df_train = df_food[df_food.index.isin(train_images)]
The resulting df_train will look like this:

col1	label	label_idx	dataset	dataset_name	other	image_path
None	0	0	food101	food101	None	apple/image1.jpg
None	0	0	food101	food101	None	apple/image3.jpg

The row with banana/image2.jpg is excluded because it is not in the train_images list.

Conclusion:
This line of code is used to filter df_food to create a new DataFrame, df_train, that only includes the images listed in the train_images list.

In [31]:
df_train =df_food[df_food.index.isin(train_images)]

Example:
Assume df_food looks like this:

col1	label	label_idx	dataset	dataset_name	other	image_path
None	0	0	food101	food101	None	apple/image1.jpg
None	1	1	food101	food101	None	banana/image2.jpg
None	0	0	food101	food101	None	apple/image3.jpg

And val_images contains:

['apple/image1.jpg', 'apple/image3.jpg']
After running the code:

df_val = df_food[df_food.index.isin(val_images)]
The resulting df_val DataFrame will look like this:

col1	label	label_idx	dataset	dataset_name	other	image_path
None	0	0	food101	food101	None	apple/image1.jpg
None	0	0	food101	food101	None	apple/image3.jpg
The row with banana/image2.jpg is excluded because it is not in the val_images list.

Conclusion:
This line of code is used to create a new DataFrame, df_val, which contains only the rows from df_food where the image paths are listed in val_images. This is how you filter the images that belong to the validation set.

In [32]:
df_val = df_food[df_food.index.isin(val_images)]

In [33]:
df_val

Unnamed: 0,image_url,class,label,data_type,dataset,query,img_path
french_onion_soup/2993508.jpg,,French onion soup,41,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
french_onion_soup/1184744.jpg,,French onion soup,41,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
french_onion_soup/798877.jpg,,French onion soup,41,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
french_onion_soup/3808030.jpg,,French onion soup,41,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
french_onion_soup/1959316.jpg,,French onion soup,41,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
...,...,...,...,...,...,...,...
bread_pudding/562153.jpg,,Bread pudding,8,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
bread_pudding/2443597.jpg,,Bread pudding,8,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
bread_pudding/3294452.jpg,,Bread pudding,8,food101,food101,,/bigstor/common_data/food_101/food-101/images/...
bread_pudding/255220.jpg,,Bread pudding,8,food101,food101,,/bigstor/common_data/food_101/food-101/images/...


Summary of What Happens:
Validation Set:

The df_val DataFrame (which contains metadata for validation images) is saved as "df_food101_val.pkl" in the df_path directory using pickle serialization.
Training Set:

The df_train DataFrame (which contains metadata for training images) is saved as "df_food101_train.pkl" in the df_path directory using pickle serialization.
These pickle files can later be loaded back into memory for further processing or model training using pd.read_pickle().

In [34]:
out_val_path = os.path.join(df_path, "df_food101_val.pkl")
df_val.to_pickle(out_val_path)
out_train_path = os.path.join(df_path, "df_food101_train.pkl")
df_train.to_pickle(out_train_path)

In [35]:
out_train_path

'/bigstor/common_data/food_101/DF/df_food101_train.pkl'