## A detailed example article demonstrating the flow_from_dataframe function from Keras.

* docs -> https://keras.io/preprocessing/image/ & https://www.tensorflow.org/tutorials/load_data/images
* article -> https://medium.com/@vijayabhaskar96/tutorial-on-keras-flow-from-dataframe-1fd4493d237c
* dataset -> https://www.kaggle.com/c/cifar-10/data - The CIFAR-10 data consists of 60,000 32x32 color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images in the official data. We have preserved the train/test split from the original dataset.  

Most of the Image datasets that I found online has 2 common formats, 

1) the first common format contains all the **images within separate folders named after their respective class names**, This is by far the most common format I always see online and Keras allows anyone to utilize the `flow_from_directory` function to easily the images read from the disc and perform powerful on the fly image augmentation with the ImageDataGenerator.

2) The second most common format I found online is, all the **images are present inside a single directory and their respective classes are mapped in a CSV or JSON file**. For these we can use `flow_from_dataframe`

In [1]:
import tensorflow as tf
from pathlib import Path
import pandas as pd

In [2]:
tf.__version__

'2.0.0-rc2'

In [3]:
help(tf.keras.preprocessing.image.ImageDataGenerator.flow_from_dataframe)

Help on function flow_from_dataframe in module keras_preprocessing.image.image_data_generator:

flow_from_dataframe(self, dataframe, directory=None, x_col='filename', y_col='class', weight_col=None, target_size=(256, 256), color_mode='rgb', classes=None, class_mode='categorical', batch_size=32, shuffle=True, seed=None, save_to_dir=None, save_prefix='', save_format='png', subset=None, interpolation='nearest', validate_filenames=True, **kwargs)
    Takes the dataframe and the path to a directory
     and generates batches of augmented/normalized data.
    
    **A simple tutorial can be found **[here](
                                http://bit.ly/keras_flow_from_dataframe).
    
    # Arguments
        dataframe: Pandas dataframe containing the filepaths relative to
            `directory` (or absolute paths if `directory` is None) of the
            images in a string column. It should include other column/s
            depending on the `class_mode`:
            - if `class_mode` is `"

In [4]:
directory = Path("/Users/robin/Kaggle/cifar-10")

In [5]:
[file for file in directory.glob('*.csv')]

[PosixPath('/Users/robin/Kaggle/cifar-10/sampleSubmission.csv'),
 PosixPath('/Users/robin/Kaggle/cifar-10/creditcard.csv'),
 PosixPath('/Users/robin/Kaggle/cifar-10/trainLabels.csv')]

Training files

In [6]:
train_dir = directory / 'train'
train_dir

PosixPath('/Users/robin/Kaggle/cifar-10/train')

In [7]:
train_files = [file for file in train_dir.glob("*")]
print(len(train_files))
train_files[:5]

50000


[PosixPath('/Users/robin/Kaggle/cifar-10/train/20037.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/train/3975.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/train/49081.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/train/38678.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/train/30224.png')]

Test files

In [8]:
test_dir = directory / 'test'
test_dir

PosixPath('/Users/robin/Kaggle/cifar-10/test')

In [9]:
test_files = [file for file in test_dir.glob("*")]
print(len(test_files))
test_files[:5]

300000


[PosixPath('/Users/robin/Kaggle/cifar-10/test/20037.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/test/124219.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/test/270196.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/test/3975.png'),
 PosixPath('/Users/robin/Kaggle/cifar-10/test/66062.png')]

## Create dataframes
`trainLabels.csv` maps the file `id` to the class label

In [10]:
train_df=pd.read_csv(str(directory / "trainLabels.csv"), dtype=str)

In [11]:
train_df.tail()

Unnamed: 0,id,label
49995,49996,bird
49996,49997,frog
49997,49998,truck
49998,49999,automobile
49999,50000,automobile


In [12]:
train_df.shape

(50000, 2)

In [13]:
labels = list(train_df['label'].unique())
print(len(labels))
labels

10


['frog',
 'truck',
 'deer',
 'automobile',
 'bird',
 'horse',
 'ship',
 'cat',
 'dog',
 'airplane']

From the `id` we generate the file name

In [14]:
train_df['filename'] = train_df['id'].apply(lambda x : x + ".png")

In [15]:
train_df.head()

Unnamed: 0,id,label,filename
0,1,frog,1.png
1,2,truck,2.png
2,3,truck,3.png
3,4,deer,4.png
4,5,automobile,5.png


And create the test_df

In [16]:
test_df=pd.read_csv(str(directory / "sampleSubmission.csv"),dtype=str)

In [17]:
test_df['filename'] = test_df['id'].apply(lambda x : x + ".png")

## ImageDataGenerator
Notice below that I split the train set to 2 sets one for training and the other for validation just by specifying the argument validation_split=0.25 which splits the dataset into to 2 sets where the validation set will have 25% of the total images.

In [18]:
datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255., validation_split=0.25)

In [19]:
train_generator=datagen.flow_from_dataframe(
    dataframe=train_df,
    directory=str(directory / 'train'),
    x_col="filename",
    y_col="label",
    subset="training",
    batch_size=32,
    seed=42,
    shuffle=True,
    class_mode="categorical",
    target_size=(32,32)
)

Found 37500 validated image filenames belonging to 10 classes.


In [20]:
valid_generator=datagen.flow_from_dataframe(
    dataframe=train_df,
    directory=str(directory / 'train'),
    x_col="filename",
    y_col="label",
    subset="validation",
    batch_size=32,
    seed=42,
    shuffle=True,
    class_mode="categorical",
    target_size=(32,32)
)

Found 12500 validated image filenames belonging to 10 classes.


And the test 

In [21]:
test_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255.)

test_generator=test_datagen.flow_from_dataframe(
    dataframe=test_df,
    directory=str(directory / 'test'),
    x_col="filename",
    y_col=None,
    batch_size=32,
    seed=42,
    shuffle=False,
    class_mode=None,
    target_size=(32,32)
)

Found 300000 validated image filenames.
