# Image Modelling - Pipeline Creation (Python file)
In this notebook we will cover: 
- how to prepare images for training a neural network and using shell commands instead of pandas to do so
- Weâ€™ll start by preparing the data in a way that it can be loaded into tensorflow, followed by the loading itself and checking if everything went fine. 
- Each step will be defined as a function, which we will directly write into a python file. 

In the second notebook we will import and use those functions in order to train a neural network that classifies our pictures.

In [None]:
# Remove any file that gets constructed by the notebook.
## noch anpassen 
!rm -f image_modeling.py ../data/train_noheader_split.csv ../data/test_noheader_split.csv ../data/train_noheader.csv

The following cell defines a register cell magic which lets you write the content of a cell into a python script automatically, while still executing the cell. Mode 'a' (can be set with the -a flag) appends to the file while mode 'w' overwrites all existing lines.

In [None]:
# Let's make some dark cell magic. Why not!
from IPython.core.magic import register_cell_magic

@register_cell_magic
def write_and_run(line, cell):
    argz = line.split()
    file = argz[-1]
    mode = 'w'
    if len(argz) == 2 and argz[0] == '-a':
        mode = 'a'
        print("Appended to file ", file)
    else:
        print('Written to file:', file)
    with open(file, mode) as f:
        f.write(cell.format(**globals()))        
    get_ipython().run_cell(cell)

Import needed libraries. `%%write_and_run image_modeling.py` is the call of the register cell magic from above in 'w' mode (default). It writes the imports at the beginning of the `image_modeling.py`.

In [None]:
%%write_and_run image_modeling.py
import pathlib
import IPython.display as display
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import os
import csv
import pandas as pd

Print the tensorflow version and set the threshold for what messages will be logged. 

In [None]:
print(tf.__version__)
tf.compat.v1.logging.set_verbosity(v=tf.compat.v1.logging.INFO)

Get the absolute path to the data folder, count all images and get the class names. 

In [None]:
# Get paths as POSIX paths
#home_path = str(pathlib.Path.home())
data_dir = '../images'
data_dir = pathlib.Path(data_dir)
print(f'The total number of images is: {len(os.listdir(data_dir))}')

Let's have a look at some images

In [None]:
# Get all turtles images
#turtles = list(data_dir.glob('*'))

#for image in turtles[:2]:
#    display.display(Image.open(str(image)))

## Data preparation using shell commands

Now we will use shell commands to look at the data, clean the paths to the images and split our data into train and evaluation set.

Let's look at the first five entries. 
First we use the [head](https://linuxhint.com/bash_head_tail_command/) command to generate the first five lines of the `train.csv`. Then we redirect the output of the [head](https://linuxhint.com/bash_head_tail_command/) command to the `/tmp/input.csv` via the ['>'](https://www.cs.ait.ac.th/~on/O/oreilly/unix/upt/ch13_01.htm#UPT-ART-1023) operator. We now print the content of this file with the [cat](https://www.interserver.net/tips/kb/linux-cat-command-usage-examples/?__cf_chl_f_tk=sbsfrwcq2e.iPk93oGmvT0LSXdGVW7BuzsZsRhl85GI-1642513145-0-gaNycGzNCOU) command.

In [None]:
# Let us take a look into the training set
!head -5 ../data/train.csv > /tmp/input.csv 
!cat /tmp/input.csv

### Change image_id to image_id_path 
To use the images later, we need the image_id within the csf file to be changed to the respective path of the image.
As second step we create a csv file, where the header is excluded.

In [None]:
with open("../data/train.csv",'r') as f:
    with open("../data/train_noheader.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

In [None]:
'''home_path = str(pathlib.Path.home())
df = pd.read_csv("../data/train.csv",header=0)
df["image_id"] = df["image_id"].apply(lambda x: home_path + '/neuefische/Capstone_Project_Turtle_Recall/images/' + x + ".JPG")
df.to_csv("../data/train_jpg.csv",index=False)

with open("../data/train_jpg.csv",'r') as f:
    with open("../data/train_jpg_noheader.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)'''

Save a copy from train_jpg_noheader.csv to train_jpg_noheader_split.csv and conduct the split (remove 30 % images from train_split and save as test_split)

In [None]:
#Get the line-count for 30% of the data
file = open("../data/train_noheader.csv")
reader = csv.reader(file)
lines= round(len(list(reader))*0.3)
print(lines)

In [None]:
%%bash
cat ../data/train_noheader.csv > ../data/train_noheader_split.csv 
sort -R ../data/train_noheader_split.csv --random-source=random.seed | split -l $(( $(wc -l <../data/train_noheader_split.csv) - 644)) - ../data/train_noheader_split

mv ../data/train_noheader_splitaa ../data/train_noheader_split.csv
mv ../data/train_noheader_splitab ../data/test_noheader_split.csv

Check if new csv files have correct number of lines.

In [None]:
%%bash
wc -l ../data/train_noheader_split.csv
wc -l ../data/test_noheader_split.csv

### Next we want to extract the labels from the `train_split.csv`. 

- [awk](https://www.geeksforgeeks.org/awk-command-unixlinux-examples/) lets you, amongst other things, select fields separated by white spaces in a file.
- [uniq](https://linuxhint.com/bash_uniq_command/) removes adjacent duplicate lines from a file.

Why is the `sort` command used? --> We need all duplicates in adjacent files, to successfully remove them via 'uniq'
Why are the ',' replaced with whitespaces? --> The 'awk' command selects field only separated by white spaces.

Furthermore we check the number of labels in train_split and test_split.

In [None]:
# Extract the labels from the train data 
!cat ../data/train_noheader_split.csv | sed 's/,/ /g' | awk '{print $3}' | sort | uniq > /tmp/labels.txt
#!cat /tmp/labels.txt
!wc -l /tmp/labels.txt

In [None]:
# Extract the labels from the test data 
!cat ../data/test_noheader_split.csv | sed 's/,/ /g' | awk '{print $3}' | sort | uniq > /tmp/labels.txt
#!cat /tmp/labels.txt
!wc -l /tmp/labels.txt

Great! With our random seed = 42 we have all unique turtle ID's in both train and test files.

## Define functions to process the data

From now on we will use python and Tensorflow to define some variables and functions to be used in the second notebook when we train our CNN to classify images of turtles.

We set some parameters for the model and call the register cell magic `write_and_run` again this time with the `-a` flag. This makes sure that the content of the cell is appended to `image_modeling.py` and existing lines are not overwritten.

In [None]:
%%write_and_run -a image_modeling.py

#Get unique_turtle_ids from train.csv (without header)
file = open("../data/train_noheader.csv")
reader = csv.reader(file)
turtle_ids = []
for i in list(reader):
    turtle_ids.append(i[2])
unique_turtle_ids = list(dict.fromkeys(turtle_ids))

# We set some parameters for the model
HEIGHT = 224 #image height
WIDTH = 224 #image width
CHANNELS = 3 #image RGB channels
CLASS_NAMES = unique_turtle_ids
NCLASSES = len(CLASS_NAMES)
BATCH_SIZE = 32
SHUFFLE_BUFFER = 10 * BATCH_SIZE
AUTOTUNE = tf.data.experimental.AUTOTUNE

VALIDATION_SIZE = 370
VALIDATION_STEPS = VALIDATION_SIZE // BATCH_SIZE

### Save image as float32 for Tensorflow
We define a function which takes the jpeg image and returns the image in a format which can be used by Tensorflow. We also write it to the end of `image_modeling.py`.

In [None]:
%%write_and_run -a image_modeling.py

# Define the function that decodes in the images
def decode_image(image, reshape_dim):
    # JPEG is a compressed image format. So we want to 
    # convert this format to a numpy array we can compute with.
    image = tf.image.decode_jpeg(image, channels=CHANNELS)
    # 'decode_jpeg' returns a tensor of type uint8. We need for 
    # the model 32bit floats. Actually we want them to be in 
    # the [0,1] interval.
    image = tf.image.convert_image_dtype(image, tf.float32)
    # Now we can resize to the desired size.
    image = tf.image.resize(image, reshape_dim)
    
    return image

Let us look at the result of the `decode_image` function.

In [None]:
# Let us test our decoding function.
img = tf.io.read_file('../images/ID_0014D1K8.JPG')    

# TODO: take the function above and decode the image
img = decode_image(img, [224,224])
img

We need a function that takes a row containing paths and classes and returns the actual images and a label vector which is true at the position of the class of the image, defined by CLASS_NAMES above and false otherwise (one-hot-encoding). The decode_dataset function does this for us. It will be used later and written to the end of `image_modeling.py`

In [None]:
Capstone_directory = os.getcwd().strip("/notebooks")

In [None]:
%%write_and_run -a image_modeling.py

# The train set actually gives only the paths to the training images.
# We want to create a dataset of training images, so we need a 
# function that can handle this for us.
def decode_dataset(data_row):
    record_defaults = ['path', 'image_location', 'turtle_id']
    filename, image_location_string, turtle_id_string = tf.io.decode_csv(data_row, record_defaults)
    image_bytes = tf.io.read_file(filename="/"+Capstone_directory+"/"+"images/"+filename+".JPG")
    turtle_id = tf.math.equal(turtle_id_string, CLASS_NAMES)
    return image_bytes, turtle_id

In the next cell you can see how a Tensorflow data set will look like. Tensorflow data sets will be iterable objects and we can use `.decode_csv` to unpack the content into a path and a class label.

In [None]:
dataset = tf.data.TextLineDataset('../data/train.csv')
it = iter(dataset)
record_defaults = ['path', 'image_location', 'turtle_id'] # defines dtype
# output dtype of decode_csv will be two strings. could have written ['chicken','egg'] with same outcome. But not e.g. [1,'class'].
filename, image_location_string, turtle_id_string = tf.io.decode_csv(next(it), record_defaults)
filename, image_location_string, turtle_id_string

### Augmentation (to improve)

In [None]:
%%write_and_run -a image_modeling.py

# Next we construct a function for pre-processing the images.
def read_and_preprocess(image_bytes, label, augment_randomly=False):
    if augment_randomly: 
        image = decode_image(image_bytes, [HEIGHT + 8, WIDTH + 8])
        # TODO: Augment the image.
        import random
        offset_h = random.randint(0,8)
        offset_w = random.randint(0,8)
        image = image[offset_h:224+offset_h,offset_w:224+offset_w]
    else:
        image = decode_image(image_bytes, [HEIGHT, WIDTH])
    return image, label

def read_and_preprocess_with_augmentation(image_bytes, label): 
    return read_and_preprocess(image_bytes, label, augment_randomly=True)

Finally we can define a function that loads and preprocesses our data by combining the functions defined above. 
- the `load_dataset` function applies (`map`) the `decode_dataset` to every element in the dataset.
- for training:
    - the data should use your augmentation implementation (`#TODO`).
    - then the data will be [shuffled](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) to avoid the risk to create batches that are not representative of the overall dataset. Second answer in [this](https://datascience.stackexchange.com/questions/24511/why-should-the-data-be-shuffled-for-machine-learning-tasks) thread for more details.
    - we can go through the dataset infinite times.
- for evaluation:
    - the data is neither shuffled or augmented, just read and preprocessed.
    - we only need to go through the whole dataset once, hence `repeat(count=1)`. Will just stop after end is reached.
- finally batches of size `batch_size` will be produced with each iteration step.
- with `prefetch(buffer_size=AUTOTUNE)` an optimized number batches are prepared while prior ones are trained on.

In [None]:
%%write_and_run -a image_modeling.py

# Now we can create the dataset.
def load_dataset(file_of_filenames, batch_size, training=True):
    # We create a TensorFlow Dataset from the list of files.
    # This dataset does not load the data into memory, but instead
    # pulls batches one after another.
    dataset = tf.data.TextLineDataset(filenames=file_of_filenames).\
        map(decode_dataset)
    
    if training: #Use augmentation
        dataset = dataset.map(read_and_preprocess_with_augmentation).\
            shuffle(SHUFFLE_BUFFER).\
            repeat(count=None) # Infinite iterations
    else: 
        # Evaluation or testing
        dataset = dataset.map(read_and_preprocess).\
            repeat(count=1) # One iteration
            
    # The dataset will produce batches of BATCH_SIZE and will
    # automatically prepare an optimized number of batches while the prior one is
    # trained on.
    return dataset.batch(batch_size).prefetch(buffer_size=AUTOTUNE)


#### Let us see how the `load_dataset` function works:

In [None]:
# Let us see, if data loading works as intended.
train_path = '../data/train_noheader.csv'
train_data = load_dataset(train_path, 1)
# Create an iterator that runs over the training dataset.
it = iter(train_data)

In [None]:
# Iterate and see the pictures and labels
img_batch, labels = next(it)
image = img_batch[0]
plt.imshow(image)
print(labels[0])