## 1- Subset Images

## -->Helper Code from Previous Noteboooks Below

## 2- Preprocessing

Part2 and part3
- uses shell commands to prepare the data in a way that it can be loaded into tensorflow, 
- split our data into a train and validation set (can be skipped if data are in separate folders)
- loads the data
- defines some functions that will allow us to directly import our pictures and the corresponding class labels 
- defines functions for data augmentation (Note: Data augmentation is a technique of artificially increasing the training set by creating modified copies of a dataset using existing data. It is done to prevent overfitting).It includes making minor changes to the dataset or using deep learning to generate new data points.)
- defines a function to automatically add other functions to a python file.

Below is how the codes are adjusted from Image_Modeling Notebook1 and Notebook2:

- `flowers_train.csv` replaced with `buildings_train.csv`
- `flowers_eval.csv` replaced with `buildings_validation.csv`
- `flowers_test.csv` replaced with `buildings_test.csv`
- `Image_Modeling.py` replaced with `Building_Damage-Python.py`




--!Check what the following cell does

In [None]:
# Remove any file that gets constructed by the notebook.
!rm -f Building_Damage-Python.py buildings_train.csv buildings_validation.csv buildings_test.csv

In [None]:
# Below code helps us to register the consecutive cells automatically into a
# python script whenever we put %%write_and_run Building_Damage-Python.py at the 
# beginning of that cell
from IPython.core.magic import register_cell_magic

@register_cell_magic
def write_and_run(line, cell):
    """write python code into file and execute it as well"""

    argz = line.split()
    file = argz[-1]
    mode = 'w'
    if len(argz) == 2 and argz[0] == '-a':
        mode = 'a'
        print("Appended to file ", file)
    else:
        print('Written to file:', file)
    with open(file, mode) as f:
        f.write(cell.format(**globals()))        
    get_ipython().run_cell(cell)

In [None]:
# load libraries
import pathlib
import IPython.display as display
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

#load more libraries here if necessary

In [None]:
# check tensorflow version and set verbosity
print(tf.__version__)
tf.compat.v1.logging.set_verbosity(v=tf.compat.v1.logging.INFO)

In [None]:
#loading the data ?

# Note: in Nt1 of Image_Modeling repo, the data was downloaded. There is 
# code for that in that nt

!Check what the cell below does

In [None]:
# Adjust the code below for this notebook

# Get paths as POSIX paths
home_path = str(pathlib.Path.home())
data_dir = home_path + '/.keras/datasets/flower_photos' #adjust here
data_dir = pathlib.Path(data_dir)

# count available images
image_count = len(list(data_dir.glob('*/*.tiff')))
print("We have", image_count, "images.")

# Get classes
CLASS_NAMES = np.array(
    [item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"] # adjust here
)
print("We have the following classes in the data: ", CLASS_NAMES)

In [None]:
# count number of images per class
for cls in CLASS_NAMES:
    image_count = len(list(data_dir.glob(cls+'/*.tiff')))
    print("We have", image_count, cls, "images.")

In [None]:
# Get all images from one label. Adjust the code below.
# We can write a for loop to check 1 image per class label
label1 = list(data_dir.glob('label1/*')) # replace label1

# display 3 images
for image in label1[:3]: # replace label1
    display.display(Image.open(str(image)))

## 3- Data Preparation Using Shell Commands

### 3.1- Reaching at the labels

--Adjust the text and the code below

`buildings_train_generic.csv` contains the paths to the images and their respective classes. Let's look at the first five entries. First we use the [head](https://linuxhint.com/bash_head_tail_command/) command to generate the first five lines of the `buildings_train_generic.csv`. Then we redirect the output of the [head](https://linuxhint.com/bash_head_tail_command/) command to the `/tmp/input.csv` via the ['>'](https://www.cs.ait.ac.th/~on/O/oreilly/unix/upt/ch13_01.htm#UPT-ART-1023) operator. We now print the content of this file with the [cat](https://www.interserver.net/tips/kb/linux-cat-command-usage-examples/?__cf_chl_f_tk=sbsfrwcq2e.iPk93oGmvT0LSXdGVW7BuzsZsRhl85GI-1642513145-0-gaNycGzNCOU) command. We separate the cat command by using pipe operator '|'.

In [None]:
# Let us take a look into the training set - inspect the first 5 lines with 'head'
!head -5 labels.csv > /tmp/input.csv | cat

# This is the same thing as:
# !head -5 labels.csv > /tmp/input.csv 
# !cat /tmp/input.csv

### 3.2- Correcting the name of the labels for tensor flow

`If the result of the above code contains '~', continue here:`

Unfortunately the Tensorflow Dataset API cannot handle '~'. So we write new files containing the user path instead of '~', named buildings_train.csv and buildings_test.csv

- first we generate standard output with the content of labels_train_generic.csv with the cat command.
- the [pipe operator](https://www.cs.ait.ac.th/~on/O/oreilly/unix/upt/ch13_01.htm#UPT-ART-1023) '|' uses the output of whats left of it as the input for whats right of it.
- so we use the output of cat as input of the [sed](https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/?ref=lbp) command. Note: here the delimiter is '+'
- the 's' in `s+~+{home_path}+g` specifies the substitution operation and 'g' stands for globally so all occurrences will be replaced.
- finally sed `s+~+{home_path}+g` replaces all '~' with the home_path variable and the output is written to a new file via '>'
- `home_path` is the variable we defined above.

In [None]:
# TensorFlow Dataset API cannot work with the '~' character,
# so we replace it with the specific user path using 'sed'
# and save it in another file using '>' operator
!cat buildings_train_generic.csv | sed 's+~+{home_path}+g' > buildings_train.csv # adjust
!cat buildings_test_generic.csv | sed 's+~+{home_path}+g' > buildings_test.csv # adjust


In [None]:
#check first 5 elements in the train
!head -5 buildings_train.csv

In [None]:
#check first 5 elements in the test
!head -5 buildings_test.csv

### 3.3- Train-Validation (Holdout) Split

`if the data is already split, you can skip this step`

Now the paths in `buildings_train.csv` and `buildings_test.csv` are in the path format needed for Tensorflow.

In order to be able to see if our model is overfitting we need to split our train set again into a train set and an evaluation set. To make results comparable we want to randomly split with a seed specified by the file random.seed. 

- first we use the `sort -R` command to shuffle the lines in flowers_train.csv. `--random-source=random.seed` sets the random seed.
- then we use this output as input for the [split -l](https://www.geeksforgeeks.org/split-command-in-linux-with-examples/) command via pipe ('|').
- `split -l` then takes the pseudo randomized data and splits it into two files after a specified number of lines. 
- the [wc -l](https://www.geeksforgeeks.org/wc-command-linux-examples/) counts the number of lines in a file.
- So finally we get a train file with `x number` and a evaluation file with `y number` lines.

>__Exercise__: use the `wc -l` command to check the length of the two files in your terminal

Note that due to the `%%bash` in the beginning of the cell we can omit the usual '!' in front of the shell command. 

In [None]:
#adjust below code (figure out what 'flowers' is for)

%%bash
# Let us also create an evaluation set from the train set.
# We want to have a fixed seed.
# NOTE: 'sort -R' is used to shuffle the data. If there are
# duplicates it sorts them next to each others, which we 
# want to have, since we want to avoid evaluation leakage.

echo "files before splitting:"
ls flowers*  #Is flowers the name of the labels file or images file?
sort -R buildings_train.csv --random-source=random.seed | split -l $(( $(wc -l <buildings_train.csv) - 370)) - buildings_train

echo # print empty line
echo "files after splitting:"
ls flowers* #?
# results of split are written into two files automatically: a and b refer to train and test
mv buildings_trainaa buildings_train.csv 
mv buildings_trainab buildings_validation.csv
echo; echo "files after renaming:"
ls flowers* #?

In [None]:
%%bash
# count lines in all csv files
wc -l *.csv

### 3.4- Extract the Labels from Train Set

Next we want to extract the labels from the `buildings_train.csv`. 

- [awk](https://www.geeksforgeeks.org/awk-command-unixlinux-examples/) lets you, amongst other things, select fields separated by white spaces in a file.
- [uniq](https://linuxhint.com/bash_uniq_command/) removes adjacent duplicate lines from a file.


In [None]:
# Extract the labels from the train data 
#########################################
# cat: read file content 
# sed: replace comma by space 
# awk: extract column 2 separated by space 
# sort: sort labels
# uniq: extract unique labels 
# > write data to file
!cat buildings_train.csv | sed 's/,/ /g' | awk '{print $2}' | sort | uniq > /tmp/labels.txt #adjust
# cat: display file content 
!cat /tmp/labels.txt

### 3.5- Define Functions to Process the Data (Decoding and Loading the Data)

We set some parameters for the model and call the register cell magic `write_and_run` again this time with the `-a` flag. This makes sure that the content of the cell is appended to `.py` and existing lines are not overwritten.

`tf.data` builds a performance model of the input pipeline and runs an optimization algorithm to find a good allocation of its CPU budget across all parameters specified as `AUTOTUNE`. While the input pipeline is running, `tf.data` tracks the time spent in each operation, so that these times can be fed into the optimization algorithm.

The [OptimizationOptions](https://www.tensorflow.org/api_docs/python/tf/data/experimental/OptimizationOptions) object gives some control over how autotune will behave.

In [None]:
%%write_and_run -a Building_Damage-Python.py

# adjust the code below 

# We set some parameters for the model
# image characteristics
HEIGHT = 1024 #image height
WIDTH = 1024 #image width
CHANNELS = 3 #image RGB channels

# label characteristics
CLASS_NAMES = ['label1', 'label2', 'label3', 'label4', 'label5'] # put all the label names
NCLASSES = len(CLASS_NAMES)

# algorithmic parameters
BATCH_SIZE = 10 #adjust
SHUFFLE_BUFFER = 10 * BATCH_SIZE #adjust
AUTOTUNE = tf.data.experimental.AUTOTUNE # what does this code do?

VALIDATION_SIZE =  # enter a number here depending on how many samples are in train set
VALIDATION_STEPS = VALIDATION_SIZE // BATCH_SIZE

In [None]:
%%write_and_run -a Building_Damage-Python.py

# Define the function that decodes the images--adjusted for tiff images
def decode_image(image, reshape_dim):
    """decode image based on image loaded image and its size dimension

    Args:
        image (_type_): <class 'tensorflow.python.framework.ops.EagerTensor'> e.g. from tf.io.read_file(<filename>)
        reshape_dim (_type_): list with height and width

    Returns:
        tensor representation of the image
    """
     
    # we convert tiff format to a numpy array we can compute with.
    image = tf.image.decode_tiff(image, channels=CHANNELS)
    # 'decode_jpeg' returns a tensor of type uint8. I replaced it with
    # decode_tiff. We need  32bit floats for the model. Actually we want 
    # them to be in the [0,1] interval. I am not sure if we need to change
    # them to float for tiff (?)
    image = tf.image.convert_image_dtype(image, tf.float32)
    # Now we can resize to the desired size.
    image = tf.image.resize(image, reshape_dim)
    
    return image

In [None]:
# test the decoding function
img = tf.io.read_file(home_path + '/xxx.tiff')  #adjust
print(type(img), "\n")

# TODO: take the function above and decode the image
decode_image(img,[HEIGHT, WIDTH])

We need a function that takes a row containing paths and classes and returns the actual images and a label vector which is true at the position of the class of the image, defined by CLASS_NAMES above and false otherwise (one-hot-encoding). The decode_dataset function does this for us. It will be used later and written to the end of `image_modeling.py`

In [None]:
%%write_and_run -a Building_Damage-Python.py

# check again what this cell does

# The train set actually gives only the paths to the training images.
# We want to create a dataset of training images, so we need a 
# function that can handle this for us.
def decode_dataset(data_row):
    # extract image path and class label for a single row from csv
    record_defaults = ['path', 'class']
    # read row of csv
    filename, label_string = tf.io.decode_csv(data_row, record_defaults)
    # read image into class 'tensorflow.python.framework.ops.EagerTensor'
    image_bytes = tf.io.read_file(filename=filename)
    # read label - looks like: Tensor("Equal:0", shape=(5,), dtype=bool)
    label = tf.math.equal(label_string, CLASS_NAMES)
    return image_bytes, label

In the next cell you can see how a Tensorflow data set will look like. Tensorflow data sets will be iterable objects and we can use `.decode_csv` to unpack the content into a path and a class label.

In [None]:
# check again what this cell does

# read training data for iteration and decode the csv to get filename and label
dataset = tf.data.TextLineDataset('flowers_train.csv')
it = iter(dataset)

# unpack tensorflow object content into file path and class label string
record_defaults = ['path', 'class'] # defines dtype
# output dtype of decode_csv will be two strings
# could have written ['chicken','egg'] with same outcome. But not e.g. [1,'class'].
filename, label_string = tf.io.decode_csv(next(it), record_defaults)
filename, label_string

### 3.6- Data Augmentation

In [None]:
%%write_and_run -a Building_Damage-Python.py

# Next we construct a function for pre-processing the images.
def read_and_preprocess(image_bytes, label, augment_randomly=False):
    """randomly transform image, if augment_randomly == True"""
    # transform image randomly
    if augment_randomly:
        # increase image size 
        image = decode_image(image_bytes, [HEIGHT + 8, WIDTH + 8])
        # TODO: Augment the image.
        # randomly crop image 
        image = tf.image.random_crop(image, size=[HEIGHT, WIDTH, 3])
    # use original image 
    else:
        image = decode_image(image_bytes, [HEIGHT, WIDTH])
    return image, label

def read_and_preprocess_with_augmentation(image_bytes, label): 
    """read images and augment randomly"""
    return read_and_preprocess(image_bytes, label, augment_randomly=True)

### 3.7- Loading the Data with the Labels

Finally we can define a function that loads and preprocesses our data by combining the functions defined above. 
- the `load_dataset` function applies (`map`) the `decode_dataset` to every element in the dataset.
- for training:
    - the data should use your augmentation implementation (`#TODO`).
    - then the data will be [shuffled](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) to avoid the risk to create batches that are not representative of the overall dataset. Second answer in [this](https://datascience.stackexchange.com/questions/24511/why-should-the-data-be-shuffled-for-machine-learning-tasks) thread for more details.
    - we can go through the dataset infinite times.
- for evaluation:
    - the data is neither shuffled or augmented, just read and preprocessed.
    - we only need to go through the whole dataset once, hence `repeat(count=1)`. Will just stop after end is reached.
- finally batches of size `batch_size` will be produced with each iteration step.
- with `prefetch(buffer_size=AUTOTUNE)` an optimized number batches are prepared while prior ones are trained on.

In [None]:
%%write_and_run -a Building_Damage-Python.py

# Now we can create the dataset.
def load_dataset(file_of_filenames, batch_size, training=True):
    # We create a TensorFlow Dataset from the list of files.
    # This dataset does not load the data into memory, but instead
    # pulls batches one after another.
    dataset = tf.data.TextLineDataset(filenames=file_of_filenames).\
        map(decode_dataset)
    
    # important: augmentation only used during training!
    if training:
        # TODO: Use augmentation here.
        dataset = dataset.map(read_and_preprocess_with_augmentation).\
            shuffle(SHUFFLE_BUFFER).\
            repeat(count=None) # Infinite iterations

        # # previous function that got replaced by another with augmentation
        # dataset = dataset.map(read_and_preprocess).\
        #     shuffle(SHUFFLE_BUFFER).\
        #     repeat(count=None) # Infinite iterations
    
    else: 
        # Evaluation or testing
        dataset = dataset.map(read_and_preprocess).\
            repeat(count=1) # One iteration
            
    # The dataset will produce batches of BATCH_SIZE and will automatically
    # prepare an optimized number of batches while the prior one is trained on.
    return dataset.batch(batch_size).prefetch(buffer_size=AUTOTUNE)

In [None]:
# Use load_dataset function 

# check if data loading works as intended.
train_path = 'flowers_train.csv'
train_data = load_dataset(train_path, 1)
# Create an iterator that runs over the training dataset.
it = iter(train_data)

In [None]:
# Iterate and see the pictures and labels
# use 'next' to go to next image (as 'it' is an iterator that runs over the training dataset)
img_batch, labels = next(it)
# show random image
image = img_batch[0]
plt.imshow(image)
print(labels[0])

In [None]:
# show another image
img_batch, labels = next(it) 
image = img_batch[0]
plt.imshow(image)
print(labels[0])

# 4- Check Parameters from Python File before Training

In [None]:
# Import required packages 
# import tensorflow as tf --> already imported in part2
import tensorflow_hub as hub
import datetime

# import image_modeling.py file created with NB1
import Building_Damage-Python

# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
# run this in case you need to clear any logs from previous runs
# !rm -rf ./logs/

In [None]:
# Import variables from Building_Damage-Python.py file
HEIGHT = image_modeling.HEIGHT
WIDTH = image_modeling.WIDTH
NCLASSES = image_modeling.NCLASSES
CLASS_NAMES = image_modeling.CLASS_NAMES
BATCH_SIZE = image_modeling.BATCH_SIZE
TRAINING_SIZE = !wc -l < buildings_train.csv
TRAINING_STEPS = int(TRAINING_SIZE[0]) // BATCH_SIZE

# set paths to data sets
TRAIN_PATH = 'buildings_train.csv'
EVAL_PATH = 'buildings_validation.csv'
TEST_PATH = 'buildings_test.csv'

In [None]:
# double check if the variables now contain the correct values.
print(HEIGHT)
print(CLASS_NAMES)
print(NCLASSES)

# 5- Load Pretrained Models (Transfer Learning)

## refer to Nt2 of Image_Modeling

# 6- Train the Model/s

# 7- Metrics for Model Performance