# Project - deep learning modeling and optimization

In this project you'll be required to implement an architecture of a network, train it on dataset while comparing different optimizers and eventually optimize it using TensorRT.

## Implement and train the model
Hereby shown the architecture of a well known classifier VGG-19:

|Layer Type|	Feature Map|	Size	|Kernel Size|	Stride	|Activation|
| :-: | :-: | :-: | :-: | :-: | :-: |
|Image|	1	|224×224|	–|	–|	–|
|Convolution|	64|	224×224|	3×3|	1|	ReLU|
|Convolution|	64|	224×224|	3×3|	1|	ReLU|
|Max Pooling|	64|	112×112|	2×2|	2|	–|
|Convolution|	128|	112×112|	3×3|	1|	ReLU|
|Convolution|	128|	112×112|	3×3|	1|	ReLU|
|Max Pooling|	128|	56×56|	2×2|	2|	–|
|Convolution|	256|	56×56|	3×3|	1|	ReLU|
|Convolution|	256|	56×56|	3×3|	1|	ReLU|
|Convolution|	256|	56×56|	3×3|	1|	ReLU|
|Convolution|	256|	56×56|	3×3|	1|	ReLU|
|Max Pooling|	256|	28×28|	2×2|	2|	–|
|Convolution|	512|	28×28|	3×3|	1|	ReLU|
|Convolution|	512|	28×28|	3×3|	1|	ReLU|
|Convolution|	512|	28×28|	3×3|	1|	ReLU|
|Convolution|	512|	28×28|	3×3|	1|	ReLU|
|Max Pooling|	512|	14×14|	2×2|	2|	–|
|Convolution|	512|	14×14|	3×3|	1|	ReLU|
|Convolution|	512|	14×14|	3×3|	1|	ReLU|
|Convolution|	512|	14×14|	3×3|	1|	ReLU|
|Convolution|	512|	14×14|	3×3|	1|	ReLU|
|Max Pooling|	512|	7×7|	2×2|	2|	–|
|Fully Connected|	–|	4096|	–|	–|	ReLU|
|Fully Connected|	–|	4096|	–|	–|	ReLU|
|Fully Connected|	–|	1000|	–|	–|	Softmax|

Please implement this network architecture in tensorflow and load pretrained weights into it.

Choose the proper metrics to evaluate model performance and perform model evaluation.


### Import necessary libs

In [3]:
#!pip3 install tensorflow-datasets==4.1.0

distutils: /usr/local/lib/python3.8/dist-packages
sysconfig: /usr/lib/python3.8/site-packages[0m
distutils: /usr/local/lib/python3.8/dist-packages
sysconfig: /usr/lib/python3.8/site-packages[0m
distutils: /usr/local/include/python3.8/UNKNOWN
sysconfig: /usr/include/python3.8[0m
distutils: /usr/local/bin
sysconfig: /usr/bin[0m
distutils: /usr/local
sysconfig: /usr[0m
user = False
home = None
root = None
prefix = None[0m
Collecting tensorflow-datasets==4.1.0
  Downloading tensorflow_datasets-4.1.0-py3-none-any.whl (3.6 MB)
[K     |████████████████████████████████| 3.6 MB 19.9 MB/s eta 0:00:01
Installing collected packages: tensorflow-datasets
  distutils: /usr/local/lib/python3.8/dist-packages
  sysconfig: /usr/lib/python3.8/site-packages[0m
  distutils: /usr/local/lib/python3.8/dist-packages
  sysconfig: /usr/lib/python3.8/site-packages[0m
  distutils: /usr/local/include/python3.8/tensorflow-datasets
  sysconfig: /usr/include/python3.8/tensorflow-datasets[0m
  distutils: /usr/

In [4]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds

### Load the data

In [5]:
ds, info = tfds.load('imagenet_v2', split='test', with_info=True)

tfds.as_dataframe(ds.take(4), info)

[1mDownloading and preparing dataset imagenet_v2/matched-frequency/1.0.0 (download: 1.17 GiB, generated: 1.16 GiB, total: 2.33 GiB) to /root/tensorflow_datasets/imagenet_v2/matched-frequency/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

NonMatchingChecksumError: Artifact https://s3-us-west-2.amazonaws.com/imagenetv2public/imagenetv2-matched-frequency.tar.gz, downloaded to /root/tensorflow_datasets/downloads/s3-us-west-2_image_image-match-frequc56VsOLVttUFrJ7Ka21jcV9uodSP_TSQV-yxfB4t3_U.tar.gz.tmp.3c6b74284b5b475f8c0d0fb4113ab11f/imagenetv2-matched-frequency.tar.gz, has wrong checksum. This might indicate:
 * The website may be down (e.g. returned a 503 status code). Please check the url.
 * For Google Drive URLs, try again later as Drive sometimes rejects downloads when too many people access the same URL. See https://github.com/tensorflow/datasets/issues/1482
 * The original datasets files may have been updated. In this case the TFDS dataset builder should be updated to use the new files and checksums. Sorry about that. Please open an issue or send us a PR with a fix.
 * If you're adding a new dataset, don't forget to register the checksums as explained in: https://www.tensorflow.org/datasets/add_dataset#2_run_download_and_prepare_locally


### Data preprocessing

In [None]:
def resize_with_crop(image, label):
    i = image
    i = tf.cast(i, tf.float32)
    i = tf.image.resize_with_crop_or_pad(i, 224, 224)
    i = tf.keras.applications.vgg19.preprocess_input(i)
    return (i, label)

In [None]:
# Preprocess the images
ds = ds.map(resize_with_crop)

### Implement and build model

In [None]:
model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(64, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same', input_shape=(224, 224, 3)),
        tf.keras.layers.Conv2D(64, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
     
        
        tf.keras.layers.Conv2D(128, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(128, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
     
        
        tf.keras.layers.Conv2D(256, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(256, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(256, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(256, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
      

        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),

        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.Conv2D(512, kernel_size=(3, 3), activation='relu',kernel_initializer=he_normal(), padding='same'),
        tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
       
        tf.keras.layers.Flatten(),

        tf.keras.layers.Dense(4096, activation='relu'),
        tf.keras.layers.Dense(4096, activation='relu'),
        tf.keras.layers.Dense(1000, activation='relu')
    
    ]) 

# compile model
opt = tf.keras.optimizers.Adam(learning_rate=lr)
model.compile(optimizer=opt,
          loss=tf.keras.losses.SparseCategoricalCrossentropy(),
          metrics=['accuracy'])

### Load weights to model

In [None]:
# Loads the weights
model.load_weights("model.h5")

### Evalutate the model

In [None]:
# Evaluate the model
loss, acc = model.evaluate(ds, verbose=2)
print("Accuracy: {:5.2f}%".format(100 * acc))

### Save the model

In [None]:
model.save('./saved_model')

### Evaluate the model GPU usage

In [None]:
def evaluate_model_gpu_from_path(path):
    from tensorflow.python.saved_model import signature_constants
    from tensorflow.python.saved_model import tag_constants
    from tensorflow.python.framework import convert_to_constants

    num_of_iteration = 2000

    def get_func_from_saved_model(saved_model_dir):
      saved_model_loaded = tf.saved_model.load(
          saved_model_dir, tags=[tag_constants.SERVING])
      graph_func = saved_model_loaded.signatures[
          signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
      graph_func = convert_to_constants.convert_variables_to_constants_v2(graph_func)
      return graph_func

    def evaluate_model( ):
        mem_before = get_gpu_memory()[0]
        print("Available GPU Memory before loading: ", mem_before)

        model_func = get_func_from_saved_model(path)
        mem_after = get_gpu_memory()[0]
        print("Available GPU Memory after loading: ", mem_after)

        success = 0
        start_time = time.time()
        for i in range(num_of_iteration):
            data = tf.convert_to_tensor(np.asanyarray([validation_images[i]]))
            digit = np.argmax(model_func(data), axis=-1)[0]
            if digit == np.argmax(validation_labels[i]):
                success += 1

        print("Average FPS: ", num_of_iteration / float(time.time() - start_time))
        print("GPU Memory Usage: " + str(mem_before - mem_after) + " MiB")
        print('\nTest accuracy:', float(success) / num_of_iteration)

    p = multiprocessing.Process(target=evaluate_model)
    p.start()
    p.join()

## Optimize the model using TensorRT

After training of the model and evaluating it, your goal is to optimize the model for inference on target machine using TensorRT (use TF-TRT in this project).

Try quantizing the model for different percisions using TensorRT quantization features, compare the different percision modes and recommand what you choose.

> Bonus: if you were working on Tesla T4 GPU, what percision mode had you chosen then?

### Build  OptimizedModel  with precision = "FP32"

In [None]:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import numpy as np
import random

def optimize_model( ):
    converter = trt.TrtGraphConverterV2(input_saved_model_dir='./saved_model',
                                        conversion_params = tf.experimental.tensorrt.ConversionParams(
                                            precision_mode='FP32',
                                        )
                                       )


    def my_input_fn():
        # Input for a single inference call, for a network that has two input tensors:
        yield (np.asanyarray([train_images[0]]),)


    converter.convert()
    converter.build(my_input_fn)
    converter.save('./optimizedFp32')

p = multiprocessing.Process(target=optimize_model)
p.start()
p.join()

In [None]:
evaluate_model_gpu_from_path('./optimizedFp32')

### Build  OptimizedModel  with precision = "FP16"

In [None]:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import numpy as np
import random

def optimize_model( ):
    converter = trt.TrtGraphConverterV2(input_saved_model_dir='./saved_model',
                                        conversion_params = tf.experimental.tensorrt.ConversionParams(
                                            precision_mode='FP16',
                                        )
                                       )


    def my_input_fn():
        # Input for a single inference call, for a network that has two input tensors:
        yield (np.asanyarray([train_images[0]]),)


    converter.convert()
    converter.build(my_input_fn)
    converter.save('./optimizedFp16')

p = multiprocessing.Process(target=optimize_model)
p.start()
p.join()

In [None]:
evaluate_model_gpu_from_path('./optimizedFp16')

### Build  OptimizedModel  with precision = "INT8"

In [None]:
from tensorflow.python.compiler.tensorrt import trt_convert as trt
import numpy as np
import random

def optimize_model( ):
    converter = trt.TrtGraphConverterV2(input_saved_model_dir='./saved_model',
                                        conversion_params = tf.experimental.tensorrt.ConversionParams(
                                            precision_mode='FP16',
                                        )
                                       )


    def my_input_fn():
        # Input for a single inference call, for a network that has two input tensors:
        ############################ TBD ######################################
        yield (np.asanyarray([train_images[0]]),)


    converter.convert()
    converter.build(my_input_fn)
    converter.save('./optimizedInt8')

p = multiprocessing.Process(target=optimize_model)
p.start()
p.join()

In [None]:
evaluate_model_gpu_from_path('./optimizedInt8')

## Create Box Blur Cuda kernel with Numba

https://en.wikipedia.org/wiki/Box_blur

Follow the algorithm provided for Box blur (3X3 kernel size) and implement in two ways:
1. Using normal loop iteration over an Image
2. Using numba cuda kernel 

In [2]:
!pip install cv2

ERROR: Could not find a version that satisfies the requirement cv2 (from versions: none)
ERROR: No matching distribution found for cv2


In [1]:
import cv2
import requests
import matplotlib.pyplot as plt

r = requests.get('https://static01.nyt.com/images/2019/04/02/science/28SCI-ZIMMER1/28SCI-ZIMMER1-articleLarge.jpg?quality=75&auto=webp&disable=upscale', allow_redirects=True)
open('frog.jpg', 'wb').write(r.content)

test_image = cv2.imread('frog.jpg')

plt.figure()
plt.imshow(test_image)

ModuleNotFoundError: No module named 'cv2'

In [None]:
threadsperblock = 32
xblocks = (test_image.shape[1] + (threadsperblock - 1)) // threadsperblock
yblocks = (test_image.shape[0] + (threadsperblock - 1)) // threadsperblock

print("Xblocks: ", xblocks)
print("Yblocks: ", yblocks)

In [None]:
from numba import cuda

@cuda.jit
def cv_histogram(image, grayscale_image):
    y,x = cuda.grid(2)
    
    if x < image.shape[0] and y < image.shape[1]:
        grayscale_image[x,y] = 0.2126*image[x,y,0] + 0.7152*image[x,y,1] + 0.0722*image[x,y,2]
    
    
def grayscale(image):
    grayscale_image = np.zeros(shape=(test_image.shape[0], test_image.shape[1]), dtype=np.uint8)
    for x in range(image.shape[0]):
        for y in range(image.shape[1]):
            grayscale_image[x,y] = 0.2126*image[x,y,0] + 0.7152*image[x,y,1] + 0.0722*image[x,y,2]

In [None]:
import numpy as np
import matplotlib.pyplot as plt

gray_img = np.zeros(shape=(test_image.shape[0], test_image.shape[1]), dtype=np.uint8)
# test_image = test_image.astype(np.uint32)

blocks_per_grid = (xblocks, yblocks)

image_device = cuda.to_device(test_image)
output = cuda.to_device(gray_img)

%timeit cv_histogram[blocks_per_grid, (threadsperblock, threadsperblock)](image_device, output)
%timeit grayscale(test_image)

plt.imshow(output.copy_to_host(), cmap='gray')