### 1.	What does a SavedModel contain? How do you inspect its content?

SavedModel refers to Serialization process of converting a data object (e.g. Tensorflow models) into a format that allows us to store or transmit the data and then recreate the object when needed using the reverse process of deserialization.

We can use the SavedModel Command Line Interface (CLI) to inspect and execute a SavedModel.

### 2.	When should you use TF Serving? What are its main features? What are some tools you can use to deploy it?

TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. TensorFlow Serving makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. TensorFlow Serving provides out-of-the-box integration with TensorFlow models, but can be easily extended to serve other types of models and data.

TensorFlow Serving allows us to select which version of a model, or "servable" we want to use when we make inference requests. Each version will be exported to a different sub-directory under the given path.

Tools are:
Docker, Kubernetes, Azure, GCP or AWS


### 3.	How do you deploy a model across multiple TF Serving instances?

### 4.	When should you use the gRPC API rather than the REST API to query a model served by TF Serving?

gRPC is a modern, open source remote procedure call (RPC) framework that can run anywhere. It enables client and server applications to communicate transparently, and makes it easier to build connected systems.

`What are the benefits of using gRPC?`

- gRPC uses binary payloads, which are efficient to create and parse and hence light-weight.

- Bi-directional streaming is possible in gRPC, which is not the case with REST API

- gRPC API is built on top of HTTP/2 supporting the traditional request and response steaming as well as bi-directional streaming

- 10 times faster message transmission compared to REST API as gRPC uses serialized Protocol Buffers and HTTP/2

- Loose coupling between client and server makes it easy to make changes

- gRPC allows integration of API’s programmed in different languages

`What’s the difference between gRPC and REST API?`

- Payload Format: REST uses JSON for exchanging messages between client and server, whereas gRPC uses Protocol Buffers. Protocol Buffers are compressed better than JSON, thus making gRPC transmit data over networks more efficiently.

- Transfer Protocols: REST heavily uses HTTP 1.1 protocol, which is textual, whereas gRPC is built on the new HTTP/2 binary protocol that compresses the header with efficient parsing and is much safer.

- Streaming vs. Request-Response: REST supports the Request-Response model available in HTTP1.1. gRPC uses bi-directional streaming capabilities available in HTTP/2, where the client and server send a sequence of messages to each other using a read-write stream.

### 5.	What are the different ways TFLite reduces a model’s size to make it run on a mobile or embedded device?

We can reduce the size of a Tensorflow Model using the below mentioned methods:

- `Freezing:` Convert the variables stored in a checkpoint file of the SavedModel into constants stored directly in the model graph. This reduces the overall size of the model.

- `Pruning:` Strip unused nodes in the prediction path and the outputs of the graph, merging duplicate nodes, as well as cleaning other node ops like summary, identity, etc.

- `Constant folding:` Look for any sub-graphs within the model that always evaluate to constant expressions, and replace them with those constants. Folding batch norms: Fold the multiplications introduced in batch normalization into the weight multiplications of the previous layer.

- `Quantization:` Convert weights from floating point to lower precision, such as 16 or 8 bits.

### 6.	What is quantization-aware training, and why would you need it?

Quantization-aware training helps you train DNNs for lower precision INT8 deployment, without compromising on accuracy. This is achieved by modeling quantization errors during training which helps in maintaining accuracy as compared to FP16 or FP32.

QAT works by emulating the quantization losses that happen after quantizing the model in the training process. This means that the model will be aware of the losses that happen after quantization and it will learn to overcome them.

Training with QAT helps to increase the accuracy of the network.There are a few potential reasons for it:

- Finetuning: Since we convert a previously trained model, we are in-effect, continuing the training of previously trained model. This can be a reason for the increase in accuracy.

- Regularization: Each operation in the model is now converted to its equivalent fake quantization operation. This could act like a regularization method and help improve the accuracy of the network.

### 7.	What are model parallelism and data parallelism? Why is the latter generally recommended?

`Model parallelism` is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances.

`Data parallelism` refers to scenarios in which the same operation is performed concurrently (that is, in parallel) on elements in a source collection or array.

### 8.	When training a model across multiple servers, what distribution strategies can you use? How do you choose which one to use?

`Synchronous vs asynchronous training:` These are two common ways of distributing training with data parallelism. In sync training, all workers train over different slices of input data in sync, and aggregating gradients at each step. In async training, all workers are independently training over the input data and updating variables asynchronously. Typically sync training is supported via all-reduce and async through parameter server architecture.

TensorFlow has 

- MirroredStrategy, 
- TPUStrategy, 
- MultiWorkerMirroredStrategy, 
- ParameterServerStrategy, 
- CentralStorageStrategy, 
- Default Strategy
- OneDeviceStrategy

