# Challenge: Lets train a QuickDraw model & Deploy it as an online service
In the following App we will create a [QuickDraw](https://quickdraw.withgoogle.com/) predictor App. The Dataset is available from [GCS](https://quickdraw.withgoogle.com/data) and contains more than **50 million** labeled drawings. Deep-Learning is a fantastic modeling technique to apply to a visual dataset like this. 

## Build a UnionML app

To train a QuickDraw model, we will use the UnionML, which is implemented in [main.py](pictionary_app/main.py)

In [1]:
%%capture
!pip install wandb

In [3]:
!export WANDB_API_KEY="bb3911fee5ec2805704ae7542fe46ecb69dd0a24"

In [1]:
from pictionary_app import model

## Train on a Small Dataset Locally

In [4]:
num_classes = 10

model.train(
    hyperparameters={"num_classes": num_classes},
    trainer_kwargs={"num_epochs": 1, "batch_size": 512},
    data_dir="/tmp/quickdraw_data",
    max_examples_per_class=1000,
    class_limit=num_classes,
)

PyTorch: setting up devices
***** Running training *****
  Num examples = 10000
  Num Epochs = 1
  Instantaneous batch size per device = 512
  Total train batch size (w. parallel, distributed & accumulation) = 512
  Gradient Accumulation steps = 1
  Total optimization steps = 19


Training on device: cpu


Step,Training Loss


Saving model checkpoint to ./.tmp/outputs_20k_2022-07-15-152151/checkpoint-19
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to ./.tmp/outputs_20k_2022-07-15-152151
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


***** train metrics *****
  epoch                    =        1.0
  total_flos               =        0GF
  train_loss               =     2.3026
  train_runtime            = 0:00:23.99
  train_samples_per_second =    416.779
  train_steps_per_second   =      0.792


(Sequential(
   (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=same)
   (1): ReLU()
   (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=same)
   (4): ReLU()
   (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=same)
   (7): ReLU()
   (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
   (9): Flatten(start_dim=1, end_dim=-1)
   (10): Linear(in_features=2304, out_features=512, bias=True)
   (11): ReLU()
   (12): Linear(in_features=512, out_features=10, bias=True)
 ),
 {'train': 9.765625})

## Train on a Larger Dataset on a Cluster

Let us try to train the model on more data. But for this, we need a GPU. (For refernece training for 2 classes take almost 5 minutes on CPU and 5 seconds on GPU)
but, how should we do that?

this is where UnionML shines with the help of flyte in the backend. you can simply change the API from `train` to ``remote_train``

In [4]:
num_classes = 345
max_examples_per_class = 20000
num_epochs = 5
batch_size = 2048

execution = model.remote_train(
    app_version="dc608b2395a11d30869e238718400626df6e2cf4",
    wait=False,
    hyperparameters={"num_classes": num_classes},
    trainer_kwargs={"num_epochs": num_epochs, "batch_size": batch_size},
    data_dir="./data",
    max_examples_per_class=max_examples_per_class,
    class_limit=num_classes,
)

{"asctime": "2022-07-15 15:26:40,114", "name": "flytekit.cli", "levelname": "ERROR", "message": "Non-auth RPC error <_InactiveRpcError of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = \"Unable to read WorkflowClosure from location s3://production-playground/metadata/admin/unionml/development/quickdraw_classifier.train/dc608b2395a11d30869e238718400626df6e2cf4 : path:s3://production-playground/metadata/admin/unionml/development/quickdraw_classifier.train/dc608b2395a11d30869e238718400626df6e2cf4: Conf container:union-opencompute-open-compute2-playground != Passed Container:production-playground. Dynamic loading is disabled: not found\"\n\tdebug_error_string = \"{\"created\":\"@1657916800.113842000\",\"description\":\"Error received from peer ipv4:3.137.115.239:443\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":904,\"grpc_message\":\"Unable to read WorkflowClosure from location s3://production-playground/metadata/admin/unionml/development/quickdraw_class

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "Unable to read WorkflowClosure from location s3://production-playground/metadata/admin/unionml/development/quickdraw_classifier.train/dc608b2395a11d30869e238718400626df6e2cf4 : path:s3://production-playground/metadata/admin/unionml/development/quickdraw_classifier.train/dc608b2395a11d30869e238718400626df6e2cf4: Conf container:union-opencompute-open-compute2-playground != Passed Container:production-playground. Dynamic loading is disabled: not found"
	debug_error_string = "{"created":"@1657916800.814560000","description":"Error received from peer ipv4:3.137.115.239:443","file":"src/core/lib/surface/call.cc","file_line":904,"grpc_message":"Unable to read WorkflowClosure from location s3://production-playground/metadata/admin/unionml/development/quickdraw_classifier.train/dc608b2395a11d30869e238718400626df6e2cf4 : path:s3://production-playground/metadata/admin/unionml/development/quickdraw_classifier.train/dc608b2395a11d30869e238718400626df6e2cf4: Conf container:union-opencompute-open-compute2-playground != Passed Container:production-playground. Dynamic loading is disabled: not found","grpc_status":13}"
>

Now, wait for the execution to complete and then load model from the remote training job. We can easily interact with the fetched model locally to generate predictions.

In [9]:
model.remote_load(execution)

Waiting for execution f7599ac9a0231493eb5d to complete...
Done.


In [10]:
model.artifact

ModelArtifact(model_object=Sequential(
  (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=same)
  (1): ReLU()
  (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=same)
  (4): ReLU()
  (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=same)
  (7): ReLU()
  (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (9): Flatten(start_dim=1, end_dim=-1)
  (10): Linear(in_features=2304, out_features=512, bias=True)
  (11): ReLU()
  (12): Linear(in_features=512, out_features=345, bias=True)
), hyperparameters=HyperparametersSchema(num_classes=345), metrics={'train': 76.10881042480469})

### Create a Frontend Widget for our UnionML App

Lets fetch the trained model ^^ and then using the wonderful library called [gradio](https://gradio.app/) to create an interactive widget to test out the model. 

**Note** UnionML makes it simple to create a webserver using the same ``predict`` method that you wrote as part of ``model``

**Challenge** Draw a smiley face and see if the model understands it!

In [11]:
import gradio as gr

gr.Interface(
    fn=lambda img: img if img is None else model.predict(img),
    inputs="sketchpad",
    outputs="label",
    live=True,
    allow_flagging="never",
).launch()

Hint: Set streaming=True for Sketchpad component to use live streaming.
Running on local URL:  http://127.0.0.1:7860/

To create a public link, set `share=True` in `launch()`.


(<gradio.routes.App at 0x7fa5624de7c0>, 'http://127.0.0.1:7860/', None)

Exception in callback None(<Task finishe...> result=None>)
handle: <Handle>
Traceback (most recent call last):
  File "/Users/nielsbantilan/miniconda3/lib/python3.9/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
TypeError: 'NoneType' object is not callable
