<img src="https://fsdl.me/logo-720-dark-horizontal">

# Lab 07: Deployment

## What You Will Learn

- How to convert PyTorch models into portable TorchScript binaries
- How to use `gradio` to make a simple demo UI for your ML-powered applications
- How to split out a model service from the frontend and spin up a publicly accessible application

# Setup

In [2]:
lab_idx = 7


if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    import bootstrap

    # change into the lab directory
    bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    bootstrap.run = False  # change to True re-run setup
    
!pwd
%ls

env: PYTHONPATH=.:
.:
[m[m[m[m/home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07
[0m[34;42mapi_serverless[0m/  [34;42mapp_gradio[0m/  [34;42mnotebooks[0m/  [34;42mtasks[0m/  [34;42mtext_recognizer[0m/  [34;42mtraining[0m/


In [2]:
from IPython.display import display, HTML, IFrame

full_width = True
frame_height = 720  # adjust for your screen

if full_width:  # if we want the notebook to take up the whole width
    # add styling to the notebook's HTML directly
    display(HTML("<style>.container { width:100% !important; }</style>"))
    display(HTML("<style>.output_result { max-width:100% !important; }</style>"))

### Follow along with a video walkthrough on YouTube:

In [2]:
from IPython.display import IFrame


IFrame(src="https://fsdl.me/2022-lab-07-video-embed", width="100%", height=720)

# Making the model portable

While training the model,
we've saved checkpoints and stored them locally
and on W&B.

From these checkpoints, we can reload model weights
and even restart training if we are in or can recreate
the model development environment.

We could directly deploy these checkpoints into production,
but they're suboptimal for two reasons.

First, as the name suggests,
these "checkpoints" are designed for serializing
state at a point of time in training.

That means they can include lots of information
not relevant during inference,
e.g. optimizer states like running average gradients.

Additionally, the model development environment
is much more heavyweight than what we need during inference.

For example, we've got Lightning for training models
and W&B for tracking training runs.

These in turn incur dependencies on lots of heavy data science libraries.

We don't need this anymore -- we just want to run the model.

These are effectively "compiler tools", which our runtime model doesn't need.

So we need a new model binary artifact for runtime
that's leaner and more independent.

For this purpose, we use TorchScript.

## Compiling models to TorchScript

Torch has two main facilities for creating
more portable model binaries:
_scripting_ and _tracing_.

Scripting produces a binary that combines
constant `Tensor` values
(like weights and positional embeddings)
with a program that describes how to use them.

The result is a program that creates a dynamic graph,
as does a normal PyTorch program,
but this program is written in a
sub-dialect of Python called
_TorchScript_.

The [TorchScript sub-dialect of Python](https://pytorch.org/docs/stable/jit_language_reference.html#language-reference)
is more performant
and can even be run without a Python interpreter.

For example, TorchScript programs can be executed in pure C++
[using LibTorch](https://pytorch.org/tutorials/advanced/cpp_export.html).

You can read more in the documentation for the primary method
for scripting models, `torch.jit.script`:

In [3]:
import torch


torch.jit.script??

[0;31mSignature:[0m
[0mtorch[0m[0;34m.[0m[0mjit[0m[0;34m.[0m[0mscript[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mobj[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moptimize[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m_frames_up[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m_rcb[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexample_inputs[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mList[0m[0;34m[[0m[0mTuple[0m[0;34m][0m[0;34m,[0m [0mDict[0m[0;34m[[0m[0mCallable[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mTuple[0m[0;34m][0m[0;34m][0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m   
[0;32mdef[0m [0mscript[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mobj[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moptimize[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m

The primary alternative to scripting is _tracing_,
which runs the PyTorch module on a specific
set of inputs and records, or "traces",
the compute graph.

You can read more about it in the documentation for the primary method
for tracing models, `torch.jit.trace`,
or just read the quick summary and comparison below.

In [4]:
torch.jit.trace??

[0;31mSignature:[0m
[0mtorch[0m[0;34m.[0m[0mjit[0m[0;34m.[0m[0mtrace[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfunc[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexample_inputs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moptimize[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcheck_trace[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcheck_inputs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcheck_tolerance[0m[0;34m=[0m[0;36m1e-05[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrict[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m_force_outplace[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m_module_class[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m_compilation_unit[0m[0;34m=[0m[0;34m<[0m[0mtorch[0m[0;34m.[0m[0mjit[0m[0;34m.[0m[0mCompilationUnit[0m [0mobject[0m [0

### Tracing versus Scripting for TorchScript

The traced program is generally faster than the scripted version,
for models that are compatible with both tracing and scripting.

Tracing produces a static compute graph,
which means all control flow
(`if`s or `for` loops)
are effectively inlined.

As written, our text recognizer has a loop with conditional breaking -- fairly typical for Transformers in autoregressive mode --
so it isn't compatible with tracing.

Furthermore, the static compute graph includes concrete choices of operations,
e.g. specific CUDA kernels if tracing is run on the GPU.

If you try to run the traced model on a system that doesn't support those kernels,
it will crash.
That means tracing must occur in the target deployment environment.

Scripted models are much more portable, at the cost of both slower runtimes
for a fixed hardware target and of some restrictions on how dynamic the Python code can be.

We don't find the restrictions scripting places on Python code to be too onerous
and in our experience, the performance gains are not worth the extra effort
until the team size is larger,
model serving hardware and strategy is more mature,
and model release cycles are slower.

For an alternative perspective that's more in favor of tracing
and walks through how to mix-and-match scripting
and tracing for maximum flexibility and performance, see
[this blogpost](https://ppwwyyxx.com/blog/2022/TorchScript-Tracing-vs-Scripting/)
from
[Detectron2](https://ai.facebook.com/blog/-detectron2-a-pytorch-based-modular-object-detection-library-/)
dev Yuxin Wu.

Choosing just one of scripting or tracing
means we can use a high-level method
from PyTorch Lightning,
`to_torchscript`,
to produce our scripted model binary
and we don't need to touch our model code.

In [5]:
import pytorch_lightning as pl


pl.LightningModule.to_torchscript??

[0;31mSignature:[0m
[0mpl[0m[0;34m.[0m[0mLightningModule[0m[0;34m.[0m[0mto_torchscript[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfile_path[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mpathlib[0m[0;34m.[0m[0mPath[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmethod[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;34m'script'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexample_inputs[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mAny[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mUnion[0m[0;34m[[0m[0mtorch[0m[0;34m.[0m[0mScriptModule[0m[0;34m,[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mtorch[0m[0;34m.[0m[0mScriptModule[0m[0;34m][0m

## Alternatives to TorchScript

Though it has some sharp edges,
TorchScript is a relatively easy to use tool
for compiling neural networks written in PyTorch.

If you're willing to tolerate more sharp edges,
e.g. limited support for certain ops
and a higher risk of subtle differences in behavior, the
[Open Neural Network eXchange](https://onnx.ai/)
format, ONNX, is a compilation target for
[a wide variety of DNN libraries](https://onnx.ai/supported-tools.html),
from `sklearn` and MATLAB
to PyTorch and Hugging Face.

A high-level utility for conversion to ONNX is also included
in PyTorch Lightning, `pl.LightningModule.to_onnx`.

Because it is framework agnostic,
there's more and more varied tooling around ONNX,
and it has smoother paths to
compilation targets that can run DNNs
at the highest possible speeds,
like
[NVIDIA's TensorRT](https://developer.nvidia.com/tensorrt)
or
[Apache TVM](https://tvm.apache.org/2017/08/17/tvm-release-announcement).

TensorRT is the model format used in the
[Triton Inference Server](https://github.com/triton-inference-server/server),
a sort of "kubernetes for GPU-accelerated DNNs"
that is, as of 2022,
the state of the art in running deep networks
at maximum throughput on server-grade GPUs.


## A simple script for compiling and staging models

To recap, our model staging workflow,
which does the hand-off between training and production, looks like this:

1. Get model weights and hyperparameters
from a tracked training run in W&B's cloud storage.
2. Reload the model as a `LightningModule` using those weights and hyperparameters.
3. Call `to_torchscript` on it.
4. Save that result to W&B's cloud storage.

We provide a simple script to implement this process:

In [6]:
!ls | grep charset_normalizer.py


In [7]:
%cd ../
!ls

/home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07
api_serverless		    gradio_test_stats_history.csv  __pycache__
app_gradio		    graph.py			   tasks
app.py			    lunar.pt			   test.jpeg
Dockerfile		    lunar_thunder.ckpt		   test_locust.py
gradio_test_exceptions.csv  model.ckpt			   text_recognizer
gradio_test_failures.csv    model.pt			   training
gradio_test_stats.csv	    notebooks


In [4]:
%tb

SystemExit: 2

In [87]:
%run training/stagelocalckpt.py --ckpt lunar_thunder.ckpt --out lunar.pt --data_class IAMOriginalAndSyntheticParagraphs --model_class ResnetTransformer

Loading checkpoint from: /home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07/lunar_thunder.ckpt
Extracting hyperparameters...

Final parameters used to build the model:
{'data_class': 'IAMOriginalAndSyntheticParagraphs',
 'epoch': 899,
 'global_step': 38700,
 'lr_schedulers': [],
 'model_class': 'ResnetTransformer',
 'optimizer_states': [{'param_groups': [{'amsgrad': False,
                                         'betas': (0.9, 0.999),
                                         'capturable': False,
                                         'eps': 1e-08,
                                         'foreach': None,
                                         'lr': 0.00012,
                                         'maximize': False,
                                         'params': [0,
                                                    1,
                                                    2,
                                                    3,
                                        

Here in this notebook,
rather than training or scripting a model ourselves,
we'll just `--fetch`
an already trained and scripted model binary:

In [None]:
%run training/stage_model.py --fetch --entity=cfrye59 --from_project=fsdl-text-recognizer-2021-training

Note that we can use the metadata of the staged model
to find the training run that generated the model weights.
It requires two graph hops:
find the run that created the staged TorchScript model
then in that run,
find the model checkpoint artifact
and look for the run that created it.

In [23]:
from IPython import display


staged_model_url = "https://wandb.ai/cfrye59/fsdl-text-recognizer-2021-training/artifacts/prod-ready/paragraph-text-recognizer/3e07efa34aec61999c5a/overview"

IFrame(staged_model_url, width="100%", height=720)

When we're deploying our first model,
this doesn't feel that important --
it's easy enough to find the training runs
we've executed and connect them to the model in production.

But as we train and release more models,
this information will become harder to find
and automation and API access will become more important.

This will be especially true if we adopt more sophisticated rollout strategies,
like A/B testing or canarying,
as the application matures.

Our system here is not robust enough to be Enterprise Grade™️ --
marking models as "in production" is manual
and there are no access control planes built in --
but at least the information is preserved.

## Running our more portable model via a CLI

Now that our TorchScript model binary file is present,
we can spin up our text recognizer
with much less code.

We just need a compatible version of PyTorch
and methods to convert
our generic data types
(images, strings)
to and from PyTorch `Tensor`s.

We can put all this together in
a single light-weight object,
the `ParagraphTextRecognizer` class:

In [1]:
%cd ../

/home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07


In [2]:
from text_recognizer.paragraph_text_recognizer import ParagraphTextRecognizer


ParagraphTextRecognizer??

ptr = ParagraphTextRecognizer()

Using default model path: /home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07/text_recognizer/artifacts/paragraph-text-recognizer/lunar.pt


[0;31mInit signature:[0m [0mParagraphTextRecognizer[0m[0;34m([0m[0mmodel_path[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mSource:[0m        
[0;32mclass[0m [0mParagraphTextRecognizer[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Recognizes text in a single paragraph image or in all cells of a form."""[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mmodel_path[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mif[0m [0mmodel_path[0m [0;32mis[0m [0;32mNone[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mmodel_path[0m [0;34m=[0m [0mSTAGED_MODEL_DIRNAME[0m [0;34m/[0m [0mMODEL_FILE[0m[0;34m[0m
[0;34m[0m            [0mprint[0m[0;34m([0m[0;34mf"Using default model path: {model_path}"[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mmodel[0m [0;34m=[0m [0mtorch[0m[0;34m

And from there,
we can start running on images
and inferring the text that they contain:

In [None]:
from IPython.display import Image

example_input = "text_recognizer/tests/support/paragraphs/image5.png"

print(ptr.predict_on_form(example_input)
Image(example_input)

SyntaxError: invalid syntax (3103147764.py, line 5)

As usual,
we write our Python code
so that it can be imported as a module
and run in a Jupyter notebook,
for documentation and experimentation,
and we make it executable as a script
for easier automation:

In [35]:
%run text_recognizer/paragraph_text_recognizer.py --help

%run text_recognizer/paragraph_text_recognizer.py {example_input}

usage: paragraph_text_recognizer.py [-h] filename

Detects a paragraph of text in an input image.

positional arguments:
  filename    Name for an image file. This can be a local path, a URL, a URI
              from AWS/GCP/Azure storage, an HDFS path, or any other resource
              locator supported by the smart_open library.

options:
  -h, --help  show this help message and exit
Using default model path: /home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07/text_recognizer/artifacts/paragraph-text-recognizer/lunar.pt




pre*?5mencer


Notice that the `filename` here can be a local file, a URL, or even a cloud storage URI.

Rather than writing the logic for handling these different cases,
we use the
[`smart_open` library](https://pypi.org/project/smart-open/).

## Testing our model development pipeline

Creating models is _the_ critical function of our code base,
so it's important that we test it,
at the very least with "smoke tests" that let us know
if the code is completely broken.

Right now we have tests for data loading and model training,
but no tests for end-to-end model development,
which combines data loading, model training, and model compilation.

So we add a simple model development test
that trains a model for a very small number of steps
and then runs our staging script.

This model development test script returns an error code (`exit 1`) if the process of
building a model fails (`"$FAILURE" = true`).

We use
[the `||` operator](https://www.unix.com/shell-programming-and-scripting/42417-what-does-mean-double-pipe.html)
to set the `FAILURE` variable to `true` if any of the key commands in model development fail.

In [27]:
!cat training/tests/test_model_development.sh

#!/bin/bash
set -uo pipefail
set +e

FAILURE=false

CI="${CI:-false}"
if [ "$CI" = false ]; then
  export WANDB_PROJECT="fsdl-testing-2022"
else
  export WANDB_PROJECT="fsdl-testing-2022-ci"
fi

echo "training smaller version of real model class on real data"
python training/run_experiment.py --data_class=IAMParagraphs --model_class=ResnetTransformer --loss=transformer \
  --tf_dim 4 --tf_fc_dim 2 --tf_layers 2 --tf_nhead 2 --batch_size 2 --lr 0.0001 \
  --limit_train_batches 1 --limit_val_batches 1 --limit_test_batches 1 --num_sanity_val_steps 0 \
  --num_workers 1 --wandb || FAILURE=true

TRAIN_RUN=$(find ./training/logs/wandb/latest-run/* | grep -Eo "run-([[:alnum:]])+\.wandb" | sed -e "s/^run-//" -e "s/\.wandb//")

echo "staging trained model from run $TRAIN_RUN"
python training/stage_model.py --entity DEFAULT --run "$TRAIN_RUN" --staged_model_name test-dummy --ckpt_alias latest --to_project "$WANDB_PROJECT" --from_project "$WANDB_PROJECT" || FAILURE=true

echo "fetching staged mod

As a next step to improve the coverage of this test,
we might compare the model's outputs
on the same inputs before and after compilation.

### Cleaning up artifacts

The final few lines of the testing script mention
"`selecting for deletion`" some artifacts.

As we incorporate more of our code into testing
and develop more models,
the amount of information we are storing on W&B increases.

We're already uploading model checkpoints, several gigabytes per model training run,
and now we're also looking at uploading several hundred megabytes
of model data per execution of our test.

Artifact storage is free up to 100GB,
but storing more requires a paid account.

That means it literally pays to clean up after ourselves.

We use a very simple script to select certain artifacts for deletion.
 
> ⚠️ **Don't use this untested demonstration script in important environments!** ⚠️
We include options for `-v`erbose output and a `--dryrun` mode,
which are both critical for destructive actions that have access
to model weights that might cost $1000s to produce.

See the `--help` below for more on cleaning up artifacts.

In [28]:
%run training/cleanup_artifacts.py --help

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


ValueError: API key must be 40 characters long, yours was 7

## Tuning inference performance on CPU and GPU

Apart from compilation to TorchScript,
the biggest difference for running the model in production
is that now all of our operations occur on the CPU.

This is a surprising feature of DNN deployment
that's worth thinking about in detail.

Why isn't it a given that deep network inference
runs on GPUs, when that's so critical for deep network training?

First,
not many web applications use GPUs,
so there aren't nearly as many good tools and techniques
for deplyoing GPU-backed services.

But there's another, deeper reason:
GPUs are not as easy to run efficiently
during inference as they are in training.

In training,
we use static or synthetic datasets
and our training code is in charge
of the query patterns.

In particular,
we can request exactly as many inputs
as we want to produce a batch
that makes optimal use
of our expensive GPUs.

In production, requests arrive independently,
according to the whims of our users.

This makes batching challenging,
and by far the simplest service architecture
just runs on each request as it arrives.

But that tanks GPU utilization.

GPUs are highly parallel computers,
and batch is the easiest dimension to parallelize on --
for example, we load the model weights into memory once,
use them, and then release the memory.

The cell below
compares two traces
for a GPU-accelerated
Text Recognizer model running
on a single input and on a batch.

For a simple summary,
you can compare the two profiles in TensorBoard
([batch size 1 here](https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-labs-lab05_training/runs/1vj48h6j/tensorboard?workspace=user-cfrye59),
[batch size 16 here](https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/runs/67j1qxws/tensorboard?workspace=user-cfrye59)).

GPU utilization,
our baseline metric for model performance,
is under 50% with batch size 1,
as compared to >90% with batch size 16,
which fills up GPU RAM.

You can also look through the traces for more details:

In [3]:
trace_comparison_url = "https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-labs-lab05_training/reports/Trace-Comparison-Batch-Size-16-vs-1--VmlldzoyNTg2MTU4"

print(trace_comparison_url)
IFrame(src=trace_comparison_url, width="100%", height=frame_height)

https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-labs-lab05_training/reports/Trace-Comparison-Batch-Size-16-vs-1--VmlldzoyNTg2MTU4


But performance during inference is not as simple 
as just "maximize GPU utilization".

In particular, throughput for the GPU with batch size 16
is over 2x better,
one example per 8 ms vs
one example per 40 ms,
but latency is much worse.

It takes 140ms to complete the batch of size 16.
In the intervening time no examples are completed,
and all 16 users are waiting on a response.

For comparison,
running one example at a time
would get the first user's result
in just 40 ms,
but the total processing time for all 16 examples would be
640 ms.

For user experience, latency is critical,
but for making the most efficient use of hardware,
throughput is generally more important.

During training, we care much less about latency
and much more about computing gradients as fast as possible,
so we aim for larger batch sizes.

Because of the need for efficient use of hardware,
running on single inputs isn't always feasible.

The usual solution is to run a queue,
which collects up enough requests for a batch.

One of the easiest ways to do this as of writing in September 2022 is to use
[`cog` by Replicate](https://github.com/replicate/cog),
which both solves difficult issues with containerizing
models with GPU acceleration 
and includes, as a beta feature, a built-in Redis queue
for batching requests and responses.

But note that we can't just run a queue that waits for,
say, 16 user requests
to build up, then runs them all.
If 15 requests come in at once,
but then no requests come for an hour,
all 15 users will be waiting for an hour
for their responses --
much worse than just waiting a few hundred extra milliseconds!

We need to make sure the queue flushes after a certain amount of time,
regardless of how many requests it has received,
complicating our implementation.

Running single inputs on GPUs
and running a naive queue
are two different ways it's easy to accidentally tank latency
while pursuing efficiency,
at least for some fraction of cases.

So we stick with CPU inference.

# Building a simple model UI

With compilation,
we've moved from a model that can only run
in a very special environment
and with lots of support code
into something lightweight
that runs with a simple CLI.

If we want users to send data to our model
and get useful predictions out,
we need to create a UI.

But a CLI is not a UI --
it's at best the foundation out of which a UI is built.

This is not just a concern once the model is finished:
a UI is an incredible tool for model debugging.

It's hard to overstate the difference between
a static, CLI or code-writing workflow
for sending information to a model
and an interactive interface.

When your model is easily accessible on a mobile phone,
when you can copy-paste text from elsewhere on your machine or the internet,
or when you can upload arbitrary files,
the whole range of possible inputs becomes clear
in a way that's very hard to replicate with fixed data sets.

Unfortunately, creating a GUI from scratch is not easy,
especially in Python.

The best tool for GUIs is the browser,
but the lingua franca of the browser
is JavaScript
([for now](https://webassembly.org/)).

As full stack deep learning engineers,
we're already writing Python with C/C++ acceleration,
we're gluing scripts together with Bash,
and we need to know enough SQL to talk to databases.

Do we now need to learn front-end web development too?

In the long term, it's a good investment,
and we recommend
[The Odin Project](https://www.theodinproject.com/),
a free online course and community for learning web development.

Their
[Foundations course](https://www.theodinproject.com/paths/foundations/courses/foundations#html-foundations),
starting from HTML foundations and proceeding
through basic CSS
and JavaScript,
is a great way to dip your toes in
and learn enough about building websites and UIs
in the browser to be dangerous.

In the short term,
we write our frontends in Python libraries
that effectively write the frontend JavaScript/CSS/HTML
for us.

For the past few years,
[Streamlit](https://streamlit.io/)
has been a popular choice for the busy Python data scientist.

It remains a solid choice,
and tooling for building complex apps with Streamlit is more mature.

We use the
[`gradio` library](https://gradio.app/),
which includes a simple API for wrapping
a single Python function into a frontend
in addition to a less mature, lower-level API
for building apps more flexibly.



This iteration of the FSDL codebase
includes a new module,
`app_gradio`,
that makes a simple UI for the Text Recognizer
using `gradio`.

The core component is a script,
`app_gradio/app.py`,
that can be used to spin up our model and UI
from the command line:

In [9]:
%cd ../

/home/rehanfarooq/fsdl


In [3]:
%run app_gradio/app.py --help

usage: app.py [-h] [--model_url MODEL_URL] [--port PORT]

Provide an image of handwritten text and get back out a string!

options:
  -h, --help            show this help message and exit
  --model_url MODEL_URL
  --port PORT


But one very nice feature of `gradio`
is that it is designed to run as easily
from the notebook as from the command line.

Let's import the contents of `app.py`
and take a look,
then launch our UI.

In [9]:
from app_gradio import app


app.make_frontend??
frontend = app.make_frontend(ptr.predict_on_form)

  # Build the interface


[0;31mSignature:[0m
[0mapp[0m[0;34m.[0m[0mmake_frontend[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfn[0m[0;34m:[0m [0mCallable[0m[0;34m[[0m[0;34m[[0m[0;34m<[0m[0mmodule[0m [0;34m'PIL.Image'[0m [0;32mfrom[0m [0;34m'/home/rehanfarooq/.conda/envs/fsdl-text-recognizer-2022/lib/python3.10/site-packages/PIL/Image.py'[0m[0;34m>[0m[0;34m,[0m [0mbool[0m[0;34m][0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Creates a gradio.Interface frontend for the prediction function.
[0;31mSource:[0m   
    [0;32mdef[0m [0m_log_inference[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mpred[0m[0;34m,[0m [0mmetrics[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;32mfor[0m [0mkey[0m[0;34m,[0m [0mvalue[0m [0;32min[0m [0mmetrics[0m[0;34m.[0m[0mitems[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m            [0mlogging[0m[0;34m.[0m[0minfo[

IMPORTANT: You are using gradio version 3.40.1, however version 4.44.1 is available, please upgrade.
--------


We use `gradio`'s high-level API, `gr.Interface`,
to build a UI by wrapping our `ptr.predict` function,
defining its inputs
(an `Image`)
and outputs
(a `TextBox`),
and specifying some formatting
and styling choices.



We can spin up our UI with the `.launch` method,
and now we can interact
with the model from inside the notebook.


In [None]:
frontend.launch(share=True, width="100%")

Running on local URL:  http://127.0.0.1:7862
Running on public URL: https://5b3117166cd8f1e5f6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




--- Starting Full Form Processing ---

Step 2: Detecting table cells on a low-resolution copy...

Successfully extracted 256 cell coordinates.

Step 3: Scaling coordinates to match original image resolution...

Step 4: Running OCR on padded, high-resolution cell crops...


  plt.imshow(processed_image_pil, cmap='gray'); plt.title("Preprocessed Cell"); plt.axis('off'); plt.show()


Prediction for Cell 0: 'sambour.
comprobishion Riddies line.'




Prediction for Cell 1: 'Compover-hase Shavied Garmusy'




Prediction for Cell 2: 'sumberson scarised-'




Prediction for Cell 3: 'over 4. 8.57... 270 -950...'




Prediction for Cell 4: 'is.'




Prediction for Cell 5: '1934.'




Prediction for Cell 6: 'corp.'




Prediction for Cell 7: 'ay.'




Prediction for Cell 8: '20.'




Prediction for Cell 9: 'and ret preas!
morld and somictee'




Prediction for Cell 10: 'Burn.'




Prediction for Cell 11: 'every.
wood.'




Prediction for Cell 12: '200.'




Prediction for Cell 13: 'the.'




Prediction for Cell 14: '#'




Prediction for Cell 15: 'of the'




Prediction for Cell 16: 'erse.'




Prediction for Cell 17: 'houndly.
womado.'




Prediction for Cell 18: 'Priasient'




Prediction for Cell 19: 'but. Kle piper
Deciastord'




Prediction for Cell 20: '#'




Prediction for Cell 21: 'the'




Prediction for Cell 22: '#'




Prediction for Cell 23: '#'




Prediction for Cell 24: 'ther.'




Prediction for Cell 25: 'as.'




Prediction for Cell 26: ''Air.'




Prediction for Cell 27: ''A.'




Prediction for Cell 28: ''they.'
of they areling ather.
'Awlesty are's depling that
'they arther?'
of they starest from -
eling the splater.
'Elthey from jerkey.
'Oher ither?'
'the Gold proting the ifter.
'Oher ither?'
gether! She warkely from -
of they arther?'
of they's ater?'




Prediction for Cell 29: '1914.'




Prediction for Cell 30: 'and ret pighe.
Promachil (G830.'




Prediction for Cell 31: 'sid." a'




Prediction for Cell 32: '"at &amp; I'




Prediction for Cell 33: '250.'




Prediction for Cell 34: 'they.'




Prediction for Cell 35: '#'




Prediction for Cell 36: '#'




Prediction for Cell 37: 'eys'




Prediction for Cell 38: ''A.'




Prediction for Cell 39: '"ar.'




Prediction for Cell 40: ''are.'




Prediction for Cell 41: '#'




Prediction for Cell 42: '#'




Prediction for Cell 43: '#'




Prediction for Cell 44: 'oper'




Prediction for Cell 45: 'so.'




Prediction for Cell 46: 'as a'




Prediction for Cell 47: '19537.'




Prediction for Cell 48: ''I'




Prediction for Cell 49: ''they.'
of they areling ather.
'Awly are's proting.
'Arglist, from atjoker.
*?6ther?'




Prediction for Cell 50: 'in.'




Prediction for Cell 51: 'bisow armose
I friled I filed I'




Prediction for Cell 52: '# "a'




Prediction for Cell 53: 'as "a'




Prediction for Cell 54: '210'




Prediction for Cell 55: '#'




Prediction for Cell 56: '#'




Prediction for Cell 57: 'they.'




Prediction for Cell 58: 'as or'




Prediction for Cell 59: 'grees.
all ABDIR'




Prediction for Cell 60: 'eaboder.'




Prediction for Cell 61: 'biform ethange
wom Six and or'




Prediction for Cell 62: '#'




Prediction for Cell 63: '#'




Prediction for Cell 64: '#'




Prediction for Cell 65: '#'




Prediction for Cell 66: 'they.'




Prediction for Cell 67: 'a a'




Prediction for Cell 68: '135).
exister.'




Prediction for Cell 69: ''s.'




Prediction for Cell 70: 'gethey.'
of they are from -
of they.'
'Arglist a from -
'they arthey from -
gether! Whapterly are's
eling the splater.
eling the starest.
eling the starest after inglest.
'Oher goling tharest.
'Oher goling thalf, the grow?
elorgher after splower.
'Eurley 'Arglesey 'A'




Prediction for Cell 71: 'as.'




Prediction for Cell 72: 'as "eim is'




Prediction for Cell 73: 'as "a.'




Prediction for Cell 74: 'as ".'




Prediction for Cell 75: 'ast.'




Prediction for Cell 76: 'they.'




Prediction for Cell 77: 'they.'




Prediction for Cell 78: '#'




Prediction for Cell 79: 'er.'




Prediction for Cell 80: 'arond.'




Prediction for Cell 81: ''I'




Prediction for Cell 82: '"am.'




Prediction for Cell 83: 'proy.'




Prediction for Cell 84: '#'




Prediction for Cell 85: '#'




Prediction for Cell 86: '#'




Prediction for Cell 87: 'ers.'




Prediction for Cell 88: '&amp; they are of they.
of they.'
of the splact of the
ast inglast inglest.
is # a splact.
get a # say.'
of the sigher. 'A stay.'
of the sigher.
of the sither stay.
of the sigher. 'A grow?
of the sigher.
of the sigher after if shep.
as iong arest a'




Prediction for Cell 89: 'the'




Prediction for Cell 90: 'of they.' Archait a from -
of they.'
of they arkjoling at.
'the stay.' A writher?'
of they starest from -
of they arther?'
of they arther?'
of there starest from -
of there starplest.
of there bety from -
of they starest from -
of they arther?''




Prediction for Cell 91: ''ely.'




Prediction for Cell 92: '&amp; they are's aftering.
of they.'
'Arglist a from -
'the Goldring thar.
'the Gold proter.
'the Governmest from "by the
of ther! Shep what bely to gret.
of there stary. 'Arglish, from 'the
of there stay. 'A writher?'
gother! She prober! She'




Prediction for Cell 93: 'whip.'




Prediction for Cell 94: 'limp.'




Prediction for Cell 95: 'whips.'




Prediction for Cell 96: '#'




Prediction for Cell 97: '#'




Prediction for Cell 98: '#'




Prediction for Cell 99: '#'




Prediction for Cell 100: 'grether! Why are's aftering.
'Oher dialing thad.
'Arglesty are's defrity.
'Arglesty from 'depling to
goling the splater.
of there from -
growler defrity. 'Arglish, from -
of they arest.
'Arglish, from jerkey arest.
of they arest from -
growlerghar &amp; they are's gret.'




Prediction for Cell 101: 'ploy.'




Prediction for Cell 102: '&amp; they are's aftering.
of they.'
of they are's afteriat.
gret defrity are's beting the
of they.'
of they starely from -
of they.'
of they starest from 'the grow?
of there stay. 'Arglish, from 'the
gother! She way.'
of they set. A grow?'




Prediction for Cell 103: 'liam.'




Prediction for Cell 104: '#'




Prediction for Cell 105: '#'




Prediction for Cell 106: '#'




Prediction for Cell 107: '#'




Prediction for Cell 108: '#'




Prediction for Cell 109: 'as a'




Prediction for Cell 110: ''they.'
of they areling at.
'Arglesey are's defrity.
'Arglesty are's defrity.
'Oher deflority.
'They arely from 'deplaing at.
of there stay.'
of they starest after inglest.
'Oher dif they are.
'the growlest from -
'they arely from 'the grow?'




Prediction for Cell 111: 'they.'
of they are from atjoker.
of they arkely -
of they starest from -
querly from jerkey.
of there stary. 'A
of they starely from -
of ther?' Strey hard 'the grow?
of there from 'the Governmest.
of there from jerkey.
of they arther?'
of they from jeokery.'




Prediction for Cell 112: 'get.'




Prediction for Cell 113: '&amp; they are's aftering.
of they.'
of they are's afteriat.
gret afriest a.
the from jerkey arest.
of there byck.'
of they from 'the Governmest.
gower. 'Anglesey are's after.
gother! She wark, from 'the
of there stay. 'A writher?'
of they from grow?'




Prediction for Cell 114: 'wiplen.'




Prediction for Cell 115: 'lion.'




Prediction for Cell 116: 'they.'




Prediction for Cell 117: '#'




Prediction for Cell 118: '#'




Prediction for Cell 119: '#'




Prediction for Cell 120: '#'




Prediction for Cell 121: 'oftery.'




Prediction for Cell 122: 'they.'




Prediction for Cell 123: ''them.''




Prediction for Cell 124: 'limp.'




Prediction for Cell 125: 'they.'




Prediction for Cell 126: '#'




Prediction for Cell 127: 'they.'




Prediction for Cell 128: 'erge,'




Prediction for Cell 129: 'gredy.'




Prediction for Cell 130: 'I'




Prediction for Cell 131: ''they.'
of they are from a.
'Arglesey are's defrity.
'Arglesty are's defrity.
'They arely from -
'the splact.
'You gret afterly. 'Awly 'the
of they starest.
of there stay.'
'Oher dif the splater.
'Oher gither?'
'the Golvery from jeoking at.
'they.''




Prediction for Cell 132: 'Bylo: 3.K preding ather.
of they arkely -
of they arkely -
'the Goldring to frest.
of they surget after.
of therely from jerkey.
outher! She warkjoling to
of there starplest.
of ther stay. 'A writher?'
of they starest from -
of they arther?''




Prediction for Cell 133: ''Welforety. 'A writher?'
of they are's aftering at
'the diplority -
'the Goldring to frest.
'You gret limp; the Governmest
of there of ther?'
of there stary from -
of there starplest.
'Oher if the jokery.
'Oher Gold problest from -
'You.K." I whap they?'




Prediction for Cell 134: '&amp; they are's aftering.
of they.'
of they are's aftering -
of they starest.
gret of they set.
of thered from 'the Governmest.
of ther! Shep they are's
of there stay. 'Arglish, from -
of they starest.
goling the from -
of they starest. 'A wrigher.'




Prediction for Cell 135: 'limply.'




Prediction for Cell 136: '"go ig.'




Prediction for Cell 137: 'whips.'




Prediction for Cell 138: 'of Her.'




Prediction for Cell 139: 'ere.'




Prediction for Cell 140: '#'




Prediction for Cell 141: '#'




Prediction for Cell 142: 'they.'




Prediction for Cell 143: 'they.'




Prediction for Cell 144: ''them.' Awly Sambly a.
of they.'
'Arglesty afterly -
'the Govery from 'deplaing.
'the splact.
of they starest from 'the
of they.'
of they starest from 'the Gover.
of them?' Strey harest.
of the splact be stay.
of there from jeoking at.
of they set. I way.''




Prediction for Cell 145: 'whip.'




Prediction for Cell 146: '#'




Prediction for Cell 147: '#'




Prediction for Cell 148: '#'




Prediction for Cell 149: 'they.'




Prediction for Cell 150: 'ory.'




Prediction for Cell 151: 'a a'




Prediction for Cell 152: 'Byrou. 'Arglist a from -
of they.'
are dialty a from -
the Governmest joker.
goling the splact.
of there stay.'
of they starely from -
of there stay. 'Arglish, from -
of they starest from -
growlest beling tharest.
of they from jerkey.
of they arthey from grow.'




Prediction for Cell 153: 'here. I arghard after. 'Awly a
of they.'
arthey from jeoking athest.
of they starest from -
grether! Whapterly in the
of there stay.
of there stary from -
of there starplest.
of there from -
grether! She wark, from Edget.
of there problest.
of they from grow?'




Prediction for Cell 154: 'ably. 'Arglist a from -
'Only incly.
'Arglist a.
'A charget af they are.
of they arther?'
of they arely from 'det.
outher! She probles! he
outher! She warkjoling that.
of there stay. 'Arglish, from 'the
of they arther?'
of they arely from -'




Prediction for Cell 155: 'they.'




Prediction for Cell 156: 'if.'




Prediction for Cell 157: 'are.'




Prediction for Cell 158: 'look.'




Prediction for Cell 159: '#'




Prediction for Cell 160: '#'




Prediction for Cell 161: 'they.'




Prediction for Cell 162: '#'




Prediction for Cell 163: 'ory.'




Prediction for Cell 164: 'why.'




Prediction for Cell 165: ''them.' Awly Sambly a.
of they.'
'Arglesty afterly -
'the Govery from 'deplaing.
'they inclest from -
of they.'
of they starely from -
of they.' Stred I washer.
of there stay. 'Arglish, from 'the
of they starest.
'Oher gither?'
of they set. A grow?'




Prediction for Cell 166: 'whip.'




Prediction for Cell 167: '#'




Prediction for Cell 168: 'prety.'




Prediction for Cell 169: 'es.'




Prediction for Cell 170: 'us.'




Prediction for Cell 171: 'as.'




Prediction for Cell 172: 'a of'




Prediction for Cell 173: 'they.'




Prediction for Cell 174: 'the'




Prediction for Cell 175: 'they.'
of they are from a.
'Arglist, from -
of they arkjoling thad.
the from -
of they starest from -
elinget afrither. 'Awly the
outher! She warkjouther.
outher! She wither?'
of there from -
grether! She warkeng the arther.
of there for eftory. 'A'




Prediction for Cell 176: 'that.'




Prediction for Cell 177: 'I'




Prediction for Cell 178: 'list.'




Prediction for Cell 179: 'is.'




Prediction for Cell 180: '#'




Prediction for Cell 181: '#'




Prediction for Cell 182: ''As'




Prediction for Cell 183: 'asio.'




Prediction for Cell 184: 'it.'




Prediction for Cell 185: 'they.'




Prediction for Cell 186: 'they.''




Prediction for Cell 187: 'if'




Prediction for Cell 188: 'er.'




Prediction for Cell 189: 'the'




Prediction for Cell 190: '#'




Prediction for Cell 191: '#'




Prediction for Cell 192: ''Asy.'




Prediction for Cell 193: ''so'




Prediction for Cell 194: 'they.'
of they are from a.
'Arglist, from -
'the diplating ather.
grether! She wark, from -
goling the splact.
of there stay.'
of there stay. 'Arglish, from 'the
gother! She way.'
of there stay. 'A writher?'
gother! She problest after goling.
of they.''




Prediction for Cell 195: 'there belty are from -
of they.'
of they arkely -
'Arglist from 'deplaing.
'the Goldring tharest.
eling the from -
eling the splater.
of there stary from 'the Gover.
gether! Shep the Gold apto inflority.
of there from jerkey.
of there from -'




Prediction for Cell 196: 'they.'




Prediction for Cell 197: 'god.'




Prediction for Cell 198: 'why.'




Prediction for Cell 199: 'loy.'




Prediction for Cell 200: 'why.'




Prediction for Cell 201: 'at.'




Prediction for Cell 202: 'or.'




Prediction for Cell 203: 'get.'




Prediction for Cell 204: 'It'




Prediction for Cell 205: 'pro-'




Prediction for Cell 206: 'they.'




Prediction for Cell 207: '.'




Prediction for Cell 208: 'of'




Prediction for Cell 209: 'gresty.'




Prediction for Cell 210: 'erget'




Prediction for Cell 211: ''as.'




Prediction for Cell 212: '#'




Prediction for Cell 213: 'oby.'




Prediction for Cell 214: 'as is'




Prediction for Cell 215: 'they.'
of they are from a.
'Arglist, from -
'the diplating ather.
the from jeoking the
queling tred. 'A writher?'
of there stay. 'Arglish, from -
gourlesty. 'Arthey 'defrither.
gother! She wark, from 'the
goling the spetharget.
of there stay.'




Prediction for Cell 216: ''they.' Archait a from -
'they arkjoling ather.
'Arglist, from atjoker.
'the splact bely -
'the Goldring thart.
'Bether! She problest -
of ther?' Strey harest after inglest.
of there starplo if the
golerity from jerkey.
'Oher stilest from -
'elty. I ways from grother.'




Prediction for Cell 217: 'It'




Prediction for Cell 218: ''they.'
of they are prober.
'Arglesey are's defrity.
'the arthey from -
of they starest from -
of they starest.
of they starely from -
owlerght and the grow.
oper the Governmest.
of there stay. 'A writher?'
of they starest from -
of they set. A grow?'




Prediction for Cell 219: 'whip.'




Prediction for Cell 220: 'of.'




Prediction for Cell 221: 'story.'




Prediction for Cell 222: 'her.'




Prediction for Cell 223: '#'




Prediction for Cell 224: 'It'




Prediction for Cell 225: 'theor.'




Prediction for Cell 226: 'they.'




Prediction for Cell 227: 'of they.'
of they are prober.
'the diplority.
'Oher defarly -
'Arglesty from 'deplaing at.
of they starest.
of they starely from -
owlerght a.
'the stay.' Stred harge, from -
of they inclest.
ing the splact by infloring -
of they starest.
'Oher gither?''




Prediction for Cell 228: 'if the'




Prediction for Cell 229: 'whip.'




Prediction for Cell 230: '#'




Prediction for Cell 231: 'he.'




Prediction for Cell 232: 'the.'




Prediction for Cell 233: 'get.'




Prediction for Cell 234: 'her.'




Prediction for Cell 235: ''Ar'




Prediction for Cell 236: 'they.'
of they are from a.
'Arglist, from -
'the diplating ather.
'the Goldring tharest.
golerity from 'the Governmest.
of there stay.
of there stary from 'the grow?
of there starplest.
'Oher goling tharest.
goling the splater. 'A'




Prediction for Cell 237: 'got.'




Prediction for Cell 238: ''they.'
'Oher dialty a from -
'they arkely -
'Arglist a.
'A clorght cold. 'A writher?'
'they if they arest.
lestly inflority. 'Awly in the
'therefor the Governmest.
'elting the joker.
'Oher ither?'
gether! She wark, from grother.
'elting leting to left.'




Prediction for Cell 239: 'her.'




Prediction for Cell 240: 'why.'




Prediction for Cell 241: 'do.'




Prediction for Cell 242: 'whip.'




Prediction for Cell 243: 'are.'




Prediction for Cell 244: 'It'




Prediction for Cell 245: 'proy'




Prediction for Cell 246: 'get.'




Prediction for Cell 247: 'he.'




Prediction for Cell 248: 'they.'




Prediction for Cell 249: 'git.'




Prediction for Cell 250: 'less.'




Prediction for Cell 251: 'er.'




Prediction for Cell 252: 'asy.'




Prediction for Cell 253: '#'




Prediction for Cell 254: 'get.'




Prediction for Cell 255: ''the.'

--- Full Form Processing Complete ---


For 72 hours, we can also access the model over the public internet
using a URL provided by `gradio`:

In [16]:
print(frontend.share_url)

https://8650feb0428bd7a433.gradio.live


You can point your browser to that URL
to see what the model looks like as a full-fledged web application,
instead of a widget inside the notebook.

In addition to this UI,
`gradio` also creates a simple REST API,
so we can make requests
from outside the browser,
programmatically,
and get responses.

In [17]:
%env API_URL={frontend.share_url + "/api"}

env: API_URL=https://8650feb0428bd7a433.gradio.live/api


We can see the details of the API by clicking
"view api" at the bottom of the Gradio interface.

In particular,
we can see that the API expects image data in
[base64 format](https://developer.mozilla.org/en-US/docs/Glossary/Base64),
which encodes binary data as ASCII text
so that it can be sent over interfaces that expect ASCII text.

The line below encodes an image with the `base64` utility,
packages it into the appropriate JSON format
and uses `echo` to pipe it into a `curl` command.

`curl` can be used to make requests to web services at URLs
-- here `${API_URL}/predict` --
of specific types
-- here `POST` --
that include `-d`ata
and `-H`eaders identifying the format of the data.

The response is returned as
[string-formatted JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON).

In [None]:
response, = ! \https://5f188a77e2b666deba.gradio.live
  (echo -n '{ "data": ["data:image/png;base64,'$(base64 -w0 -i text_recognizer/tests/support/paragraphs/a01-077.png)'"] }') \
  | curl -s -X POST "${API_URL}/predict" -H 'Content-Type: application/json' -d @-
  
response

IndentationError: unexpected indent (3814692727.py, line 2)

JSON, short for "JavaScript Object Notation",
is effectively the standard for representing dictionaries
when sharing information between applications
that may be written in different languages.

With the standard library's `json.loads`,
we can convert the response into a Python dictionary
and then access the response `data` within.

In [19]:
import json


print(json.loads(response)["data"][0])

NameError: name 'response' is not defined

Importantly, the `echo | curl` command
does not need to be run from the same machine that is running the model --
that's another big win for this UI over the CLI script we ran previously.

Try running the command from your own machine,
if you are running OS X or Linux,
and see if you can get a response.

Don't forget to define the `API_URL` environment variable on your machine
and download the image file,
`text_recognizer/tests/support/paragraphs/a01-077.png`,
changing the path if needed.

Once you're done,
turn off the Gradio interface by running the `.close` method.

In [9]:
frontend.close()

Closing server running on port: 7860


## Testing our UI

We've added a lot of new functionality here,
and some of it is critical to our application.

The surface area is too large and
the components too complex for testing in depth
to  be worth the investment --
do we really want to set up a
[headless browser](https://www.browserstack.com/guide/what-is-headless-browser-testing)
or similar mock test to check whether our README is being loaded properly?

So once again, we pick the minimal test that checks whether
the core functionality is working:
we spin up our frontend and ping the API,
making sure we get back a
[`200 OK`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200)
response, indicating that at least the server thinks everything is fine.

In [36]:
!cat app_gradio/tests/test_app.py

import json
import os

import requests

from app_gradio import app
from text_recognizer import util


os.environ["CUDA_VISIBLE_DEVICES"] = ""


TEST_IMAGE = "text_recognizer/tests/support/paragraphs/a01-077.png"


def test_local_run():
    """A quick test to make sure we can build the app and ping the API locally."""
    backend = app.PredictorBackend()
    frontend = app.make_frontend(fn=backend.run)

    # run the UI without blocking
    frontend.launch(share=False, prevent_thread_lock=True)
    local_url = frontend.local_url
    get_response = requests.get(local_url)
    assert get_response.status_code == 200, get_response.content

    image_b64 = util.encode_b64_image(util.read_image_pil(TEST_IMAGE))

    local_api = f"{local_url}api/predict"
    headers = {"Content-Type": "application/json"}
    payload = json.dumps({"data": ["data:image/png;base64," + image_b64]})
    post_response = requests.post(local_api, data=payload, headers=headers)
    assert post_response.status_code == 2

## Start here, finish anywhere

You may be concerned:
is `gradio` a children's toy?
am I painting myself into a corner
by using such a high-level framework and doing web development in Python?
shouldn't I be using Ruby On Rails/Angular/React/WhateversNext.js?

DALL-E Mini, now
[crAIyon](https://www.craiyon.com/),
began its life as
[a Gradio app](https://huggingface.co/spaces/dalle-mini/dalle-mini)
built by FSDL alumnus
[Boris Dayma](https://twitter.com/borisdayma).

Gradio and similar tools
are critical for quickly getting to an MVP
and getting useful feedback on your model.

Expend your engineering effort on data and training,
not frontend interface development,
until you're sure you've got something people want to use.

# Wrapping a model into a model service

We've got an interactive interface for our model
that we can share with friends, colleagues,
potential users, or stakeholders,
which is huge.

But we have a problem:
our model is running in the same place as our frontend.

This is simple,
but it ties too many things together.

First, it ties together execution of the two components.

If the model has a heart attack due to misformatted inputs
or some mysterious DNN bug,
the server goes down.
The same applies in reverse --
the only API for the model is provided by `gradio`,
so a frontend issue means the model is inaccessible.

Additionally, it ties together dependencies,
since our server and our model are in the same
environment.

Lastly, it ties together the hardware used to run our
server and our model.

That's bad because the server and the model scale differently.
Running the server at scale has different memory and computational requirements
than does running the model at scale.

We could just run another server --
even writing it in Gradio if we wanted! --
for the model.
This is common with GPU inference,
especially when doing queueing, cacheing,
and other advanced techniques for improving
model efficiency and latency.

But that's potentially expensive --
we're running two machines,
which costs twice as much.

Furthermore, this setup is harder to scale "horizontally".

We'll pretty quickly need a solution for auto-scaling
our two servers independently,
e.g. directly in a container orchestration service, like
[Kubernetes](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/),
or in a managed version of the same, like
[Elastic Kubernetes Service](https://aws.amazon.com/eks/),
or with an infrastructure automation tool, like
[Terraform](https://www.terraform.io/).

Luckily, there is an easier way, because our model service-plus-UI
combo fits into a common pattern.

We have a server that we want to be up all the time,
ready to take requests,
but we really only need
the model service to run when a request hits.

And apart from its environment (which includes the weights),
the model only needs the request in order to produce a result.

It does not need to hold onto any information in between executions --
it is _stateless_.

This pattern is common enough that all cloud providers
offer a solution that takes the pain out of scaling
the stateless component:
"serverless cloud functions",
so named because
- they are run intermittently, rather than 24/7, like a server.
- they are run on cloud infrastructure.
- they are, as in
[purely functional programming](https://en.wikipedia.org/wiki/Purely_functional_programming)
or in mathematics, "pure" functions of their inputs,
with no concept of state.

We use AWS's serverless offering,
[AWS Lambda](https://aws.amazon.com/lambda/).

In [40]:
from api_serverless import api

api??

[0;31mType:[0m        module
[0;31mString form:[0m <module 'api_serverless.api' from '/home/rehanfarooq/fsdl/fsdl-text-recognizer-2022-labs/lab07/api_serverless/api.py'>
[0;31mFile:[0m        ~/fsdl/fsdl-text-recognizer-2022-labs/lab07/api_serverless/api.py
[0;31mSource:[0m     
[0;34m"""AWS Lambda function serving text_recognizer predictions."""[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mjson[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mPIL[0m [0;32mimport[0m [0mImageStat[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;32mfrom[0m [0mtext_recognizer[0m[0;34m.[0m[0mparagraph_text_recognizer[0m [0;32mimport[0m [0mParagraphTextRecognizer[0m[0;34m[0m
[0;34m[0m[0;32mimport[0m [0mtext_recognizer[0m[0;34m.[0m[0mutil[0m [0;32mas[0m [0mutil[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0mmodel[0m [0;34m=[0m [0mParagraphTextRecognizer[0m[0;34m([0m[0;34m)[0m[0;34m[0m
[0;34m[0m[0;34m[0m
[0;34m[0m[0;34m

Our main function here, `api.handler`, wraps `ParagraphTextRecognizer.predict`.

Effectively, `api.handler` maps HTTP requests (`event`s) with AWS's canonical format
to a format our `ParagraphTextRecognizer` understands,
then converts the text recognizer's output into something
that AWS understands.

Deploying models as web services is an exercise in taking
the Tensor-to-Tensor-mappings we work with in model development
and wrapping them so that they run in the
JSON-to-JSON-mapping world of web services.

## Talking to a model service

Setting up a serverless function on AWS requires an account
(which requires putting down a credit card)
and configuration of permissions
(which is error-prone).

If you want to see how that process works,
check out our
["bonus notebook" on serverless deployment on AWS Lambda](https://github.com/full-stack-deep-learning/fsdl-text-recognizer-2022/blob/main/notebooks/lab99_serverless_aws.ipynb).
Heads up: it uses Docker,
which means it's not compatible with Google Colab.

So we'll skip that step and,
like Julia Child or Martha Stewart, check out
[one that was prepared earlier](https://tvtropes.org/pmwiki/pmwiki.php/Main/OneIPreparedEarlier).

The cell below sends a request
to a serverless cloud function running on the FSDL AWS account.

This request is
much like the one we sent to the API provided by `gradio`,
but we here construct and send it in Python,
using the `requests` library,
rather than operating from the command line.

When playing around with an API,
writing requests and parsing responses "by hand"
in the command line is helpful,
but once we're working on real use cases for the API,
we'll want to use higher-level libraries
with good code quality and nice integrations.

In [41]:
import json

from IPython.display import Image
import requests  # the preferred library for writing HTTP requests in Python

lambda_url = "https://3akxma777p53w57mmdika3sflu0fvazm.lambda-url.us-west-1.on.aws/"
image_url = "https://fsdl-public-assets.s3-us-west-2.amazonaws.com/paragraphs/a01-077.png"

headers = {"Content-Type": "application/json"} 
payload = json.dumps({"image_url": image_url})

response = requests.post(  # we POST the image to the URL, expecting a prediction as a response
    lambda_url, data=payload, headers=headers)
pred = response.json()["pred"]  # the response is also json

print(pred)

Image(url=image_url, width=512)

KeyError: 'pred'

Before deploying a service like this one,
it's important to check how well it handles different traffic volumes and patterns.
This process is known as _load-testing_.

For a quick tutorial on some basic tooling and a run-through of
results from load-testing the FSDL Text Recognizer on AWS Lambda, see
[this "bonus notebook" on load-testing](https://fsdl.me/loadtesting-colab).

## Local in the front, serverless in the back

The primary "win" here
is that we don't need to run
the frontend UI server
and the backend model service in
the same place.

For example,
we can run a Gradio app locally
but send the images to the serverless function
for prediction.

Our `app_gradio` implementation supports this via the `PredictorBackend`.

In [None]:
serverless_backend = app.PredictorBackend(url=lambda_url)

Previously, our `PredictorBackend`
was just a wrapper around the `ParagraphTextRecognizer` class.

By passing a URL,
we switch to sending data elsewhere via an HTTP request.

This is done by the
`_predict_from_endpoint` method,
which runs effectively the same code we used
to talk to the model service in the cell above.

In [None]:
serverless_backend._predict_from_endpoint??

The frontend doesn't care where the inference is getting done or how.

A `gradio.Interface`
just knows there's a Python function that it invokes and then 
waits for outputs from.

Here, that Python function
makes a request to the serverless backend,
rather than running the model.

Go ahead and try it out!

You won't notice a difference,
except that the machine you're running this notebook on
no longer runs the model.

In [None]:
frontend_serverless_backend = app.make_frontend(serverless_backend.run)

frontend_serverless_backend.launch(share=True)

# Serving a `gradio` app with `ngrok`

We've now got a model service and a web server
that we can stand up and scale independently,
but we're not quite done yet.

First, our URL is controlled by Gradio.

Very quickly once we leave the territory of a minimal demo,
we'll want that URL to be branded.

Relatedly,
you may have noticed messages indicating that the public URL
from Gradio is only good for 72 hours.

That means we'd have to reset our frontend
and share a new URL every few days.

For projects that are mostly intended as public demos,
you might follow the advice from those printed warnings
and use
[Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)
for free, permanent hosting.

This relieves you of the burden of keeping the frontend server running.

However, note that this requires you to use the Hugging Face Hub
as a remote for your `git` repository, alongside GitHub or GitLab.
This connection to the version control system can make for tricky integration,
e.g. the need to create a new repository for each new model.

By default, the demo is embedded inside Hugging Face,
limiting your control over the look and feel.

However, you can embed the demo in another website with
[Web Components or IFrames](https://gradio.app/sharing_your_app/#embedding-with-web-components).
You can also adapt the aesthetics and interactivity of the demo with
[custom CSS and JS](https://gradio.app/custom_CSS_and_JS/).

We will instead run the frontend server ourselves
and provide a public URL
without relying on Gradio's service.

Half of the work is already done for us:
the `gradio` frontend is already listening on a port and IP address
that is accessible locally
(on `127.0.0.1` or `localhost`, as printed below).

In [None]:
frontend_serverless_backend.local_url

So we can, for example, send `curl` requests locally,
i.e. on the same machine as the frontend,
and get responses.

In [None]:
# we send an improperly formatted request, because we just want to check for a response

!curl -X POST {frontend_serverless_backend.local_url}api/predict

Running the same command on another machine will result in an error --
`127.0.0.1` and `localhost` always mean "on this machine".

So fundamentally,
the goal is to take the frontend service
running on an IP and port that is only accessible locally
and make it accessible globally.

There's some tricky bits here --
for example, you'll want to communicate using encryption,
i.e. over HTTPS instead of HTTP --
that make doing this entirely on your own
a bit of a headache.

To avoid these issues,
we can once again use
[`ngrok`](https://ngrok.com/),
the service we used to provide access to our Label Studio instance
in the data annotation lab.

The free tier includes public URLs and secure communication with HTTPS.

However, the URL changes each time you relaunch your service,
e.g. after an outage or a version update.

The paid tier allows for branded domains,
simpler authentication with
[OAuth](https://oauth.net/),
and some basic scaling tools like load balancing.

This is what we use for the official FSDL text recognizer at
[fsdl-text-recognizer.ngrok.io](https://fsdl-text-recognizer.ngrok.io/).

To get started, let's
set up our `ngrok` credentials.

In [None]:
import os
import getpass

from pyngrok import ngrok

config_file = ngrok.conf.DEFAULT_NGROK_CONFIG_PATH
config_file_exists =  os.path.exists(config_file)
config_file_contents = !cat {config_file}

auth_token_found = config_file_exists \
    and config_file_contents \
    and "authtoken" in config_file_contents[0] \
    and ": exit" not in config_file_contents  # state if interrupted

if not auth_token_found:
    print("Enter your ngrok auth token, which can be copied from https://dashboard.ngrok.com/auth")
    !ngrok authtoken {getpass.getpass()}

From there,
it's as simple as pointing
an `ngrok` tunnel
at the port associated with your frontend.

> For our purposes, ports are
"places you can listen for messages to your web service".
By separating ports,
which are identifiers within a machine,
from URLs/IPs,
which are identifiers across machines,
we can run multiple services on a single machine.

In [None]:
TEXT_RECOGNIZER_PORT = frontend_serverless_backend.server_port

https_tunnel = ngrok.connect(TEXT_RECOGNIZER_PORT, bind_tls=True)
print(https_tunnel)

Head to the printed `ngrok.io` URL from any device --
e.g. a mobile phone --
to check out your shiny new ML-powered application UI
with serverless backend.

Running a web service out of a Jupyter notebook is not recommended.

`gradio` and `ngrok`
can be run from the command line.

If you're running the lab locally,
just define the `TEXT_RECOGNIZER_PORT`
and `LAMBDA_URL` environment variables
and then run

```bash
python app_gradio/app.py --model_url $LAMBDA_URL --model_port $TEXT_RECOGNIZER_PORT
```

in one terminal
and, in a separate terminal,
run
```bash
ngrok $TEXT_RECOGNIZER_PORT https
```

and navigate to the printed URL.

## Launching a server on a cloud instance

We are almost, but not quite,
to the point of a reasonably professional web service.

The last missing piece is that our server is running
either on Colab,
which has short uptimes and is not intended for serving,
or on our own personal machine,
which is also likely a few
[nines](https://en.wikipedia.org/wiki/High_availability#Percentage_calculation) short of an uptime SLA.

We want to instead run this on a dedicated server,
and the simplest way to do so is to spin up a machine in a cloud provider.

[Elastic Compute Cloud](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html)
(aka EC2)
is the option in AWS,
our chosen cloud provider.

To get the server going on another machine,
we'll need to `git clone` our library,
`pip install` our `prod` requirements,
and then finally run `ngrok` and `app_gradio/app.py`.

We can make that process slightly easier
by incorporating it into a `Dockerfile`
and building a container image.

In [1]:
!cat app_gradio/Dockerfile

cat: app_gradio/Dockerfile: No such file or directory


We can then store the container image in a registry, like
[Docker Hub](https://hub.docker.com/)
or the container image registry built into our cloud provider, like AWS's
[Elastic Container Registry](https://aws.amazon.com/ecr/).

Then, setup just means pulling the image down onto the machine
we want to run our server from and executing a `docker run` command.