In [4]:
import os
import sys
import subprocess


# !!!!!! ATTENTION !!!!!!!!!
# Use a venv for your jupyter-notebook, these installations will possibly be blocked
%pip install -U requests


Note: you may need to restart the kernel to use updated packages.


### <u> Now we can start our docker for the "service" </u> 
To be honest, it's no real service. In reality I'd do this 
* in Kubernetes 
* or at least in docker-compose

But to keep compatibility issues minimal, I'll strictly follow the exercise instructions, and just do a docker port-mapping to the host. (port 8123->8123).
This will allow us to send requests to localhost:8123, and reach the server-api that's running on the docker. 
(Again, docker isn't the best way to do this. But we're talking fictional here)

If notebook doesn't allow execution, you can always paste the commands below into your shell 

#### NOTE: It's much more reliable to paste these commands below into your shell directly. Don't trust Jupyter's execution: docker often doesn't have real rights


In [5]:
!docker pull sebastianfchr/appl_tfdocker:latest
!docker run -d --gpus=all -v $(pwd):/code -p 8123:8123 -w /code sebastianfchr/appl_tfdocker:latest -- uvicorn serverapi:app --host 0.0.0.0 --port 8123
# or if you use nerdctl:
# nerdctl run -d --gpus=all -v $(pwd):/code -p 8123:8123 -w /code sebastianfchr/appl_tfdocker:latest -- uvicorn serverapi:app --host 0.0.0.0 --port 8123


latest: Pulling from sebastianfchr/appl_tfdocker
Digest: sha256:d91e8857a1cc7a240a0e38f8362911f19b4f53d314303939b7510299d69ac366
Status: Image is up to date for sebastianfchr/appl_tfdocker:latest
docker.io/sebastianfchr/appl_tfdocker:latest
5aad1df881ef2b87bcd127961f9493391892c303e611a5722a64f96605898a23



### Docker image
Above, we use my custom cuda-tf-docker, that I use for such deployments, hosted on docker container-registry `docker.io/sebastianfchr/appl_tfdocker:latest`

It takes some time to download, since it's *cuda enabled*. (It runs tf2.7, and the compatible cuda/cudnn version)

Generally, I'm a fan of lightweigtht docker images. But in this application, where we leverage the full potential of GPU-tensorflow, we have to go with this one.



### Sending requests

All we need here, is the python-package "requests". I've created two endpoints on the server for
* single sentence prediction
* prediction of "chunks" of sentences 

### API
I made two api-endpoints:
* `server-url/predict_sentence_batch/` for the chunked version
* `server-url/predict_sentence/` for the single version


Since all the functionality is on the docker, we merely need to be able to send requests from pyton. Let's predict some sentiments then


### <center> Important: Please make sure the docker is up and running. Its API-server takes a bit to load </center>

In [6]:
import requests

# docker is reachable through this mapping
url_batch = "http://0.0.0.0:8123/predict_sentence_batch/"

sentences = [
    "going down the beautiful road, I met a horrible rabbit",
    "Sadly, the guys from HUK gave me the wrong weights, and I had to do specification training myself",
    "I really hope that despite the whole python compatibility hell you could happily execute everything until here",
    "I really hope that despite the whole python compatibility hell you could happily execute everything until here"
]
sentiments = [
    "positive",
    "negative",
    "negative",
    "positive"
]

# POST-data for the batched sentence
data = {"sentences": [s for s in sentences], "sentiments": [s for s in sentiments]}
response = requests.post(url_batch, json=data, headers={ 'Content-Type': 'application/json' })

try:
    extracted_sentence_fragments = response.json()['data']
    print("sentences and extractions: \n ")
    for orig_sentence, sentiment, extr_fragment in zip (sentences, sentiments, extracted_sentence_fragments):
        print("`{}` extract: `{}`: \n ==> {}\n".format(orig_sentence, sentiment, extr_fragment))

except: 
    print("response seems not to contian json")



sentences and extractions: 
 
`going down the beautiful road, I met a horrible rabbit` extract: `positive`: 
 ==>  beautiful

`Sadly, the guys from HUK gave me the wrong weights, and I had to do specification training myself` extract: `negative`: 
 ==>  sadly,

`I really hope that despite the whole python compatibility hell you could happily execute everything until here` extract: `negative`: 
 ==>  hell

`I really hope that despite the whole python compatibility hell you could happily execute everything until here` extract: `positive`: 
 ==>  hope



### <u> What about speed?</u>

Very good question: This is where it gets interesting. There are some relevant comparisons to make, since we're dealing with a server that takes requests, but we want efficient execution. First, some facts

* GPUS are latency-hiding machines that want "compute-bound", parallelizable programs. ML: 👌

* GPUS have a n overhead for kernel-launches. So, launch the kernel on as many predictions (or as our GPU can)

* Server requests have a round-trip time. That adds overhead to every request we make from the client

* The above point means that we profit from chunking predictions within requests. But: Longer requests take a bit longer to send


### A word about timing
When we want to predict a number of sentences, 
we can investigate the time for the following modes.

We differentiate between predictions that go through the API and those that don't:

* single prediction request per API

* chunked prediction requests per API

* single direct prediction on tensorflow

* single direct prediction on tensorflow



<!-- ![title]("./benchmarks.png") -->

<center><img src="./images/benchmarks.png" width=80% height=80% /></center>

### <u> Why this is relevant </u>

First, these numbers aren't surprising. But they tell the MLOps engineer something very interesting:
* requests are best chunked before being called by tensorflor (and cuda). I'd love to go into depth here
* requests are best chunked before being sent to the API to avoid unnecessary round-trip times
* to the above point, we see a small influence of message-size even if we make chunked requests (orange-right) 

###  What the developers can do

In a real a real setting, there will be small requests per  client.

However, the engineer can make some design-improvements in case our servers are under high load:
* aggregate incoming API-requests from different clients (over a small timeframe)
* predict them in a chunked fashion (fast!)
* map each prediction back them to the client, and send them back




#### Note: To run the rest, wee need to install requirements:<br> since the Challenge uses `tensorflow 2.7`, we must have a python-version between 2.6 and 2.9. Newer ones don't run this tf! 

you need to have the requirements)strict.txt installed to be exactly congruent with the setup demanded in the challenge (i.e. tf2.7 uses python3.6-3.9). If you encounter the exception above, try commenting it out and thus installing requirements_relaxed.txt. No Guarantee tho!


In [8]:
import os
import sys

(py_major, py_minor) = sys.version_info[0:2]

if not (py_major != 2 and (6 <= py_minor <= 9)):
    # raise Exception('Requires python 2.6-2.9, else you cannot run the code provided by HUK. '
    # 'You can comment out this exception, and see whether it works with a newer Setup. \n \
    #  The specification Tensorflow==2.7.0 is only compatible with 2.6-2.9. Similar things hold for transformers ')
    %pip install -r requirements/requirements_relaxed.txt

else: 
    %pip install -r requirements/requirements_strict.txt

import pandas

Note: you may need to restart the kernel to use updated packages.


### Generate your local benchmarks
You can now generate your own benchmarks. The code below will generate a `benchmark_tf_and_api_calls.png` in this folder.


### Note: Since the docker is running tf-gpu instance AND the benchmarking-script runs another one, this line will fail if your machine has less than ~4GB GPU vRam (~ 2GB reserved in docker, ~2 GB reserved here). 

In [3]:
import benchmark

2025-03-30 19:05:58.744984: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-30 19:05:58.753686: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743354358.763692   72815 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743354358.767071   72815 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743354358.774991   72815 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8123): Max retries exceeded with url: /predict_sentence_batch/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f55d6995070>: Failed to establish a new connection: [Errno 111] Connection refused'))

## Python tests


### Note: Since the docker is running tf-gpu instance AND the testing runs another one, this line will fail if your machine has less than ~4GB GPU vRam (~ 2GB reserved in docker, ~2 GB reserved here). 

In [7]:
# To run all my tests
# now, let's hope pytest selects the correct version that you're using in this notebook
!pytest

platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/seb/Desktop/CodingChallenge_MLE
collected 3 items                                                              [0m[1m

test_components.py [31mF[0m[32m.[0m[32m.[0m[31m                                                   [100%][0m

[31m[1m________ TestClassPredictor.test_batched_sentence_extraction_vs_manual _________[0m

self = <test_components.TestClassPredictor object at 0x7295e1e880b0>

    [0m[94mdef[39;49;00m[90m [39;49;00m[92mtest_batched_sentence_extraction_vs_manual[39;49;00m([96mself[39;49;00m):[90m[39;49;00m
    [90m    [39;49;00m[33m""" Sanity check for RobertaPredictor.predict_sentence_batch(). Tests whether it produces[39;49;00m
    [33m    the same sentence-fragment as manual tokenization, prediction, and decoding """[39;49;00m[90m[39;49;00m
    [90m[39;49;00m
        num_elements_tested = [94m10[39;49;00m[90m[39;49;00m
    [90m[39;49;00m
        tokenizer = tok

### There are two main things that I identified as testworthy
* correct tf-model interfacing in deployment
* correct API server behavior

#### <u>1. Check for correct tf-model-interfacing and text-prediction: <br></u> `test_batched_sentence_extraction_vs_manual`
The text goes through quite some processing before the data gets into the kernel-call. 
* Text preparation 
* Text tokenization 
* Masking
* Prediction, tokenized output
* Decoding of predicted tokens. This gives us the sentence-fragment

Fully automating into the above used functions to

`predict_sentence_batch(<<sentence>>, <<sentiment>>)` 

`predict_sentence(<<sentence>>, <<sentiment>>)` 

means to go through all the above stages. To test this code, I compared
* a completely manually written version 
* the automatized versions above


#### <u> 2. Check for correct API Behavior: </u> 


### <u> Train your own weights </u> 

After refractoring a lot, I only left the training-loop itself intact. Check out the following program to train your own weights. 


I found that this line does not run unless you comply with requirements_strict. <br>It looks like huggingface's transformers relied on tf-calls that changed in later versions (that is, later than about tf2.7)

In [5]:
!python3 train.py

2025-03-30 19:07:41.050326: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-30 19:07:41.058923: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743354461.068529   73402 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743354461.071742   73402 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1743354461.080146   73402 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 


### <u> Software-Design and Refractoring </u>

I turned this into the 'roberta' package. The below classes allow me to be able to write more understandable, and encapulate the functionality in a meaningful way.


In [6]:
from roberta import RobertaPredictor, TokenEncoder 
