In [67]:
import os
import sys
import subprocess

import sys



# Define the virtual environment name
venv_name = ".seb_huk_venv"

# Create the virtual environment when ran for the first time
if not os.path.exists(venv_name):
    print(f"Creating virtual environment: {venv_name}")
    subprocess.run([sys.executable, "-m", "venv", venv_name], check=True)

# Activate virtual environment. Windows or Linux/MacOs
if sys.platform == "win32": activate_script = os.path.join(venv_name, "Scripts", "activate")
else: activate_script = os.path.join(venv_name, "bin", "activate")

print(f"Activating virtual environment: {activate_script}")

# Install dependencies, so the following code can run!
pip_executable = os.path.join(venv_name, "bin", "pip") if sys.platform != "win32" else os.path.join(venv_name, "Scripts", "pip")
subprocess.run([pip_executable, "install", "requests"], check=True)



Activating virtual environment: .seb_huk_venv/bin/activate


CompletedProcess(args=['.seb_huk_venv/bin/pip', 'install', 'requests'], returncode=0)

### <u> Now we can tart our docker for the "service" </u> 
To be honest, it's no real service. In reality I'd do this 
* in Kubernetes 
* or at least in docker-compose

But to keep compatibility issues minimal, I'll strictly follow the exercise instructions, and just do a docker port-mapping to the host. (port 8123->8123).
This will allow us to send requests to localhost:8123, and reach the server-api that's running on the docker. 
(Again, docker isn't the best way to do this. But we're talking fictional here)

If notebook doesn't allow execution, you can always paste the commands below into your shell 

#### NOTE: It's much more reliable to paste these commands below into your shell directly. Don't trust Jupyter's execution (rights!)

In [17]:
!docker run --gpus=all -v $(pwd):/code -p 8123:8123 -w /code -it sebastianfchr/appl_tfdocker:latest -- uvicorn serverapi:app --host 0.0.0.0 --port 8123

# or if you use nerdctl like me, activate:
# !nerdctl run --gpus=all -v $(pwd):/code -p 8123:8123 -w /code -it sebastianfchr/appl_tfdocker:latest -- uvicorn serverapi:app --host 0.0.0.0 --port 8123


docker: Error response from daemon: driver failed programming external connectivity on endpoint awesome_jepsen (37d66b61813dd5de883e11466e9c4665abe221afb50954974a88954e43c818d8): Bind for 0.0.0.0:8123 failed: port is already allocated.



### Docker image
Above, we use my custom cuda-tf-docker, that I use for such deployments, hosted on docker container-registry `docker.io/sebastianfchr/appl_tfdocker:latest`

It takes some time to download, since it's *cuda enabled*. (It runs tf2.7, and the compatible cuda/cudnn version)

Generally, I'm a fan of lightweigtht docker images. But in this application, where we leverage the full potential of GPU-tensorflow, we have to go with this one.



### Sending requests

All we need here, is the python-package "requests". I've created two endpoints on the server for
* single sentence prediction
* prediction of "chunks" of sentences 

### API
I made two api-endpoints:
* `server-url/predict_sentence_batch/` for the chunked version
* `server-url/predict_sentence/` for the single version


Since all the functionality is on the docker, we merely need to be able to send requests from pyton. Let's predict some sentiments then

In [None]:
import requests

# docker is reachable through this mapping
url_batch = "http://0.0.0.0:8123/predict_sentence_batch/"

sentences = [
    "going down the beautiful road, I met a horrible rabbit",
    "Sadly, the guys from HUK gave me the wrong weights, and I had to do specification training myself",
    "I really hope that despite the whole python compatibility hell you could happily execute everything until here",
    "I really hope that despite the whole python compatibility hell you could happily execute everything until here"
]
sentiments = [
    "positive",
    "negative",
    "negative",
    "positive"
]

# POST-data for the batched sentence
data = {"sentences": [s for s in sentences], "sentiments": [s for s in sentiments]}
response = requests.post(url_batch, json=data, headers={ 'Content-Type': 'application/json' })

try:
    extracted_sentence_fragments = response.json()['data']
    print("sentences and extractions: \n ")
    for orig_sentence, sentiment, extr_fragment in zip (sentences, sentiments, extracted_sentence_fragments):
        print("`{}` extract: `{}`: \n ==> {}\n".format(orig_sentence, sentiment, extr_fragment))

except: 
    print("response seems not to contian json")



sentences and extractissons: 
 
`going down the beautiful road, I met a horrible rabbit` extract: `positive`: 
 ==>  beautiful

`Sadly, the guys from HUK gave me the wrong weights, and I had to do specification training myself` extract: `negative`: 
 ==>  sadly,

`I really hope that despite the whole python compatibility hell you could happily execute everything until here` extract: `negative`: 
 ==>  hell

`I really hope that despite the whole python compatibility hell you could happily execute everything until here` extract: `positive`: 
 ==>  hope



### <u> What about speed?</u>

Very good question: This is where it gets interesting. There are some relevant comparisons to make, since we're dealing with a server that takes requests, but we want efficient execution. First, some facts

* GPUS are latency-hiding machines that want "compute-bound", parallelizable programs. 👌

* GPUS have a n overhead for kernel-launches. So, launch the kernel on as many predictions (or as our GPU can)

* Server requests have a round-trip time. That adds overhead to every request we make from the client

* The above point means that we profit from chunking predictions within requests. But: Longer requests take a bit longer to send


### A word about timing
When we want to predict a number of sentences, 
we can investigate the time for the following modes.

We differentiate between predictions that go through the API and those that don't:

* single prediction request per API

* chunked prediction requests per API

* single direct prediction on tensorflow

* single direct prediction on tensorflow



<!-- ![title]("./benchmarks.png") -->

<center><img src="benchmarks.png" width=80% height=80% /></center>

### <u> Why this is relevant </u>

First, these numbers aren't surprising. But they tell the MLOps engineer something very interesting:
* requests are best chunked before being called by tensorflor (and cuda). I'd love to go into depth here
* requests are best chunked before being sent to the API to avoid unnecessary round-trip times
* to the above point, we see a small influence of message-size even if we make chunked requests (orange-right) 

###  What the developers can do

In a real a real setting, there will be small requests per  client.

However, the engineer can make some design-improvements in case our servers are under high load:
* aggregate incoming API-requests from different clients (over a small timeframe)
* predict them in a chunked fashion (fast!)
* map each prediction back them to the client, and send them back





### Generate your local benchmarks
You can also generate your own! (you need to have the requirements.txt installed: just try to run the first code-section)
This will generate a `benchmark_tf_and_api_calls.png` in this folder.

#### Note: since the Challenge uses `tensorflow 2.7`, we must have a python-version between 2.6 and 2.9. Never ones don't run this tf! 

In [None]:
subprocess.run([pip_executable, "install", "--upgrade", "pip"], check=True)
subprocess.run([pip_executable, "install", "-r", "requirements.txt"], check=True)
subprocess.run(["python3", "--version"], check=True)


In [49]:
!python3 benchmark.py

Traceback (most recent call last):
  File "/home/seb/Desktop/CodingChallenge_MLE/benchmark.py", line 3, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'


## Python tests

In [55]:
# To run all my tests
!pytest

/bin/bash: line 1: pytest: command not found


### There are two main things that I identified as testworthy
* correct tf-model interfacing in deployment
* correct API server behavior

#### <u>1. Check for correct tf-model-interfacing and text-prediction: <br></u> `test_batched_sentence_extraction_vs_manual`
The text goes through quite some processing before the data gets into the kernel-call. 
* Text preparation 
* Text tokenization 
* Masking
* Prediction, tokenized output
* Decoding of predicted tokens. This gives us the sentence-fragment

Fully automating into the above used functions to

`predict_sentence_batch(<<sentence>>, <<sentiment>>)` 

`predict_sentence(<<sentence>>, <<sentiment>>)` 

means to go through all the above stages. To test this code, I compared
* a completely manually written version 
* the automatized versions above


#### <u> 2. Check for correct API Behavior: </u> 
