# Runnning MLServer as Python Files 

While most use cases tackled with MLServer can be completed via the command line, others might 
require the flexibility that comes from having a Python file that can be run from anywhere. That's 
what we'll learn about in this tutorial. We'll first work through an example with MusicGen 
and the move on to multi-model serving with different audio models.

## Dependencies

In [None]:
!pip install mlserver torch transformers audiocraft accelerate diffusers mlserver-huggingface

## Creating A Script

In [2]:
%%writefile ./model_4_tutorial/my_kul_model.py

from mlserver.types import InferenceResponse, InferenceRequest, ResponseOutput
from mlserver import MLServer, Settings, ModelSettings, MLModel
from audiocraft.models import MusicGen
import asyncio

class MusicGenServer(MLModel):

    async def load(self):
        self.model = MusicGen.get_pretrained('small', device="cuda")

    async def predict(self, request: InferenceRequest) -> InferenceResponse:
        
        prompts = request.inputs[0].data["prompts"]
        seconds = request.inputs[0].data["duration"]
        duration = 5 if seconds < 2 else seconds

        self.model.set_generation_params(duration=duration)
        wav = self.model.generate(prompts, progress=True)
        response_output = ResponseOutput(
            name="new_music",
            shape=list(wav[0].shape),
            datatype="FLOAT32",
            data=wav[0, 0].cpu().tolist(),
        )
        return InferenceResponse(model_name="music_model", outputs=[response_output])

async def main():
    settings = Settings(debug=True)
    my_server = MLServer(settings=settings)
    music_generator = ModelSettings(name='awesome_model', implementation=MusicGenServer)
    await my_server.start(models_settings=[music_generator])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Overwriting ./model_4_tutorial/my_kul_model.py


We can run the file above from the terminal with the following command. Please note, if you run the 
file from this notebook you will need to test it with a request from the terminal, another notebook, 
or elsewhere.

In [29]:
# !python model_4_tutorial/my_kul_model.py

To test our first microservice, we can send a [POST request](https://rapidapi.com/blog/api-glossary/post/) 
in the following format.

In [5]:
import requests

In [3]:
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": [2],
          "datatype": "FP32",
          "data": {
              'prompts': ['a high speed bachata in the style of Romeo Santos'],
              'duration': 10
            }
        }
    ]
}

In [6]:
r = requests.post(
    'http://localhost:8080/v2/models/awesome_model/infer',
    json=inference_request
)

In [7]:
new_song = r.json()['outputs'][0]['data']
len(new_song), new_song[:5]

(320000,
 [-0.0814489871263504,
  -0.08753831684589386,
  -0.05938887223601341,
  -0.0547080934047699,
  -0.05944954976439476])

Finally, in order to listen to the Bethoven-like tune we just created, we can use the Audio 
module of the [IPython](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html) library.

Note that we do need to specify the rate alongside the array of numbers generated by our model. If 
you've never worked with audio data this probably won't make much sense to you. An audio file is a 
digital representation of sound, sound is a representation of air pressure, and the sampling rate 
refers to the number of samples of the audio waveform taken per second to convert sound into a digital 
format. If you divide the amount of values in our array by the sampling rate of, say, 32,000 as below, 
you get the amount of seconds we told our model to reproduce.

In [39]:
from IPython.display import Audio

In [40]:
Audio(new_song, rate=32_000)

Nice we are probably not close to a real musician yet, but getting there step by step.

### Quick Recap of the Open Inference Protocol

If the way in which we wrote our request seemed a bit odd, that's because that is because that's the 
the Open or V2 Inference Protocol (OIP). The OIP refers to a standardized way of communicating between 
different software components to process and understand data using machine learning models, making it 
easier for different systems to work together.

Here are some examples for each step of the V2 inference protocol:

The client sends a request to the server, specifying the model name and the data that needs to be inferenced. The request might look like this:

POST /inference HTTP/1.1
Content-Type: application/json
Host: example.com
Authorization: Bearer token1234567890
Accept: application/json

In this example, the client sends a POST request to the server with the model name "my-model" and the data "{'x': 1, 'y': 2, 'z': 3}".

The server responds with the model metadata, which includes the input and output shapes of the model. The response might look like this:

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 331

{"inputs": [{"name": "x", "type": "integer"}],
"outputs": [{"name": "prediction", "type": "float"}],
"model_name": "my-model"}

In this example, the server responds with the model metadata that includes the input and output shapes of the model. The model has one input named "x" and one output named "prediction", both of which are of type "float".

The client sends the data to the server. In this example, the client sends the data "{'x': 1, 'y': 2, 'z': 3}" to the server.

The server preprocesses the data and sends it to the model. In this example, the server preprocesses the data by converting it to the appropriate format for the model. The data is then sent to the model for inference.

The model inferences the data and sends the output back to the server. In this example, the model inferences the data and returns the output "{'prediction': 0.31704061458615477}".

The server sends the output back to the client. In this example, the server sends the output back to the client in the form of a JSON object.

The client receives the output and can use it as needed. In this example, the client receives the output and uses it to make a decision or prediction based on the model's output.

Now that we know a bit about the OIP, let's explore how we can duplicate our model with `mlserver` and send multiple requests at the same time.two

## Let's Try Serving 2 Models Now

In [70]:
%%writefile ./model_4_tutorial/second_kul_model.py

from mlserver import MLServer, Settings, ModelSettings, MLModel
from mlserver.codecs import decode_args
from diffusers import StableDiffusionPipeline
from audiocraft.models import MusicGen
from typing import List, Optional
import numpy as np
import asyncio
import torch


class StableDifServer(MLModel):
    async def load(self):
        # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")#.to(device)

    @decode_args
    async def predict(self, prompt: List[str]) -> np.ndarray:
        img = self.model(prompt).images[0]
        return np.asarray(img)

class MusicGenServer(MLModel):
    async def load(self):
        self.model = MusicGen.get_pretrained('small', device="cpu")

    @decode_args
    async def predict(self, prompts: List[str], seconds: Optional[np.ndarray] = 5) -> np.ndarray:
        self.model.set_generation_params(duration=seconds)
        wav = self.model.generate(prompts, progress=True)
        return wav[0, 0].cpu().numpy()

async def main():
    settings = Settings(debug=True)
    my_server = MLServer(settings=settings)
    music_generator = ModelSettings(name='music_model', implementation=MusicGenServer)
    image_generator = ModelSettings(name='image_model', implementation=StableDifServer)
    await my_server.start(models_settings=[music_generator, image_generator])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Overwriting ./model_4_tutorial/second_kul_model.py


In [15]:
from mlserver.codecs import StringCodec, StringRequestCodec, NumpyCodec
import numpy as np

In [23]:
StringCodec.encode_input(name='prompts', payload=["Sean Paul style song with fast tempo and slow beat"], use_bytes=False).dict()

{'name': 'prompts',
 'shape': [1, 1],
 'datatype': 'BYTES',
 'parameters': {'content_type': 'str'},
 'data': ['Sean Paul style song with fast tempo and slow beat']}

In [114]:
import requests
reqs = [{
  "inputs": [
      StringCodec.encode_input(name='prompts', payload=["Sean Paul style song with fast tempo and slow beat"], use_bytes=False).dict(),
      NumpyCodec.encode_input(name='seconds', payload=np.array([10])).dict()
  ]}, {
      'inputs': [
          StringCodec.encode_input(name='prompt', payload=['Siberian husky having a beer at the beach.'], use_bytes=False).dict()
      ]
  }
]

uri = 'http://localhost:8080/v2/models/music_model/infer'

r = requests.post(url=uri, json=reqs[0])

In [115]:
r.json()

{'model_name': 'music_model',
 'id': '78838f86-a444-4e1e-9828-4046000f8ad6',
 'parameters': {},
 'outputs': [{'name': 'output-0',
   'shape': [320000, 1],
   'datatype': 'FP32',
   'parameters': {'content_type': 'np'},
   'data': [-0.005420586559921503,
    -0.013944793492555618,
    0.03711867332458496,
    0.030150622129440308,
    0.018377797678112984,
    0.02527262456715107,
    -0.023051630705595016,
    -0.05942783132195473,
    0.005358945578336716,
    0.04060020297765732,
    0.12447784841060638,
    0.1085006445646286,
    -0.04576878622174263,
    -0.14591152966022491,
    -0.19027909636497498,
    0.0012136835139244795,
    -0.03433338180184364,
    -0.10227467119693756,
    0.046383000910282135,
    0.11696174740791321,
    0.027399083599448204,
    -0.05950654670596123,
    -0.02765616960823536,
    -0.11219191551208496,
    -0.07429184019565582,
    -0.08501432836055756,
    -0.060061976313591,
    0.06926213949918747,
    0.044906213879585266,
    0.06844282150268555,


In [71]:
import httpx

uri = 'http://localhost:8080/v2/models/image_model/infer'
r2 = httpx.post(url=uri, json=reqs[1], timeout=360)

In [72]:
from PIL import Image

In [73]:
r2.json()['outputs']

[{'name': 'output-0',
  'shape': [512, 512, 3],
  'datatype': 'UINT8',
  'parameters': {'content_type': 'np'},
  'data': [168,
   153,
   137,
   174,
   155,
   137,
   172,
   155,
   139,
   172,
   156,
   140,
   172,
   155,
   139,
   172,
   156,
   139,
   171,
   155,
   140,
   171,
   154,
   138,
   170,
   154,
   137,
   169,
   153,
   136,
   169,
   152,
   135,
   168,
   151,
   134,
   167,
   150,
   134,
   165,
   149,
   133,
   164,
   147,
   132,
   164,
   147,
   131,
   163,
   146,
   131,
   162,
   145,
   130,
   161,
   144,
   130,
   161,
   144,
   130,
   160,
   144,
   130,
   160,
   143,
   130,
   160,
   143,
   130,
   159,
   143,
   130,
   159,
   143,
   130,
   159,
   143,
   131,
   159,
   143,
   130,
   159,
   144,
   131,
   160,
   144,
   133,
   159,
   144,
   133,
   159,
   143,
   133,
   160,
   144,
   134,
   160,
   144,
   133,
   160,
   144,
   134,
   159,
   145,
   134,
   159,
   145,
   134,
   159,
   145,
 

In [78]:
shape = r2.json()['outputs'][0]['shape']
data = r2.json()['outputs'][0]['data']
image = np.array(data, dtype=np.uint8).reshape(shape)
# image

In [76]:
test = Image.fromarray(image)

In [77]:
test.show()

## Diving Into the Model Settings

In [107]:
%%writefile ./model-settings.json
{
    "name": "text_model",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2"
        }
    }
}

Overwriting ./model-settings.json


In [108]:
%%writefile ./model_4_tutorial/third_kul_model.py

from mlserver import MLServer, Settings, ModelSettings, MLModel
from mlserver.codecs import decode_args
from diffusers import StableDiffusionPipeline
from audiocraft.models import MusicGen
from typing import List, Optional
import numpy as np
import asyncio
import torch


class StableDifServer(MLModel):
    async def load(self):
        # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")#.to(device)

    @decode_args
    async def predict(self, prompt: List[str]) -> np.ndarray:
        img = self.model(prompt).images[0]
        return np.asarray(img)

class MusicGenServer(MLModel):
    async def load(self):
        self.model = MusicGen.get_pretrained('small', device="cpu")

    @decode_args
    async def predict(self, prompts: List[str], seconds: Optional[np.ndarray] = 5) -> np.ndarray:
        self.model.set_generation_params(duration=seconds)
        wav = self.model.generate(prompts, progress=True)
        return wav[0, 0].cpu().numpy()

async def main():
    settings = Settings(debug=True)
    my_server = MLServer(settings=settings)
    text_generator = ModelSettings.parse_file('model-settings.json')
    music_generator = ModelSettings(name='music_model', implementation=MusicGenServer)
    image_generator = ModelSettings(name='image_model', implementation=StableDifServer)
    await my_server.start(models_settings=[music_generator, image_generator, text_generator])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Overwriting ./model_4_tutorial/third_kul_model.py


In [111]:
tell_me_about_london = {
      'inputs': [
          StringCodec.encode_input(
              name='text_inputs', 
              payload=['The Siberian Husky is...'], 
              use_bytes=False).dict()
      ]
  }
requests.post("http://localhost:8080/v2/models/text_model/infer", json=tell_me_about_london).json()

{'model_name': 'text_model',
 'id': 'c7e8dedc-b247-4209-9f78-944f68174a2c',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['[{"generated_text": "The Siberian Husky is...a small black bear that can be seen in the video. She looks pretty good, as far as we\'re concerned. It weighs 1.1 pounds with a body length of 2.0 inches and has a small pouch"}]']}]}

## Package Server

## Build a Website

## Build UI

In [112]:
import fastapi

In [113]:
fastapi.__version__

'0.89.1'

In [46]:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

In [47]:
def get_song(req):
    resp = requests.post(url=uri, json=req)
    return resp.json()

In [49]:
with ThreadPoolExecutor(max_workers=6) as e:
    resi = list(e.map(get_song, reqs))

In [34]:
async def get_songs():
    
    async with httpx.AsyncClient() as client:
        songs = []
        for parameters in reqs:
            resp = await client.post(url=uri, json=parameters, timeout=240)
            songs.append(resp.json())
    return songs

In [36]:
def get_songs():    
    with httpx.Client() as client:
        songs = []
        for parameters in reqs:
            resp = client.post(url=uri, json=parameters, timeout=240)
            songs.append(resp.json())
    return songs

In [37]:
s = get_songs()

  s = get_songs()


In [35]:
s = asyncio.run(get_songs())

RuntimeError: asyncio.run() cannot be called from a running event loop

In [12]:
import requests

In [None]:
work_res = httpx.post(
    url=uri, json={
        "inputs": [
            {"name": "prompts", "datatype": "BYTES", "parameters": {"content_type": "str"}, "shape": [1], "data": ["Vivaldi song with fast tempo"]},
            {"name": "seconds", "datatype": "INT32", "shape": [1], "data": [8]},
        ]
    }
)

In [31]:
loop = asyncio.get_event_loop()
new_songs = loop.run_until_complete(
    get_songs()
)

RuntimeError: This event loop is already running

In [86]:
async def run_func():
    my = await [asyncio.gather(*[post(uri, reqs) for _ in range(3)])]
    return my

In [87]:
si = run_func()

In [80]:
r2 = requests.post(
    'http://localhost:8080/v2/models/awesome_model/infer',
    json=req
)

In [65]:
r2.json()['outputs'][0]['data']

{'model_name': 'awesome_model',
 'id': '493c9552-cae5-4525-848c-584551d7777e',
 'parameters': {},
 'outputs': [{'name': 'output-0',
   'shape': [256000, 1],
   'datatype': 'FP32',
   'parameters': {'content_type': 'np'},
   'data': [0.02485641837120056,
    0.018693696707487106,
    0.025755302980542183,
    0.03313429653644562,
    0.03723397105932236,
    0.04621417075395584,
    0.05284407362341881,
    0.05691305175423622,
    0.07750385999679565,
    0.09173086285591125,
    0.09458108991384506,
    0.07488996535539627,
    0.026731153950095177,
    -0.02475874498486519,
    -0.05350523442029953,
    -0.053910981863737106,
    -0.06517928093671799,
    -0.07227375358343124,
    -0.07143330574035645,
    -0.07701081037521362,
    -0.10081789642572403,
    -0.1297508031129837,
    -0.15276402235031128,
    -0.17547070980072021,
    -0.1779816597700119,
    -0.16086426377296448,
    -0.12818372249603271,
    -0.08118252456188202,
    -0.033356159925460815,
    0.013456334359943867,
 

In [66]:
Audio(r2.json()['outputs'][0]['data'], rate=32000)

In [57]:
from mlserver.codecs import NumpyCodec
from mlserver.codecs import NumpyRequestCodec
import numpy as np

x_0 = np.array([28.0])
# inference_request = InferenceRequest(
#     inputs=[
NumpyCodec.encode_input(name="marriage", payload=x_0).json()
    # ]
# )


'{"name": "marriage", "shape": [1, 1], "datatype": "FP64", "parameters": {"content_type": "np"}, "data": [28.0]}'

In [54]:
len(r2.json()['outputs'][0]['data'])

KeyError: 'outputs'

In [24]:
r2.json()['outputs'][0].keys()

dict_keys(['name', 'shape', 'datatype', 'data'])

## Multi-model Serving

In [21]:
%%writefile ./model_4_tutorial/two_kul_model.py

from mlserver.types import InferenceResponse, InferenceRequest, ResponseOutput
from mlserver import MLServer, Settings, ModelSettings, MLModel
from mlserver.codecs import decode_args
from audiocraft.models import MusicGen
from transformers import pipeline
import numpy as np
import asyncio

class MusicGenServer(MLModel):

    async def load(self):
        self.model = MusicGen.get_pretrained('small', device="cuda")

    async def predict(self, request: InferenceRequest) -> InferenceResponse:
        
        prompts = request.inputs[0].data["prompts"]
        seconds = request.inputs[0].data["duration"]
        duration = 5 if seconds < 2 else seconds

        self.model.set_generation_params(duration=duration)
        wav = self.model.generate(prompts)
        response_output = ResponseOutput(
            name="new_music",
            shape=list(wav[0].shape),
            datatype="FLOAT32",
            data=wav[0, 0].cpu().tolist(),
        )
        return InferenceResponse(model_name="music_model", outputs=[response_output])

class MusicGenreClassifier(MLModel):

    async def load(self):
        self.model = pipeline("audio-classification", model="ramonpzg/wav2musicgenre")

    @decode_args
    async def predict(self, request):
        song_genre = self.model(request)
        return song_genre

async def main():
    settings = Settings(debug=True)
    my_server = MLServer(settings=settings)
    music_generator = ModelSettings(name='music_generator', implementation=MusicGenServer)
    music_classifier = ModelSettings(name='music_classifier', implementation=MusicGenreClassifier)
    await my_server.start(models_settings=[music_generator, music_classifier])

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Writing ./model_4_tutorial/two_kul_model.py


In [27]:
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": [2],
          "datatype": "FP32",
          "data": {
              'prompts': ['a slow weird worm-like song that is kind of gross but also travels in time'],
              'duration': 10
            }
        }
    ]
}

In [28]:
r = requests.post(
    'http://localhost:8080/v2/models/music_generator/infer',
    json=inference_request
)

In [29]:
another_song = r.json()['outputs'][0]['data']
another_song[:5]

[0.03370356932282448,
 0.03290902078151703,
 0.03377167508006096,
 0.03358738496899605,
 0.03169146552681923]

In [30]:
Audio(another_song, rate=32_000)

In [32]:
import numpy as np
from mlserver.codecs import NumpyRequestCodec

In [33]:
numpy_another_song = np.array(another_song)

In [34]:
inference_request = NumpyRequestCodec.encode_request(numpy_another_song)

In [None]:
inference_request.json()

In [45]:
inference_request.dict()['inputs'][0]['data'][:4]

[0.03370356932282448,
 0.03290902078151703,
 0.03377167508006096,
 0.03358738496899605]

In [39]:
r = requests.post(
    'http://localhost:8080/v2/models/music_classifier/infer',
    json=inference_request
)

TypeError: Object of type InferenceRequest is not JSON serializable

In [38]:
r.json()

{'detail': [{'loc': ['body'],
   'msg': 'value is not a valid dict',
   'type': 'type_error.dict'}]}

## Synchronous Models

In [90]:
%%writefile ./model_4_tutorial/one_kul_model.py

from mlserver import MLServer, Settings, ModelSettings, MLModel
from mlserver.codecs import decode_args
from audiocraft.models import MusicGen
from typing import List, Optional
import numpy as np
import asyncio

class MusicGenServer(MLModel):

    def load(self):
        self.model = MusicGen.get_pretrained('small', device="cpu")

    @decode_args
    def predict(self, prompts: List[str], seconds: Optional[np.ndarray] = 5) -> np.ndarray:
        self.model.set_generation_params(duration=seconds)
        wav = self.model.generate(prompts, progress=True)
        return wav[0, 0].cpu().numpy()

async def main():
    settings = Settings(debug=True, parallel_workers=0)
    my_server = MLServer(settings=settings)
    music_generator = ModelSettings(name='awesome_model', implementation=MusicGenServer)
    await my_server.start(models_settings=[music_generator])

if __name__ == '__main__':
    # loop = asyncio.get_event_loop()
    # loop.run_until_complete(main())
    main()

Overwriting ./model_4_tutorial/one_kul_model.py
