Below, you can find examples of LLMs with multi-modal capabilities.

---
For this use case, we will need bigger models that support the vision. You can find those models in [Ollama Library](https://ollama.com/library), marked with a `vision` label.

So there is a tiny list of vision capable models: 
- [llava](https://ollama.com/library/llava)
- [llava-phi3](https://ollama.com/library/llava-phi3)
- [llava-llama3](https://ollama.com/library/llava-llama3)
- [moondream](https://ollama.com/library/moondream)


In [None]:
USE_MODEL = "llava-phi3:3.8b"

In [None]:
import ollama

# If we don't have this specific model, let's pull it.
ollama.pull(USE_MODEL)

---
To retrieve images from the internet, we need a way to download them. To do this, we could use [`requests`](https://docs.python-requests.org/en/latest/index.html) or [`httpx`](https://www.python-httpx.org/).

So, let's grab random image from [`xkcd`](https://xkcd.com/).

In [None]:
import httpx
import random
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from io import BytesIO

latest = httpx.get('https://xkcd.com/info.0.json')
latest.raise_for_status()

num = random.randint(1, latest.json().get('num'))
comic = httpx.get(f'https://xkcd.com/{num}/info.0.json')
comic.raise_for_status()

print(f'xkcd #{comic.json().get("num")}: {comic.json().get("alt")}')
print(f'link: https://xkcd.com/{num}')
raw = httpx.get(comic.json().get('img'))
raw.raise_for_status()
pil_im = Image.open(BytesIO(raw.content))

im_array = np.asarray(pil_im)
plt.imshow(im_array)

---
And now we can try to ask model to explain this comic.

In [None]:
for response in ollama.generate(USE_MODEL, 'explain this comic:', images=[raw.content], stream=True):
  print(response['response'], end='', flush=True)

---
You can try other models and compare their performance on these images.