In this notebook, we'll provide a detailed tutorial how one can process multiple chunks by distributing the workload between all the available Intel XPUs.

#### Installation

https://github.com/intel/intel-extension-for-transformers

```pip install intel-extension-for-pytorch```

follow this README

https://github.com/intel/intel-extension-for-pytorch/tree/xpu-main

Importing necessary libraries. We'll be using Intel edition of the transformers libray to quantise and load the model

In [None]:
import torch
from transformers import AutoTokenizer
from accelerate import PartialState
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
import intel_extension_for_pytorch as ipex

Loading the model, tokenizer, quantise it and input sentences

In [None]:
sentences = ["what's the capital of England?", "what is the tallest mountain?", "who is the president of the USA?"]

model_name = "BAAI/bge-m3"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True, use_llm_runtime=False)

Distributing the workload

Here we're doing a cheatcode to hardcoding the sentences on each XPU rather than splitting using accelerate

https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference

In [None]:
distributed_state = PartialState()

device = torch.device(f"xpu:{distributed_state.process_index}")
model.to(device)

if distributed_state.process_index == 0:
    subset_sentences = ["what's the capital of England?"]
elif distributed_state.process_index == 1:
    subset_sentences = ["who is the president of the USA?"]
elif distributed_state.process_index == 2:
    subset_sentences = ["what is the tallest mountain?"]
else:
    subset_sentences = []

Finally, getting the embeddings back

In [None]:
if subset_sentences:
    subset_inputs = tokenizer(subset_sentences, return_tensors="pt", padding=True, truncation=True)
    subset_inputs = {key: tensor.to(device) for key, tensor in subset_inputs.items()}

    with torch.no_grad():
        outputs = model(**subset_inputs)
        logits = outputs.logits

    embeddings = logits.mean(dim=1)

    print(f"Process {distributed_state.process_index} embeddings:")
    print(embeddings)
else:
    print(f"Process {distributed_state.process_index} has no sentences to process.")

to use this file use ```accelerate launch [scriptname]```

output

```bash
Process 1 embeddings:
tensor([[-10.1953,   0.1705,   0.0363,  ...,  -5.0781,   0.6475,  -3.7539]],
       device='xpu:1', dtype=torch.float16)
Process 0 embeddings:
tensor([[-6.3789,  0.2463, -9.2734,  ..., -3.4590, -0.7021, -4.2773]],
       device='xpu:0', dtype=torch.float16)
Process 2 embeddings:
tensor([[-4.9922, -0.2871, -2.2910,  ..., -5.4102,  1.5928, -4.4609]],
       device='xpu:2', dtype=torch.float16)
```