Unable to run Batch inference with Multiple GPUs using LLM and Ray #41728

Roufa-mohammad-soulpage · 2023-12-08T14:13:44Z

Hi,

Thank you so much for the Excellent package. I can not exactly keep the code and objective here, but the below code is almost same and images replicate the issue clearly.

I am trying to do a batch inference on PDF using an LLM named Mistral. I have faced issue while doing it. Let's say, I want to identify the tone/other_objective in Each page of the pdf is my usecase.

I have referenced below links to implement the code:

Note:

Anyone can easily replicate the code to check the issue
A sample pdf with text is the input(My pdfs are more than 200 pages.)
Current pdf that raised issue has 231 pages. you can see this in logs.

Code:

import os
import re
import gc

import ray
import time
import torch

import pdfplumber
import numpy as np
import pandas as pd
from tqdm import tqdm
from typing import Dict
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


def load_large_model():
    model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"

    model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                                 device_map="cuda:0",
                                                 trust_remote_code=False,
                                                 revision="main",
                                                 load_in_4bit=True,
                                                 )
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                             use_fast=True)
    return model, tokenizer

def get_data(pdf):
    all_pages_text = []

    for pg in range(len(pdf.pages)):
        result = {}

        page = pdf.pages[pg]
        text = page.extract_text(layout = False)
        text = text.strip("\n")

        prompt = "Some Random/Example Prompt here :{}"
        new_prompt = prompt.format(text)
        template=f'''<s>[INST] {new_prompt} '~'[/INST]'''
        
        all_pages_text.append(template)

    return all_pages_text


def document_data_style(pdf, model, tokenizer):
    output = []

    all_pages_text = get_data(pdf)
    ds = ray.data.from_numpy(np.asarray(all_pages_text))

    print("total items :", ds.count())
    print("printing dataset inspection execution stats:", ds.stats())

    BATCH_SIZE = 5 # maximum batch size that can fit into memory
    class LLM_classifier:
        def __init__(self):
            self.detector = pipeline("text-generation",
                                    model=model,
                                    tokenizer=tokenizer,
                                    max_new_tokens=512,
                                    do_sample=True,
                                    temperature=0.3,
                                    top_p=0.95,
                                    top_k=40,
                                    repetition_penalty=1.1,
                                    )
            self.detector.tokenizer.pad_token_id = model.config.eos_token_id #padding for batching to same size

        def __call__(self, batch: Dict[str, np.ndarray]):
            outputs = self.detector(list(batch["data"]), batch_size = BATCH_SIZE)
            batch["result"] = [output[0]['generated_text'] for output in outputs]
            return batch # returning dictionary

    predictions = ds.map_batches(LLM_classifier,
                                compute=ray.data.ActorPoolStrategy(size=4), # Use 4 GPUs so we will have 4 actors. Change this number based on the number of GPUs in your cluster.
                                num_gpus=1,  # Specify 1 GPU per model/actor replica.
                                batch_size = BATCH_SIZE # Use the largest batch size that can fit on our GPUs
                                )

    final_predictions = []
    for pg_no,prediction in  enumerate(predictions.take_all()):
      response = prediction["result"]
      split_result = response.split('~')[-1]
      final_predictions.append(split_result)
    
    # removing ds ray variable from memory
    del ds
    return final_predictions


def identify_tone(pdf):
    ray.init(num_gpus=torch.cuda.device_count())
    model, tokenizer  = load_large_model()

    doc_style = document_data_style(pdf, model, tokenizer)
    del model
    del tokenizer
    return doc_style


start = time.time()
pdf_path = 'sample_pdf.pdf' # I have pdfs of 300 to 500 pages.
pdf = pdfplumber.open(pdf_path)
doc_styles = identify_tone(pdf)
end = time.time()
print("time:", end-start)
print(doc_styles)


# List all PyTorch tensors and their sizes currently allocated on GPU
print(torch.cuda.memory_summary())
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

ray.shutdown()
gc.collect()
torch.cuda.empty_cache()

Logs:

Since the logs are a little long, I have added them in text file and shared the drive link here

Dependencies:

Using all the required packages like transformers, accelerate, deepspeed, ray with Recent stable versions as of date.

GPU Configuration:

By the time of screenshot, there is a small process running in cuda:0, but please consider them empty.

Issue:

Unable to run or parallelize the pdf page type checking. pasting first lines of logs here. but please check the logs above.

Running: 0.0/48.0 CPU, 4.0/4.0 GPU, 0.0 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]                                                                                                  
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Running: 0.0/48.0 CPU, 4.0/4.0 GPU, 0.0 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
Running: 0.0/48.0 CPU, 1.0/4.0 GPU, 0.47 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]                                                                                              
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Running: 0.0/48.0 CPU, 1.0/4.0 GPU, 0.47 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:40<?, ?it/s]                                                                                                         
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Running: 0.0/48.0 CPU, 1.0/4.0 GPU, 0.47 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [01:20<?, ?it/s]                                                                                                         
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

Initially it used all the 4 gpus, but later it only started using one GPU only. check the logs for more please, since the issue displays at end of the logs.

How to run the code in all 4 gpus paralelly. what am i missing here though I followed the official docs.
Does Anyone have Any idea, or can help me to understand/resolve the issue.

Thanks in advance.
Thank you

The text was updated successfully, but these errors were encountered:

purnasai · 2023-12-10T06:27:24Z

Hi, I have faced the same issue. let's see if we get any help from the team here.. just tagging @richardliaw for quick response.

dmatrix · 2023-12-11T01:16:03Z

cc: @akshay-anyscale

akshay-anyscale · 2023-12-11T03:04:43Z

cc @c21

purnasai · 2023-12-18T07:55:33Z

Hi, Do we have any update on this. Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run Batch inference with Multiple GPUs using LLM and Ray #41728

Unable to run Batch inference with Multiple GPUs using LLM and Ray #41728

Roufa-mohammad-soulpage commented Dec 8, 2023

purnasai commented Dec 10, 2023 •

edited

dmatrix commented Dec 11, 2023

akshay-anyscale commented Dec 11, 2023

purnasai commented Dec 18, 2023

Unable to run Batch inference with Multiple GPUs using LLM and Ray #41728

Unable to run Batch inference with Multiple GPUs using LLM and Ray #41728

Comments

Roufa-mohammad-soulpage commented Dec 8, 2023

Note:

Code:

Logs:

Dependencies:

GPU Configuration:

Issue:

purnasai commented Dec 10, 2023 • edited

dmatrix commented Dec 11, 2023

akshay-anyscale commented Dec 11, 2023

purnasai commented Dec 18, 2023

purnasai commented Dec 10, 2023 •

edited