Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run Batch inference with Multiple GPUs using LLM and Ray #41728

Open
Roufa-mohammad-soulpage opened this issue Dec 8, 2023 · 4 comments

Comments

@Roufa-mohammad-soulpage
Copy link

Hi,

Thank you so much for the Excellent package. I can not exactly keep the code and objective here, but the below code is almost same and images replicate the issue clearly.

I am trying to do a batch inference on PDF using an LLM named Mistral. I have faced issue while doing it. Let's say, I want to identify the tone/other_objective in Each page of the pdf is my usecase.

I have referenced below links to implement the code:

Note:

  • Anyone can easily replicate the code to check the issue
  • A sample pdf with text is the input(My pdfs are more than 200 pages.)
  • Current pdf that raised issue has 231 pages. you can see this in logs.

Code:

import os
import re
import gc

import ray
import time
import torch

import pdfplumber
import numpy as np
import pandas as pd
from tqdm import tqdm
from typing import Dict
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


def load_large_model():
    model_name_or_path = "mistralai/Mistral-7B-Instruct-v0.1"

    model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                                 device_map="cuda:0",
                                                 trust_remote_code=False,
                                                 revision="main",
                                                 load_in_4bit=True,
                                                 )
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,
                                             use_fast=True)
    return model, tokenizer

def get_data(pdf):
    all_pages_text = []

    for pg in range(len(pdf.pages)):
        result = {}

        page = pdf.pages[pg]
        text = page.extract_text(layout = False)
        text = text.strip("\n")

        prompt = "Some Random/Example Prompt here :{}"
        new_prompt = prompt.format(text)
        template=f'''<s>[INST] {new_prompt} '~'[/INST]'''
        
        all_pages_text.append(template)

    return all_pages_text


def document_data_style(pdf, model, tokenizer):
    output = []

    all_pages_text = get_data(pdf)
    ds = ray.data.from_numpy(np.asarray(all_pages_text))

    print("total items :", ds.count())
    print("printing dataset inspection execution stats:", ds.stats())

    BATCH_SIZE = 5 # maximum batch size that can fit into memory
    class LLM_classifier:
        def __init__(self):
            self.detector = pipeline("text-generation",
                                    model=model,
                                    tokenizer=tokenizer,
                                    max_new_tokens=512,
                                    do_sample=True,
                                    temperature=0.3,
                                    top_p=0.95,
                                    top_k=40,
                                    repetition_penalty=1.1,
                                    )
            self.detector.tokenizer.pad_token_id = model.config.eos_token_id #padding for batching to same size

        def __call__(self, batch: Dict[str, np.ndarray]):
            outputs = self.detector(list(batch["data"]), batch_size = BATCH_SIZE)
            batch["result"] = [output[0]['generated_text'] for output in outputs]
            return batch # returning dictionary

    predictions = ds.map_batches(LLM_classifier,
                                compute=ray.data.ActorPoolStrategy(size=4), # Use 4 GPUs so we will have 4 actors. Change this number based on the number of GPUs in your cluster.
                                num_gpus=1,  # Specify 1 GPU per model/actor replica.
                                batch_size = BATCH_SIZE # Use the largest batch size that can fit on our GPUs
                                )

    final_predictions = []
    for pg_no,prediction in  enumerate(predictions.take_all()):
      response = prediction["result"]
      split_result = response.split('~')[-1]
      final_predictions.append(split_result)
    
    # removing ds ray variable from memory
    del ds
    return final_predictions


def identify_tone(pdf):
    ray.init(num_gpus=torch.cuda.device_count())
    model, tokenizer  = load_large_model()

    doc_style = document_data_style(pdf, model, tokenizer)
    del model
    del tokenizer
    return doc_style


start = time.time()
pdf_path = 'sample_pdf.pdf' # I have pdfs of 300 to 500 pages.
pdf = pdfplumber.open(pdf_path)
doc_styles = identify_tone(pdf)
end = time.time()
print("time:", end-start)
print(doc_styles)


# List all PyTorch tensors and their sizes currently allocated on GPU
print(torch.cuda.memory_summary())
print(torch.cuda.memory_allocated())
print(torch.cuda.memory_reserved())

ray.shutdown()
gc.collect()
torch.cuda.empty_cache()

Logs:

Since the logs are a little long, I have added them in text file and shared the drive link here

Dependencies:

  • Using all the required packages like transformers, accelerate, deepspeed, ray with Recent stable versions as of date.

GPU Configuration:

  • By the time of screenshot, there is a small process running in cuda:0, but please consider them empty.
    Screenshot from 2023-12-08 19-34-45

Issue:

Unable to run or parallelize the pdf page type checking. pasting first lines of logs here. but please check the logs above.

Running: 0.0/48.0 CPU, 4.0/4.0 GPU, 0.0 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]                                                                                                  
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Running: 0.0/48.0 CPU, 4.0/4.0 GPU, 0.0 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
Running: 0.0/48.0 CPU, 1.0/4.0 GPU, 0.47 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]                                                                                              
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Running: 0.0/48.0 CPU, 1.0/4.0 GPU, 0.47 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [00:40<?, ?it/s]                                                                                                         
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Running: 0.0/48.0 CPU, 1.0/4.0 GPU, 0.47 MiB/8.95 GiB object_store_memory:   0%|          | 0/1 [01:20<?, ?it/s]                                                                                                         
�[36m(MapWorker(MapBatches(LLM_classifier)) pid=3922935)�[0m Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

Initially it used all the 4 gpus, but later it only started using one GPU only. check the logs for more please, since the issue displays at end of the logs.

How to run the code in all 4 gpus paralelly. what am i missing here though I followed the official docs.
Does Anyone have Any idea, or can help me to understand/resolve the issue.

Thanks in advance.
Thank you

@purnasai
Copy link

purnasai commented Dec 10, 2023

Hi, I have faced the same issue. let's see if we get any help from the team here.. just tagging @richardliaw for quick response.

@dmatrix
Copy link
Contributor

dmatrix commented Dec 11, 2023

cc: @akshay-anyscale

@akshay-anyscale
Copy link
Contributor

cc @c21

@purnasai
Copy link

Hi, Do we have any update on this. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants