Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advices for inference speedup #88

Open
yishusong opened this issue May 14, 2024 · 12 comments
Open

Advices for inference speedup #88

yishusong opened this issue May 14, 2024 · 12 comments

Comments

@yishusong
Copy link

Hi team,

I'm running inference on a g5.24xlarge GPU instance. The data is currently structured in a Pandas dataframe. I use Pandas apply method to apply the predict_entities function. When the df gets fairly large (~1.5M rows), it takes days to run the inference.

I'm wondering if there is a way to increase GPU utilization? I suppose Pandas df is not the most efficient data structure... or maybe there is a parameter I missed that can boost GPU utilization?

Any advice is much appreciated!

@Marwen-Bhj
Copy link

Marwen-Bhj commented May 14, 2024

hello @yishusong , using Pandas apply method is slow, I suppose you want to use this model with a specific column and have the output in another maybe.

  1. transform that column into a list, use model.batch_predict_entities(your_list, labels)

  2. create a dictionnary from that output and join back with the dataframe

You would probably run OOM so make sure you run this in batches (split your data) and to run torch.cuda.empty_cache() .

About the increasing GPU utilization, I am not sure how we can increase or even make sure that it's using GPU during inference, I hope someone helps with that.

@urchade
Copy link
Owner

urchade commented May 14, 2024

you can create batches like this

# Sample text data
all_text = ["sample text 1", "sample text 2", …, "sample text n"]

# Define the batch size
batch_size = 10

# Function to create batches
def create_batches(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

# Example usage of the generator function
all_predictions = []
for batch in create_batches(all_text, batch_size):
    predictions = model.batch_predict(batch)
    all_predictions.extend(predictions)

@yishusong
Copy link
Author

Thank you very much for the replies! I'll try it out shortly.

Re: @Marwen-Bhj 's comment about GPU... I haven't looked into the source code yet but is it possible to use the model with huggingface? I was thinking something like device_map = 'auto' to use all GPU, or make data type = float16 to make the data smaller. Does the code base offer configurations like this?

If not, maybe a memory optimized instance will perform better?

@urchade
Copy link
Owner

urchade commented May 14, 2024

you can try the automatic mixed precision (AMP) module in PyTorch for inference. For me it helps speeding-up the training, but I have not tried inference

from torch.cuda.amp import autocast

with autocast(dtype = torch.float16):
    predictions = model.batch_predict(batch)

@Marwen-Bhj
Copy link

@urchade I tried AMP, it did not increase the inference speed.
headsup @yishusong
suprisingly, running inference on a CPU cluster is faster by at least 3 times than a GPU :
CPU cluster : Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
GPU instance : Nvidia V100

@yishusong
Copy link
Author

Thanks! With CPU there is joblib so there will be more speedup.

@urchade
Copy link
Owner

urchade commented May 14, 2024

Ok, that's weird but ok 😅

Did you try to pass model.to('cuda') instead of model.cuda() ?

@Marwen-Bhj
Copy link

@urchade that fixed it !
thank you :)

@yishusong
Copy link
Author

Thanks a lot! This indeed speed up inference a lot.

However, model.to('cuda') seems to only utilize 1 GPU. I looked up online, the nn.DataParallel(model) won't extend to GLiNER batch_predict...

@lifepillar
Copy link

I'm also interested in how to boost performance using multiple GPUs.

@bartmachielsen
Copy link

Hi, would it also be possible to speed up using AWS Inferentia / Optimum Neuron? (see article)

@yishusong
Copy link
Author

I don't think inferentia works because it only supports a very limited list of HF models. Also it might not be compatible with CUDA so there might be other dependency issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants