-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding operation inside non_max_suppression() function #13179
Comments
@Avaneesh-S hello, Thank you for your detailed question and for profiling your application with Understanding the IssueThe behavior you're observing in the Steps to Investigate and Optimize
Potential Optimization Strategies
YOLOv8 DifferencesYou are correct that the implementation of Feel free to share your reproducible example, and we can dive deeper into this issue. Thank you for your contribution to improving the YOLOv5 community! |
Hey @glenn-jocher, I tried to make a simple code to replicate the issue. These are the changes to be made in general.py:
def compute_1(xc,xi): def compute_2(x,c_1): def return_x(x,xc,xi): def enter_loop():
These are all the changes to be made. to run the code:
once the it starts running you will see the display of the video stream, let it run for a few seconds and then exit by pressing ctrl + c. viztracer will then save the results in report.json
ignore the large non_max_suppression() function call in the start (not sure what that is, if you know what its for do let me know) and view the others which will are the smaller sections on the right side. you will have to zoom in using ctrl+ mouse movements . zoom in until you see something like this: so in the above image you can see that compute_2() function call (which we added to separate the working of the x=x[xc[xi]] operation) takes a long time. Since the display.py processes a batch size of 1, so you can't tell that for next iterations that slow down does not happen (you can modify the detect.py to process batch_size>1 and check it), but in my application since I am processing batch size of 10 you can tell. Also additionally you can see a blank white space under right side of the non_max_suppression(), that is not there in my application's vizviewer output (Most of my application's is occupied by the compute_2() function call). This is the minimum reproducable example that I could make, do go through it and let me know why exactly its taking that long and how to optimize it (if possible) PS: I have tried the warm-up iteration by adding the following lines in the non_max_suppression() before the 'for' loop : dummy_input = torch.randn_like(prediction[0], device=prediction.device) it didn't help, compute_2() is still the one taking the most time. Is my warm-up iteration approach right? also even if the warm-up works won't it just add that additional time at the warm-up step and reduce it in the first iteration, this won't speed up the function call. My aim is to speed up the entire function call. |
Hello @Avaneesh-S, Thank you for providing such a detailed and thorough explanation along with a minimum reproducible example. This is incredibly helpful for us to understand and investigate the issue. Reviewing Your ExampleI see that you've made modifications to the Next Steps
Potential Optimization Strategies
Example Code for Pre-AllocationHere is an example of how you might pre-allocate memory for the tensors used in import torch
# Pre-allocate memory for tensors
dummy_input = torch.randn_like(prediction[0], device=prediction.device)
mask = torch.zeros_like(dummy_input, dtype=torch.bool)
def non_max_suppression(prediction, conf_thres=0.25, iou_thres=0.45, classes=None, agnostic=False):
# Your existing code here...
for i, x in enumerate(prediction): # image index, image inference
# Apply pre-allocated mask
mask = x[:, 4] > conf_thres
x = x[mask]
# Your existing code here... ConclusionThank you for your patience and for providing such a detailed example. Please try the suggestions above and let us know if you see any improvements. If the issue persists, we can continue to explore other optimization strategies. Your contributions and detailed analysis are invaluable to the YOLO community and the Ultralytics team. We appreciate your efforts in helping to improve the performance of YOLOv5. |
Hey @glenn-jocher, I have tried the strategies. pre allocating memory does not decrease the overall processing time. I am already processing in batches in my application. I could not find any other alternate implementation in torchvision. I have also been profiling detect.py on a video input on CPU as well as GPU using viztracer. I found that on CPU, though the overall processing of the video is much slower than when using GPU, but the execution time of the non_max_suppression() function on CPU is much lower than when on GPU. I found that on GPU each function call of the non_max_suppression() function runs on average 1-2 ms (miliseconds) (sometimes more) but on CPU its around 500-600 us (microsecond) only. I have also tried moving the prediction tensor in the non_max_suppression() function to CPU from GPU for the processing while using GPU (after noting the execution speed difference), but the overhead of moving the tensor from GPU to CPU is high and therefore the the function execution speed is relatively the same as just using tensor on GPU. Can you let me know why its faster on CPU and if its possible to change the code to take advantage of that without the overhead of moving the tensors. Also any other optimization strategies that you can think of to apply would also help |
Hello @Avaneesh-S, Thank you for your detailed follow-up and for sharing your profiling results. It's great to see such a thorough investigation into the performance differences between CPU and GPU executions. Understanding the IssueThe observation that Potential Reasons and Solutions
Optimization Strategies
Next Steps
ConclusionThank you for your patience and for contributing to the YOLOv5 community with your detailed analysis. Your efforts are invaluable in helping us improve the performance and efficiency of YOLOv5. If you have any further questions or need additional assistance, please feel free to reach out. |
Hey @glenn-jocher, thanks for the possible suggestions. I have tried to move only the necessary tensors to the CPU asynchronously. Moving the entire prediction tensor also has a high overhead, so instead I move the tensors like this: Only for the 1st iteration I do this moving before the loop starts (at the start of the non max suppression), for the rest I do it in loop. I also do this for the 'xc[i]' tensors so that I can do the 'x=x[xc[i]]' operation(same as x=x[xc[xi]] in the non_max_suppression() function). Additionally, I keep the 'output' tensor returned by non_max_suppression() on the GPU itself to not change what happens after the non_max_suppression() finishes when script runs on GPU. I have tested this on the detect.py script running on GPU with an input video of 504 frames processing a batch size =1 (since detect.py script can only process batch size=1) and noted the following (I ran the scripts multiple times to be sure):
The system I used for testing has Intel i5-10300H CPU and Nvidia GTX 1650 GPU. Also I think that if the batch size being processed increases, the prediction tensor's size will increase and therefore the original implementation's GPU memory allocation time might also go up a bit inside the non_max_suppression() (Let me know if I am wrong about this) . But with my approach there is very little or no GPU memory allocation overhead inside the non_max_suppression() function hence its able to run faster, tested with viztracer. Additionally if needed I can try modifying the required scripts to process batch size>1 to test the performance. |
Hello @Avaneesh-S, Thank you for your detailed follow-up and for sharing your innovative approach to optimizing the Reviewing Your ApproachYour approach of moving the next tensor to the CPU asynchronously while processing the current tensor on the GPU is a clever way to overlap data transfer and computation. This can indeed help in reducing the overall processing time, as evidenced by your performance improvements. Integration Considerations
Next Steps
ConclusionYour contributions and detailed analysis are invaluable to the YOLOv5 community and the Ultralytics team. We appreciate your efforts in helping to improve the performance of YOLOv5. If you have any further questions or need additional assistance, please feel free to reach out. We look forward to your reproducible example and further insights! Thank you for your dedication and innovative approach! 🚀 |
Hey @glenn-jocher, due to reasons I have not had the time to change the detect.py code to process in batches to test my changes. This is my implementation of the changes general.py. If possible can you look into changing the code to process in batches and check if there are any improvements. Also let me know if it is worth integrating. You can use this branch |
Hello @Avaneesh-S, Thank you for sharing your implementation and for your continued efforts to optimize the Next Steps
Encouragement and Next Steps
ConclusionThank you once again for your dedication and innovative approach. We will take a closer look at your implementation and test it with batch processing. If you have any further questions or need additional assistance, please feel free to reach out here. We look forward to collaborating with you to enhance the performance of YOLOv5! 🚀 |
Search before asking
Question
I am processing batch of 10 videos at same time and running Yolov5 on them. (every batch contains one frame from each video, so in batches of 10 at a time). While using viztracer for profiling my application, I found that in the non_max_suppression() in general.py at the following line:
this operation takes a long time on the first iteration of the 'for' loop (that is for image 1 or index 0) and for all other subsequent iterations it runs very fast.
specifically suppose ' v = xc[xi] ' then the operation ' x = x[v] ' is the one taking the most time (not v=xc[xi]).
It is seen that if non_max_suppression() executes for 100ms around 80ms is taken by this operation in the first iteration of the 'for' loop for every batch.
I want to understand why this is happening and why only on the first iteration and if there is any way to reduce this time since I am trying to optimize my application to improve average FPS and optimizing this operation would optimize the entire non_max_supression.
Additional
Additionally, I have gone through the same implementation in Yolov8 and found that it is different. Is it more optimized there? I have tried to manually replace the Yolov5's non_max_suppression() with the Yolov8's but it didn't give required output, I think its because the prediction tensors are a bit different for them (am I right?).
The text was updated successfully, but these errors were encountered: