This is a great job, thank you for your contribution!
The pre-trained BLIP model is used directly for caption operation, but the generated caption is very poor, and even the generated content has nothing to do with the image, so how to solve this problem?
The resolution of the images in the data set is not high, which is a hard injury and cannot be changed
