-
Notifications
You must be signed in to change notification settings - Fork 1
First DeepSeek-OCR working implementation #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sfallah Could you please update this branch with the main repository? They've added bicubic interpolation support for cuda/vulkan backends in ggml-org#17022, which we need for SAM. |
|
@bluebread |
…rol & all native resolution modes work
|
@sfallah I've added an argument FYI: DeepSeek-OCR (the original model in PyTorch, not ours) doesn't seem to work with the testing picture in tools/mtmd. I guess it's because DeepSeek-OCR wasn't trained on images like that, probably? Perhaps you could take a look when you have some free time. BTW, should we add DeepSeek-OCR support to llama-server as well? |
|
@bluebread |
|
@sfallah Thanks! I'm not sure if the model was trained for arbitrary image size or not. Another issue is that it would be less intuitive for users to switch from native resolutions to gundam modes or auto mode selection. I'll probably leave this design decision to the maintainers. BTW today I found that img_tool likely adopts a different bicubic interpolation algorithm from GGML/PyTorch ones. This should explain the numerical instability and why it doesn't work well as the original model, e.g. it cannot recognize the title and authors on the cover of DeepSeek-OCR paper (although it can extract the abstract). I'll work on fixing this later. |
|
@sfallah You can try replacing the prompt with "Free OCR" and testing other images. It randomly appears/disappears. A little bit puzzling. Hopefully we can wrap up this feature within a few days and then shift our attention to llama-server, which might be more necessary for most users. |
|
@bluebread |
|
@sfallah Yes, this PR is ready now. We can let the maintainers review it first. I was thinking about how to replace the interpolation with the correct approach, but others might have better solutions, so we don't need to figure it all out before the reivew. |
We need to take care of the CI jobs and make sure that all CI actions are successful. |
|
@bluebread |
|
@sfallah Thanks! I agree that we should open another PR for llama-server. |

@sfallah Working towards finishing debugging the vision model and a first functionable implementation for DeepSeek-OCR this weekend. We've made a lot of progress!!
Edit: I finished debugging, and everything magically works! Going to debug other modes next. FYI: The interpolation algorithm used in llama.cpp doesn't quite align with PyTorch's version and can produce slightly different results. It can be confusing when you cares about the correctness.
Changes
ggml_flash_attn_extin the SAM encoderget_rel_poswith bilinear interpolation hack (by reshaping tensor to [*, 1, C])clip-impl.he.g. print_tensor_info, print_tensor_sum, save_tensor_to_fileBug Fixes
rel_pos_resizednot being used inget_rel_pos