First DeepSeek-OCR working implementation #7

bluebread · 2025-11-29T03:02:56Z

@sfallah Working towards finishing debugging the vision model and a first functionable implementation for DeepSeek-OCR this weekend. We've made a lot of progress!!

Edit: I finished debugging, and everything magically works! Going to debug other modes next. FYI: The interpolation algorithm used in llama.cpp doesn't quite align with PyTorch's version and can produce slightly different results. It can be confusing when you cares about the correctness.

Changes

Replaced manual multi-head attention with ggml_flash_attn_ext in the SAM encoder
Replaced manual linear interpolation in get_rel_pos with bilinear interpolation hack (by reshaping tensor to [*, 1, C])
Currently forcing all DeepseekOCR vision tensors to F32 for numerical stability during development
Added debugging utilities in clip-impl.h e.g. print_tensor_info, print_tensor_sum, save_tensor_to_file

Bug Fixes

Fixed incorrect residual connections in SAM
Fixed image preprocessing to use proper background color for padding
Fixed blocks output not being passed to downsampling neck
Fixed rel_pos_resized not being used in get_rel_pos
Fixed bicubic interpolation logic by adding proper permute operations before/after interpolation
Added F16 type assertion for flash attention masks to catch type mismatches early (debugging this cost me a day...)
Fixed the calculation of the expected number of vision tokens (previously hardcoded)

bluebread · 2025-11-30T04:46:36Z

@sfallah Could you please update this branch with the main repository? They've added bicubic interpolation support for cuda/vulkan backends in ggml-org#17022, which we need for SAM.

…f/deepseek-ocr

sfallah · 2025-11-30T11:50:00Z

@bluebread
I have merged master into my branch!
I will review your PR.

…rol & all native resolution modes work

bluebread · 2025-11-30T17:26:01Z

@sfallah I've added an argument --dsocr-mode to llama-mtmd-cli (though I'm not sure if this will be accepted by the maintainers) and debugged the native resolution modes. I'm gonna debug the gundam modes tomorrow, so I'm temporarily converting this PR to draft and will reopen it once all modes pass tests. Here is my testing command:

 ./build/bin/llama-mtmd-cli -m /root/DeepSeek-OCR/DeepSeek-OCR-64x550M-F16.gguf --mmproj /root/DeepSeek-OCR/mmproj-DeepSeek-OCR-F16.gguf --image /root/tensor_files/treewisdom-1024.png -p "Free OCR" --chat-template deepseek -sm none -c 8192 --temp 0.0  --dsocr-mode tiny

FYI: DeepSeek-OCR (the original model in PyTorch, not ours) doesn't seem to work with the testing picture in tools/mtmd. I guess it's because DeepSeek-OCR wasn't trained on images like that, probably? Perhaps you could take a look when you have some free time. BTW, should we add DeepSeek-OCR support to llama-server as well?

sfallah · 2025-12-01T08:34:07Z

@bluebread
great job!
I am just testing your branch.
With regard to --dsocr-mode I think it will be simpler and more user-friendly to just use image-sizes like 640,1024,1280 instead of modes.

bluebread · 2025-12-01T09:15:12Z

@sfallah Thanks! I'm not sure if the model was trained for arbitrary image size or not. Another issue is that it would be less intuitive for users to switch from native resolutions to gundam modes or auto mode selection. I'll probably leave this design decision to the maintainers. BTW today I found that img_tool likely adopts a different bicubic interpolation algorithm from GGML/PyTorch ones. This should explain the numerical instability and why it doesn't work well as the original model, e.g. it cannot recognize the title and authors on the cover of DeepSeek-OCR paper (although it can extract the abstract). I'll work on fixing this later.

sfallah · 2025-12-01T09:35:51Z

@bluebread

ok, we can maybe add the image-sizes in the description of --dsocr-mode so the user knows what sizes the different modes can result in.
I was actually wondering why stb_image_resize2.h was not used for image pre-processing in mtmd, like stb_image.h is used for loading the image-bitmap.
It is maybe worth trying to resize with stb_image_resize2.h.

sfallah · 2025-12-01T12:43:50Z

@bluebread

e.g. it cannot recognize the title and authors on the cover of DeepSeek-OCR paper (although it can extract the abstract). I'll work on fixing this later.

I don't see any problem with title and authors here?

lbuild/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/ds-ocr-lm.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepdeek-ocr.gguf \
--image tmp/mtmd_test_data/Deepseek-OCR-2510.18234v1_page1.png \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek \
--dsocr-mode base
base```

bluebread · 2025-12-01T13:39:22Z

I don't see any problem with title and authors here?

@sfallah You can try replacing the prompt with "Free OCR" and testing other images. It randomly appears/disappears. A little bit puzzling. Hopefully we can wrap up this feature within a few days and then shift our attention to llama-server, which might be more necessary for most users.

sfallah · 2025-12-02T05:37:21Z

@bluebread
Is this PR actually ready?
I ask because it is still draft?

bluebread · 2025-12-02T07:44:13Z

@sfallah Yes, this PR is ready now. We can let the maintainers review it first. I was thinking about how to replace the interpolation with the correct approach, but others might have better solutions, so we don't need to figure it all out before the reivew.

sfallah · 2025-12-02T20:35:15Z

@bluebread

I will finalise the PR tomorrow.
I have merged with master.
I have also uploaded f32 versions of the GGUF models on HF hub (see PR). Other quant version will follow after testing.

We need to take care of the CI jobs and make sure that all CI actions are successful.
I suggest that we take care of llama-server in a new PR.

sfallah · 2025-12-02T23:02:24Z

@bluebread
I have given you direct access to the repo.

bluebread · 2025-12-03T08:34:22Z

@sfallah Thanks! I agree that we should open another PR for llama-server.

bluebread added 3 commits November 29, 2025 02:17

mtmd: SAM numerically works

a488b49

mtmd: debug CLIP-L (vit_pre_ln)

ccb2f23

mtmd: debug CLIP-L & first working DeepSeek-OCR model

841a4a8

bluebread changed the title ~~Debug DeepEncoder (Vision Projector)~~ Debug DeepEncoder (Vision Projector) and first DeepSeek-OCR implementation Nov 29, 2025

bluebread changed the title ~~Debug DeepEncoder (Vision Projector) and first DeepSeek-OCR implementation~~ First DeepSeek-OCR working implementation Nov 29, 2025

bluebread marked this pull request as ready for review November 29, 2025 16:47

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

5543094

…f/deepseek-ocr

mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution cont…

c5f4c64

…rol & all native resolution modes work

bluebread marked this pull request as draft November 30, 2025 17:26

bluebread marked this pull request as ready for review December 2, 2025 07:35

sfallah merged commit 6b0e7cd into sfallah:sf/deepseek-ocr Dec 2, 2025

First DeepSeek-OCR working implementation #7

First DeepSeek-OCR working implementation #7

Uh oh!

Conversation

bluebread commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Bug Fixes

Uh oh!

bluebread commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluebread commented Nov 30, 2025

Uh oh!

sfallah commented Dec 1, 2025

Uh oh!

bluebread commented Dec 1, 2025

Uh oh!

sfallah commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfallah commented Dec 1, 2025

Uh oh!

bluebread commented Dec 1, 2025

Uh oh!

sfallah commented Dec 2, 2025

Uh oh!

bluebread commented Dec 2, 2025

Uh oh!

sfallah commented Dec 2, 2025

Uh oh!

sfallah commented Dec 2, 2025

Uh oh!

bluebread commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bluebread commented Nov 29, 2025 •

edited

Loading

bluebread commented Nov 30, 2025 •

edited

Loading

sfallah commented Nov 30, 2025 •

edited

Loading

sfallah commented Dec 1, 2025 •

edited

Loading