Skip to content

Conversation

@bluebread
Copy link
Collaborator

@bluebread bluebread commented Nov 29, 2025

@sfallah Working towards finishing debugging the vision model and a first functionable implementation for DeepSeek-OCR this weekend. We've made a lot of progress!!

Edit: I finished debugging, and everything magically works! Going to debug other modes next. FYI: The interpolation algorithm used in llama.cpp doesn't quite align with PyTorch's version and can produce slightly different results. It can be confusing when you cares about the correctness.

Changes

  • Replaced manual multi-head attention with ggml_flash_attn_ext in the SAM encoder
  • Replaced manual linear interpolation in get_rel_pos with bilinear interpolation hack (by reshaping tensor to [*, 1, C])
  • Currently forcing all DeepseekOCR vision tensors to F32 for numerical stability during development
  • Added debugging utilities in clip-impl.h e.g. print_tensor_info, print_tensor_sum, save_tensor_to_file

Bug Fixes

  • Fixed incorrect residual connections in SAM
  • Fixed image preprocessing to use proper background color for padding
  • Fixed blocks output not being passed to downsampling neck
  • Fixed rel_pos_resized not being used in get_rel_pos
  • Fixed bicubic interpolation logic by adding proper permute operations before/after interpolation
  • Added F16 type assertion for flash attention masks to catch type mismatches early (debugging this cost me a day...)
  • Fixed the calculation of the expected number of vision tokens (previously hardcoded)

@bluebread bluebread changed the title Debug DeepEncoder (Vision Projector) Debug DeepEncoder (Vision Projector) and first DeepSeek-OCR implementation Nov 29, 2025
@bluebread bluebread changed the title Debug DeepEncoder (Vision Projector) and first DeepSeek-OCR implementation First DeepSeek-OCR working implementation Nov 29, 2025
@bluebread bluebread marked this pull request as ready for review November 29, 2025 16:47
@bluebread
Copy link
Collaborator Author

bluebread commented Nov 30, 2025

@sfallah Could you please update this branch with the main repository? They've added bicubic interpolation support for cuda/vulkan backends in ggml-org#17022, which we need for SAM.

@sfallah
Copy link
Owner

sfallah commented Nov 30, 2025

@bluebread
I have merged master into my branch!
I will review your PR.

@bluebread
Copy link
Collaborator Author

@sfallah I've added an argument --dsocr-mode to llama-mtmd-cli (though I'm not sure if this will be accepted by the maintainers) and debugged the native resolution modes. I'm gonna debug the gundam modes tomorrow, so I'm temporarily converting this PR to draft and will reopen it once all modes pass tests. Here is my testing command:

 ./build/bin/llama-mtmd-cli -m /root/DeepSeek-OCR/DeepSeek-OCR-64x550M-F16.gguf --mmproj /root/DeepSeek-OCR/mmproj-DeepSeek-OCR-F16.gguf --image /root/tensor_files/treewisdom-1024.png -p "Free OCR" --chat-template deepseek -sm none -c 8192 --temp 0.0  --dsocr-mode tiny

FYI: DeepSeek-OCR (the original model in PyTorch, not ours) doesn't seem to work with the testing picture in tools/mtmd. I guess it's because DeepSeek-OCR wasn't trained on images like that, probably? Perhaps you could take a look when you have some free time. BTW, should we add DeepSeek-OCR support to llama-server as well?

@bluebread bluebread marked this pull request as draft November 30, 2025 17:26
@sfallah
Copy link
Owner

sfallah commented Dec 1, 2025

@bluebread
great job!
I am just testing your branch.
With regard to --dsocr-mode I think it will be simpler and more user-friendly to just use image-sizes like 640,1024,1280 instead of modes.

@bluebread
Copy link
Collaborator Author

@sfallah Thanks! I'm not sure if the model was trained for arbitrary image size or not. Another issue is that it would be less intuitive for users to switch from native resolutions to gundam modes or auto mode selection. I'll probably leave this design decision to the maintainers. BTW today I found that img_tool likely adopts a different bicubic interpolation algorithm from GGML/PyTorch ones. This should explain the numerical instability and why it doesn't work well as the original model, e.g. it cannot recognize the title and authors on the cover of DeepSeek-OCR paper (although it can extract the abstract). I'll work on fixing this later.

@sfallah
Copy link
Owner

sfallah commented Dec 1, 2025

@bluebread

  • ok, we can maybe add the image-sizes in the description of --dsocr-mode so the user knows what sizes the different modes can result in.

  • I was actually wondering why stb_image_resize2.h was not used for image pre-processing in mtmd, like stb_image.h is used for loading the image-bitmap.
    It is maybe worth trying to resize with stb_image_resize2.h.

@sfallah
Copy link
Owner

sfallah commented Dec 1, 2025

@bluebread

e.g. it cannot recognize the title and authors on the cover of DeepSeek-OCR paper (although it can extract the abstract). I'll work on fixing this later.

Screenshot 2025-12-01 at 13 33 01 I don't see any problem with title and authors here?
lbuild/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/ds-ocr-lm.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepdeek-ocr.gguf \
--image tmp/mtmd_test_data/Deepseek-OCR-2510.18234v1_page1.png \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek \
--dsocr-mode base
base```

@bluebread
Copy link
Collaborator Author

I don't see any problem with title and authors here?

@sfallah You can try replacing the prompt with "Free OCR" and testing other images. It randomly appears/disappears. A little bit puzzling. Hopefully we can wrap up this feature within a few days and then shift our attention to llama-server, which might be more necessary for most users.

@sfallah
Copy link
Owner

sfallah commented Dec 2, 2025

@bluebread
Is this PR actually ready?
I ask because it is still draft?

@bluebread bluebread marked this pull request as ready for review December 2, 2025 07:35
@bluebread
Copy link
Collaborator Author

@sfallah Yes, this PR is ready now. We can let the maintainers review it first. I was thinking about how to replace the interpolation with the correct approach, but others might have better solutions, so we don't need to figure it all out before the reivew.

@sfallah sfallah merged commit 6b0e7cd into sfallah:sf/deepseek-ocr Dec 2, 2025
@sfallah
Copy link
Owner

sfallah commented Dec 2, 2025

@bluebread

  1. I will finalise the PR tomorrow.
  2. I have merged with master.
  3. I have also uploaded f32 versions of the GGUF models on HF hub (see PR). Other quant version will follow after testing.

We need to take care of the CI jobs and make sure that all CI actions are successful.
I suggest that we take care of llama-server in a new PR.

@sfallah
Copy link
Owner

sfallah commented Dec 2, 2025

@bluebread
I have given you direct access to the repo.

@bluebread
Copy link
Collaborator Author

@sfallah Thanks! I agree that we should open another PR for llama-server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants