Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find words with soft hyphens #1189

Open
Kristinita opened this issue Aug 16, 2019 · 3 comments
Open

Find words with soft hyphens #1189

Kristinita opened this issue Aug 16, 2019 · 3 comments

Comments

@Kristinita
Copy link

1. Related issues

  1. Forum question
  2. Issue tracker

2. Summary

It would be nice, if SumatraPDF will find words with soft hyphens in searchable documents.

3. Data

4. Argumentation

I often need to find something in the PDF files. In SumatraPDF I may not find the word, that I need, because it moves to the next line with soft hyphen, — I don't know where soft hyphen in the documents. Hyphens required in Russian language for wrapping words; I see them in any Russian scanned paper book. This problem important to me; therefore, I use another tools (see section 7 of this issue) instead of SumatraPDF.

5. Additional information

Soft hyphen (already known as »optional hyphen») is a symbol for word-breaking in line ends.

6. Actual behavior

SumatraPDF doesn't recognize soft hyphens:

SumatraPDF

- symbol required in search:

Sumatra with hyphens

7. Expected behavior

Free programs versions, that have requested feature:

  • Foxit Reader:

Foxit Reader

  • PDF X-Change Editor:

PDF X-Change Editor

Okular

Thanks.

@kjk kjk changed the title feature_request(find): words with soft hyphens Find words with soft hyphens Jun 12, 2020
@user1823
Copy link

user1823 commented Dec 1, 2023

This feature has been implemented by MuPDF in 2020:
https://git.ghostscript.com/?p=mupdf.git;h=2185f16814074f024800a8bcc2dcf2f68ffcb07e

So, in my opinion, all that needs to be done is to enable this in SumatraPDF.

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Dec 1, 2023

There is agreed a difference as shown below between MuPDF 1.20 and current SumatraPDF (you can see MuPDF skips over the end of line hyphen break but then it is not the same as when used for searching other forms such as hyphenated numbers etc. so in the second view Sumatra finds the hyphenated glyphs they are mixed but generally at end of line plain text (-)Tj (not soft, but it varies as there are mixed types) but MuPDF fails to see them, since they are classed as non existent parts of a word.
image
image

@user1823
Copy link

user1823 commented Dec 1, 2023

I don't understand Russian. But, Google Translate shows the same translation for the word whether the hyphen is included or not.

It may be a problem, but I don't know.

I think that the best solution for this issue would be to add a setting controlling whether to ignore such hyphens or not so that the users can decide on their own what works best for them.

I don't think that doing this would require a significant amount of effort (I am not a dev though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants