New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U+2212 MINUS SIGN breaking PDF generation #4136
Comments
Same with Є U+0404 CYRILLIC CAPITAL LETTER UKRAINIAN IE:
|
U+2212 belongs to the Symbol, Math category of the Mathematical Operators block. As such, naive stand of a LaTeX user would be that this is supposed to be used in math mode. Thus the nowadays standard LaTeX answer is to say to switch to Unicode engine xelatex or lualatex and use package unicode-math. Its default config will arrange for suitable math fonts hopefully having the glyph. As per U+0404, then of course we can find thousands of similar example: utf8 option of inputenc can not cover the full Unicode range (but it did get expanded in recent years, so an up-to-date TeX install is always advisable.) Here again, the probably better answer is to switch to Unicode engine xelatex or lualatex. You may also try option Related: #3444 |
Relevant tests from pandoc's default.latex template: \ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[utf8]{inputenc} % no inputenc with xetex
\ifxetex
\usepackage{mathspec}
\usepackage{xltxtra,xunicode}
\else
\usepackage{fontspec} % no fontspec
%%% hyperref
\ifxetex
\usepackage[setpagesize=false, % page size defined by xetex
unicode=false, % unicode breaks when used with xetex
xetex]{hyperref}
\else
\usepackage[unicode=true]{hyperref}
%%% (not) Babel
\ifxetex
\usepackage{polyglossia}
\setmainlanguage{$mainlang$}
\else
\usepackage[$lang$]{babel} Hope this helps
And the tricky hyperref options... |
Well, check Sphinx's latex_engine config setting. The default will use polyglossia. One can recommend adding About using As per Sphinx provides minimal set-up via latex_engine, using |
@JulienPalard There is no problem with \documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[ukrainian]{babel}% or russian
\begin{document}
Є
\end{document} So if your document uses suitable language it will ok, also with traditional pdflatex. For an isolated letter, you may need more. Please provide a minimal example of Sphinx project displaying the problem. You could modify |
This: \documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[ukrainian, english]{babel}
\usepackage{newunicodechar}
\newunicodechar{Є}{\foreignlanguage{ukrainian}{\IeC {\CYRIE }}}
\begin{document}
This is a Ukrainian letter: Є
\end{document} shows (but see note below) how you can configure your Sphinx project latex preamble : you only need modify the 'babel' latex-elements key and add to preamble the two lines with It will fix your Important side note: there appears to be a LaTeX bug with \usepackage[ukrainian, english]{babel}
\usepackage{newunicodechar}
\newunicodechar{Є}{{\fontencoding{T2A}\selectfont\IeC {\CYRIE }}} or to use
|
As proof of concept I have tried this in
then I compile my project containing There is no problem and the build succeeds. Warnings appear in the LaTeX log:
which says that the (TeX) Times font (from Side note: there is a strange problem that if I use
and not
then the LaTeX compilation is done with Russian as main document language. This seems a bug of LaTeX-babel as it contradicts its documentation, EDIT: this is indeed a documented LaTeX-babel bug. In situations like this: \documentclass[letterpaper,10pt,english]{report}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[ngerman, english]{babel}
\title{FOO Documentation}
\date{Oct 10, 2017}
\author{JFB}
\begin{document}
\maketitle
\tableofcontents
Foo bar.
\end{document} the document language turns out be German !! If one suppresses the global class option
|
Still for cyrillic letter , there is an alternative to using
This gives more efficient code (the Thus, we use |
Wow, thanks for your extensive feedback everyone!! I'm reviewing it all, I'll probably first try to use a mix between platex and xelatex (from python/cpython#3940 (comment)):
I'm strongly against all kinds of hardcoding like |
Hum, I can't access language in my config.py, sphinx/config.py line 150 does not give it |
The following minimalist configuration:
builds english and french versions of the Python documentation. So I think we should add some information near the |
I'm opening a specific issue about the documentation subject. |
I am adding here some background info on problem of Unicode characters with PDFTeX uses so-called TeX format font encodings which have at most 256 slots for glyphs. Some documentation is described in file When
It is possible to set-up per Unicode character a default fall-back encoding, but the problem is that the document font may not be available in that encoding. For example when using Sphinx default
There is to the best of my knowledge no LaTeX official command to set-up a given Unicode character to forcefully not only change the font encoding but also the font family. And changing the font is rather costly macro expansion in LaTeX. It would also be very inefficient to do this at each character. The LaTeX philosophy is rather than the user adds the mark-up locally to select fonts for some stretch of text. But it is possible to do it in following way. One can use this home-made command \newcommand*\SetUnicodeCharacterWithFont[3]{%
\begingroup
\def\IeC ##1%
{\unexpanded{{\fontencoding{#2}\fontfamily{#3}\selectfont\IeC{##1}}}}
\expandafter\xdef\csname u8\string:\detokenize{#1}\endcsname
{\csname u8\string:\detokenize{#1}\endcsname}%
\endgroup
} and then issue commands in latex preamble such as \SetUnicodeCharacterWithFont{Є}{T2A}{cmr}
\SetUnicodeCharacterWithFont{Ѕ}{T2A}{cmr}
\SetUnicodeCharacterWithFont{Б}{T2A}{cmr} In this way the The burden on LaTeX user who has isolated exceptional Unicode characters to worry about is then "only" to know which encoding to use (here LaTeX documents using babel may declare multiple languages. When switching languages typically the font encoding is automatically modified. This is of course more efficient than doing it again and again for each character. Besides babel will use core TeX facilities for associations of hyphenation patterns to languages. But currently Sphinx does not support multiple languages per document. And even if it did, it would not be automatic, the rst sources would need to have extra mark-up to indicate the local language changes. All of the above applies to text fonts. It is again another matter for math fonts (which I will not go into). In fact, typically utf8-encoded characters declared by the
Thus indeed the situation with Unicode and |
People are here to write documentation, not latex. If we want people to write documentation (and we want?) forcing them to write latex macros each time they use a new out-of-ascii character is unacceptable. However, there may be a way here to autogenerate those macros while autogenerating the .tex files, but it would require a huge database of which character is from which font? Or use xelatex by default? |
I would like to have a more focused discussion on exactly what is at stake here. Your initial report mentioning Є or U+2212 did not explain how these pop up in your project. In 90% of cases, things work because the pdfLaTeX document has a specific language; the language loads the needed font encoding and inputenc prepares the corresponding Unicode codepoints for LaTeX. In human-made LaTeX documents people load babel with enough languages to cover the needed glyphs. They decide of the corresponding font set-up. They then add mark-up for changing locally languages. It is a fact that Sphinx currently has no support for multi-lingual documents. Is this what is at stake? Then it appeared your project is CPython docs and unsurprisingly in so many thousands of lines there is bound to be sooner or later some Unicode char. We could use extra raw latex markup in rst sources to use the language switching mechanism I hinted to. Or we need to find a solution to tell LaTeX that this stray Unicode character is too use some specific font suitable for that glyph in some suitable encoding. I provided macros which achieve this with minimal effort. It is not reasonable to build a database, and who is going to invest the time into this? moreover there is not going to be canonical choice of (TeX) font. Then, why make xelatex default when this will break the look of all current Sphinx projects because this forces usage of new fonts hence modifies linebreaks and pagebreaks, only to fix well-known in LaTeX world fact that pdfLaTeX isn't optimal regarding Unicode support and that every user of LaTeX who has read a minimal quantity of LaTeX documentation knows very well that for ten years+ the "Unicode" engines xelatex and lualatex are there and are much more Unicode savvy ? Anyone will then use conf.py to use xelatex and need not have that done by default. Besides, you may trust too much that xelatex will solve all Unicode problems. This is simply wrong, because the documents needs suitable fonts, and if the document has too many Unicode characters, you will have to change locally to another OpenType font and we are back to Problem 1 with pdflatex. And I do not even mention the problems of math mode. Regarding platex usage by Sphinx, here too Unicode support is lacking (for example we had a problem simply with EN DASH...). There are Unicode aware Japanese engines, but currently Sphinx is not set-up to use them, contributions welcome ! There are differences in the graphicx drivers for pdfLaTeX and xelatex, and if we switch to xelatex by default we may create problems with certain types of images in documents. For example there was a bug with Rotate not being obeyed by xelatex in included pdf images which was fixed recently upstream, but will take some time to get down to TeX Linux distros. The rendering of math mode in xelatex has issues non-existent in pdfLaTeX. The build time is increased compared to pdfLaTeX. We at Sphinx can not fix LaTeX. |
The 'Є' appeared in https://docs.python.org/3.7/whatsnew/3.7.html#optimizations:
I don't think asking documentation people (in general) to write raw latex is a thing, they write restructuredText, they (OK not everybody.) don't care about PDF being generated, so they don't remotely care latex is used to generate the PDF files. But can this extra latex markup be added automatically while generating latex file?
Because it looks like it work in more cases than pdflatex, and I think the goal of the default values is to provide a seamless experience.
We're speaking of Sphinx-doc users, not LaTeX users.
This is right, my experience covers only a single project, that's also why I'm just opening the discussion: to gather feedback, and to "document it" (here, in the issues threads) to the next ones having the same idea. |
reopening issue for better visibility during discussion |
To all, please note that there's specific threads about:
And let's try to keep this thread focused on having a few unicode characters from a language into a documentation in another language. |
Then I propose to close this again, because "having a few unicode characters" is solved by the advices I gave and switching to xelatex does not solve it. |
I will give a try to building the PDFs for CPython. Please note that the
at https://github.com/python/cpython/blob/db60a5bfa5d5f7a6f1538cc1fe76f0fda57b524e/Doc/conf.py#L97 is wrong. It should be I will investigate building CPython pdf's when I can. Also the two lines starting at https://github.com/python/cpython/blob/db60a5bfa5d5f7a6f1538cc1fe76f0fda57b524e/Doc/conf.py#L106 about |
I build successfully English docs of CPython using this (hence pdflatex) diff --git a/Doc/conf.py b/Doc/conf.py
index aaee983984..8326b1e766 100644
--- a/Doc/conf.py
+++ b/Doc/conf.py
@@ -92,9 +92,7 @@ html_split_index = True
# Get LaTeX to handle Unicode correctly
latex_elements = {
- 'inputenc': r'\usepackage[utf8x]{inputenc}',
- 'utf8extra': '',
- 'fontenc': r'\usepackage[T1,T2A]{fontenc}',
+ 'fontenc': r'\usepackage[T2A,TS1,T1]{fontenc}',
}
# Additional stuff for the LaTeX preamble.
@@ -103,8 +101,11 @@ latex_elements['preamble'] = r'''
\sphinxstrong{Python Software Foundation}\\
Email: \sphinxemail{docs@python.org}
}
-\let\Verbatim=\OriginalVerbatim
-\let\endVerbatim=\endOriginalVerbatim
+\usepackage{newunicodechar}
+\newunicodechar{ſ}{{\fontencoding{TS1}\fontfamily{lmr}\selectfont s}}
+\newunicodechar{K}{\ensuremath{\mathrm K}}
+\newunicodechar{−}{\textminus}
+\newunicodechar{Є}{{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}}
'''
# The paper size ('letter' or 'a4'). Steps:
Using Sphinx 1.6.3 and Python 3.6.2 from an Anaconda install on Mac OS X 10.9.5. What does "building French docs" mean ? does this simply mean setting the language to In the above, the most problematic was the long s; it is provided by TS1 encoding but not explicitely by |
Did it feel right for you to ask documentation writers to learn that |
Please, if you want me to help out, please stop aggressing me and turn your remark to the LaTeX team. https://www.latex-project.org/about/team/ As per my comment above I must edit it because the long s is not found in times font. Investigating. |
edit: if using same build repertory, issue make clean in the build/latex else Babel will complain from English to French transition. This is well-known problem with Babel that auxiliary files must be removed when modifying document language after compilation with one language. |
At #4136 (comment) I have provided recipe to avoid having to knwo about the internal LaTeX representation |
First, sorry if it felt like aggression, you're helping and I'm greatful for this, even if I don't show it in my english, so let's be clear: Thanks for the time you're spending on the issue, trying to build the cpython doc and look at our problems, we have almost no one latex oriented here for this, so any help is really welcome. I tried to express surprise you were proposing a syntax almost no documentation writer can read, let appart write, and a solution that will fail each time someone add a new unicode character. I clearly don't want to have a documentation build fail each time someone add a new character. In the other hand, I heard you about xelatex "not fixing every problems". I did not expect it to do to be fair, but it looks like it works in more cases than pdflatex as it successfully build english and french for cpython (python/docsbuild-scripts#34). It still look way simplier to use xelatex that to add To build the french and Japanese documentation you'll need more than language=fr, you'll need gettext_compact=0 and to give a locale_dirs. We provide a makefile in the french translation https://github.com/python/python-docs-fr to do it automatically. There's also a script to build english, japanese, and french at https://github.com/python/docsbuild-scripts I typically run it just with --skip-cache-invalidation by creating directories myself and giving me the rights to write on them, but you can also use the options to change the directories used by the script. Either way you'll have to clone python manually if I remember well the script does not clone cpython itself, it however clones translations itself. |
You are asking us at Sphinx to consider making The patch above using The |
No, I'm proposing the idea to you at Sphinx to consider making xelatex the default because my experience building the Python docs showed me that it was semless to build with xelatex, but very hard to build with pdflatex, and I though "default should be the easy way, not the hard way". Defaults should lead user to an "easy path", something working in most cases, and experts wanting to write specific latex macros could optionally switch to non-default values. xelatex may not work in all cases, but it looked like xelatex is working in more cases than pdflatex, which was enough for me to think that it would maye a better default. I still think xelatex is a better default, but that's only what I think. It would require more feedback to make a sound choice. Python side, we're succeeding to build CPython with an external logic to switch between platex and xelatex depending to the language (python/docsbuild-scripts#34). |
Thanks for the explanation relative to the internationalization. I have git cloned the French translated repo, manually created I got two innocuous warnings, then I cd to This one looks like a typo in French translation:
this is on line 101211 of library.tex. There is |
There is a U+200b in line
which is line 2096 of library.tex, right before |
I confirm I can build all French CPython Docs with pdflatex using the above. However there are bugs in the French translation strings (at least on typo and a possible stray zero-width space) which require extra steps. I did: diff --git a/Doc/conf.py b/Doc/conf.py
index f7073d116a..433f7f259f 100644
--- a/Doc/conf.py
+++ b/Doc/conf.py
@@ -15,6 +15,7 @@ sys.path.append(os.path.abspath('tools/extensions'))
extensions = ['sphinx.ext.coverage', 'sphinx.ext.doctest',
'pyspecific', 'c_annotations']
+
# General substitutions.
project = 'Python'
copyright = '2001-%s, Python Software Foundation' % time.strftime('%Y')
@@ -91,9 +92,7 @@ html_split_index = True
# Get LaTeX to handle Unicode correctly
latex_elements = {
- 'inputenc': r'\usepackage[utf8x]{inputenc}',
- 'utf8extra': '',
- 'fontenc': r'\usepackage[T1,T2A]{fontenc}',
+ 'fontenc': r'\usepackage[T2A,TS1,T1]{fontenc}',
}
# Additional stuff for the LaTeX preamble.
@@ -102,8 +101,13 @@ latex_elements['preamble'] = r'''
\sphinxstrong{Python Software Foundation}\\
Email: \sphinxemail{docs@python.org}
}
-\let\Verbatim=\OriginalVerbatim
-\let\endVerbatim=\endOriginalVerbatim
+\usepackage{newunicodechar}
+\newunicodechar{ſ}{{\fontencoding{TS1}\fontfamily{lmr}\selectfont s}}
+\newunicodechar{K}{\ensuremath{\mathrm K}}
+\newunicodechar{−}{\textminus}
+\newunicodechar{Є}{{\fontencoding{T2A}\fontfamily{cmr}\selectfont\IeC{\CYRIE}}}
+\usepackage{uspace}% handles Unicode spaces inclusive of zero width U+200B
+\newunicodechar{Ǹ}{N}
'''
# The paper size ('letter' or 'a4'). The package You an also simply do
to get rid of it (as it looked typo). As per the Sorry for delays, but building all PDFs takes time each time. Notice that ironically, bugs are revealed from not using Unicode savvy engine, as then input typos are detected. I find it useful though to have spent time on this big project (library.pdf in French going to 1742 pages), as this can be used in future in case of Sphinx LaTeX changes as a test. |
I fixed the |
Yes, I take note. Also having this in Would you like to PR this on https://github.com/python/cpython (issue already opened: https://bugs.python.org/issue31589)? |
I would be glad to PR, but unfortunately putting this in But it is possible to enclose it in some LaTeX conditional branch and avoid executing it if engine is
Regarding Japanese docs, I understand you build with no LaTeX errors. But do the PDFs look fine regarding the problematic characters? (i.e. particularly the long s |
@JulienPalard I have done the PR at python/cpython#4069 |
I have found a way for Japanese. I could not solve the problem with In conf.py:
Unfortunately the extra class option Then we also need this in conf.py:
and finally we need this
edit: I think the With this set-up I can compile successfully for English, French, and Japanese: sadly however in Japanese only when
@cocoatomo do you have any comment? do you build Japanese CPython documentation with standard Sphinx set-up using I couldn't make pLaTeX work with the characters above, except the U+2212 − which at least gives reasonable output it seems with no extra set-up. I had some partial success but then unexpected results arose from using these characters in code-blocks or literal. Or course there might be much better way, but as I can't read Japanese it is very hard for me to understand how LaTeX classes and engines work there. |
nosy @methane ^ |
There is a I can force Sphinx to use it by this kind of patching in
With patch like the above I can then build the whole CPython documentation for Japanese (without having to replace The @tk0miya is there easier way to get Sphinx to use Also it would be convenient to have a |
I happened to see this issue when I was searching for tickets on another topic. I just wanted to mention that building a PDF using rinohtype might make things easier when dealing with languages other than English. As long as your fonts contain the needed glyphs, things should just work... provided the script doesn't require advanced typesetting features that are currently not supported by rinohtype. You should also know that rinohtype doesn't support typesettings of maths as of yet. I would be interested to learn how rinohtype handles your document. I welcome any feedback in the rinohtype issue tracker. |
Now Sphinx uses |
@tk0miya: +1 for closing as far as I can see. |
Subject: When using U+2212 to denote a negative integer, sphinx can no longer build PDF.
Problem
Unicode character U+2212 in rst files breaks PDF generation.
Procedure to reproduce the problem
Error logs / results
Expected results
I'd just expect my minus sign to be correctly rendered in a PDF file.
Environment info
Looks like there a lots of problems with some unicode characters: https://tex.stackexchange.com/search?q=inputenc+2212
Maybe we should just add another DeclareUnicodeCharacter in utf8extra in latex.py?
The text was updated successfully, but these errors were encountered: