Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ebooks and pdf export #88

Open
mdinger opened this issue Dec 30, 2015 · 68 comments
Open

Support ebooks and pdf export #88

mdinger opened this issue Dec 30, 2015 · 68 comments
Labels
A-Rendering Area: Rendering C-enhancement Category: Enhancement or feature request C-new-format Category: A new rendering format S-On-hold Status: On hold S-Wishlist Status: Wishlist

Comments

@mdinger
Copy link
Contributor

mdinger commented Dec 30, 2015

Gitbook supports export to ebooks and pdfs via calibre. This might be easy to hook into.

See also rust-lang/rust-by-example#684 for problems this implementation creates for rustbyexample.

@azerupi
Copy link
Contributor

azerupi commented Dec 30, 2015

I would like to support pdf and ebook format. I think this could already be developed out of tree if you use the Renderer trait from mdBook.

I am not sure I want to depend on a full blown Gui tool though. There must surely be a better alternative for that.

@mdinger
Copy link
Contributor Author

mdinger commented Dec 30, 2015

Not familiar with many conversion tools like this. Pandoc also seems like a plausible option. Don't know of any others.

@azerupi
Copy link
Contributor

azerupi commented Dec 30, 2015

Yeah pandoc seems a lot better!

@asolove
Copy link
Contributor

asolove commented Jan 11, 2016

Did some exploration on this and seems doable. Here's the default epub version of the Rust book. Note the chapters out of order and links not working.

To get good output, I think we would need to:

  • parse the ToC to get the list of md files, in the right order
  • concat and transform the markdown files, replacing file links with internal links
  • match the themes with epub versions of the styles

I'm interested in working on this but will be a bit slow.

Useful info here: Pandoc commands and styling options

@killercup
Copy link
Member

  • parse the ToC to get the list of md files, in the right order
  • concat and transform the markdown files, replacing file links with internal links

@asolove, I have implemented this (among other transformations) in https://github.com/killercup/trpl-ebook, feel free to use my code.

@asolove
Copy link
Contributor

asolove commented Jan 11, 2016

@killercup great, thanks!

@azerupi
Copy link
Contributor

azerupi commented Jan 11, 2016

Great! Thanks for doing this :)

parse the ToC to get the list of md files, in the right order

This is already done in the Rust code, the MDBook struct can be iterated on. If you make a new Renderer you have access to that.

concat and transform the markdown files, replacing file links with internal links

Concatenating the markdown files is also not that hard, I do it for the print page.

Replacing the links could be a little trickier, what should internal links look like for pandoc? I know that pulldown-cmark gives you the ability to transform the parsed markdown events before rendering, but it's not well documented. Maybe link replacing is in it's capabilities.

Static files, like images, will probably also need some special treatment to be included correctly?


I'm interested in working on this but will be a bit slow.

That is absolutely no problem, there is no rush. I will assign this issue to you so that others can see you are working on it. (can't assign you). If you need any help, feel free to ask here :)

I am also planning on doing a big refactor (#90) to clean up and create a better API. For example, I am thinking about adding a way to poll the MDBook struct for specific chapters, etc. This would make it a lot more flexible for Renderers and if I end up doing something like #93. If you have suggestions or requests that might be relevant, post them in #90 so that I / we can brainstorm and come up with a good design :)

@killercup
Copy link
Member

Replacing the links could be a little trickier, what should internal links look like for pandoc?

FIY, I'm doing some regex work to transform links relative to the doc.rust-lang.org domain and make reference link names unique for the combined markdown file.

@azerupi
Copy link
Contributor

azerupi commented Jan 11, 2016

FIY, I'm doing some regex work to transform links relative to the doc.rust-lang.org domain

let cross_section_link = Regex::new(r"]\((?P<file>[\w-_]+)\.html\)").unwrap();
output = cross_section_link.replace_all(&output, r"](#sec--$file)");

let cross_section_ref = Regex::new(r"(?m)^\[(?P<id>.+)\]:\s(?P<file>[^:^/]+)\.html$").unwrap();
output = cross_section_ref.replace_all(&output, r"[$id]: #sec--$file");

let cross_subsection_link = Regex::new(r"]\((?P<file>[\w-_]+)\.html#(?P<subsection>[\w-_]+)\)").unwrap();
output = cross_subsection_link.replace_all(&output, r"](#$subsection)");

let cross_subsection_ref = Regex::new(r"(?m)^\[(?P<id>.+)\]:\s(?P<file>[^:^/]+)\.html#(?P<subsection>[\w-_]+)$").unwrap();
output = cross_subsection_ref.replace_all(&output, r"[$id]: #$subsection");

Thanks! Does pandoc auto-generate the anchors from the markdown files in those formats? like #sec--$file? Or is that also handled by your code?

@killercup
Copy link
Member

@azerupi I'm pretty sure pandoc generates those. I've had problems before because pandoc generates slugs in a different way than rustdoc.

It should be possible to add a specific id to each header, though. The syntax is # Header Name {#header-name} IIRC.

You might also want to look at adjust_header_level.rs and adjust_reference_names.rs.

@azerupi
Copy link
Contributor

azerupi commented Jan 11, 2016

Ok thanks for all the information, this will probably help @asolove a lot! :)

@asolove asolove mentioned this issue Jan 12, 2016
5 tasks
@cetra3
Copy link
Contributor

cetra3 commented Jan 12, 2016

Not sure if this will help you guys, but I've created a simple rust tool which will collate multiple markdown files into one, resolving internal links and turning them into anchor links

We can use this in a pipeline on the way to converting to PDF:

mdcollate book-example/src/SUMMARY.md | pulldown-cmark > test.html && wkhtmltopdf test.html test.pdf

Code can be found here:
https://github.com/cetra3/mdcollate

Happy to accept any PRs

@azerupi
Copy link
Contributor

azerupi commented Jan 12, 2016

@cetra3 That is really cool!
The plan is to make a "renderer" that does everything so that it can be used with the mdbook build command. So using a command line tool adds some complications. Have you thought about exposing the functionality as a crate?

I am not sure I would add a dependency just for that functionality, because there is always the possibility that it will not be maintained actively. But it could be considered if it offers enough useful methods that we wouldn't have to reinvent here.

@mkpankov
Copy link

I'm also sceptical about Calibre. We use it in Russian translation of TRPL and we've come along several problems with EPUB (links are to descriptions in Russian, for reference):

@azerupi
Copy link
Contributor

azerupi commented Jan 12, 2016

Thanks for sharing your experience :)
We will see if pandoc has the same problems, but I think @killercup used it without too much / any problems?

I also vaguely remember we had to hack styles in order to get better PDF. Not sure if it's necessary or not with Pandoc

I am not sure how this is handled with Pandoc, but having a custom theme could be a good thing.

@cetra3
Copy link
Contributor

cetra3 commented Jan 13, 2016

It's probably possible to wrap up those command line tools into a combined tool or expose it as a rust library. The last component (html to pdf) would need to use FFI as wkhtmltopdf is written in C. Not sure whether this adds too much dependency on externalities though.

The complication arises in that markdown is a superset of HTML which means that you need something that can present HTML in a printable fashion. In my experience with this problem, Pandoc and Calibre will do a subset, but you won't get full parity.

@killercup
Copy link
Member

There are a few things to be aware of, but in general pandoc is really amazing at converting Markdown to LaTeX. Which is what you want, I think—it has some very nice features that you currently can't get with HTML-to-PDF converters. For example, my PDF versions of the Rust Book include cross-references like "This is a mutable variable binding (section 5, page 163)".

If you're no LaTeX wizard (I'm not), you might want to look at this template I threw together.

If you have any issues with this, just mention me.

@azerupi
Copy link
Contributor

azerupi commented Jan 13, 2016

Thanks for all your help Pascal!
I will definitely look at what you have currently running and I am pretty sure we will end up stealing a lot of your code (if that is ok with you) 😉

@gambhiro
Copy link
Contributor

gambhiro commented Aug 8, 2016

+1 for the effort, I am looking forward to using mdbook to produce ebooks.

It seems to have stalled a bit, is anyone currently working on this?

@azerupi
Copy link
Contributor

azerupi commented Aug 8, 2016

It seems to have stalled a bit, is anyone currently working on this?

Indeed, it has stalled a bit. In the last 6 months I have been overwhelmed with work at school 😕

I am (very) slowly working on the refactoring / clean-up that I wanted to do. And that work is probably going to change the way this specific feature is going to be implemented. Hopefully I will have some time in September to make significant progress on the internal rewrite so that I can work on new features again.

@gambhiro
Copy link
Contributor

gambhiro commented Aug 9, 2016

@azerupi How much space is there for discussing this feature? There are some specific things I would be looking for in a CLI ebook helper, but maybe you are already determined in which way to go.

Some time ago I wrote prophecy, a ruby gem to automate the tasks I needed when producing ebooks. This is and example of the output. It has been very useful for me, but I believe I am the only user :)

I have been wanting to rewrite it with some of the hindsight since its early days, but when I saw this I thought maybe mdbook would be able to produce the same results.

There is an asciinema recording to see to sort of things it does.

@azerupi
Copy link
Contributor

azerupi commented Aug 9, 2016

I'm open to all ideas :)

@d8aninja
Copy link

d8aninja commented Feb 26, 2019

One of the things that doesn't seem to be mentioned anywhere on this ticket is the ability to highlight the important bits. I have used a chrome extension called Hypothesis to do this until recently but a) chrome extension, ew b) its pretty sloppy about whether the highlights are saved under your personal view or public view (ie, in some cases you can see others' highlights) and c) I say recently because I'm pretty sure when the book gets updated, all my highlights and attached notes get deleted, too.

Anyway, just wanted to add my two cents and support for a PDF version to be released in parallel to the online book's update. I realize that must be a lot harder than many of us make it out to be, and you all are doing a wonderful job regardless of the format in which we are all consuming your work. Thanks!

@XVilka
Copy link

XVilka commented Jun 6, 2019

Seems that mdproof can be used for such a task

@mkurnikov
Copy link

mkurnikov commented Oct 12, 2019

Trying to open print.html page from the browser in Print format gives me out-of-memory, but this one works

google-chrome-stable --headless --print-to-pdf=rust_book.pdf https://doc.rust-lang.org/beta/book/print.html

@Binlogo
Copy link

Binlogo commented Dec 12, 2020

Nice feature request.
Print the pdf with browser is a workaround way, the format looks nice already. 👍
Of course, if mdbook support export feature, it will be convenient for Continuous Integration.

@heyakyra
Copy link

if mdbook support export feature, it will be convenient for Continuous Integration.

Yes, I think this can be an important bit. Is this work useful? void-linux/void-docs#416

@Huy-Ngo
Copy link

Huy-Ngo commented Apr 21, 2021

I can't install mdbook-latex because it depends on harfbuzz_rs, which currently can't be compiled for a minor bug (harfbuzz/harfbuzz_rs#30). It's fixed but not released.

I think the maintainers would rather encourage user to use external plugins like mdbook-latex or mdbook-epub rather than internal implementation.

@ildar
Copy link

ildar commented May 7, 2021 via email

@Huy-Ngo
Copy link

Huy-Ngo commented May 7, 2021

As an example on how I'd like a ebook to look like.
Rust by example book: https://flibusta.is/b/619885/

I see an almost blank page in Russian (and the text seems to say "The page is not found"). Which part of it do you mean you want an ebook to look like?

@ildar
Copy link

ildar commented May 10, 2021 via email

@XVilka
Copy link

XVilka commented Sep 22, 2021

Since there is zero interest to support that in mdBook, I recommend a relatively new framework to create books, more flexible that commonly known Bookdown - Quarto. It's pandoc-based, thus can export to basically anything. You can see their gallery for samples how such different formats and exports look like. It's quite actively developed as well.

@aplatypus
Copy link

@dustinmatlock ... I am afraid that I disagree:

Probably the easiest way: print at the top right and save as PDF. Of course, this doesn't solve the EPUB issue, but programming books are sometimes best viewed in PDF. The book's website looks great on mobile.

Print Rust PDF

A PDF file is usually complete in that a link from the Table of Contents, when clicked would go to page #202 say. With a generated PDF file from a browser, clicking a hyperlink in the resulting PDF provieds this informative explaination ...

image

Firefox can’t establish a connection to the server at 127.0.0.1:3000.

    The site could be temporarily unavailable or too busy. Try again in a few moments. ...

An export function would at least support hyperlinks for Table of Contents, an Index and Footnotes. Sometimes it is useful for an exported file to link enternally.. In such cases I believe the PDF format specifies "internal" and "external" links.

Yes, a saved PDF is a usable solution to an off-line document, it is not a solution that I can save to a tablet and use when I'm out of touch with the internet, or late at night, etc.

Have a good one ...!

@HollowMan6
Copy link

Hi all! I just created a mdBook backend named mdbook-pdf for generating PDF based on headless chrome and Chrome DevTools Protocol Page.printToPDF. It depends on Google Chrome / Microsoft Edge / Chromium. The generated page are pretty much alike the one you manually printed to PDF in your browser by opening print.html or mentioned here: #88 (comment) , but with customization of PDF paper orientation, scale of the webpage rendering, paper width and height, page margins, generated PDF page ranges, whether to display header and footer as well as customize their formats, and more, as well as automation. It supports all the platform where Google Chrome / Microsoft Edge / Chromium would work. You can check samples of the generated PDF files in the Artifacts here.

For the issue aplatypus just mentioned above by using this method #88 (comment) , I guess for those "internal" links inside the book, work should be done in the mdbook side for print.html referring here so that all the links linked "internally" would jump inside the generated print.html, as all the contents should already be on the print.html, there shouldn't be any hyperlinks that jump to other html files in the book. By resolving in this way, the generated PDF would also jump internally instead of opening a browser that won't connect to anything.

@jacobmellin
Copy link

jacobmellin commented Jun 10, 2022

Hi, I wrote a quick bash script to generate a PDF from mdBook markdown using pandoc and the Eisvogel Pandoc/LaTeX template. Maybe it is of help to someone:

#!/bin/sh
# This script converts mdBook markdown output into a pdf using Pandoc/LaTeX and
# the eisvogel pandoc template (https://github.com/Wandmalfarbe/pandoc-latex-template).
# By default, it assumes that the script is put in to a direct subfolder of your
# mdBook project, next to the eisvogel.latex file and your mdBook project root
# contains the book.toml, your markdown sources at ./src and the preprocessed markdown
# will be created in ./book/markdown. Your book.toml file needs to contain the line 
# [output.markdown]
# The path of the resulting pdf file will be ./book/pdf/output.pdf

# Directory that this script is in (e. g. subfolder of PROJECT_DIR)
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

# Project directory (contains book.toml, src folder and book output folder)
# Change this if your script is not inside a subfolder (e.g. 'scripts') of the project directory.
PROJECT_DIR="$( dirname $SCRIPT_DIR)"

# Pandoc LaTeX template
# This script works with the eisvogel-Template
# (https://github.com/Wandmalfarbe/pandoc-latex-template).
# If you want to use this template, please put the file
# eisvogel.latex in the same directory as this script
# (e.g. $PROJECT_DIR/scripts/eisvogel.latex).
TPL="$SCRIPT_DIR/eisvogel.latex"

# Build markdown
# Ensure that your book.toml contains the line
# [output.markdown]
mdbook build

# Make output and temp folders
mkdir -p $PROJECT_DIR/book/pdf
mkdir -p $PROJECT_DIR/book/markdown-temp/images

# Copy all images to a single directory
find $PROJECT_DIR/src -name \*.png -exec cp {} $PROJECT_DIR/book/markdown-temp/images \;

# Define output file path
OUTPUT_FILE=$PROJECT_DIR/book/markdown-temp/output.md

# Read meta information from book.toml
CONFIG_FILE_CONTENTS=$( < $PROJECT_DIR/book.toml )

[[ $CONFIG_FILE_CONTENTS =~ title\ +=\ +\"(.*)\" ]] \
    && DOCUMENT_TITLE=${BASH_REMATCH[1]}

[[ $CONFIG_FILE_CONTENTS =~ language\ =\ \"([a-z]*)\" ]] \
    && DOCUMENT_LANGUAGE=${BASH_REMATCH[1]}

[[ $CONFIG_FILE_CONTENTS =~ authors\ +=\ +(\[[^\]]+])\ * ]] \
    && DOCUMENT_AUTHORS=${BASH_REMATCH[1]}

# Write the document title and configuration to output file
cat > $OUTPUT_FILE<< EOF
---
title: ${DOCUMENT_TITLE}
author: ${DOCUMENT_AUTHORS}
date: "11.06.2022"
titlepage: true
fontsize: 10pt
logo: ""
logo-width: 110mm
toc: true
toc-own-page: true
keywords: [Markdown, Example]
...
EOF

# echo -e "# $DOCUMENT_TITLE\n" >> $OUTPUT_FILE

# Read SUMMARY.md, combine output titles and individual .md file contents
# into single output markdown file
while read line
do
    [[ $line =~ Summary ]] && continue
    # Write SUMMARY.md section titles to markdown file
    # [[ $line =~ ^\# ]] && echo -e "$line\n" >> $OUTPUT_FILE
    
    # Combine different markdown files, increasing the section level
    # for each headline
    # [[ $line =~ \((.*\.md)\) ]] \
        # && sed -e 's/^#/##/g' \
            # "$PROJECT_DIR/book/markdown/${BASH_REMATCH[1]}" \
                # >> $OUTPUT_FILE

    # Combine markdown files, leaving the section headings as they are
    [[ $line =~ \((.*\.md)\) ]] \
        && cat "$PROJECT_DIR/book/markdown/${BASH_REMATCH[1]}" \
            >> $OUTPUT_FILE
        echo -e "\n" >> $OUTPUT_FILE
done < $PROJECT_DIR/src/SUMMARY.md

# Do pandoc conversion of markdown
cd $PROJECT_DIR/book/markdown-temp
pandoc -w latex --template $TPL -o ../pdf/output.pdf output.md --number-sections -V lang=$DOCUMENT_LANGUAGE

@hoijui
Copy link

hoijui commented Jun 11, 2022

The main issue when creating a PDF from mdbook sources, is that the Markdown sources are a tree, potentially/likely randomly interlinked (just like HTML, which makes the conversion trivial), while a PDF is a single, linear document.
The main thing to be done there, is to get from a tree to a single, linear document. This can be seen in @jacobmellin s script, for example.
I solved this in MoVeDo (a set of scripts abstracting over multiple tools that take MD sources and produce HTML and/or PDF),
also with a BASH script, considering a few more of the issues (probably not all of them either, though). The script doing this is called linearize. It uses Pandoc filters to shift header levels, it removes individual files front matters, extracts the titles from there and adds them as headers, prepends header-ids with their sanitized source-file path, and rewrites internal links (links between the MD source files), so they still work within the resulting, single MD file, and maybe one or two additional small things. It uses a file called doc.yml as the FrontMatter for the resulting doc.md.
The script relies on other scripts and filters inside MoVeDo, but it is by far the most interesting/useful part of the whole piece of software (Maybe the only useful part at all, for anyone but myself). It should probably be extracted/made stand-alone some day.

@jacobmellin
Copy link

@hoijui Very nice project, I'll definitely check it out.

@aplatypus
Copy link

@hoijui ... You could look to the open source Okular tool to see how they load a MD document and render it as PDF.

@hoijui
Copy link

hoijui commented Dec 8, 2022

@aplatypus As I wrote before, the issue is not how to render a single MD file as PDF, that is trivial and possible with many tools and libraries. The issue is, how to convert a tree of Markdown files/documents into a single Markdown file.

@ourongxing
Copy link

@HollowMan6
Copy link

mdbook-pdf now supports Table of Content, see: HollowMan6/mdbook-pdf#1 (comment)

Hi all! I just created a mdBook backend named mdbook-pdf for generating PDF based on headless chrome and Chrome DevTools Protocol Page.printToPDF. It depends on Google Chrome / Microsoft Edge / Chromium. The generated page are pretty much alike the one you manually printed to PDF in your browser by opening print.html or mentioned here: #88 (comment) , but with customization of PDF paper orientation, scale of the webpage rendering, paper width and height, page margins, generated PDF page ranges, whether to display header and footer as well as customize their formats, and more, as well as automation. It supports all the platform where Google Chrome / Microsoft Edge / Chromium would work. You can check samples of the generated PDF files in the Artifacts here.

For the issue aplatypus just mentioned above by using this method #88 (comment) , I guess for those "internal" links inside the book, work should be done in the mdbook side for print.html referring here so that all the links linked "internally" would jump inside the generated print.html, as all the contents should already be on the print.html, there shouldn't be any hyperlinks that jump to other html files in the book. By resolving in this way, the generated PDF would also jump internally instead of opening a browser that won't connect to anything.

@LegNeato
Copy link

If you are looking for pdf output, check out the project I just posted in #815 (comment)

@max-heller
Copy link
Contributor

I built mdbook-pandoc, a backend powered by Pandoc. Pandoc is quite mature and supports many output formats, including PDF (I've mainly tested LaTeX) and EPUB. Sample rendered PDF books are here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Area: Rendering C-enhancement Category: Enhancement or feature request C-new-format Category: A new rendering format S-On-hold Status: On hold S-Wishlist Status: Wishlist
Projects
None yet
Development

No branches or pull requests