Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf.toStream() ? Issues with larger pdfs #46

Closed
keithrz opened this issue Oct 9, 2015 · 10 comments
Closed

pdf.toStream() ? Issues with larger pdfs #46

keithrz opened this issue Oct 9, 2015 · 10 comments
Labels

Comments

@keithrz
Copy link

keithrz commented Oct 9, 2015

I think I'm running into memory issues when rendering larger pdfs.

Right now I'm creating 3 similar pdf files. The page on each pdf contains a table with 25 columns and anywhere from 4 to 25 rows.

2 of the 3 pdf files work fine, because at most they produce 3 pages each. But the 3rd pdf is big. It is 6 Mb size, and has 28 pages in it. Pages 19-28 cannot be viewed in Adobe Reader (but they can be viewed in Mac Preview.)

I've tested the big pdf at "http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx", PDF/A-1b, which gives me the following error:
The content stream contains an invalid operator. (10)

It says the error occurred 10 times, which makes sense since pages 19-28 cannot be viewed.

Now, if I render the same document, but only render pages 19-28, they render just fine. So there must be some resource limitation that is causing the issue.

I've tried rendering the pdf on different machines, and each machine produces the same result.

Instead of using pdf.toString() to trigger the building of the pdf, I'm wondering if creating a text stream from the pdf via a method like pdf.toStream() would prevent the resource limitation. Is this even possible to do with the code? I'm going to fork the code and try it, but if you think this is all but impossible, please let me know.

I cannot share the json data used to render the pdf, since it has client info. If you think it would really help to have this data in order to debug the issue, let me know and I'll try to obfuscate the data.

@rkusa
Copy link
Owner

rkusa commented Oct 10, 2015

Hi, thanks for reporting your issue!

Regarding toStream(): I would also pretty much love such a functionality. However, it would not work. For example, when having the total page count on each page, you have to update each preceding page when rendering a new one. This is just one example that prevents streaming functionality :-(

I'll try to trigger the same error first so that I may not need your source data. Are you using version 1.0.0-alpha.5?

@rkusa
Copy link
Owner

rkusa commented Oct 10, 2015

Are you saving the pdf with encoding set to binary? e.g.

fs.writeFile(fileName, pdfString, 'binary', function () { 
  console.log('saved');
});

Edit: I think I am not able to reproduce you issue. Are you getting the location of where in your PDF the issue occurs? If so you could also give me the some lines surrounding this location straight out of the .pdf file.

Thanks

@rkusa rkusa mentioned this issue Oct 10, 2015
@keithrz
Copy link
Author

keithrz commented Oct 10, 2015

Yes, I'm using that version, and yes, I've been saving the file as binary.

I think if you create enough data you'll be able to recreate. Thanks for
the info about streaming!

On Saturday, October 10, 2015, Markus Ast notifications@github.com wrote:

Hi, thanks for reporting your issue!

Regarding toStream(): I would also pretty much love such a functionality.
However, it would not work. For example, when having the total page count
on each page, you have to update each preceding page when rendering a new
one. This is just one example that prevents streaming functionality :-(

I'll try to trigger the same error first so that I may not need your
source data. Are you using version 1.0.0-alpha.5?


Reply to this email directly or view it on GitHub
#46 (comment).

@rkusa
Copy link
Owner

rkusa commented Oct 11, 2015

Are you just adding text, or also images?

@keithrz
Copy link
Author

keithrz commented Oct 11, 2015

Just text.

Here is the code that I use to create each document:

var underscore = require('underscore');
var fs = require('fs');
var tmp = require('tmp');
var path = require('path');
var pdfjs = require('pdfjs');
var formatNumber = require('format-number');
var moment = require('moment-timezone');


/**
 * Creates a pdf using data from a given array.
 * @param metadata (object) : object defining how the pdf should be formatted
 * @param pages (Array) : each array item should have {pageTitle: string, data: object}
 * @param next (Function) : callback with params (err, filepath) : filepath is the path of the generated pdf.
 */
function pdfFromDataAsFile(metadata, pages, next) {
    var docMetadataEnsureFont = underscore.defaults(convertFontName(metadata.doc), {font: regularFont});

    var doc = new pdfjs.Document(docMetadataEnsureFont);

    // header and footer common to all pages
    var header = doc.header();
    header.text(metadata.docTitle, {font: boldFont, fontSize: 12});

    var footer = doc.footer();
    var table, tr;
    footer.text({textAlign: 'center'}).append('Page  ').pageNumber().append('  of  ').pageCount();

    pages.forEach(function(page, index) {
        if(index > 0)
            doc.pageBreak();

        // title specific to each page
        table = doc.table(underscore.extend({widths: ['100%']}, convertFontName(metadata.pageTitle)));
        table.tr().td(page.pageTitle);

        table = doc.table(convertFontName(metadata.table));

        // header row(s)
        metadata.headerRowsData.forEach(function (rowData) {
            tr = table.tr(convertFontName(metadata.headerRows));

            metadata.rowKeys.forEach(function(rowKey) {
                tr.td(rowData[rowKey]);
            });
        });

        // data row(s)
        page.data.forEach(function(dataset) {
            var datasetMetadata = metadata.mainRows;
            var datasetMetadataFirstCell = metadata.mainRowsFirstCell;
            if(dataset.summary) {
                datasetMetadata = metadata.summaryRows;
                datasetMetadataFirstCell = metadata.summaryRowsFirstCell;
            }

            dataset.rows.forEach(function(rowData) {
                tr = table.tr(convertFontName(datasetMetadata));

                metadata.rowKeys.forEach(function(rowKey, index) {
                    if(index === 0)
                        tr.td(rowData[rowKey], convertFontName(datasetMetadataFirstCell));
                    else
                        tr.td(formatNumber(metadata.numberFormat[rowKey])(rowData[rowKey]));
                });
            });
        });

        table = doc.table({widths: ["100%"]});
        tr = table.tr();
        var generatedText = tr.td().text();
        generatedText.br();
        var generatedDateWithTimezone = moment.tz(moment(), conf.reportSchedule.defaultTimezone);
        var generatedDateText = generatedDateWithTimezone.format('M/D/YYYY h:mm:ss A z');
        generatedText.append("Report Generated by Reporting System at " + generatedDateText);
    });

    var pdf = doc.render();

    var tempPath = path.resolve(__dirname + '/../temp');
    var tempFileTemplate = tempPath + '/XXXXXX.pdf';
    tmp.tmpName({template: tempFileTemplate}, function(err, path) {
        if(err)
            return next(err);

        savePdf(pdf, path, next);
    });
}

exports.pdfFromDataAsFile = pdfFromDataAsFile;

function savePdf(pdf, filepath, next) {
    //var buffer = new Buffer(pdf.toString(), 'binary');
    fs.writeFile(pdf.toString(), 'binary', function(err) {
        if(err)
            return next(err);

        next(null, filepath);
    })
}

And here is some sample code that populates my doc metadata:

function getMetadata(docTitle, headerTitle) {
    var docMetadata = {
        width: 1008, // (14  in * 72 dpi)
        height: 612  // (8.5 in * 72 dpi)
    };

    var tableMetadata = {
        headerRows: 0, fontSize: 5,
        borderHorizontalWidth: 0.5,
        borderVerticalWidth: 0.5,
        widths: [
            '11.1%', '3.7%', '3.7%', '3.7%', '3.7%',
            '3.7%', '3.7%', '3.7%', '3.7%', '3.7%',
            '3.7%', '3.7%', '3.7%', '3.7%', '3.7%',
            '3.7%', '3.7%', '3.7%', '3.7%', '3.7%',
            '3.7%', '3.7%', '3.7%', '3.7%', '3.7%'
        ],
        padding: 1
    };

    var rowKeys = [
        'name', 'net_sales', 'net_sales_last_year', 'net_sales_ly_delta',
        'visitors', 'visitors_last_year', 'visitors_ly_delta', 'conversion', 'conversion_last_year', 'conversion_ly_delta',
        'transactions', 'transactions_last_year', 'transactions_ly_delta',
        'units', 'units_last_year', 'units_ly_delta',
        'average_unit_retail', 'average_unit_retail_last_year', 'average_unit_retail_ly_delta',
        'units_per_transaction', 'units_per_transaction_last_year', 'units_per_transaction_ly_delta',
        'avg_dollar_per_transaction', 'avg_dollar_per_transaction_last_year', 'avg_dollar_per_transaction_ly_delta'
    ];

    var pageTitleMetadata = { fontName: 'bold', fontSize: 5, paddingTop: 10, paddingBottom: 4};

    var headerRowsMetadata = { textAlign: 'center', fontName: 'bold', backgroundColor: 'd2d2d2' };

    var headerRowsData = [
        {
            name: '',
            net_sales: 'Net Sales',
            net_sales_last_year: 'Net Sales',
            net_sales_ly_delta: 'Net Sales',
            units: 'Units',
            units_last_year: 'Units',
            units_ly_delta: 'Units',
            visitors: 'Visitors',
            visitors_last_year: 'Visitors',
            visitors_ly_delta: 'Visitors',
            conversion: 'Conversion Rate',
            conversion_last_year: 'Conversion Rate',
            conversion_ly_delta: 'Conversion Rate',
            transactions: 'Transactions',
            transactions_last_year: 'Transactions',
            transactions_ly_delta: 'Transactions',
            average_unit_retail: 'AUR',
            average_unit_retail_last_year: 'AUR',
            average_unit_retail_ly_delta: 'AUR',
            units_per_transaction: 'UPT',
            units_per_transaction_last_year: 'UPT',
            units_per_transaction_ly_delta: 'UPT',
            avg_dollar_per_transaction: 'ADT',
            avg_dollar_per_transaction_last_year: 'ADT',
            avg_dollar_per_transaction_ly_delta: 'ADT'
        },{
            name: headerTitle,
            net_sales: 'Actual',
            net_sales_last_year: 'Last Year',
            net_sales_ly_delta: '% Chg to LY',
            units: 'Actual',
            units_last_year: 'Last Year',
            units_ly_delta: '% Chg to LY',
            visitors: 'Actual',
            visitors_last_year: 'Last Year',
            visitors_ly_delta: '% Chg to LY',
            conversion: 'Actual',
            conversion_last_year: 'Last Year',
            conversion_ly_delta: 'Chg to LY',
            transactions: 'Actual',
            transactions_last_year: 'Last Year',
            transactions_ly_delta: '% Chg to LY',
            average_unit_retail: 'Actual',
            average_unit_retail_last_year: 'Last Year',
            average_unit_retail_ly_delta: 'Chg to LY',
            units_per_transaction: 'Actual',
            units_per_transaction_last_year: 'Last Year',
            units_per_transaction_ly_delta: 'Chg to LY',
            avg_dollar_per_transaction: 'Actual',
            avg_dollar_per_transaction_last_year: 'Last Year',
            avg_dollar_per_transaction_ly_delta: 'Chg to LY'
        }
    ];

    var mainRowsMetadata = {textAlign: 'right'};

    var mainRowsFirstCellMetadata = {textAlign: 'left', fontName: 'bold'};

    var summaryRowsMetadata = { textAlign: 'right', fontName: 'bold', backgroundColor: 'd2d2d2'};

    var summaryRowsFirstCellMetadata = {textAlign: 'left'};

    var numberFormatMetadata = {
        net_sales: {prefix: '$', round: 0},
        net_sales_last_year: {prefix: '$', round: 0},
        net_sales_ly_delta: {suffix: '%', round: 2},
        units: {},
        units_last_year: {},
        units_ly_delta: {suffix: '%', round: 2},
        visitors: {},
        visitors_last_year: {},
        visitors_ly_delta: {suffix: '%', round: 2},
        conversion: {suffix: '%', round: 2},
        conversion_last_year: {suffix: '%', round: 2},
        conversion_ly_delta: {round: 2},
        transactions: {},
        transactions_last_year: {},
        transactions_ly_delta: {suffix: '%', round: 2},
        average_unit_retail: {prefix: '$', round: 2},
        average_unit_retail_last_year: {prefix: '$', round: 2},
        average_unit_retail_ly_delta: {prefix: '$', round: 2},
        units_per_transaction: {round: 2},
        units_per_transaction_last_year: {round: 2},
        units_per_transaction_ly_delta: {round: 2},
        avg_dollar_per_transaction: {prefix: '$', round: 2},
        avg_dollar_per_transaction_last_year: {prefix: '$', round: 2},
        avg_dollar_per_transaction_ly_delta: {prefix: '$', round: 2}
    };

    var metadata = {
        doc: docMetadata,
        docTitle: docTitle,
        numberFormat: numberFormatMetadata,
        pageTitle: pageTitleMetadata,
        table: tableMetadata,
        rowKeys: rowKeys,
        headerRows: headerRowsMetadata,
        headerRowsData: headerRowsData,
        mainRows: mainRowsMetadata,
        mainRowsFirstCell: mainRowsFirstCellMetadata,
        summaryRows: summaryRowsMetadata,
        summaryRowsFirstCell: summaryRowsFirstCellMetadata
    };

    return metadata;
}

I don't have any sample data for pages, but they would have the format:

[{pageTitle: string, data: object}]

where the data object would have the same format as headerRowsData in the metadata object. The values within the data object would be numbers.

@keithrz
Copy link
Author

keithrz commented Oct 12, 2015

I've created a full test gist here:
https://gist.github.com/keithrz/d8c9b6c2821bd66c36e5

rkusa added a commit that referenced this issue Oct 13, 2015
rkusa added a commit that referenced this issue Oct 13, 2015
@rkusa
Copy link
Owner

rkusa commented Oct 13, 2015

Should be fixed in master. It was an issue with rounding numbers in exponential notation.
I've also cut come bytes of the resulting file size.

There are some more possible ways to reduce file size, e.g.: draw borders for each row and not for each single cell and compress PDF content streams. I've added these features to my todo list, but they do not have a high priority.

For performance and memory usage: I'll definitely add a streaming API. But unfortunately, not before December.

Thank you very much for providing the example code!

@rkusa rkusa added the bug label Oct 13, 2015
@keithrz
Copy link
Author

keithrz commented Oct 13, 2015

Awesome improvement!

Besides fixing the invalid character issue, the 28-page report shrunk almost in half - from 6 Mb to 3.3 Mb.

The streaming API would be amazing, and yes, definitely something that will take some time.

Unless you want this issue kept open to keep track of the streaming API, I will close it.

@rkusa
Copy link
Owner

rkusa commented Oct 13, 2015

I am glad it worked out!

Streaming API will be tracked in #47

@rkusa rkusa closed this as completed Oct 13, 2015
@rkusa
Copy link
Owner

rkusa commented Apr 12, 2017

Version 2.0.0-alpha.1 is now completely streaming based.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants