UniPDF v4 - Proposals #337

gunnsth · 2020-05-03T22:39:45Z

Introduction

The idea of this ticket is to collect all ideas that would make sense for next major version (v4). This is a chance for considering major updates.

Ideas related to performance improvements should include benchmarks that clearly show potential advantages.

Ideas for refactoring need to clearly state the advantages and also what it means for the user. If there is a breaking change, there should be an easy way to update code.

Ideas for reducing binary sizes are also welcomed. Although its important that the API remains easy to use.

Related issues

There are already a few relevant issues:

Change WriteString signature to a Write method (with io.Reader) #49
Core: Refactor parser to use bufferedReadSeeker type #55
Memory management for large PDFs #19 - For this one it might make sense to create a generalized temp storage interface that the user can provide and we can have default implementations (memory default/on disk secondary or hybrid). Recently did a similar thing in UniOffice.

gunnsth · 2020-05-07T08:50:13Z

contentstream.ContentStreamOperations should be a struct containing an array. not as currently

type ContentStreamOperations []*ContentStreamOperation

it's not fun to work with those typed slices, since iterating through an arbitrary type is kinda messy. Better to have like cs.Elements() etc. Like done with core.PdfObjectArray already. Also adds flexibility to add some extra data that can be useful to the struct.

gunnsth · 2020-05-30T11:22:53Z

With support for 1 character code <-> multiple runes (string) in CMaps, it makes sense to update our text encoder interfaces in the future. Currently we have

// TextEncoder defines the common methods that a text encoder implementation must have in UniDoc.
type TextEncoder interface {
	// String returns a string that describes the TextEncoder instance.
	String() string

	// Encode converts the Go unicode string to a PDF encoded string.
	Encode(str string) []byte

	// Decode converts PDF encoded string to a Go unicode string.
	Decode(raw []byte) string

	// RuneToCharcode returns the PDF character code corresponding to rune `r`.
	// The bool return flag is true if there was a match, and false otherwise.
	// This is usually implemented as RuneToGlyph->GlyphToCharcode
	RuneToCharcode(r rune) (CharCode, bool)

	// CharcodeToRune returns the rune corresponding to character code `code`.
	// The bool return flag is true if there was a match, and false otherwise.
	// This is usually implemented as CharcodeToGlyph->GlyphToRune
	CharcodeToRune(code CharCode) (rune, bool)

	// ToPdfObject returns a PDF Object that represents the encoding.
	ToPdfObject() core.PdfObject
}

It would make sense to have charcode <-> string, and charcode <-> string. or maybe ones that process multiples instead of single ones.

peterwilliams97 · 2020-06-27T01:38:34Z

Extractor.ExtractPageText() returns two statistics that I don't think anyone uses or will ever use.
Can we replace it with a function like Extractor.Extract() (*PageText, error)?

peterwilliams97 · 2020-06-27T11:54:18Z

Text extraction is now aware of paragraph and line structure. We can therefore write a search function that returns bounding boxes of the line fragments of the matching text when the match spans multiple lines or multiple paragraphs

progamer71 · 2020-08-18T07:44:57Z

support create and manage PDF/A3 with file attachment

gunnsth · 2020-08-27T13:09:40Z

@progamer71 That is in our radar but that is not what this ticket about. This is about API compatibility and possible major changes in upcoming v4. PDF/A3 is not part of our API yet, so it is not a concern here. It would make sense to create a new issue for that, if there is not one already. And with more details as well.
See #11

gunnsth · 2020-08-27T13:11:22Z

NewPdfFontFromTTFFile and NewCompositePdfFontFromTTFFile are a bit confusing. Users often try to use NewPdfFontFromTTFFile and then use symbols which are not in the simple encoding and does not display.
It would be nice if NewPdfFontFromTTFFile could handle this, and the second function would not be needed.

gunnsth · 2020-10-07T10:17:56Z

In V4: We should change content stream processing.
Currently we have

func (p *PdfPage) GetAllContentStreams() (string, error) {

which returns a string. The problem with this is that the content streams can get very big, and working with it as string leads to copying which is inefficient and memory intensive.

Creating a new type in contentstream called ContentStream to represent the content stream may be feasible where it can be worked with as a byte slice and avoid copying unless absolutely necessary.

gunnsth · 2021-01-11T11:01:00Z

Deprecate creator.Paragraph in favor of creator.StyledParagraph

gunnsth · 2021-01-13T22:33:56Z

Remove model.ImageHandling or make internal. Alternatively it could be redesigned such that it would be actually usable for providing handlers for loading images. At the moment this functionality is not well maintained and would need more testing.

StreamEncoders could be designed such that they can be registered, such that an external handler could be registered (in particular for image handling). The Decode output for images as a []byte stream (data) may not be ideal and sometimes we are loading an image and converting between models multiple times which is not efficient.

gunnsth · 2021-01-14T11:25:36Z

Text extraction should have options. Possible options:

Raw -> Just get the plain (decoded) text from the content streams. Should be very fast, and output very consistent (independent of table detection algorithms). Good for benchmarking against.
Raw sorted -> Processed to sort (top-down, left-right).
Cells/Tabular -> Apply table detection to the text and grouping text together into cells. Final output is sorted (top-down, left-right by the grouped cells (upper left coordinate of each))

gunnsth · 2021-02-12T17:31:40Z

Unify ContentStreamProcessor based on usage in render and extractor packages. Should be able to keep track of graphics and text state there in one place.

gunnsth changed the title ~~UniPDF v4 - proposals~~ UniPDF v4 - Proposals May 3, 2020

gunnsth mentioned this issue Jun 24, 2020

Text extraction code for columns. #366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UniPDF v4 - Proposals #337

UniPDF v4 - Proposals #337

gunnsth commented May 3, 2020

gunnsth commented May 7, 2020

gunnsth commented May 30, 2020

peterwilliams97 commented Jun 27, 2020

peterwilliams97 commented Jun 27, 2020

progamer71 commented Aug 18, 2020

gunnsth commented Aug 27, 2020 •

edited

Loading

gunnsth commented Aug 27, 2020

gunnsth commented Oct 7, 2020

gunnsth commented Jan 11, 2021 •

edited

Loading

gunnsth commented Jan 13, 2021

gunnsth commented Jan 14, 2021 •

edited

Loading

gunnsth commented Feb 12, 2021

UniPDF v4 - Proposals #337

UniPDF v4 - Proposals #337

Comments

gunnsth commented May 3, 2020

Introduction

Related issues

gunnsth commented May 7, 2020

gunnsth commented May 30, 2020

peterwilliams97 commented Jun 27, 2020

peterwilliams97 commented Jun 27, 2020

progamer71 commented Aug 18, 2020

gunnsth commented Aug 27, 2020 • edited Loading

gunnsth commented Aug 27, 2020

gunnsth commented Oct 7, 2020

gunnsth commented Jan 11, 2021 • edited Loading

gunnsth commented Jan 13, 2021

gunnsth commented Jan 14, 2021 • edited Loading

gunnsth commented Feb 12, 2021

gunnsth commented Aug 27, 2020 •

edited

Loading

gunnsth commented Jan 11, 2021 •

edited

Loading

gunnsth commented Jan 14, 2021 •

edited

Loading