Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UniPDF v4 - Proposals #337

Open
gunnsth opened this issue May 3, 2020 · 12 comments
Open

UniPDF v4 - Proposals #337

gunnsth opened this issue May 3, 2020 · 12 comments

Comments

@gunnsth
Copy link
Contributor

gunnsth commented May 3, 2020

Introduction

The idea of this ticket is to collect all ideas that would make sense for next major version (v4). This is a chance for considering major updates.

Ideas related to performance improvements should include benchmarks that clearly show potential advantages.

Ideas for refactoring need to clearly state the advantages and also what it means for the user. If there is a breaking change, there should be an easy way to update code.

Ideas for reducing binary sizes are also welcomed. Although its important that the API remains easy to use.

Related issues

There are already a few relevant issues:

@gunnsth gunnsth changed the title UniPDF v4 - proposals UniPDF v4 - Proposals May 3, 2020
@gunnsth
Copy link
Contributor Author

gunnsth commented May 7, 2020

contentstream.ContentStreamOperations should be a struct containing an array. not as currently

type ContentStreamOperations []*ContentStreamOperation

it's not fun to work with those typed slices, since iterating through an arbitrary type is kinda messy. Better to have like cs.Elements() etc. Like done with core.PdfObjectArray already. Also adds flexibility to add some extra data that can be useful to the struct.

@gunnsth
Copy link
Contributor Author

gunnsth commented May 30, 2020

With support for 1 character code <-> multiple runes (string) in CMaps, it makes sense to update our text encoder interfaces in the future. Currently we have

// TextEncoder defines the common methods that a text encoder implementation must have in UniDoc.
type TextEncoder interface {
	// String returns a string that describes the TextEncoder instance.
	String() string

	// Encode converts the Go unicode string to a PDF encoded string.
	Encode(str string) []byte

	// Decode converts PDF encoded string to a Go unicode string.
	Decode(raw []byte) string

	// RuneToCharcode returns the PDF character code corresponding to rune `r`.
	// The bool return flag is true if there was a match, and false otherwise.
	// This is usually implemented as RuneToGlyph->GlyphToCharcode
	RuneToCharcode(r rune) (CharCode, bool)

	// CharcodeToRune returns the rune corresponding to character code `code`.
	// The bool return flag is true if there was a match, and false otherwise.
	// This is usually implemented as CharcodeToGlyph->GlyphToRune
	CharcodeToRune(code CharCode) (rune, bool)

	// ToPdfObject returns a PDF Object that represents the encoding.
	ToPdfObject() core.PdfObject
}

It would make sense to have charcode <-> string, and charcode <-> string. or maybe ones that process multiples instead of single ones.

@peterwilliams97
Copy link
Contributor

Extractor.ExtractPageText() returns two statistics that I don't think anyone uses or will ever use.
Can we replace it with a function like Extractor.Extract() (*PageText, error)?

@peterwilliams97
Copy link
Contributor

Text extraction is now aware of paragraph and line structure. We can therefore write a search function that returns bounding boxes of the line fragments of the matching text when the match spans multiple lines or multiple paragraphs

@progamer71
Copy link

support create and manage PDF/A3 with file attachment

@gunnsth
Copy link
Contributor Author

gunnsth commented Aug 27, 2020

@progamer71 That is in our radar but that is not what this ticket about. This is about API compatibility and possible major changes in upcoming v4. PDF/A3 is not part of our API yet, so it is not a concern here. It would make sense to create a new issue for that, if there is not one already. And with more details as well.
See #11

@gunnsth
Copy link
Contributor Author

gunnsth commented Aug 27, 2020

NewPdfFontFromTTFFile and NewCompositePdfFontFromTTFFile are a bit confusing. Users often try to use NewPdfFontFromTTFFile and then use symbols which are not in the simple encoding and does not display.
It would be nice if NewPdfFontFromTTFFile could handle this, and the second function would not be needed.

@gunnsth
Copy link
Contributor Author

gunnsth commented Oct 7, 2020

In V4: We should change content stream processing.
Currently we have

func (p *PdfPage) GetAllContentStreams() (string, error) {

which returns a string. The problem with this is that the content streams can get very big, and working with it as string leads to copying which is inefficient and memory intensive.

Creating a new type in contentstream called ContentStream to represent the content stream may be feasible where it can be worked with as a byte slice and avoid copying unless absolutely necessary.

@gunnsth
Copy link
Contributor Author

gunnsth commented Jan 11, 2021

Deprecate creator.Paragraph in favor of creator.StyledParagraph

@gunnsth
Copy link
Contributor Author

gunnsth commented Jan 13, 2021

Remove model.ImageHandling or make internal. Alternatively it could be redesigned such that it would be actually usable for providing handlers for loading images. At the moment this functionality is not well maintained and would need more testing.

StreamEncoders could be designed such that they can be registered, such that an external handler could be registered (in particular for image handling). The Decode output for images as a []byte stream (data) may not be ideal and sometimes we are loading an image and converting between models multiple times which is not efficient.

@gunnsth
Copy link
Contributor Author

gunnsth commented Jan 14, 2021

Text extraction should have options. Possible options:

  • Raw -> Just get the plain (decoded) text from the content streams. Should be very fast, and output very consistent (independent of table detection algorithms). Good for benchmarking against.
  • Raw sorted -> Processed to sort (top-down, left-right).
  • Cells/Tabular -> Apply table detection to the text and grouping text together into cells. Final output is sorted (top-down, left-right by the grouped cells (upper left coordinate of each))

@gunnsth
Copy link
Contributor Author

gunnsth commented Feb 12, 2021

Unify ContentStreamProcessor based on usage in render and extractor packages. Should be able to keep track of graphics and text state there in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants