Extract images with position in package extractor (v3) #330

gunnsth · 2019-01-23T10:16:49Z

Image extraction support in extractor package. Can extract image data with coordinate information (position and dimensions). Implements #317.

codecov · 2019-01-23T10:17:09Z

Codecov Report

Merging #330 into v3 will increase coverage by 6%.
The diff coverage is 72.47%.

@@            Coverage Diff            @@
##               v3     #330     +/-   ##
=========================================
+ Coverage   51.34%   57.34%     +6%     
=========================================
  Files         144      145      +1     
  Lines       25098    25208    +110     
=========================================
+ Hits        12886    14456   +1570     
+ Misses      10488    10389     -99     
+ Partials     1724      363   -1361

Impacted Files	Coverage Δ
pdf/extractor/image.go	`71.81% <72.47%> (ø)`
pdf/creator/division.go	`89.13% <0%> (+1.08%)`	⬆️
pdf/model/outline.go	`86.74% <0%> (+1.2%)`	⬆️
pdf/creator/invoice.go	`90.07% <0%> (+1.21%)`	⬆️
pdf/model/annotations.go	`22.64% <0%> (+1.56%)`	⬆️
pdf/creator/toc_line.go	`75.39% <0%> (+1.58%)`	⬆️
pdf/internal/textencoding/truetype.go	`16% <0%> (+2%)`	⬆️
pdf/creator/chapters.go	`64.94% <0%> (+2.06%)`	⬆️
pdf/model/fonts/std.go	`90% <0%> (+2.5%)`	⬆️
... and 88 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2253e29...ff44143. Read the comment docs.

…leaner.

…images.

…idoc/unidoc into v3-extract-images-with-position

adrg

The code looks great overall. I left a couple of suggested changes, although some of them may be a bit subjective. The tests are passing and I followed the logic and it seems correct to me.

adrg · 2019-01-29T17:06:04Z

pdf/extractor/image.go

+			return ctx.processOperand(op, gs, resources)
+		})
+
+	err = processor.Process(resources)


Can just return the result of the Process method:

return processor.Process(resources)

adrg · 2019-01-29T17:06:44Z

pdf/extractor/image.go

+func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error {
+	if op.Operand == "BI" && len(op.Params) == 1 {
+		// BI: Inline image.
+


Remove unnecessary empty line

adrg · 2019-01-29T17:13:29Z

pdf/extractor/image.go

+		if err != nil {
+			return err
+		}
+		xDim := gs.CTM.ScalingFactorX()


Unnecessary variables. Maybe assign directly to struct:

imgMark := ImageMark{ Image: &rgbImg, Width: gs.CTM.ScalingFactorX(), Height: gs.CTM.ScalingFactorY(), Angle: gs.CTM.Angle(), } imgMark.X, imgMark.Y = gs.CTM.Translation()

adrg · 2019-01-29T17:21:09Z

pdf/extractor/image.go

+
+		_, xtype := resources.GetXObjectByName(*name)
+		if xtype == model.XObjectTypeImage {
+			common.Log.Debug(" XObject Image: %s", *name)


Remove space at the beginning of the " XObject Image: %s" string

adrg · 2019-01-29T17:22:29Z

pdf/extractor/image.go

+				}
+				ctx.cacheXObjectImages[stream] = cimg
+			}
+			img := cimg.image


Unnecessary variables. Maybe use directly:

rgbImg, err := cimg.cs.ImageToRGB(*cimg.image) if err != nil { return err } common.Log.Debug("@Do CTM: %s", gs.CTM.String()) imgMark := ImageMark{ Image: &rgbImg, Width: gs.CTM.ScalingFactorX(), Height: gs.CTM.ScalingFactorY(), Angle: gs.CTM.Angle(), } imgMark.X, imgMark.Y = gs.CTM.Translation()

adrg · 2019-01-29T17:28:13Z

pdf/extractor/image.go

+		return nil, err
+	}
+
+	images := &PageImages{


Maybe return directly:

return &PageImages{ Images: ctx.extractedImages, }, nil

adrg · 2019-01-29T17:29:07Z

pdf/extractor/image.go

+	cs    model.PdfColorspace
+}
+
+func (ctx *imageExtractContext) extractImagesInContentStream(contents string, resources *model.PdfPageResources) error {


This method could be named extractContentStreamImages instead of extractImagesInContentStream.

adrg · 2019-01-29T17:43:52Z

pdf/extractor/image.go

+}
+
+// Process individual content stream operands for image extraction.
+func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error {


This method seems too long to me. May be better to split it in separate methods. Something like:

func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error { lenParams := len(op.Params) if op.Operand == "BI" && lenParams == 1 { iimg, ok := op.Params[0].(*contentstream.ContentStreamInlineImage) if !ok { return nil } return ctx.extractInlineImage(iimg, gs, resources) } else if op.Operand == "Do" && lenParams == 1 { name, ok := core.GetName(op.Params[0]) if !ok { return errTypeCheck } _, xtype := resources.GetXObjectByName(*name) if xtype == model.XObjectTypeImage { return ctx.extractXObjectImage(name, gs, resources) } else if xtype == model.XObjectTypeForm { return ctx.extractFormImages(name, gs, resources) } } return nil }

gunnsth · 2019-01-29T22:48:36Z

@adrg Thanks. I have addressed the comments, please take a look.

adrg · 2019-01-30T17:01:14Z

@gunnsth All review points resolved. Looks good.

gunnsth added 3 commits January 23, 2019 09:32

Image extration support in extractor package. Implements #317 .

3e52ad9

Blueprint for testing image extractor

3a2b2fa

Update extractor const.go

b0cbbf2

gunnsth added this to the v3.0.0-alpha.2 milestone Jan 23, 2019

gunnsth added 5 commits January 28, 2019 22:40

Fix image caching mechanism. Check for nils.

4078900

Test case with nested transform matrices. Use testify to make tests c…

6b4edc4

…leaner.

Merge branch 'v3' into v3-extract-images-with-position

1824e46

Add more test cases for image extraction, including inline and multi …

77840ff

…images.

Merge branch 'v3-extract-images-with-position' of ssh://github.com/un…

cd7f589

…idoc/unidoc into v3-extract-images-with-position

gunnsth changed the title ~~WIP: Extract images with position in package extractor (v3)~~ Extract images with position in package extractor (v3) Jan 28, 2019

gunnsth requested a review from adrg January 28, 2019 23:50

adrg reviewed Jan 29, 2019

View reviewed changes

Address PR review comments

ff44143

gunnsth merged commit 3121645 into v3 Jan 30, 2019

gunnsth deleted the v3-extract-images-with-position branch January 30, 2019 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract images with position in package extractor (v3) #330

Extract images with position in package extractor (v3) #330

gunnsth commented Jan 23, 2019 •

edited

codecov bot commented Jan 23, 2019 •

edited

adrg left a comment

adrg Jan 29, 2019

adrg Jan 29, 2019

adrg Jan 29, 2019

adrg Jan 29, 2019

adrg Jan 29, 2019

adrg Jan 29, 2019

adrg Jan 29, 2019

adrg Jan 29, 2019

gunnsth commented Jan 29, 2019

adrg commented Jan 30, 2019

Extract images with position in package extractor (v3) #330

Extract images with position in package extractor (v3) #330

Conversation

gunnsth commented Jan 23, 2019 • edited

codecov bot commented Jan 23, 2019 • edited

Codecov Report

adrg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunnsth commented Jan 29, 2019

adrg commented Jan 30, 2019

gunnsth commented Jan 23, 2019 •

edited

codecov bot commented Jan 23, 2019 •

edited