Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract images with position in package extractor (v3) #330

Merged
merged 9 commits into from
Jan 30, 2019

Conversation

gunnsth
Copy link
Contributor

@gunnsth gunnsth commented Jan 23, 2019

Image extraction support in extractor package. Can extract image data with coordinate information (position and dimensions). Implements #317.

@gunnsth gunnsth added this to the v3.0.0-alpha.2 milestone Jan 23, 2019
@codecov
Copy link

codecov bot commented Jan 23, 2019

Codecov Report

Merging #330 into v3 will increase coverage by 6%.
The diff coverage is 72.47%.

Impacted file tree graph

@@            Coverage Diff            @@
##               v3     #330     +/-   ##
=========================================
+ Coverage   51.34%   57.34%     +6%     
=========================================
  Files         144      145      +1     
  Lines       25098    25208    +110     
=========================================
+ Hits        12886    14456   +1570     
+ Misses      10488    10389     -99     
+ Partials     1724      363   -1361
Impacted Files Coverage Δ
pdf/extractor/image.go 71.81% <72.47%> (ø)
pdf/creator/division.go 89.13% <0%> (+1.08%) ⬆️
pdf/model/outline.go 86.74% <0%> (+1.2%) ⬆️
pdf/creator/invoice.go 90.07% <0%> (+1.21%) ⬆️
pdf/model/annotations.go 22.64% <0%> (+1.56%) ⬆️
pdf/creator/toc_line.go 75.39% <0%> (+1.58%) ⬆️
pdf/internal/textencoding/truetype.go 16% <0%> (+2%) ⬆️
pdf/creator/chapters.go 64.94% <0%> (+2.06%) ⬆️
pdf/model/fonts/std.go 90% <0%> (+2.5%) ⬆️
... and 88 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2253e29...ff44143. Read the comment docs.

@gunnsth gunnsth changed the title WIP: Extract images with position in package extractor (v3) Extract images with position in package extractor (v3) Jan 28, 2019
@gunnsth gunnsth requested a review from adrg January 28, 2019 23:50
Copy link
Collaborator

@adrg adrg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great overall. I left a couple of suggested changes, although some of them may be a bit subjective. The tests are passing and I followed the logic and it seems correct to me.

return ctx.processOperand(op, gs, resources)
})

err = processor.Process(resources)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just return the result of the Process method:

return processor.Process(resources)

func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error {
if op.Operand == "BI" && len(op.Params) == 1 {
// BI: Inline image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unnecessary empty line

if err != nil {
return err
}
xDim := gs.CTM.ScalingFactorX()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary variables. Maybe assign directly to struct:

imgMark := ImageMark{
        Image:  &rgbImg,
        Width:  gs.CTM.ScalingFactorX(),
        Height: gs.CTM.ScalingFactorY(),
        Angle:  gs.CTM.Angle(),
}
imgMark.X, imgMark.Y = gs.CTM.Translation()


_, xtype := resources.GetXObjectByName(*name)
if xtype == model.XObjectTypeImage {
common.Log.Debug(" XObject Image: %s", *name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space at the beginning of the " XObject Image: %s" string

}
ctx.cacheXObjectImages[stream] = cimg
}
img := cimg.image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary variables. Maybe use directly:

rgbImg, err := cimg.cs.ImageToRGB(*cimg.image)
if err != nil {
        return err
}

common.Log.Debug("@Do CTM: %s", gs.CTM.String())
imgMark := ImageMark{
        Image:  &rgbImg,
        Width:  gs.CTM.ScalingFactorX(),
        Height: gs.CTM.ScalingFactorY(),
        Angle:  gs.CTM.Angle(),
}
imgMark.X, imgMark.Y = gs.CTM.Translation()

return nil, err
}

images := &PageImages{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe return directly:

return &PageImages{
        Images: ctx.extractedImages,
}, nil

cs model.PdfColorspace
}

func (ctx *imageExtractContext) extractImagesInContentStream(contents string, resources *model.PdfPageResources) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be named extractContentStreamImages instead of extractImagesInContentStream.

}

// Process individual content stream operands for image extraction.
func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems too long to me. May be better to split it in separate methods. Something like:

func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error {
    lenParams := len(op.Params)
    if op.Operand == "BI" && lenParams == 1 {
        iimg, ok := op.Params[0].(*contentstream.ContentStreamInlineImage)
        if !ok {
            return nil
        }

        return ctx.extractInlineImage(iimg, gs, resources)
    } else if op.Operand == "Do" && lenParams == 1 {
        name, ok := core.GetName(op.Params[0])
        if !ok {
            return errTypeCheck
        }

        _, xtype := resources.GetXObjectByName(*name)
        if xtype == model.XObjectTypeImage {
            return ctx.extractXObjectImage(name, gs, resources)
        } else if xtype == model.XObjectTypeForm {
            return ctx.extractFormImages(name, gs, resources)
        }
    }

    return nil
}

@gunnsth
Copy link
Contributor Author

gunnsth commented Jan 29, 2019

@adrg Thanks. I have addressed the comments, please take a look.

@adrg
Copy link
Collaborator

adrg commented Jan 30, 2019

@gunnsth All review points resolved. Looks good.

@gunnsth gunnsth merged commit 3121645 into v3 Jan 30, 2019
@gunnsth gunnsth deleted the v3-extract-images-with-position branch January 30, 2019 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants