-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract images with position in package extractor (v3) #330
Conversation
Codecov Report
@@ Coverage Diff @@
## v3 #330 +/- ##
=========================================
+ Coverage 51.34% 57.34% +6%
=========================================
Files 144 145 +1
Lines 25098 25208 +110
=========================================
+ Hits 12886 14456 +1570
+ Misses 10488 10389 -99
+ Partials 1724 363 -1361
Continue to review full report at Codecov.
|
…idoc/unidoc into v3-extract-images-with-position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great overall. I left a couple of suggested changes, although some of them may be a bit subjective. The tests are passing and I followed the logic and it seems correct to me.
pdf/extractor/image.go
Outdated
return ctx.processOperand(op, gs, resources) | ||
}) | ||
|
||
err = processor.Process(resources) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just return the result of the Process method:
return processor.Process(resources)
pdf/extractor/image.go
Outdated
func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error { | ||
if op.Operand == "BI" && len(op.Params) == 1 { | ||
// BI: Inline image. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unnecessary empty line
pdf/extractor/image.go
Outdated
if err != nil { | ||
return err | ||
} | ||
xDim := gs.CTM.ScalingFactorX() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary variables. Maybe assign directly to struct:
imgMark := ImageMark{
Image: &rgbImg,
Width: gs.CTM.ScalingFactorX(),
Height: gs.CTM.ScalingFactorY(),
Angle: gs.CTM.Angle(),
}
imgMark.X, imgMark.Y = gs.CTM.Translation()
pdf/extractor/image.go
Outdated
|
||
_, xtype := resources.GetXObjectByName(*name) | ||
if xtype == model.XObjectTypeImage { | ||
common.Log.Debug(" XObject Image: %s", *name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove space at the beginning of the " XObject Image: %s"
string
pdf/extractor/image.go
Outdated
} | ||
ctx.cacheXObjectImages[stream] = cimg | ||
} | ||
img := cimg.image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary variables. Maybe use directly:
rgbImg, err := cimg.cs.ImageToRGB(*cimg.image)
if err != nil {
return err
}
common.Log.Debug("@Do CTM: %s", gs.CTM.String())
imgMark := ImageMark{
Image: &rgbImg,
Width: gs.CTM.ScalingFactorX(),
Height: gs.CTM.ScalingFactorY(),
Angle: gs.CTM.Angle(),
}
imgMark.X, imgMark.Y = gs.CTM.Translation()
pdf/extractor/image.go
Outdated
return nil, err | ||
} | ||
|
||
images := &PageImages{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe return directly:
return &PageImages{
Images: ctx.extractedImages,
}, nil
pdf/extractor/image.go
Outdated
cs model.PdfColorspace | ||
} | ||
|
||
func (ctx *imageExtractContext) extractImagesInContentStream(contents string, resources *model.PdfPageResources) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could be named extractContentStreamImages
instead of extractImagesInContentStream
.
} | ||
|
||
// Process individual content stream operands for image extraction. | ||
func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems too long to me. May be better to split it in separate methods. Something like:
func (ctx *imageExtractContext) processOperand(op *contentstream.ContentStreamOperation, gs contentstream.GraphicsState, resources *model.PdfPageResources) error {
lenParams := len(op.Params)
if op.Operand == "BI" && lenParams == 1 {
iimg, ok := op.Params[0].(*contentstream.ContentStreamInlineImage)
if !ok {
return nil
}
return ctx.extractInlineImage(iimg, gs, resources)
} else if op.Operand == "Do" && lenParams == 1 {
name, ok := core.GetName(op.Params[0])
if !ok {
return errTypeCheck
}
_, xtype := resources.GetXObjectByName(*name)
if xtype == model.XObjectTypeImage {
return ctx.extractXObjectImage(name, gs, resources)
} else if xtype == model.XObjectTypeForm {
return ctx.extractFormImages(name, gs, resources)
}
}
return nil
}
@adrg Thanks. I have addressed the comments, please take a look. |
@gunnsth All review points resolved. Looks good. |
Image extraction support in extractor package. Can extract image data with coordinate information (position and dimensions). Implements #317.