page.text only returns the copyright #82

Closed
bryanrasmussen opened this Issue Feb 5, 2013 · 3 comments

Comments

Projects
None yet
2 participants
@bryanrasmussen

The pdf is generated from xsl-fo, so it is text, my code is the following:

def pdf_text(pdflocation)
if $env == "local" then
pdflocation = "sinatra_resources/example.pdf"
end
reader = PDF::Reader.new(pdflocation)
text = ""
reader.pages.each do |page|
text += page.text
end
return text
end

I call it with
document = pdf_text(pdflocation)
which returns the last bit of text in the document, a copyright notice in the footer.

My env is local, so I don't have any problems with finding the wrong pdf somehow.
If I do pdftotext on the commandline I can get the whole document out
There is only one page in the test document, if I do reader.page(1).text I get the same thing, only the copyright in the page footer.

I'm supposing it's something I have totally misunderstood but what - I would assume page.text should get me all the text on the page?

@yob

This comment has been minimized.

Show comment Hide comment
@yob

yob Feb 5, 2013

Owner

You've understood the API correctly, this is a bug of some kind.

Are you able to send me a copy of the PDF via email (I can keep it private)?

james@yob.id.au

2013/2/5 bryan rasmussen notifications@github.com

The pdf is generated from xsl-fo, so it is text, my code is the following:

def pdf_text(pdflocation)
if $env == "local" then
pdflocation = "sinatra_resources/example.pdf"
end
reader = PDF::Reader.new(pdflocation)
text = ""
reader.pages.each do |page|
text += page.text
end
return text
end

I call it with
document = pdf_text(pdflocation)
which returns the last bit of text in the document, a copyright notice in
the footer.

My env is local, so I don't have any problems with finding the wrong pdf
somehow.
If I do pdftotext on the commandline I can get the whole document out
There is only one page in the test document, if I do reader.page(1).text I
get the same thing, only the copyright in the page footer.

I'm supposing it's something I have totally misunderstood but what - I
would assume page.text should get me all the text on the page?


Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/82.

Owner

yob commented Feb 5, 2013

You've understood the API correctly, this is a bug of some kind.

Are you able to send me a copy of the PDF via email (I can keep it private)?

james@yob.id.au

2013/2/5 bryan rasmussen notifications@github.com

The pdf is generated from xsl-fo, so it is text, my code is the following:

def pdf_text(pdflocation)
if $env == "local" then
pdflocation = "sinatra_resources/example.pdf"
end
reader = PDF::Reader.new(pdflocation)
text = ""
reader.pages.each do |page|
text += page.text
end
return text
end

I call it with
document = pdf_text(pdflocation)
which returns the last bit of text in the document, a copyright notice in
the footer.

My env is local, so I don't have any problems with finding the wrong pdf
somehow.
If I do pdftotext on the commandline I can get the whole document out
There is only one page in the test document, if I do reader.page(1).text I
get the same thing, only the copyright in the page footer.

I'm supposing it's something I have totally misunderstood but what - I
would assume page.text should get me all the text on the page?


Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/82.

@bryanrasmussen

This comment has been minimized.

Show comment Hide comment
@bryanrasmussen

bryanrasmussen Feb 5, 2013

Hi,

The one thing that I thought might be an issue is it has a TOC.

Thanks,
Bryan Rasmussen

On Tue, Feb 5, 2013 at 11:36 AM, James Healy notifications@github.comwrote:

You've understood the API correctly, this is a bug of some kind.

Are you able to send me a copy of the PDF via email (I can keep it
private)?

james@yob.id.au

2013/2/5 bryan rasmussen notifications@github.com

The pdf is generated from xsl-fo, so it is text, my code is the
following:

def pdf_text(pdflocation)
if $env == "local" then
pdflocation = "sinatra_resources/example.pdf"
end
reader = PDF::Reader.new(pdflocation)
text = ""
reader.pages.each do |page|
text += page.text
end
return text
end

I call it with
document = pdf_text(pdflocation)
which returns the last bit of text in the document, a copyright notice
in
the footer.

My env is local, so I don't have any problems with finding the wrong pdf
somehow.
If I do pdftotext on the commandline I can get the whole document out
There is only one page in the test document, if I do reader.page(1).text
I
get the same thing, only the copyright in the page footer.

I'm supposing it's something I have totally misunderstood but what - I
would assume page.text should get me all the text on the page?


Reply to this email directly or view it on GitHub<
https://github.com/yob/pdf-reader/issues/82>.


Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/82#issuecomment-13123382.

Hi,

The one thing that I thought might be an issue is it has a TOC.

Thanks,
Bryan Rasmussen

On Tue, Feb 5, 2013 at 11:36 AM, James Healy notifications@github.comwrote:

You've understood the API correctly, this is a bug of some kind.

Are you able to send me a copy of the PDF via email (I can keep it
private)?

james@yob.id.au

2013/2/5 bryan rasmussen notifications@github.com

The pdf is generated from xsl-fo, so it is text, my code is the
following:

def pdf_text(pdflocation)
if $env == "local" then
pdflocation = "sinatra_resources/example.pdf"
end
reader = PDF::Reader.new(pdflocation)
text = ""
reader.pages.each do |page|
text += page.text
end
return text
end

I call it with
document = pdf_text(pdflocation)
which returns the last bit of text in the document, a copyright notice
in
the footer.

My env is local, so I don't have any problems with finding the wrong pdf
somehow.
If I do pdftotext on the commandline I can get the whole document out
There is only one page in the test document, if I do reader.page(1).text
I
get the same thing, only the copyright in the page footer.

I'm supposing it's something I have totally misunderstood but what - I
would assume page.text should get me all the text on the page?


Reply to this email directly or view it on GitHub<
https://github.com/yob/pdf-reader/issues/82>.


Reply to this email directly or view it on GitHubhttps://github.com/yob/pdf-reader/issues/82#issuecomment-13123382.

@yob

This comment has been minimized.

Show comment Hide comment
@yob

yob Feb 14, 2017

Owner

Thanks for taking the time to report this. I'm going to close it due to inactivity, but feel free to re-open if required.

Owner

yob commented Feb 14, 2017

Thanks for taking the time to report this. I'm going to close it due to inactivity, but feel free to re-open if required.

@yob yob closed this Feb 14, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment