Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add table extraction capabilities #231

Closed
bonsonsm opened this issue Oct 14, 2015 · 9 comments
Closed

Add table extraction capabilities #231

bonsonsm opened this issue Oct 14, 2015 · 9 comments
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@bonsonsm
Copy link

bonsonsm commented Oct 14, 2015

Hello,
I am converting a pdf file into a text file. In the extracted text file, I am unable to know where the table starts, however I am able to extract the text of the table as is, but I want to know where the table starts and ends so I could do some post processing on it.

Below is my code:

def extractTextFromPDF(strDownloadDirectory, fileName, txtFilePath):
        filePathName = strDownloadDirectory + fileName
        pdfFileObj = open(filePathName, 'rb')
        pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
        intPages = pdfReader.getNumPages()
        print(intPages)
        strText = ''
        print(fileName)
        fileName =fileName[0:len(fileName)-4]
        txtFilePath = txtFilePath +fileName  + '.txt'
        target_file = open(txtFilePath, "w" , encoding='utf-8')
        for i in range(0,intPages):
            objPDFObj = pdfReader.getPage(i)
            strText =  objPDFObj.extractText().rstrip()
            strText = " ".join(strText.replace(u"\xa0", " ").strip().split())
            print(strText)
        target_file.write(strText)
        target_file.close()

Kindly suggest.

@mstamy2
Copy link
Collaborator

mstamy2 commented Jan 4, 2016

PyPDF2's text extraction capabilities are somewhat primitive at the moment, though I want to make enhancing them a priority.

Consider PDFMiner or PDFbox for your immediate needs; they feature much more sophisticated text extraction resources.

It's not simple to recognize a table structure in PDF, but these libraries might be able to help.

@mstamy2 mstamy2 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label May 19, 2016
@sils
Copy link

sils commented Nov 7, 2016

@mstamy2 any progress on this? I guess getting all text and it's position might be sufficient to get most usecases covered already.

@sils
Copy link

sils commented Nov 7, 2016

CC @sims1253

@emanuelevivoli
Copy link

emanuelevivoli commented Oct 12, 2021

@mstamy2 hello, I'm interested in the Table-side too. As it is still open I suppose no progress on thi direction has been made .... is it right ?
Please let me know :)
Thanks

@MartinThoma
Copy link
Member

Table extraction is super hard. There are libraries which just attempt to do that (I think "Tabula" and "excalibur" were the names)

@MartinThoma MartinThoma added the is-feature A feature request label Jun 10, 2022
@MartinThoma
Copy link
Member

I've just noticed "TABLE 10.29 Standard layout attributes" with

image

@MartinThoma MartinThoma changed the title Unable to identify the tables Add table extraction capabilities Jun 17, 2022
@MartinThoma
Copy link
Member

I'm closing this for the moment as it distracts from other topics that seem more important at the moment.

If anybody has an idea how to approach this in a reasonable way: I'm open for discussions and I can re-open :-)

For people looking for solutions: The best I've got for you is https://pypi.org/project/camelot-py/ or developing something on your own, e.g. using a layout-preserving extraction (e.g. pdftotext -layout) + some heuristics.

@pubpub-zz
Copy link
Collaborator

@MartinThoma
Perhaps should you open a discussion listing all those features which are left on slide for the moment in order to find them directly

@MartinThoma
Copy link
Member

Good idea @pubpub-zz 👍 See #1181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-feature A feature request workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

6 participants