autobot - PDF parsing and extraction utility using Apache Tika

Autobot parses the PDF files using Apache Tika and extracts the title, authorString and contents of the IEEE Xplore PDFs.

Please download the utility jar from the link below: https://github.com/sidmishraw/autobot/blob/master/build/libs/autobot-1.0.0.jar

Description:

It requires 2 inputs:

1> Absolute file-path of a file named “conf.txt”

This file will have the list of all file-paths of the input PDF documents on each line

For eg:

path-to-pdfs\04403110.pdf
path-to-pdfs\04403128.pdf
path-to-pdfs\04403127.pdf

2> Absolute file-path of the output directory.

Usage:

java -jar autobot-1.0.0.jar “path-to-conf.txt” “path-to-output-directory”.

For eg:

java -jar autobot-1.0.0.jar "/Users/sidmishraw/Downloads/conf.txt" "/Users/sidmishraw/Downloads/outpdfs"

Caveats:

• It cannot get the exact author names, but I’ve made it to extract and group together the author name area string together and it is named “authorString”.

{
  "title": "Incompleteness Errors in Ontology",
  "authorString": [
    "1 Muhammad Abdul Qadir, 2Muhammad Fahad, 3Syed Adnan Hussain Shah Muhammad Ali Jinnah University, Islamabad, Pakistan",
    "1aqadir@jinnah.edu.pk, 2mhd.fahad@gmail.com, 3syedadnan@gmail.com"
  ],
  "content": "Abstract\nOntology ev…"
}

As you can see from the example, if there are numbered bullets in-front of the name’s etc, it is still difficult to remove them.

Some, PDF documents turn out good:

{
  "title": "Privacy Preserving Collaborative Filtering using Data Obfuscation",
  "authorString": [
    "Rupa Parameswaran Georgia Institute of Technology",
    "School of Electrical and Computer Engineering Atlanta, GA",
    "rupa@ece.gatech.edu",
    "Douglas M Blough Georgia Institute of Technology",
    "School of Electrical and Computer Engineering Atlanta, GA",
    "doug.blough@ece.gatech.edu"
  ],
  "content": "Abstract\n…"
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.settings		.settings
autobot		autobot
build		build
gradle/wrapper		gradle/wrapper
outputs		outputs
pdfs		pdfs
src/main/java/io/github/sidmishraw/autobot/core		src/main/java/io/github/sidmishraw/autobot/core
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
build.gradle		build.gradle
conf.txt		conf.txt
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autobot - PDF parsing and extraction utility using Apache Tika

About

Releases

Packages

Languages

sidmishraw/autobot

Folders and files

Latest commit

History

Repository files navigation

autobot - PDF parsing and extraction utility using Apache Tika

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages