Autobot parses the PDF files using Apache Tika and extracts the title, authorString and contents of the IEEE Xplore PDFs.
Please download the utility jar from the link below: https://github.com/sidmishraw/autobot/blob/master/build/libs/autobot-1.0.0.jar
Description:
It requires 2 inputs:
1> Absolute file-path of a file named “conf.txt”
This file will have the list of all file-paths of the input PDF documents on each line
For eg:
path-to-pdfs\04403110.pdf
path-to-pdfs\04403128.pdf
path-to-pdfs\04403127.pdf
2> Absolute file-path of the output directory.
Usage:
java -jar autobot-1.0.0.jar “path-to-conf.txt” “path-to-output-directory”.
For eg:
java -jar autobot-1.0.0.jar "/Users/sidmishraw/Downloads/conf.txt" "/Users/sidmishraw/Downloads/outpdfs"
Caveats:
• It cannot get the exact author names, but I’ve made it to extract and group together the author name area string together and it is named “authorString”.
{
"title": "Incompleteness Errors in Ontology",
"authorString": [
"1 Muhammad Abdul Qadir, 2Muhammad Fahad, 3Syed Adnan Hussain Shah Muhammad Ali Jinnah University, Islamabad, Pakistan",
"1aqadir@jinnah.edu.pk, 2mhd.fahad@gmail.com, 3syedadnan@gmail.com"
],
"content": "Abstract\nOntology ev…"
}
As you can see from the example, if there are numbered bullets in-front of the name’s etc, it is still difficult to remove them.
Some, PDF documents turn out good:
{
"title": "Privacy Preserving Collaborative Filtering using Data Obfuscation",
"authorString": [
"Rupa Parameswaran Georgia Institute of Technology",
"School of Electrical and Computer Engineering Atlanta, GA",
"rupa@ece.gatech.edu",
"Douglas M Blough Georgia Institute of Technology",
"School of Electrical and Computer Engineering Atlanta, GA",
"doug.blough@ece.gatech.edu"
],
"content": "Abstract\n…"
}