Show some examples of using Apache Tika from within Python. 

The following imports the Python `tika` package. The `initVM()` starts a Java process that runs the Tika server. Subsequent uses of the Python tika package send the parsing requests to that server and get the results back. 

In [1]:
from __future__ import print_function
import os
import sys
import xml.etree.ElementTree as ET

import tika
tika.initVM()
from tika import parser

The example files we have are in the `data` sub-directory and are:
* example.xlsx
* mydoc.pdf
* sun-flower-1536088_640.jpg

# PDF Example

## Parse from file

In [3]:
pdf_plain = parser.from_file('data/mydoc.pdf', xmlContent=False)

In [4]:
pdf_plain

{'content': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMy Document\n\n\nHello\tworld!\t\n\n\n',
 'metadata': {'AAPL:Keywords': '',
  'Author': ' Scott Hajek',
  'Content-Type': 'application/pdf',
  'Creation-Date': '2017-05-05T21:42:16Z',
  'Keywords': '',
  'Last-Modified': '2017-05-05T21:42:16Z',
  'Last-Save-Date': '2017-05-05T21:42:16Z',
  'X-Parsed-By': ['org.apache.tika.parser.DefaultParser',
   'org.apache.tika.parser.pdf.PDFParser'],
  'X-TIKA:parse_time_millis': '11',
  'access_permission:assemble_document': 'true',
  'access_permission:can_modify': 'true',
  'access_permission:can_print': 'true',
  'access_permission:can_print_degraded': 'true',
  'access_permission:extract_content': 'true',
  'access_permission:extract_for_accessibility': 'true',
  'access_permission:fill_in_form': 'true',
  'access_permission:modify_annotations': 'true',
  'cp:subject': 'test PDF doc',
  'created': 'Fri May 05 17:42:16 EDT 2017',
  '

Note above that with the parameter `xmlContent=False`, the 'content' key has plain text. We can also get the content in XML format, which will include some of the metadata right in the content. 

In [5]:
pdf_xml = parser.from_file('data/mydoc.pdf', xmlContent=True)

In [7]:
pdf_xml['content']

'<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="date" content="2017-05-05T21:42:16Z" />\n<meta name="pdf:docinfo:custom:AAPL:Keywords" content="" />\n<meta name="pdf:PDFVersion" content="1.3" />\n<meta name="pdf:docinfo:title" content="My Document" />\n<meta name="xmp:CreatorTool" content="Word" />\n<meta name="Keywords" content="" />\n<meta name="access_permission:modify_annotations" content="true" />\n<meta name="access_permission:can_print_degraded" content="true" />\n<meta name="subject" content="test PDF doc" />\n<meta name="AAPL:Keywords" content="" />\n<meta name="dc:creator" content=" Scott Hajek" />\n<meta name="dcterms:created" content="2017-05-05T21:42:16Z" />\n<meta name="Last-Modified" content="2017-05-05T21:42:16Z" />\n<meta name="dcterms:modified" content="2017-05-05T21:42:16Z" />\n<meta name="dc:format" content="application/pdf; version=1.3" />\n<meta name="Last-Save-Date" content="2017-05-05T21:42:16Z" />\n<meta name="pdf:docinfo:creator_tool" content

## Parse from string/bytes

In [9]:
with open('data/mydoc.pdf', 'rb') as f:
    pdf_bytes = f.read()

In [12]:
pdf_bytes[:50]

b'%PDF-1.3\n%\xc4\xe5\xf2\xe5\xeb\xa7\xf3\xa0\xd0\xc4\xc6\n4 0 obj\n<< /Length 5 0 R /Fi'

In [13]:
pdf_from_bytes = parser.from_buffer(pdf_bytes, xmlContent=False)

In [14]:
pdf_from_bytes['content']

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nMy Document\n\n\nHello\tworld!\t\n\n\n'

# Excel example

In [16]:
xls_plain = parser.from_file('data/example.xlsx', xmlContent=False)

In [21]:
xls_plain['content']

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSheet1\n\tTitle\tAuthor\n\tTom Sawyer\tMark Twain\n\tOutliers\tMalcolm Gladwell\n\n\n\n\n\n\n\n\n\n\n\n/docProps/thumbnail.jpeg\n\n \n\nMM MOM GM\n\n;zszT\n\n \n\n \n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n/docProps/thumbnail.jpeg\n\n'

In [22]:
xls_plain['metadata']

{'Apple Multi-language Profile Name': '31 skSK(Všeobecný RGB profil) daDK(Generel RGB-beskrivelse) caES(Perfil RGB genèric) viVN(Cấu hình RGB Chung) ptBR(Perfil RGB Genérico) ukUA(Загальний профайл RGB) frFU(Profil générique RVB) huHU(Általános RGB profil) zhTW(通用 RGB 色彩描述) nbNO(Generisk RGB-profil) csCZ(Obecný RGB profil) heIL(פרופיל RGB כללי) itIT(Profilo RGB generico) roRO(Profil RGB generic) deDE(Allgemeines RGB-Profil) koKR(일반 RGB 프로파일) svSE(Generisk RGB-profil) zhCN(普通 RGB 描述文件) jaJP(一般 RGB プロファイル) elGR(Γενικό προφίλ RGB) ptPO(Perfil RGB genérico) nlNL(Algemeen RGB-profiel) esES(Perfil RGB genérico) thTH(โปรไฟล์ RGB ทั่วไป) trTR(Genel RGB Profili) fiFI(Yleinen RGB-profiili) hrHR(Generički RGB profil) plPL(Uniwersalny profil RGB) ruRU(Общий профиль RGB) arEG(ملف تعريف RGB العام) enUS(Generic RGB Profile)',
 'Application-Name': 'Microsoft Macintosh Excel',
 'Application-Version': '14.0300',
 'Author': 'Scott Hajek',
 'Blue Colorant': '(0.1566, 0.0845, 0.7196)',
 'Blue TRC': '0.0070

In [23]:
xls_xml = parser.from_file('data/example.xlsx', xmlContent=True)

In [27]:
xls_xml['content']

'<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="date" content="2017-05-05T21:55:47Z" />\n<meta name="extended-properties:AppVersion" content="14.0300" />\n<meta name="dc:creator" content="Scott Hajek" />\n<meta name="extended-properties:Company" content="Pivotal" />\n<meta name="dcterms:created" content="2017-05-05T21:53:22Z" />\n<meta name="Last-Modified" content="2017-05-05T21:55:47Z" />\n<meta name="dcterms:modified" content="2017-05-05T21:55:47Z" />\n<meta name="Last-Save-Date" content="2017-05-05T21:55:47Z" />\n<meta name="protected" content="false" />\n<meta name="meta:save-date" content="2017-05-05T21:55:47Z" />\n<meta name="Application-Name" content="Microsoft Macintosh Excel" />\n<meta name="modified" content="2017-05-05T21:55:47Z" />\n<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" />\n<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />\n<meta name="X-Parsed-By" content="org.apache.

## Work in progress
Parse the XML for the &lt;table&gt; tag

In [30]:
root = ET.fromstring('<root>{}</root>'.format(xls_xml['content']))

In [37]:
root.getchildren()[0].find('table')

# Image example


In [44]:
jpg_plain = parser.from_file('data/sun-flower-1536088_640.jpg', xmlContent=False)

In [45]:
jpg_plain

{'content': None,
 'metadata': {'Component 1': 'Y component: Quantization table 0, Sampling factors 2 horiz/2 vert',
  'Component 2': 'Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert',
  'Component 3': 'Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert',
  'Compression Type': 'Baseline',
  'Content-Type': 'image/jpeg',
  'Creation-Date': '2016-07-04T10:21:49',
  'Data Precision': '8 bits',
  'Date/Time Original': '2016:07:04 14:21:49',
  'Exposure Time': '1/4000 sec',
  'F-Number': 'f/4.0',
  'File Modified Date': 'Fri May 05 18:24:55 -04:00 2017',
  'File Name': 'apache-tika-8636408895051569645.tmp',
  'File Size': '59358 bytes',
  'Focal Length': '35 mm',
  'Image Height': '419 pixels',
  'Image Width': '640 pixels',
  'Lens Make': 'FUJIFILM',
  'Lens Model': 'XF35mmF1.4 R',
  'Make': 'FUJIFILM',
  'Model': 'X-T10',
  'Number of Components': '3',
  'Resolution Units': 'inch',
  'Thumbnail Height Pixels': '0',
  'Thumbnail Width Pixels': '0',
  'X