<a href="https://colab.research.google.com/github/ufrpe-ensino/curso-mineracao-textos/blob/master/11_IndexandoDocumentos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Indexando diferentes tipos de documentos

Tem uma coleção de documentos? Documentos do Word, arquivos HTML, PDFs, PDFs baseados em imagens e mais alguma coisa? Não se preocupe, o Apache [Tika](https://tika.apache.org) tem tudo para você.

# Extraindo Texto de Documentos

## Instalação
O ideal é ter um servidor dedicado ao serviço de OCR. No nosso caso, utilizaremos o próprio colab como servidor. Primeiro precisamos instalar o wrapper python para o Tika

In [1]:
!apt-get update
!apt-get install tesseract-ocr
!apt-get install tesseract-ocr-por

0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [Waiting for headers] [Connecting to cloud.r-project.org (13.227.219.75)] [W                                                                               Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
                                                                               Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
                                                                               Get:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
                                                                               Hit:5 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
0% [Waiting for headers] [Connected to cloud.r-project.org (13.227.219.75)] [Wa                                                                               Get:6 http://archive.ubuntu.com/ubuntu bionic-ba

In [2]:
!pip install tika

Collecting tika
  Downloading tika-1.24.tar.gz (28 kB)
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-py3-none-any.whl size=32893 sha256=4dba1bbaf30b708d3420e46cb85e77a48f395478e40e9d1dd72af18f95c71a5f
  Stored in directory: /root/.cache/pip/wheels/ec/2b/38/58ff05467a742e32f67f5d0de048fa046e764e2fbb25ac93f3
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


## Baixando alguns documentos

* pdf: https://data.ct.gov/download/fxjv-82m6/application/pdf
* doc: https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx
* png for OCR: https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg


In [3]:
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  822k  100  822k    0     0  10.2M      0 --:--:-- --:--:-- --:--:-- 10.2M


![](https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg)

## Testando o tessract via linha de comando

In [4]:
!tesseract Dr._Jekyll_and_Mr._Hyde_Text.jpg stdout

at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, s

## Usando o Tika
### Inicializando o serviço

In [5]:
import tika
import requests
from tika import parser

# Start running the tika service
tika.initVM()

## Parsing!
Existem duas maneiras de fazer isso!

**Direto da web**

```python
response = requests.get(...)
results = parser.from_buffer(response.content)
```

**De um arquivo local**

```python
results = parser.from_file(filename)
```

Observe que se você deseja fazer OCR em outro idioma, é necessário mudar um pouco as coisas. O que está abaixo para grego. Veja o que seu tesseract suporta com `tesseract --list-langs`

```python
headers = {
    "X-Tika-OCRLanguage": "por"
}
results = parser.from_buffer(response.content, headers=headers)
```

In [6]:
!tesseract --list-langs

List of available languages (3):
eng
por
osd


## Exemplo: PDF

In [7]:
response = requests.get('https://data.ct.gov/download/fxjv-82m6/application/pdf')
results = parser.from_buffer(response)

2022-04-25 12:26:46,413 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2022-04-25 12:26:46,928 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2022-04-25 12:26:47,347 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


In [8]:
results.keys()

dict_keys(['metadata', 'content', 'status'])

In [9]:
results['status']

200

In [10]:
# Only showing the first 1000 chars
results['content'][:1000]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n  \n\n  \n\n \n\n \n\nConnecticut \n\nOpen Data \n\nPolicy \nEffective April 22, 2015 \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nPromulgated in accordance with and \n\nunder the authority of Executive \n\nOrder 39 of Governor Dannel P. \n\nMalloy \n\n \n\n  \n\n  \n\n \n\n  \n \n\n  \n\n\n\n \n\n \n\nContents \n\n \n\n \n1.0 Definitions .......................................................................................................................... 3 \n\n2.0  Introduction...................................................................................................................... 5 \n\n2.1  Intent ............................................................................................................................ 5 \n\n2.2  Scope ............................................................................................................................ 5 \n\n2.3  Legal 

In [11]:
# formatando quebras de linha
print(results['content'][:1000].strip())

Connecticut 

Open Data 

Policy 
Effective April 22, 2015 

 

 

 

 

 

 

 

Promulgated in accordance with and 

under the authority of Executive 

Order 39 of Governor Dannel P. 

Malloy 

 

  

  

 

  
 

  



 

 

Contents 

 

 
1.0 Definitions .......................................................................................................................... 3 

2.0  Introduction...................................................................................................................... 5 

2.1  Intent ............................................................................................................................ 5 

2.2  Scope ............................................................................................................................ 5 

2.3  Legal Considerations .......................................................................................................


## Exemplo: DOC

In [12]:
response = requests.get('https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx')
results = parser.from_buffer(response)
print(results['content'].strip())

Highlights 

Evaluating Weathering of Food Packaging Polyethylene-Nano-clay Composites: Release of Nanoparticles and their Impacts

Changseok Han1, Amy Zhao1, and Eunice Varughese2, E. Sahle-Demessie*1




1. UV or O3 degradation food packaging composites released nanoclay particles. 
2. Properties of nanocomposites changed during accelerated weathering.
3. Nanoclay release was proportional to weathering time.
4. Toxicity of released nanoclay at test concentrations were not significant.


## Exemplo: Imagem (OCR)

In [13]:
response = requests.get('https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg')
results = parser.from_buffer(response)
results['status']

200

In [14]:
print(results['content'].strip())

at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, s

## Imagem em Português
![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Livro_de_uma_sogra.djvu/page45-552px-Livro_de_uma_sogra.djvu.jpg)

In [20]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
response = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Livro_de_uma_sogra.djvu/page45-552px-Livro_de_uma_sogra.djvu.jpg',
                        headers=headers)

results = parser.from_buffer(response)
print(results['content'].strip())

Livao DE UMA socRa a

 

 

quanto ella, mas que eu s6 avaliava por con-
jecturas, e cujo perfume de cabello ou cheiro
de corpo nunca me tinham sido revelados na
intimidade da posse, impunha-se despotica-
mente aos meus culposos sentidos, accordan-
do-me amores fogosos ¢ energicos, como os jé
no accordava a minha bonita companheira.
Oh I que me perdées, Olympia, as vezes
que em ti matei desejos que vinham de outras
mulheres !
<E, emconsciencia, no sera isto j4o
adulterio ? A idéa do toque amoroso com outra,
que nao seja a propria esposa, no seré uma
traigio conjugal ? Castusest qui amorem amo-
re, ignemque igne excludit, diz Santo Ago:
tinho. Se assim ¢, hade ser difficil descobrie
‘um casal que se no adultere de parte a parte,
_pois estou bem convencido de que com minha
mulher, por excellencia virtuosa, devia succe-
der outro tanto; assim como estou ampla~
mente convencido de que tudo, tudo que em
mim observei, se verificou tambem com ella. »

 

 

 

Ahi termina 0 trecho das notas d

**O que deu errado?**

Ajustar o idioma do tesseract!

In [21]:
response = requests.get('https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Livro_de_uma_sogra.djvu/page45-552px-Livro_de_uma_sogra.djvu.jpg',
                        headers=headers)
results = parser.from_buffer(response, headers={
    "X-Tika-OCRLanguage": "por"
})
print(results['content'].strip())

LAVRO DE UMA SOGRA q

 

 

quanto ella, mas que eu só avaliava por con-
jecturas, e cujo perfume de cabello ou cheiro
de corpo nunca me tinham sido revelados na
intimidade da posse, impunha-se despotica-
mente aos meus culposos sentidos, accordan-
do-me amores fogosos e energicos, como os já
não accordava a minha bonita companheira.

«Oh! que me perdões, Olympia, as vezes
que em ti matei desejos que vinham de outras
mulheres!

«E, em consciencia, não será isto já o
adulterio ? A idéa do toque amoroso com outra
que não seja a propria esposa, não será uma
traição conjugal? Castus est qui amorem amo-
rey ignemque igne excludit, diz Santo Agos
tinho. Se assim é, hade ser difficil descobrir
um casal que se não adultere de parte a parte,
- pois estou bem convencido de que com minha
mulher, por excellencia virtuosa, devia succe-
der outro tanto; assim como estou ampla-
mente convencido de que tudo, tudo que em
mim observei, se verificou tambem com ella. »

 

 

 

Ahi termina o trecho das n