<a href="https://colab.research.google.com/github/ucusita/convert-pdfs-docx-img-2-Text/blob/main/Convirtiendo_documents_a_texto_with_Apache_Tika.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting all kinds of documents into text

Have a collection of documents? Word docs, HTML files, PDFs, _image-based_ PDFs, and anything else? Don't worry, Apache Tika has you covered. 

<p class="reading-options">
  <a class="btn" href="/text-analysis/processing-documents-with-apache-tika">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/text-analysis/notebooks/Processing documents with Apache Tika.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Processing documents with Apache Tika.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

## Installation

These installation instructions only work on OS X, but it's possible to get the same software running on Windows.

### Tesseract

[Tesseract](https://github.com/tesseract-ocr/tesseract) is a piece of software that performs OCR, converting images of text into actual text. If we need to perform OCR on more languages than just English, we'll also need to install `tesseract-lang` to add more languages to the mix.
    
```
brew install tesseract tesseract-lang
```

### Tika

[Tika](https://tika.apache.org/) is an incredible piece of software that converts just about any kind of document to text. It requires Java - I installed Java from https://www.java.com/en/download/ and it didn't work, so you'll need to use the install command below.

```
brew cask install adoptopenjdk
brew install tika 
```

Tika will automatically know about tesseract.

### Python bindings for Tika

Tika is a piece of software that exists _outside of Python_. If we want Python to be able to use Tika, we'll need to install the **Python bindings** for TIka.

```
pip install tika
```

If you'd like to just run this all from the notebook, uncomment and run the cell below. **You'll need to type in your password for the `adoptopenjdk` one, so be sure to pay attention to when it asks you.**

# Mis modificaciones
Con estas modificaciones funcionó en la máquina de Colab

In [2]:
!sudo apt install tesseract-ocr

!pip install pytesseract

!pip install tika

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 42 not upgraded.
Need to get 4,795 kB of archives.
After this operation, 15.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-eng all 4.00~git24-0e00fe6-1.2 [1,588 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr-osd all 4.00~git24-0e00fe6-1.2 [2,989 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tesseract-ocr amd64 4.00~git2288-10f4998a-2 [218 kB]
Fetched 4,795 kB in 1s (5,281 kB/s)
debconf: unable to initi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tika
  Downloading tika-1.24.tar.gz (28 kB)
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-py3-none-any.whl size=32893 sha256=079f5d1d90de9e602b54ef5e131b7d712fd0d1427c43ce32703061528fc99f37
  Stored in directory: /root/.cache/pip/wheels/ec/2b/38/58ff05467a742e32f67f5d0de048fa046e764e2fbb25ac93f3
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


## Documents we'll be using

* `pdf`: https://data.ct.gov/download/fxjv-82m6/application/pdf
* `doc`: https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx
* `png` for OCR: https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

## Confirm tesseract works

In [3]:
# Download the image
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  822k  100  822k    0     0  14.6M      0 --:--:-- --:--:-- --:--:-- 14.6M


![](https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg)

In [4]:
!tesseract Dr._Jekyll_and_Mr._Hyde_Text.jpg stdout

at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, s

## Using Tika

### Starting it up

In [5]:
import tika
import requests
from tika import parser

# Start running the tika service
tika.initVM()

## Examples

### PDF example

The first time it will be very slow, as it's... downloading Tika again, I think?

In [8]:
response = requests.get('https://data.ct.gov/download/fxjv-82m6/application/pdf')
results = parser.from_buffer(response)

In [9]:
results.keys()

dict_keys(['metadata', 'content', 'status'])

In [10]:
results['status']

200

In [11]:
# Only showing the first 500 chars because there are SO MANY
results['content'][:1000]

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n  \n\n  \n\n \n\n \n\nConnecticut \n\nOpen Data \n\nPolicy \nEffective April 22, 2015 \n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nPromulgated in accordance with and \n\nunder the authority of Executive \n\nOrder 39 of Governor Dannel P. \n\nMalloy \n\n \n\n  \n\n  \n\n \n\n  \n \n\n  \n\n\n\n \n\n \n\nContents \n\n \n\n \n1.0 Definitions .......................................................................................................................... 3 \n\n2.0  Introduction...................................................................................................................... 5 \n\n2.1  Intent ............................................................................................................................ 5 \n\n2.2  Scope ............................................................................................................................ 5 \n\n2.3  Legal 

In [12]:
# Only showing the first 10000 chars
print(results['content'][:10000].strip())

Connecticut 

Open Data 

Policy 
Effective April 22, 2015 

 

 

 

 

 

 

 

Promulgated in accordance with and 

under the authority of Executive 

Order 39 of Governor Dannel P. 

Malloy 

 

  

  

 

  
 

  



 

 

Contents 

 

 
1.0 Definitions .......................................................................................................................... 3 

2.0  Introduction...................................................................................................................... 5 

2.1  Intent ............................................................................................................................ 5 

2.2  Scope ............................................................................................................................ 5 

2.3  Legal Considerations ....................................................................................................... 5 

3.0  Open Data Policy Requirements .......................

### Word doc example

In [13]:
response = requests.get('https://pasteur.epa.gov/uploads/10.23719/1500001/LDPE_nanoclay_Highlights_.docx')
results = parser.from_buffer(response)
print(results['content'].strip())

Highlights 

Evaluating Weathering of Food Packaging Polyethylene-Nano-clay Composites: Release of Nanoparticles and their Impacts

Changseok Han1, Amy Zhao1, and Eunice Varughese2, E. Sahle-Demessie*1




1. UV or O3 degradation food packaging composites released nanoclay particles. 
2. Properties of nanocomposites changed during accelerated weathering.
3. Nanoclay release was proportional to weathering time.
4. Toxicity of released nanoclay at test concentrations were not significant.


### OCR image example

It will work the same with a PDF instead of an image.

In [16]:
response = requests.get('https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg')
print(response)
results = parser.from_buffer(response)
print (results)
results['status']

<Response [200]>
{'metadata': {'Blue Colorant': '(0.1431, 0.0606, 0.7141)', 'Blue TRC': '0.0, 0.0000763, 0.0001526, 0.0002289, 0.0003052, 0.0003815, 0.0004578, 0.0005341, 0.0006104, 0.0006867, 0.000763, 0.0008392, 0.0009003, 0.0009766, 0.0010529, 0.0011292, 0.0012055, 0.0012818, 0.0013581, 0.0014343, 0.0015106, 0.0015869, 0.0016632, 0.0017395, 0.0018158, 0.0018921, 0.0019684, 0.0020447, 0.002121, 0.0021973, 0.0022736, 0.0023499, 0.0024262, 0.0025025, 0.0025788, 0.0026551, 0.0027161, 0.0027924, 0.0028687, 0.002945, 0.0030213, 0.0030976, 0.0031739, 0.0032502, 0.0033417, 0.003418, 0.0034943, 0.0035859, 0.0036622, 0.0037537, 0.00383, 0.0039216, 0.0040131, 0.0041047, 0.0041962, 0.0042878, 0.0043793, 0.0044709, 0.0045624, 0.0046693, 0.0047608, 0.0048524, 0.0049592, 0.005066, 0.0051575, 0.0052644, 0.0053712, 0.005478, 0.0055848, 0.0056916, 0.0057984, 0.0059052, 0.0060273, 0.0061341, 0.0062562, 0.006363, 0.0064851, 0.0066072, 0.0067292, 0.0068513, 0.0069734, 0.0070954, 0.0072175, 0.0073396, 0.

200

In [17]:
print(results['content'].strip())

at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment; and I under-
stood...” He paused and put his hand to his throat, and I could see,
in spite of his collected manner, that he was wrestling against the
approaches of the hysteria—“I understood, a drawer...”

But here I took pity on my visitor’s suspense, and some perhaps
on my own growing curiosity.

“There it is, s

### Using local files

In [18]:
# Save the file locally
!curl -O https://upload.wikimedia.org/wikipedia/commons/5/5f/Dr._Jekyll_and_Mr._Hyde_Text.jpg

results = parser.from_file('Dr._Jekyll_and_Mr._Hyde_Text.jpg')
print(results['content'].strip())

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  822k  100  822k    0     0  14.3M      0 --:--:-- --:--:-- --:--:-- 14.3M
at his touch ofa certain icy pang along my blood. “Come, sir,’ said I.
“You forget that I have not yet the pleasure of your acquaintance. Be
seated, if you please.” And I showed him an example, and sat down
myself in my customary seat and with as fair an imitation of my or-
dinary manner to a patient, as the lateness of the hour, the nature of
my preoccupations, and the horror I had of my visitor, would suffer
me to muster.

“I beg your pardon, Dr. Lanyon,” he replied civilly enough. “What
you say is very well founded; and my impatience has shown its heels
to my politeness. I come here at the instance of your colleague, Dr.
Henry Jekyll, on a piece of business of some moment

# PDF desde Drive
**Leyendo archivo desde drive y conviertiendolo usando tika**

In [23]:
import os
from google.colab import drive
drive.mount('/content/gdrive')
path = os.path.abspath('/content/gdrive/MyDrive/Colab Notebooks/pdfs/')
path
os.chdir("/content/gdrive/MyDrive/Colab Notebooks/pdfs/")
!ls

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
'1909.11573 deep fake.pdf'
'Sommerville Ian 2011 Ingeniera de software 9th.pdf'


In [21]:
def read_pdf(filename):
    file = parser.from_file(filename)

    return(file) 

In [24]:
#text = read_pdf(file)
text = read_pdf('Sommerville Ian 2011 Ingeniera de software 9th.pdf')

In [29]:
print(text)
print(text.keys())

{'metadata': {'Author': 'efrenhuerta', 'Content-Type': 'application/pdf', 'Creation-Date': '2012-04-03T21:48:15Z', 'EBX_PUBLISHER': 'COSName{Pearson Educacin de Mexico, SA de CV}', 'GTS_PDFXConformance': 'PDF/X-1a:2001', 'GTS_PDFXVersion': 'PDF/X-1:2001', 'Last-Modified': '2012-05-29T19:41:00Z', 'Last-Save-Date': '2012-05-29T19:41:00Z', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '10008', 'access_permission:assemble_document': 'true', 'access_permission:can_modify': 'true', 'access_permission:can_print': 'true', 'access_permission:can_print_degraded': 'true', 'access_permission:extract_content': 'true', 'access_permission:extract_for_accessibility': 'true', 'access_permission:fill_in_form': 'true', 'access_permission:modify_annotations': 'true', 'created': '2012-04-03T21:48:15Z', 'creator': 'efrenhuerta', 'date': '2012-05-29T19

In [30]:
print(text['content'][:10000].strip())

Ingeniera de software


INGENIERÍA DE SOFTWARE, en su novena edición, se dirige 
principalmente a estudiantes universitarios que están inscritos en cursos 
tanto introductorios como avanzados de ingeniería de software y sistemas. 
Asimismo, los ingenieros de software que trabajan en la industria encon-
trarán el libro útil como lectura general y para actualizar sus conocimientos 
acerca de temas como reutilización de software, diseño arquitectónico, 
confiabilidad, seguridad y mejora de procesos.

La presente obra se actualizó al incorporar los siguientes cambios:

>  Nuevos capítulos sobre software ágil y sistemas embebidos.

>  Nuevo material sobre ingeniería dirigida por modelos, desarrollo de 
fuente abierta y desarrollo dirigido por pruebas, modelo de queso suizo 
de Reason, arquitecturas de sistemas confiables, análisis estático y com-
probación de modelos, reutilización COTS, software como servicio y 
planeación ágil.

>  Un nuevo estudio de caso de amplio alcance que detalla un

### Doing your parsing
**Otros ejemplos**

There are two ways to do it!

**Right from the web**

```python
response = requests.get(...)
results = parser.from_buffer(response.content)
```

**From a downloaded file**

```python
results = parser.from_file(filename)
```

Note if you want to do **non-English OCR**, you need to change things up a bit. The one below for Greek. See what your tesseract supports with `tesseract --list-langs`

```python
headers = {
    "X-Tika-OCRLanguage": "grc"
}

results = parser.from_buffer(response.content, headers=headers)
```

**Desde el google drive**
```python
from google.colab import drive
drive.mount('/content/gdrive')
path = os.path.abspath('/content/gdrive/MyDrive/Colab Notebooks/pdfs/')
path
```
Lista de todos los archivos en la carpeta
```python
all_files = [f for f in listdir(path) if isfile(join(path, f)) and f.endswith(".pdf")]
all_files

file_sizes = [os.path.getsize(path + '/' + f) for f in listdir(path) if f.endswith(".pdf")]
file_sizes
```

Función para leer tamaños de archivo:
```python
def read_pdf(filename):
    file = parser.from_file(filename)
 
    return(file)
```

**Leer algunos pdfs desde drive**
```python
pdf_text_list = []
pagenumbers = []
path_and_files = [path + "/" + f for f in all_files]

def read_pdf_files(path=None):
    if path is not None:
        for i,file in enumerate(path):
            # creating an object
            try:
                text = read_pdf(file)
                pagenumbers.append(text['metadata']['xmpTPg:NPages'])
                t = [[k, v] for k, v in text.items()]
                
                text_strings = str(t[1][1])
                
            except:
                print(f"Something is wrong with reading PDF file #{i}")
                continue
            pdf_text_list.append(cleaning_raw_text(text_strings))


%time read_pdf_files(path=all_files) # instead of path_and_files

pdf_text_list
```


In [31]:
!tesseract --list-langs

List of available languages (2):
eng
osd
