# PDF text extraction

Below we illustrate the usage of `PdfToTextConverted`. Please notice that data curation of extracted texts is still required if readability is a requirement. If quality of automated extractions is often poor for a specific language, you might want to search the web how to *train tesseract*, that topic is not covered here.

**Note:** this note assumes [tesseract](https://github.com/tesseract-ocr/tesseract) and [poppler](https://github.com/oschwartz10612/poppler-windows), and [ImageMagick](https://imagemagick.org/) are available in system path. Under Windows you might struggle to get them all working together, please check Majordome's Kompanion for automatic installation.

Install dependencies on Ubuntu 22.04:

```bash
sudo apt install tesseract-ocr imagemagick poppler-utils
```

In case of Rocky Linux 9:

```bash
sudo dnf install tesseract tesseract-langpack-eng ImageMagick poppler-utils
```

In [1]:
from majordome import PdfToTextConverter

Assuming the dependencies are found in the path, it is simply a matter of creating a converter:

In [2]:
converter = PdfToTextConverter()

For generated PDF (not scanned documents), it is much faster to avoir using OCR; below we show the metadata from a paper:

In [3]:
data = converter("data/sample-pdf/paper.pdf", use_ocr=False)
data.meta

{'/Author': "W. Dal'Maz Silva",
 '/CreationDate': "D:20170403224009+05'30'",
 '/Creator': 'Elsevier',
 '/CrossMarkDomains[1]': 'elsevier.com',
 '/CrossMarkDomains[2]': 'sciencedirect.com',
 '/CrossmarkDomainExclusive': 'true',
 '/CrossmarkMajorVersionDate': '2010-04-23',
 '/ElsevierWebPDFSpecifications': '6.5',
 '/Keywords': 'Hardness measurement; Martensite; Low-alloy steel; Precipitation',
 '/ModDate': "D:20170403224009+05'30'",
 '/Subject': 'Materials Science & Engineering A, 693 (2017) 225-232. doi:10.1016/j.msea.2017.03.077',
 '/Title': 'Carbonitriding of low alloy steels_ Mechanical and metallurgical responses',
 '/doi': '10.1016/j.msea.2017.03.077',
 '/robots': 'noindex'}

For scanned documents, by default if OCR is not enabled it will be used as a fallback method for text extraction:

In [4]:
data = converter("data/sample-pdf/scanned.pdf", last_page=1)
data.content[:500]

'549\n\n5.. Uber die von der molekularkinetischen Theorie\nder Wdirme geforderte Bewegung von in ruhenden\nFlissigkeiten suspendierten Teilchen;\nvon A, Einstein.\n\nIn dieser Arbeit soll gezeigt werden, daB nach der molekular-\nkinetischen Theorie der Warme in Flissigkeiten suspendierte\nKérper von mikroskopisch sichtbarer GréBe infolge der Mole-\nkularbewegung der Warme Bewegungen von solcher GréBe\nausfiihren miissen, daB diese Bewegungen leicht mit dem\nMikroskop nachgewiesen werden kénnen. Es ist méglic'