In [1]:
!git clone https://github.com/Fuenfgeld/LLM-Utility-Cookbook.git

Cloning into 'LLM-Utility-Cookbook'...
remote: Enumerating objects: 57, done.[K
remote: Counting objects: 100% (57/57), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 57 (delta 25), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (57/57), 1.07 MiB | 11.57 MiB/s, done.
Resolving deltas: 100% (25/25), done.


**Install Required Libraries**

In this section, we install the required libraries for OCR. We use Tesseract for OCR and Poppler-utils for converting PDFs to images.

In [2]:
!sudo apt install tesseract-ocr
!sudo apt-get install -y poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 18 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 0s (12.0 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debco

In [3]:
!pip install pytesseract
!pip install pdf2image

Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10
Collecting pdf2image
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.16.3


In [4]:
import pytesseract
from PIL import Image

In [5]:
imagePath ='/content/LLM-Utility-Cookbook/data/DocImage.png'

In [6]:
extractedText = pytesseract.image_to_string(Image.open(imagePath))

In [7]:
extractedText

"LLM-Utility-Cookbook\n\n \n\nHello, future LLM enthusiasts! Welcome to the LLM-Utility-Cookbook, a place where we'll explore, understand, and\nplay with a myriad of tools and techniques related to Large Language Models (LLMs). This repository serves as an\nextension of our lectures, bridging theory and practice in the most interactive way possible.\n\nOur Learning Menu ™\n\n \n\nHere's what we'll be exploring together:\n\n. Voice to Text: We'll unravel the magic behind turning spoken words into written text.\n\n. Text to Voice: A dive into how we can transform static text into expressive audible speech.\n. Document Scan to Text: Learn how to breathe digital life into your physical documents.\n\n. Prompts: Together, we'll optimize and manage prompts to extract the most from our LLMs.\nMemory: Get hands-on with persisting states between calls in a chain or agent.\n\n. Indexes: We'll tinker with loading, querying, and updating external data.\n\nChains: Discover the art of crafting struct

# **PDF to Image to Text**
For PDFs, the process is a bit different. Since OCR engines typically work on images, we first convert the PDF to images. Each page of the PDF is converted into a separate image. Then, we apply the OCR engine to each image to extract the text.

In [8]:
from pdf2image import convert_from_path
import os

# **Convert PDF to Images**
We convert each page of the PDF into a separate image using the pdf2image library. These images are saved in a specified output directory.



In [9]:
pdfPath = '/content/LLM-Utility-Cookbook/data/ScanPDF.pdf'
outputDirPath = '/content/pdfImages'
os.makedirs(outputDirPath,exist_ok=True)

images = convert_from_path(pdfPath)
for i, image in enumerate(images):
  image.save(outputDirPath + '/output' + str(i) + '.jpg', 'JPEG')

# **Extract Text from Images**
We iterate through each image that resulted from the PDF conversion and extract text using Tesseract. The text is saved in a dictionary with the image's filename as the key for easy lookup.

In [10]:
imagesToProcess = os.listdir(outputDirPath)
extractedTextPages = {}

for tempFileName in imagesToProcess:
  tempPath = outputDirPath + '/' + tempFileName
  extractedTextPages[tempFileName] = pytesseract.image_to_string(Image.open(tempPath))

In [11]:
extractedTextPages

{'output0.jpg': " \n\nLLM-Utility-Cookbook\n\nHello, future LLM enthusiasts! Welcome to the LLM-Utility-Cookbook, a place where\nwe'll explore, understand, and play with a myriad of tools and techniques related to\nLarge Language Models (LLMs). This repository serves as an extension of our\nlectures, bridging theory and practice in the most interactive way possible.\n\nOur Learning Menu &\nHere's what we'll be exploring together:\n\n1. Voice to Text: We'll unravel the magic behind turning spoken words into\nwritten text.\n\n2. Text to Voice: A dive into how we can transform static text into expressive\naudible speech.\n\n3. Document Scan to Text: Learn how to breathe digital life into your physical\ndocuments.\n\n4. Prompts: Together, we'll optimize and manage prompts to extract the most\nfrom our LLMs.\n\n5. Memory: Get hands-on with persisting states between calls in a chain or\nagent.\n\n6. Indexes: We'll tinker with loading, querying, and updating external data.\n\n7. Chains: Disco