<a href="https://colab.research.google.com/github/rahiakela/general-utility-notebooks/blob/main/arabic_to_eng_with_gemini.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

**Reference**:

https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemini-api/docs/vision.ipynb

https://learnopencv.com/optical-character-recognition-using-paddleocr/

https://stackoverflow.com/questions/76728440/not-able-to-import-paddleocr-library-on-google-colab

https://stackoverflow.com/questions/46184239/python-extract-a-pdf-page-as-a-jpeg

In [1]:
!pip install -q -U google-generativeai

In [None]:
%%shell

pip install pillow
pip install pdf2image

In [None]:
!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
!sudo apt install tesseract-ocr
!sudo apt-get install poppler-utils

In [4]:
import os
import tempfile
from pdf2image import convert_from_path
from PIL import Image
import base64
from IPython.display import Markdown

import google.generativeai as genai

In [None]:
!wget https://github.com/rahiakela/genai-research-and-practice/raw/main/gemini-projects/dataset.zip?raw=true -O dataset.zip
!unzip dataset.zip

In [7]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

## PDF to Image

In [8]:
def convert_pdf(file_path, output_path):
    # save temp image files in temp dir, delete them after we are finished
    with tempfile.TemporaryDirectory() as temp_dir:

        # convert pdf to multiple image
        images = convert_from_path(file_path, output_folder=temp_dir)

        # save images to temporary directory
        temp_images = []
        for i in range(len(images)):
            image_path = f'{temp_dir}/{i}.jpg'
            images[i].save(image_path, 'JPEG')
            temp_images.append(image_path)

        # read images into pillow.Image
        imgs = list(map(Image.open, temp_images))

    # find maximum width of images
    max_img_width = max(i.width for i in imgs)

    # find total height of all images
    total_height = 0
    for i, img in enumerate(imgs):
        total_height += imgs[i].height

    # create new image object with width and total height
    merged_image = Image.new(imgs[0].mode, (max_img_width, total_height))

    # paste images together one by one
    y = 0
    for img in imgs:
        merged_image.paste(img, (0, y))
        y += img.height

    # save merged image
    merged_image.save(output_path)

    return output_path

In [9]:
!mkdir img_output

In [11]:
output_path = convert_pdf("dataset/Input- arabic.pdf", "img_output/input_arabic.jpg")

## Image bytes

In [12]:
from PIL import Image

# Specify the file path
file_path = 'img_output/input_arabic.jpg'
image_url = Image.open(file_path)

In [None]:
display(image_url)

In [13]:
# Convert the image to bytes
import io
buffered = io.BytesIO()
image_url.save(buffered, format="JPEG")
img_bytes = buffered.getvalue()

## Translatation

In [14]:
# Choose a Gemini model
model = genai.GenerativeModel(model_name="gemini-1.5-pro")

# Create a prompt
prompt = """You are an Arabic to English translator expert that translates Arabic to English.
            Traslate Arabic text into English based on the provided image.
            Do NOT MAKE ANY ASSUMPTION.
          """
response = model.generate_content(
    [
        {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(img_bytes).decode("utf-8"),
        },
        prompt,
    ]
)

In [15]:
Markdown(">" + response.text)

>**Cover Page:**

Deloitte.

*(Inside the box)*
Report on Agreed-Upon Procedures
Regarding the Financial Statements of
The Social Development Bank
For the Year Ended December 31, 2021

*(Bottom of the page)*
www.deloitte.com/middleeast

**Page 2:**

Deloitte.

*(Top Right)*
Social Development Bank
Agreed-Upon Procedures
Financial Statements as of 31/12/2021

*(Below Deloitte logo)*
Report on Agreed-Upon Procedures

We have performed the agreed-upon procedures listed in the attached annex regarding the accompanying financial statements of the Social Development Bank (the "Bank") for the year ended December 31, 2021.  The Bank's management is responsible for the financial statements. This agreed-upon procedures engagement was conducted in accordance with International Standard on Related Services 4400 (Revised), Engagements to Perform Agreed-Upon Procedures Regarding Financial Information.  We make no representation regarding the sufficiency of the procedures described in the annex either for the purpose for which this report has been requested or for any other purpose.

*(Signature and Stamp)*
Jeddah, Kingdom of Saudi Arabia
March 8, 2022

*(Bottom Right)*
Scanned by CamScanner

*(The following pages contain financial tables and Arabic text describing the procedures performed and the findings. Due to the image quality, precise translation of the numerical data and the lengthy Arabic descriptions is not reliably possible.  If sharper images of these sections are provided, a more accurate translation can be furnished.)*


The subsequent pages show tables with numerical data, and descriptive text in Arabic.  Key terms recurring include references to specific line items in the financial statements,  "current year," "previous year," and phrases like "agreed-upon procedures," and "bank management."  "Scanned by CamScanner" appears at the bottom right of each page.

*(Final Page with partial black triangle)*

This page appears to contain a continuation of the Arabic text and potentially more financial data, but it is difficult to decipher due to the image quality and the partial obscuration.


To provide a more complete and accurate translation, please provide higher-resolution images of each page. This will allow for proper reading of the financial figures and the accompanying Arabic descriptions.


```log
This document appears to be a financial report or audit statement prepared by Deloitte, likely for a client in a region where Arabic is used. Because the image is blurry and fragmented, providing a completely accurate translation is impossible. However, I can give you a general idea of what some sections likely contain:

Cover Page: This shows the Deloitte logo and likely includes information like the report title, client name (redacted in this case), and date. The Arabic phrase likely translates to something similar to "Independent Auditor's Report."

Subsequent Pages: These pages contain financial data presented in tables. Typical elements that can be inferred, though the numbers are unreadable:

Amounts in Arabic numerals: These are financial figures, likely in the local currency.
Column Headings: Likely represent periods (e.g., "Current Year," "Prior Year," possibly quarters or months). Other columns might indicate "Description" or "Account Name."
Row Labels (Arabic text): These would be the names of accounts (e.g., "Cash and Cash Equivalents," "Accounts Receivable," "Revenue," "Expenses," "Net Income," etc.). Due to blurriness, providing specific translations is impossible.
Footnotes (Arabic text at page bottoms): These provide further explanations or details regarding the figures presented in the tables. They often explain accounting policies or significant events.
"Scanned by CamScanner": Indicates the document was digitally scanned.
Narrative Sections (Arabic Text): These sections, also too blurry to read, would contain explanations and analysis of the financial data. They'd likely cover topics like:

Basis of Presentation: Explains the accounting standards followed (e.g., IFRS).
Key Performance Indicators: Discussion of important financial metrics.
Risk Factors: Potential issues that could affect the company's financial performance.
Auditor's Opinion: Deloitte's formal statement on the fairness and accuracy of the financial statements.
To provide a more useful translation, you would need to provide a clearer image of the document. If you can provide a sharper image of specific sections you are most interested in, I can attempt a more precise translation.
```

In [16]:
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

In [17]:
# Create a prompt
prompt = """You are an Arabic to English translator expert that translates Arabic to English.
            Traslate Arabic text into English based on the provided image.
            Do NOT MAKE ANY ASSUMPTION.
          """
response = model.generate_content(
    [
        {
            "mime_type": "image/jpeg",
            "data": base64.b64encode(img_bytes).decode("utf-8"),
        },
        prompt,
    ]
)

In [18]:
Markdown(">" + response.text)

>Certainly! I can translate the Arabic text in the image for you.  However, due to the image quality and the handwriting style,  some parts may be difficult to decipher with complete accuracy.  I'll do my best to provide a faithful translation, noting any ambiguities.

Please provide the image.  I need the image to be able to complete the translation.


## OCR

In [None]:
!pip install paddleocr
!pip install paddlepaddle

In [None]:
!wget http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.0g-2ubuntu4_amd64.deb
!sudo dpkg -i libssl1.1_1.1.0g-2ubuntu4_amd64.deb

In [None]:
!git clone https://github.com/PaddlePaddle/PaddleOCR

In [1]:
# Importing required functions for inference and visualization.
from paddleocr import PaddleOCR,draw_ocr
import os
import cv2
import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
from pdf2image import convert_from_path
pages = convert_from_path('dataset/Input- arabic.pdf', 500)

In [11]:
for count, page in enumerate(pages):
    page.save(f'out{count}.jpg', 'JPEG')

In [None]:
ocr = PaddleOCR(lang="en")

In [6]:
font='/content/PaddleOCR/StyleText/fonts/simfang.ttf'
def save_ocr(img_path, out_path, result, font):
  save_path = os.path.join(out_path, img_path.split('/')[-1] + 'output')

  image = cv2.imread(img_path)

  boxes = [line[0] for line in result]
  txts = [line[1][0] for line in result]
  scores = [line[1][1] for line in result]

  im_show = draw_ocr(image, boxes, txts, scores, font_path=font)

  cv2.imwrite(save_path, im_show)

  img = cv2.cvtColor(im_show, cv2.COLOR_BGR2RGB)
  plt.imshow(img)

In [None]:
out_path = "img_output"
file_path = 'out0.jpg'6
result = ocr.ocr(file_path)

In [13]:
out_path = "img_output"
save_ocr(file_path, out_path, result, font)

TypeError: '<' not supported between instances of 'tuple' and 'float'