<a href="https://colab.research.google.com/github/rosemarythomas994/Ai/blob/main/Copy_of_Llava_demo_4bit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running Llava: a large multi-modal model on Google Colab

Run Llava model on a Google Colab!

Llava is a multi-modal image-text to text model that can be seen as an "open source version of GPT4". It yields to very nice results as we will see in this Google Colab demo.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png)

The architecutre is a pure decoder-based text model that takes concatenated vision hidden states with text hidden states.

We will leverage QLoRA quantization method and use `pipeline` to run our model.

In [2]:
# !pip install -q -U transformers==4.37.2
# !pip install -q bitsandbytes==0.41.3 accelerate==0.25.0
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.4


In [3]:
!pip install pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import re
import json

# Path to your PDF
pdf_path = "/test.pdf"

def ocr_pdf_to_text(pdf_path):
    """Convert PDF pages to OCR text using Tesseract"""
    doc = fitz.open(pdf_path)
    results = []
    for i, page in enumerate(doc):
        pix = page.get_pixmap(dpi=300)  # render at 300 dpi for accuracy
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        text = pytesseract.image_to_string(img, lang="eng")
        results.append(text)
    return results

def parse_export_license(text_pages):
    """Extract structured fields into JSON"""
    data = {}

    # Combine all pages into one text block
    full_text = "\n".join(text_pages)

    # ---------------- Contact Information ----------------
    data["ContactInformation"] = {
        "ReferenceNumber": re.search(r"Reference Number\s+(\S+)", full_text).group(1) if re.search(r"Reference Number\s+(\S+)", full_text) else None,
        "ContactPerson": re.search(r"1\. Contact Person.*\n(.*)", full_text).group(1).strip() if re.search(r"1\. Contact Person.*\n(.*)", full_text) else None,
        "Telephone": re.search(r"Telephone Number.*\n(\d+)", full_text).group(1) if re.search(r"Telephone Number.*\n(\d+)", full_text) else None,
        "Email": re.search(r"Email\s*\n([^\s]+@[^\s]+)", full_text).group(1) if re.search(r"Email\s*\n([^\s]+@[^\s]+)", full_text) else None,
        "CreationDate": re.search(r"Creation Date\s*\n(\d{2}/\d{2}/\d{4})", full_text).group(1) if re.search(r"Creation Date\s*\n(\d{2}/\d{2}/\d{4})", full_text) else None,
        "ApplicationType": re.search(r"Type of Application\s*\n(.+)", full_text).group(1).strip() if re.search(r"Type of Application\s*\n(.+)", full_text) else None
    }

    # ---------------- Applicant Information ----------------
    applicant_match = re.search(r"CIN \(Applicant ID\)\s*([\w\d]+)\s*(.*?)\nAddress", full_text, re.DOTALL)
    if applicant_match:
        data["ApplicantInformation"] = {
            "CIN": applicant_match.group(1),
            "Name": applicant_match.group(2).strip()
        }

    # ---------------- Purchaser Information ----------------
    purchaser_match = re.search(r"Purchaser\s*\n(.*?)\n\nAddress 1\s*(.*?)\n", full_text, re.DOTALL)
    if purchaser_match:
        data["PurchaserInformation"] = {
            "Name": purchaser_match.group(1).strip(),
            "Address": purchaser_match.group(2).strip()
        }

    # ---------------- Intermediate Consignee ----------------
    consignee_match = re.search(r"Intermediate Consignee\s*\n(.*?)\n", full_text, re.DOTALL)
    if consignee_match:
        data["IntermediateConsignee"] = consignee_match.group(1).strip()

    # ---------------- Document Checklist ----------------
    checklist_items = []
    checklist_section = re.search(r"Document Checklist(.*?)(Applicant Information|License Information)", full_text, re.DOTALL)
    if checklist_section:
        lines = checklist_section.group(1).splitlines()
        for line in lines:
            line = line.strip()
            if not line:
                continue
            # detect checkbox markers (OCR may output _, CJ, ✔, etc.)
            checked = bool(re.match(r"^[_CJ\[\(✔]", line))
            # clean up item text
            item = re.sub(r"^[_CJ\[\(✔\)]+", "", line).strip(" -")
            checklist_items.append({"item": item, "selected": checked})
    if checklist_items:
        data["DocumentChecklist"] = checklist_items

    return data

if __name__ == "__main__":
    # Step 1: OCR all pages
    pages_text = ocr_pdf_to_text(pdf_path)

    # Step 2: Parse into JSON
    extracted_data = parse_export_license(pages_text)

    # Step 3: Print JSON
    print(json.dumps(extracted_data, indent=4))


{
    "ContactInformation": {
        "ReferenceNumber": "SLV0530",
        "ContactPerson": "Shelley Vybiral",
        "Telephone": "6302003543",
        "Email": "shelley.vybiral@cmcelectronics.us",
        "CreationDate": "05/30/2025",
        "ApplicationType": "Export License Application"
    },
    "ApplicantInformation": {
        "CIN": "C702375",
        "Name": "Address 1\n84 N. Dugan Road\n\nCity\nSugar Grove\n\nState/Province\nIllinois\n\n11. Replacement License Number\n\nImport Certificate Number\n\nApplicant\nCMC Electronics Aurora, LLC"
    },
    "PurchaserInformation": {
        "Name": "PILATUS AIRCRAFT LIMITED",
        "Address": "Address 2"
    },
    "IntermediateConsignee": "Hellmann Worldwide Logistics AG",
    "DocumentChecklist": [
        {
            "item": "6. Documents submitted with application",
            "selected": false
        },
        {
            "item": "Export Items (BIS-748P-A)",
            "selected": false
        },
        {
        

In [16]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image

# Path to your PDF
pdf_path = "/content/test.pdf"

def ocr_pdf_to_text(pdf_path, output_txt="output.txt"):
    """Extract text from all pages of PDF using OCR and save to a file
    """
    doc = fitz.open(pdf_path)
    all_text = []

    for i, page in enumerate(doc):
        # Convert each page to image
        pix = page.get_pixmap(dpi=300)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

        # OCR using Tesseract
        text = pytesseract.image_to_string(img, lang="eng")

        # Save per-page text
        page_header = f"\n\n===== PAGE {i+1} =====\n\n"
        all_text.append(page_header + text.strip())

    # Combine all pages into one text string
    full_text = "\n".join(all_text)

    # Save to file
    with open(output_txt, "w", encoding="utf-8") as f:
        f.write(full_text)

    return full_text

if __name__ == "__main__":
    extracted_text = ocr_pdf_to_text(pdf_path)
    print(extracted_text[:2000])  # print first 2000 characters as a preview




===== PAGE 1 =====

= An official website of the United States government Here's how you know v

Bureau of Industry and Security

U.S. Department of Commerce

 

Export License Application _ status (¢ompzetes=xpPROVEDW/CONDITIONS)

Contact Information

Reference Number
SLV0530

1. Contact Person (First Name, Last Name)
Shelley Vybiral

2. Telephone Number 3. Fax Number
6302003543 -

Email
shelley.vybiral@cmcelectronics.us

4. Creation Date
05/30/2025

5. Type of Application
Export License Application

Document Checklist

6. Documents submitted with application 7. Documents on file with applicant
Export Items (BIS-748P-A) (_) Bis-711

CJ End Users (BIS-748P-B) CJ Letter of Assurance

CJ BIS-711 CJ Import/End-User Certificate
Import/End-User Certificate CJ Nuclear Certification

Technical Specification
C) P Other
CJ Letter of Explanation -

(_) Foreign Availability

Other


===== PAGE 2 =====

purchase order

License Information

9. Special Purpose

10. Resubmission ACN

13. Import Cer

In [7]:
print(extracted_text)



===== PAGE 1 =====

= An official website of the United States government Here's how you know v

Bureau of Industry and Security

U.S. Department of Commerce

 

Export License Application _ status (¢ompzetes=xpPROVEDW/CONDITIONS)

Contact Information

Reference Number
SLV0530

1. Contact Person (First Name, Last Name)
Shelley Vybiral

2. Telephone Number 3. Fax Number
6302003543 -

Email
shelley.vybiral@cmcelectronics.us

4. Creation Date
05/30/2025

5. Type of Application
Export License Application

Document Checklist

6. Documents submitted with application 7. Documents on file with applicant
Export Items (BIS-748P-A) (_) Bis-711

CJ End Users (BIS-748P-B) CJ Letter of Assurance

CJ BIS-711 CJ Import/End-User Certificate
Import/End-User Certificate CJ Nuclear Certification

Technical Specification
C) P Other
CJ Letter of Explanation -

(_) Foreign Availability

Other


===== PAGE 2 =====

purchase order

License Information

9. Special Purpose

10. Resubmission ACN

13. Import Cer

In [15]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image

# Path to your PDF
pdf_path = "/content/test.pdf"

def ocr_pdf_to_text(pdf_path, output_txt="output.txt"):
    """Extract text from all pages of PDF using OCR and save to a file
    You are a document parser. Your task is to convert the provided PDF text and form data
into structured JSON.

Rules:

1. Include all fields in the output JSON, even if their value is empty or null.
Understand the file structure.
2. Input may contain checkbox fields, understand how it is represented.
For checkboxes or radio buttons:
   -Include only the items that are selected (marked with '✔' or similar).
   - If no items in a checkbox/radio group are selected, set the field to null.
   - Do NOT output boolean true/false for unselected items.
   - Extract only the items that are marked as selected, indicated by a leading '✔' or similar mark.
    - Ignore items without any selection mark.
    - If no items are selected in a checklist, set it to null.
    - Include only the items that are marked as selected, indicated by a leading '✔', or 'n' or similar mark.
  - Do NOT output boolean true/false flags for items.
  - If no items are selected in a checklist, set that checklist field to null.
  - if checkbox is under different column in table, keep it under that column name. You have follow the structure.

3. Split full names into 'FirstName' and 'LastName' if possible.
4. Group logically related fields together.
5. Output valid JSON only. No explanation, no extra text.
6.Extract data if it is split into two coloumns and/or numbered columns data

Here are the filled form fields (checkboxes, radios, etc.):

{full_text}

Here is the extracted text from the PDF:

{full_text}"""
    doc = fitz.open(pdf_path)
    all_text = []

    for i, page in enumerate(doc):
        # Convert each page to image
        pix = page.get_pixmap(dpi=300)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

        # OCR using Tesseract
        text = pytesseract.image_to_string(img, lang="eng")

        # Save per-page text
        page_header = f"\n\n===== PAGE {i+1} =====\n\n"
        all_text.append(page_header + text.strip())

    # Combine all pages into one text string
    full_text = "\n".join(all_text)

    # Save to file
    with open(output_txt, "w", encoding="utf-8") as f:
        f.write(full_text)

    return full_text

if __name__ == "__main__":
    extracted_text = ocr_pdf_to_text(pdf_path)
    print(extracted_text[:2000])  # print first 2000 characters as a previewimport fitz  # PyMuPDF




===== PAGE 1 =====

= An official website of the United States government Here's how you know v

Bureau of Industry and Security

U.S. Department of Commerce

 

Export License Application _ status (¢ompzetes=xpPROVEDW/CONDITIONS)

Contact Information

Reference Number
SLV0530

1. Contact Person (First Name, Last Name)
Shelley Vybiral

2. Telephone Number 3. Fax Number
6302003543 -

Email
shelley.vybiral@cmcelectronics.us

4. Creation Date
05/30/2025

5. Type of Application
Export License Application

Document Checklist

6. Documents submitted with application 7. Documents on file with applicant
Export Items (BIS-748P-A) (_) Bis-711

CJ End Users (BIS-748P-B) CJ Letter of Assurance

CJ BIS-711 CJ Import/End-User Certificate
Import/End-User Certificate CJ Nuclear Certification

Technical Specification
C) P Other
CJ Letter of Explanation -

(_) Foreign Availability

Other


===== PAGE 2 =====

purchase order

License Information

9. Special Purpose

10. Resubmission ACN

13. Import Cer

In [8]:
pip install pytesseract pdf2image PyPDF2 pdfplumber Pillow

Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m