# extractGemini

> use google genai to process data

In [None]:
#| default_exp extractGemini

In [42]:
#| export 
from fastcore.utils import *
from fastcore.xtras import *
from finfox.core import *
import pandas as pd

## Gemini api

In [39]:
#| export
#| hide
import os
from dotenv import load_dotenv
from google import genai
load_dotenv()
gemini_key = os.environ["GEMINI_KEY1"]

In [40]:
#| export
def promt_helper(s): 
    response = client.models.generate_content(
        model="gemini-1.5-flash", 
        contents=f"Write an efficient prompt for llm for the task: {s}",
    )
    return response.text

In [65]:
p = promt_helper("""
As a AI image extractor, given a finacial statement as image extract all the information in markdown.
Transcript the table, figures and graph as reported. Do not return any images.
""")
print(p)

Okay, here's an efficient prompt for an LLM to extract information from a financial statement image and output it in Markdown:

```prompt
You are an expert AI financial statement data extractor. Your task is to analyze images of financial statements and extract all relevant information into a well-structured Markdown document.

**Input:** An image of a financial statement (e.g., Balance Sheet, Income Statement, Cash Flow Statement, or Notes to Financial Statements).

**Instructions:**

1.  **Thorough Extraction:** Extract all textual data from the image, including headings, labels, values, footnotes, and narrative text. Ensure accuracy in transcribing numbers and dates.

2.  **Markdown Formatting:** Organize the extracted information using Markdown syntax for clear presentation.
    *   Use headings (H1-H6) to structure the document according to the statement's layout (e.g., Balance Sheet, Assets, Liabilities & Equity).
    *   Format tables using Markdown table syntax (e.g., `| Header

## image extraction

In [172]:
promt = """
You are an expert AI finance Engine trained to extract data from input image. Double check before generation.
Your task is to analyze the provided image and transcribe all relevant information into a structured Markdown document.
Do not miss anything, evrything is mission critcal. 
Wait and think, be rigorous in following the Instructions:
1. **Tables:** Replicate all tables exactly as presented in the image. Use Markdown table syntax for formatting. Preserve headings, row labels, and numerical data with accurate values and formatting (e.g., commas, decimals).
2. **Figures (Numerical Data):** If numerical data is presented outside of tables, transcribe this data with clear labels indicating what the numbers represent. Use bullet points or numbered lists for better organization.
3. **Graphs (Reported Data):** Summarize the information conveyed by any graphs. Identify the key trends, data points, and conclusions that are visually communicated in the graph. Describe these insights in concise sentences or bullet points.  Do not attempt to recreate the graph visually; focus on extracting the reported data.
4. **General instructions:** Do not return any images.
5. **No Interpretation or Analysis:**  Do not provide any analysis, interpretation, or commentary on the financial data. Your sole purpose is transcription and structured representation.
6. **Accuracy:** Ensure 100% accuracy when transcribing numerical data.
7. **Order:** Present the information in the same order as it appears in the image, starting from the top left and proceeding sequentially.
8. **Return all other OCR as present in the image. Do not summerize or ommit any information.**
9. Do not send redundant blanck spaces and any non numerical chars more than 3 times(Do not sent "I am ....." instead send "I am .")
"""

In [None]:
promt = """You are an expert AI Finance Engine trained to extract data from input images with exceptional accuracy. 
Your task is to analyze the provided image and transcribe all relevant information into a structured Markdown document.
Every detail is mission critical; do not omit any information.
Please follow these instructions precisely:

1. **Tables:**
   - Replicate all tables exactly as they appear in the image.
   - Format tables using Markdown table syntax.
   - Preserve table headings, row labels, and numerical data with exact formatting (including commas, decimals, etc.).

2. **Figures (Numerical Data):**
   - Transcribe any numerical data presented outside of tables, with clear labels indicating what each number represents.
   - Organize this information using bullet points or numbered lists.

3. **Graphs (Reported Data):**
   - Summarize the data from any graphs by identifying key trends, data points, and conclusions.
   - Describe these insights in concise sentences or bullet points.
   - Do not attempt to recreate graphs visually; focus only on extracting and presenting the reported data.

4. **General Guidelines:**
   - Do not return any images.
   - Do not provide interpretation, analysis, or commentary on the financial data; simply transcribe the information.
   - Ensure 100% accuracy when transcribing numerical data and other text.

5. **Order and Completeness:**
   - Present the information in the same sequence as it appears in the image, starting from the top left and proceeding sequentially.
   - Return all OCR-extracted text exactly as present in the image; do not summarize or omit any content.

6. **Formatting Specifics:**
   - Avoid redundant blank spaces.
   - Do not use sequences of more than three identical non-numerical characters. For example, instead of "I am ....." use "I am ."
   - Limit the generation of chars like

Before generating your final output, double-check all details for completeness and accuracy.
Wait and think, be rigorous."""


In [43]:
df = pd.read_csv("imgs.csv")
df.head()

Unnamed: 0,fn
0,/mnt/d/finfox/ASHOKLEY/Financial Year 2010/ind...
1,/mnt/d/finfox/ASHOKLEY/Financial Year 2010/ind...
2,/mnt/d/finfox/ASHOKLEY/Financial Year 2010/ind...
3,/mnt/d/finfox/ASHOKLEY/Financial Year 2010/ind...
4,/mnt/d/finfox/ASHOKLEY/Financial Year 2010/ind...


In [45]:
df.iloc[0]['fn']

'/mnt/d/finfox/ASHOKLEY/Financial Year 2010/index_img/0.png'

In [49]:
config = genai.types.GenerateContentConfig(
        temperature=0.5,
        max_output_tokens=2_048,  # Maximum number of tokens to generate
    )

In [62]:
#| export 
from PIL import Image
def im_open(fn:Path):
    """
    open an image and return image
    """
    return Image.open(fn)

In [65]:
fn = Path(df.iloc[10]['fn'])
im = im_open(fn)

In [97]:
fn.parent.parent/"Index_md",  fn.name.split('.')[0]

(Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2010/Index_md'), '19')

In [107]:
#| export 
import re
def md(text: str) -> list:
    """
    Extract text wrapped in markdown-style triple backticks.
    - list: A list of extracted strings.
    """
    pattern = r'```(?:markdown)?(.*?)```'  # Match ```markdown <text> ```
    return [match.strip() for match in re.findall(pattern, text, re.DOTALL)]

In [108]:
md("""Some content outside
```markdown
This is the markdown content.
```
""")

['This is the markdown content.']

In [58]:
client = genai.Client(api_key=gemini_key)
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[promt, im], 
    config=generation_config)

print(response.text)

```markdown
ANNEXURE-B TO DIRECTORS' REPORT - REPORT ON CORPORATE GOVERNANCE
Non-mandatory requirements
1. Non-Executive Chairman

The Company maintains the office of the Non-Executive
Chairman and reimburses expenses incurred in the
performance of his duties.

2. Remuneration Committee

The Company has constituted a Remuneration Committee;
full details are furnished under Item 4 of this Annexure.

3. Shareholder Rights

The statements of quarterly and half-yearly results are being
published in the Press. The Company has been mailing half-
yearly reports to shareholders since October 2001, along with
a letter from the Managing Director highlighting significant
events.

4. Postal Ballot

The Company has had no occasion to use the postal ballot
during the year.

5. Whistle Blower Policy

The Company does not have a Whistle Blower Policy, but has
an independent Ombudsman, who is not an employee of the
Company.

SEBI Guidelines on Corporate Governance

The Company is fully compliant with t

In [55]:
print(response.usage_metadata)

cached_content_token_count=None candidates_token_count=679 prompt_token_count=1602 total_token_count=2281


In [56]:
response.json()

'{"candidates":[{"content":{"parts":[{"video_metadata":null,"thought":null,"code_execution_result":null,"executable_code":null,"file_data":null,"function_call":null,"function_response":null,"inline_data":null,"text":"```markdown\\nANNEXURE-B TO DIRECTORS\' REPORT - REPORT ON CORPORATE GOVERNANCE\\nNon-mandatory requirements\\n1. Non-Executive Chairman\\n\\nThe Company maintains the office of the Non-Executive\\nChairman and reimburses expenses incurred in the\\nperformance of his duties.\\n2. Remuneration Committee\\n\\nThe Company has constituted a Remuneration Committee;\\nfull details are furnished under Item 4 of this Annexure.\\n3. Shareholder Rights\\n\\nThe statements of quarterly and half-yearly results are being\\npublished in the Press. The Company has been mailing half-\\nyearly reports to shareholders since October 2001, along with\\na letter from the Managing Director highlighting significant\\nevents.\\n4. Postal Ballot\\n\\nThe Company has had no occasion to use the post

In [99]:
df.shape[0]

13720

In [198]:
def get_md(fn:Path, dir_name:str="Index_md"):
    try :

        dir = fn.parent.parent/dir_name
        dir.mkdir(exist_ok=True)
        im = im_open(fn)
        fn_ = fn.name.split('.')[0]

        res = client.models.generate_content(
            model="gemini-2.0-flash",
            contents=[promt, im], 
            config=config)

        #print(res.json())

        md_fn = dir/f"{fn_}.md"
        txt = res.text if res.text else ""
        md_fn.write_text(res.text)
        print(f"{md_fn=} {len(txt)=}")

    except Exception as e:
        print(f"error @ {fn=}  {e}")



In [112]:
p = Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2010/Index_md/19.md')
p.write_text()

4

In [191]:
p = Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2013/Index_img/43.png')
get_md(p)

md_fn=Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2013/Index_md/43.md') len(txt)=1731


In [175]:
191+113

304

In [202]:
from time import sleep
import random

for i in range(2836,5000):

    fn = Path(df.iloc[i]['fn'])
    get_md(fn)
    sleep(random.uniform(0, 5))

error @ fn=Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2024/index_img/129.png')  429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}
error @ fn=Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2024/index_img/13.png')  429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}
error @ fn=Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2024/index_img/130.png')  429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}
error @ fn=Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2024/index_img/131.png')  429 RESOURCE_EXHAUSTED. {'error': {'code': 429, 'message': 'Resource has been exhausted (e.g. check quota).', 'status': 'RESOURCE_EXHAUSTED'}}
error @ fn=Path('/mnt/d/finfox/ASHOKLEY/Financial Year 2024/index_img/132.png')  429 RESOURCE_EXHAUSTED. 

KeyboardInterrupt: 