In [34]:
import camelot

pdf_path = "content.pdf"

tables = camelot.read_pdf(pdf_path, pages="all", flavor='lattice')
print(f"Tổng cộng: {tables.n} bảng được tìm thấy.")

table_strings = []

for i, table in enumerate(tables):
    table_str = table.df.to_markdown(index=False)  # Dễ đọc hơn với index=False
    table_strings.append(table_str)
    print(f"\n--- Bảng {i + 1} ---")
    print(table_str)



Tổng cộng: 10 bảng được tìm thấy.

--- Bảng 1 ---
| 0   | 1                                                 |
|:----|:--------------------------------------------------|
|     | Elected to the Hall of Fame on this ballot (named |
|     | in bold italics).                                 |
|     | Elected subsequently, as of 2025 (named in plain  |
|     | italics).                                         |
|     | Renominated for the 2019 BBWAA election by        |
|     | adequate performance on this ballot and has not   |
|     | subsequently been eliminated.                     |
|     | Eliminated from annual BBWAA consideration by     |
|     | poor performance or expiration on subsequent      |

--- Bảng 2 ---
| 0   | 1                                                 |
|:----|:--------------------------------------------------|
|     | ballots.                                          |
|     | Eliminated from annual BBWAA consideration by     |
|     | poor performance or expira

In [35]:
import requests
import json

def generate_response(prompt):
    """
    Gửi prompt đến Ollama và nhận về một đoạn phản hồi hoàn chỉnh.
    
    Tham số:
        prompt (str): Câu hỏi hoặc yêu cầu bạn muốn gửi đến mô hình.

    Trả về:
        str: Phản hồi hoàn chỉnh từ mô hình.
    """
    full_response = ""
    context = f"""
   You are given a list of table fragments extracted from a PDF document using OCR. Some tables were split across multiple pages and therefore appear as separate fragments in the input.

    Your task is to:
    1. Carefully analyze all table fragments based on:
    - Column count and alignment
    - Content similarity and continuity
    - Semantic meaning and context
    - Position of headers and repeated patterns
    2. Identify which table fragments logically belong to the same original table, regardless of their order in the input.
    - For example: Table 1 may need to be merged with Table 6 even if they are not adjacent.
    3. Merge those fragments into complete tables while preserving:
    - Correct column alignment
    - Logical row ordering (based on data flow)
    - Data integrity — no duplication or omission of any rows
    4. Output only:
    - The original input table fragments exactly as provided
    - The final merged tables constructed **only from the provided input**
    5. Format all output in clean markdown tables.
    6. DO NOT add explanations, summaries, interpretations, or generate any new text outside of the tables themselves.
    7. DO NOT infer missing rows, extrapolate values, or fill in gaps using context or assumptions.
    8. If two tables do not logically belong together, leave them as separate tables.

    Important rules:
    - Only use data that appears in the input — do NOT generate new content.
    - Do NOT assume that Table 1 must merge with Table 2 — use structural and semantic analysis instead.
    - Be careful with empty rows, partial headers, or formatting issues.
    - Maintain exact values from the original fragments without modification.
    - Clearly separate each final merged table.
    ### ✅ Example Input

    --- Bảng A ---
    | Player       | Votes | Percent |
    |--------------|-------|---------|
    | Chipper Jones| 410   | 97.2%   |

    --- Bảng B ---
    | Player          | Votes | Percent |
    |------------------|-------|---------|
    | Vladimir Guerrero| 392   | 92.9%   |

    --- Bảng C ---
    | Candidate     | Category | Ref  |
    |---------------|----------|------|
    | Jack Morris   | Player   | [12] |

    ### ✅ Expected Output

    ### Merged Table: A + B
    | Player            | Votes | Percent |
    |-------------------|-------|---------|
    | Chipper Jones     | 410   | 97.2%   |
    | Vladimir Guerrero | 392   | 92.9%   |

    ### Original Table C
    | Candidate     | Category | Ref  |
    |---------------|----------|------|
    | Jack Morris   | Player   | [12] |

    This is just an example. The actual tables may be different.
    Only output the final merged tables and the other original table fragments.
    Here is the full list of table fragments from the input:
    {prompt}
    """
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": "qwen3:latest",
        "prompt": context,
        "stream": True
    }

    with requests.post(url, json=payload, stream=True) as response:
        for line in response.iter_lines():
            if line:
                try:
                    data = json.loads(line)
                    if 'response' in data:
                        chunk = data['response']
                        full_response += chunk
                    if data.get('done', False):
                        break
                except json.JSONDecodeError:
                    continue  

    return full_response.strip()

In [36]:
all_tables_combined = ""

for i, table in enumerate(tables):
    table_str = table.df.to_markdown(index=False)  
    all_tables_combined += f"\n\n--- Bảng {i + 1} ---\n"
    all_tables_combined += table_str





In [37]:
print("\n--- Toàn bộ bảng đã gộp ---")
print(all_tables_combined)


--- Toàn bộ bảng đã gộp ---


--- Bảng 1 ---
| 0   | 1                                                 |
|:----|:--------------------------------------------------|
|     | Elected to the Hall of Fame on this ballot (named |
|     | in bold italics).                                 |
|     | Elected subsequently, as of 2025 (named in plain  |
|     | italics).                                         |
|     | Renominated for the 2019 BBWAA election by        |
|     | adequate performance on this ballot and has not   |
|     | subsequently been eliminated.                     |
|     | Eliminated from annual BBWAA consideration by     |
|     | poor performance or expiration on subsequent      |

--- Bảng 2 ---
| 0   | 1                                                 |
|:----|:--------------------------------------------------|
|     | ballots.                                          |
|     | Eliminated from annual BBWAA consideration by     |
|     | poor performance or expiration

In [38]:
if all_tables_combined.strip():
    prompt = all_tables_combined
    print(generate_response(prompt))

<think>
Okay, let's tackle this problem step by step. First, I need to understand what's being asked. The user provided several table fragments from a PDF, and my task is to merge them into complete tables if they logically belong together. The key points are to analyze column counts, alignment, content continuity, semantic meaning, and header positions. I need to make sure not to add any new data or assumptions, just work with what's given.

Looking at the tables provided, let's start by examining each one individually to see their structure and content. 

Starting with Bảng 1: It has two columns, labeled 0 and 1. The rows under column 0 seem to be empty, while column 1 has text about being elected to the Hall of Fame, eliminated, etc. This looks like a description or notes section, possibly explaining the status of candidates.

Bảng 2: Also has two columns, 0 and 1. The content here continues the text from Bảng 1, mentioning ballots and elimination reasons. The symbols like † and * m