# No.1 accuracy in multiform table extraction 
- Convert documents to maximize RAG performance 
- LangChain provides powerful tools for text splitting and vectorization


![Layout Analyzer](./figures/la.png)

In [1]:
! pip3 install -qU  markdownify  langchain-upstage  requests

In [2]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

In [3]:
import warnings

warnings.filterwarnings("ignore")

![Layout Analyzer](./figures/solar_sample.png)

In [6]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/OSHA.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()

In [7]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:5000]))

In [4]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    Think step by step and look the html tags and table values carefully to provide the most correct answer.
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [5]:
chain.invoke({"question": "Explain Table 2?", "Context": docs})

'Table 2 is not present in the given context.'

In [8]:
chain.invoke({"question": "What is MMLU scores of SOLAR 10.7B?", "Context": docs})

'The MMLU scores of SOLAR 10.7B is 65.48.'

In [9]:
chain.invoke({"question": "What is MMLU scores of Mistral 7B-Instruct-v0.2?", "Context": docs})

'MMLU scores of Mistral 7B-Instruct-v0.2 is 60.78.'

# Excercise 
Sometimes, even if we provide a table in Markdown or HTML format, the Large Language Model (LLM) may not extract the information correctly. How can you fix this issue?

Hint: Consider using CoT, a few-shot learning approach or a divide and conquer strategy. 
