<a href="https://colab.research.google.com/github/shaoyinguo-portfolio/CorpGenie-exp/blob/main/Text2RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview:

This Notebook demos ingesting a text document into FAISS vector store, then retrieve information from an RAG agent built using `Langchain`.

## Further work:

1. citing the source and images in the orginal text
2. design an integration mechanism for document access control

In [1]:
!pip install -q langchain langchain-core langchain-community langchain-openai langchain-text-splitters faiss-cpu pypdf langchain-huggingface --upgrade --no-deps requests


In [2]:
import os
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_huggingface import HuggingFaceEmbeddings

from google.colab import drive, userdata
from pathlib import Path

from langchain_openai import ChatOpenAI
from langchain.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser

In [3]:
try:
    drive.mount('/content/drive')
    data_path = Path('/content/drive/MyDrive/Colab Notebooks/data')
    print('Mounted Google Drive')
except:
    data_path = Path('./data')
    print('Mounted local drive')

if not data_path.exists():
    data_path.mkdir()

TEXT_PATH = f'{data_path}/output_round2_10.txt'

KEYFRAME_PATH = f'{data_path}/key_frames'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Mounted Google Drive


In [4]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENROUTER_API_KEY')

In [5]:
loader = TextLoader(TEXT_PATH)
docs = loader.load()
docs

[Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/output_round2_10.txt'}, page_content='```latex\nThe presentation focuses on TSMC and Intel\'s packaging process technologies, specifically CoWoS, EMIB, Foveros, and chiplets [ImageName: 1.00]. TSMC is recognized as a leader in semiconductor manufacturing, with strengths in both high yield of advanced nanometer nodes and in packaging technology. Among TSMC, Samsung, and Intel, TSMC\'s packaging process, including CoWoS and 3D fabrics, stands out. The presentation will primarily focus on CoWoS.\n\nThe Key Performance Indicators (KPIs) for packaging technology, often referred to as P3C2, include performance, power, packaging profile, cycle time, and cost [ImageName: 90.00, 1882.00]. Performance encompasses bandwidth (BW), Fmax, and functionality ("Hi, Function"). Power considers efficiency and thermal properties (Tj). The packaging profile involves footprint and thickness. Cycle time refers to the speed to market, 

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
splits

[Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/output_round2_10.txt'}, page_content="```latex\nThe presentation focuses on TSMC and Intel's packaging process technologies, specifically CoWoS, EMIB, Foveros, and chiplets [ImageName: 1.00]. TSMC is recognized as a leader in semiconductor manufacturing, with strengths in both high yield of advanced nanometer nodes and in packaging technology. Among TSMC, Samsung, and Intel, TSMC's packaging process, including CoWoS and 3D fabrics, stands out. The presentation will primarily focus on CoWoS."),
 Document(metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/output_round2_10.txt'}, page_content='The Key Performance Indicators (KPIs) for packaging technology, often referred to as P3C2, include performance, power, packaging profile, cycle time, and cost [ImageName: 90.00, 1882.00]. Performance encompasses bandwidth (BW), Fmax, and functionality ("Hi, Function"). Power considers efficiency and thermal prop

In [7]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

embedding_dim = len(embeddings.embed_query("hello world"))
index = faiss.IndexFlatL2(embedding_dim)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [8]:
vector_store.add_documents(splits)

['07b416b8-95c9-4f19-95f4-992c61ee6ea5',
 '045aac5d-3a6e-456f-9191-9f4c2fb8d294',
 '3409fe41-1e00-4fa5-beab-88862d1ac497',
 'df83710e-a4b6-4d1a-8397-358c6aa2a51b',
 'f6b38b49-dddd-468f-8be1-e80a40946434',
 '1ebd4575-8a69-43a5-902a-3e762792ec0f',
 'e1c1d52c-adb5-41f4-a29c-d02483cb76b6',
 '849a3461-e79e-4f08-ad34-2ad57d5d6b09',
 '2d67fd9c-76b8-4394-9ca6-10dbfce6b361',
 '18a3bc7b-7773-4791-91da-91e122dd1f57',
 'ab8e26a4-2ec4-48bb-a9b1-87fa3370745b',
 '0bfec751-4f79-4277-96dd-46747e800f84',
 'dcec3898-415a-49e1-b433-e037d5e81955',
 'de5ae844-9e90-4fa6-8bc3-49ca1d187e70',
 '244003fa-c194-48fe-8525-fae618d5fdec',
 '97020123-8e4c-4061-b67d-2e4beefb989c',
 'a148fe9e-53ee-4976-a0bb-4b9eacccdeb1',
 '51a58c19-93a6-41b7-a4a7-828aa85cdf1b',
 '4bdf9793-1158-4f93-8f6e-2fd9a4924027',
 '3a29a822-6bbc-40c9-b02b-c109b7bfc387',
 '1d3a1d7d-29d5-481e-8d75-47100cc0cd01',
 '9c09fba5-a6d6-41b0-88cb-401a65660a6b',
 'bddb467c-cbef-4f14-a8fd-cf46bfc93a52',
 'b8fb1029-92e0-4808-8ab5-6e2a1a3566bd',
 '320f9872-5068-

In [9]:
# save vectorstore
vector_store.save_local(data_path)

In [10]:
# load
loaded_vectorstore = FAISS.load_local(
    data_path,
    embeddings,
    allow_dangerous_deserialization=True
)

In [11]:
from langchain.tools import tool

@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query, k=5)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

In [12]:
from langchain.agents import create_agent

model = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://openrouter.ai/api/v1",
    model="google/gemini-2.0-flash-001" # "google/gemini-2.0-flash-exp:free"
)

tools = [retrieve_context]
# If desired, specify custom instructions
prompt = (
    "You have access to a tool that retrieves context from a tech document. "
    "Use the tool to help answer user queries."
)
agent = create_agent(model, tools, system_prompt=prompt)

In [13]:
query = (
    "Why TSMC CoWoS is the most popular packaging technology as oppose to Intel EMIB and Foveros?\n\n"
    "Once you get the answer, write an organized, detailed and logical paragraph."
)

for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()


Why TSMC CoWoS is the most popular packaging technology as oppose to Intel EMIB and Foveros?

Once you get the answer, write an organized, detailed and logical paragraph.
Tool Calls:
  retrieve_context (tool_0_retrieve_context_hHJ2TAwFnbq3LuLGzFHN)
 Call ID: tool_0_retrieve_context_hHJ2TAwFnbq3LuLGzFHN
  Args:
    query: popularity of TSMC CoWoS vs Intel EMIB and Foveros
Name: retrieve_context

Source: {'source': '/content/drive/MyDrive/Colab Notebooks/data/output_round2_10.txt'}
Content: The presenter states that the limitation of the interposer size for CoWoS is no longer valid with TSMC's advancements in reticle sizes. Additionally, despite CoWoS's challenges with stress, EMIB faces thermal problems due to the use of polymers. The presenter concludes that high yield and fast throughput are critical, and introducing different proposals to the factory to make everything work is a challenging task.

Intel is also working on other new technologies like Foveros [ImageName: 2588.00], whi

## Sample Answer

TSMC's CoWoS (Chip-on-Wafer-on-Substrate) packaging technology has gained popularity due to several factors. CoWoS benefits from dense, fine-pitch interconnects and coefficient of thermal expansion (CTE) matching, which reduces stress on the die's backend and lowers the likelihood of failures. Also, it makes the chip attachment easier due to better CTE matching and lower costs than TSV-based Si interposers. While CoWoS previously had limitations regarding interposer size, TSMC's advancements in reticle sizes have mitigated this concern. Intel's EMIB (Embedded Multi-die Interconnect Bridge) faces thermal challenges due to the use of polymers, although it benefits from dense fine pitch interconnects and localized high-density wiring. Intel's Foveros is a die-to-die stacking technology with direct copper bonding that eliminates the need for TSVs; it utilizes a silicon base die with active circuitry. TSMC's strong customer focus also plays a role, as their business model has historically been more geared towards meeting diverse customer needs compared to Samsung and Intel. High yield and fast throughput are critical in the packaging world.