<a href="https://colab.research.google.com/github/shaoyinguo-portfolio/CorpGenie-exp/blob/main/Meeting2TechDoc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demos the tech document generation given key frames and transcripts, using multi-modal models hosted by Open Router.

The access tokens are handled by Colab Secret Manager. Please sign up to Open Router and create your own, name it `CorpGenie`

## Key Steps:

1. Load key frames and transcript lines, combine and sort by timestamp
2. Break into chunks of certain size.
3. Recursively feed into LLM with previously generated text

In [2]:
!pip install -U -q langchain
!pip install -U -q "langchain[openai]"

In [3]:
from google.colab import drive, userdata
from matplotlib import pyplot as plt
import numpy as np
from pathlib import Path
from time import time
from PIL import Image
import base64
import io
import glob
import os
import json



try:
    import gdown
except:
    !pip install gdown
    import gdown

from langchain_openai import ChatOpenAI
from langchain.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser

In [4]:
drive.mount('/content/drive')
data_path = Path('/content/drive/MyDrive/Colab Notebooks/data')

TRANSCRIPT_PATH = f'{data_path}/transcripts.txt'
KEYFRAME_PATH = f'{data_path}/key_frames'

os.environ["OPENAI_API_KEY"] = userdata.get('OPENROUTER_API_KEY')

Mounted at /content/drive


In [5]:
# Parse transcripts text file by timestamps:

all_lines = []

with open(TRANSCRIPT_PATH, 'r') as f:
    for line in f.readlines():
        # print(line)
        splits = line.split(']')
        if len(splits) < 2:
            continue
        ts = float(splits[0].strip().replace('[', ''))
        text = splits[1].strip()
        all_lines.append((ts, text))

print(f'Found {len(all_lines)} lines')

Found 731 lines


In [23]:
# Load key frame file paths:

all_images = []
for p in glob.glob('/content/drive/MyDrive/Colab Notebooks/data/key_frames/*.jpg'):
    all_images.append((float(Path(p).name.replace('.jpg', '')), p))

print(f'Found {len(all_images)} images')
# all_images.sort(key=lambda x: x[0])

Found 104 images


In [24]:
# Combine and sort by timestamps

all_events = []
for ts, img in all_images:
    # Convert numpy array to PIL Image as required by the processor
    all_events.append({'timestamp': ts, 'type': 'image', 'data': img})

for ts, text in all_lines:
    all_events.append({'timestamp': ts, 'type': 'transcript', 'data': text})

# Sort events chronologically
all_events.sort(key=lambda x: x['timestamp'])

In [25]:
def encode_image_to_base64(image_path: str, format: str = 'PNG') -> str:
    """
    Reads an image file using PIL, saves it into an in-memory buffer,
    and encodes the buffer contents into a Base64 string.

    Args:
        image_path: The file path to the image.
        format: The format to use for the buffer (PNG is recommended for slides).
                Must be a format PIL supports.

    Returns:
        A Base64 encoded string of the image data.
    """
    try:
        # 1. Read the image file using PIL
        with Image.open(image_path) as img:
            # Ensure image is in RGB format if it's grayscale or otherwise different
            if img.mode != 'RGB':
                img = img.convert('RGB')

            # 2. Save the image data to an in-memory buffer (BytesIO)
            # This avoids writing a temporary file
            buffered = io.BytesIO()
            img.save(buffered, format=format)

            # 3. Encode the bytes from the buffer to Base64
            img_bytes = buffered.getvalue()
            img_base64 = base64.b64encode(img_bytes)

            # 4. Convert bytes to a string for the API payload
            return img_base64.decode('utf-8')

    except FileNotFoundError:
        print(f"Error: The file was not found at {image_path}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

In [26]:
def check_content_blocks_size(content_blocks):
    json_payload_str = json.dumps(content_blocks)
    return len(json_payload_str.encode('utf-8'))

In [27]:
def yield_messages(events, max_size=1e6):
    content_blocks = []

    for item in events:
        time_str = f"{item['timestamp']:.2f}" # use seconds for correlation with image names

        if item['type'] == 'image':
            # only break at images for coherence
            if check_content_blocks_size(content_blocks) > max_size:
                yield content_blocks
                content_blocks = []
            base64_image = encode_image_to_base64(item['data'])
            content_blocks.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{base64_image}"
                }
            })
            content_blocks.append({
                "type": "text",
                "text": f"[Timestamp {time_str} s]: Above image is a visual slide content [ImageName: {time_str}] related to the ongoing discussion as follows. Please quote if it adds significant value. Skip if it is not very meaningful."
            })
        elif item['type'] == 'transcript':
            content_blocks.append({
                "type": "text",
                "text": f"[Timestamp {time_str}]: Transcript excerpt: '{item['data']}'"
            })
    yield content_blocks


In [34]:
model = ChatOpenAI(
    openai_api_key=os.environ["OPENAI_API_KEY"],
    base_url="https://openrouter.ai/api/v1",
    model="google/gemini-2.0-flash-001" # "google/gemini-2.0-flash-exp:free"
)

chain = model | StrOutputParser()

system_message = SystemMessage(content="""
You are a rigorous technical writer assembling a technical document from streamed, timestamped chunks of slides/key frames and transcript captured from a meeting or presentation.
Please follow the following steps and rules:
1) First correct any miscaptured words in the transcripts by strictly referring to the slides, especially for terminologies and acronyms.
2) Write an accurate, detailed, and comprehensive professional technical document, based on the infomation parsed from the meeting key frames and corrected transcripts, together with the text generated in earlier sections as the context (if provided).
3) Always cross check all the info especially terminologies, acronyms, and numbers, between key frames, transcripts and previous context.
3) Write the new content as a smooth continuation of the previous context (if any) but do not repeat or rewrite the previous context. Skip if it was already mentioned in the previous context.
4) Focus on accurate metrics, decisions made and key action items. Record conflicts explicitly if any.
5) Never invent metrics, owners, or dates. If unknown, write “TBD”. Do not guess.
6）Explicitly quote the names of the key frame in the format of `[ImageName: XXX.XX]` when rewriting based on the transcripts that are discussing the key frame, so that the readers know which frame the ongoing discussion is about.
7）Do not quote timestamps.
8) Use proper syntax for inline or block fomulas using LaTeX code
""")

all_generations = ''
total_events = len(all_events)
chunk_seperations = []
processed_events = 0
for i, content_blocks in enumerate(yield_messages(all_events, max_size=5e6)):
    processed_events += len(content_blocks)
    ts = content_blocks[-1]['text'].split(']')[0] + ']'
    if i > 2:
        break
    human_message = HumanMessage(content=[{"type":"text", "text": f"---------Previous Context--------\n{all_generations}\n---------End of Previous Context--------\n"}])
    human_message = HumanMessage(content=content_blocks)
    all_generations += chain.invoke([system_message, human_message])
    print(f'\nProcessed Chunk #{i+1}, {processed_events} / {total_events} content blocks @ {ts}, Generated {len(all_generations)} Characters')
    chunk_seperations.append(len(all_generations))




Processed Chunk #1, 210 / 835 content blocks @ [Timestamp 847.34], Generated 4232 Characters

Processed Chunk #2, 282 / 835 content blocks @ [Timestamp 1178.70], Generated 5584 Characters

Processed Chunk #3, 306 / 835 content blocks @ [Timestamp 1248.28], Generated 7814 Characters


In [None]:
for i, chunk_seperation in enumerate(chunk_seperations):
    if i == 0:
        print(all_generations[0:chunk_seperation])
    else:
        print(all_generations[chunk_seperations[i-1]:chunk_seperation])
    print(f'\n\n--------- Chunk Seperation {i} ---------\n\n')

## Findings:

1. `google/gemini-2.0-flash-001` follows the instruction the best out of similarly priced models (e.g. `openai/gpt-4.1-nano` etc)
2. Iteratively evaluate quality of the output and fine tune instuctions
3. Removing repetitive key frames due to presenter going back and forth produces greater coherence of the text
4. TODOs:
    - add redundant key frame removal algorithm into the extraction process, based on `n_frames_to_last_repeat` and/or `time_to_last_repeat` criteria
    - Consider adding overlapping content between chunks to enhance continuity
    - Try out a final refining step for all concatenated text

## Sample Output:

The presentation focuses on packaging process technologies, particularly those of TSMC and Intel, with an emphasis on CoWoS (Chip-on-Wafer-on-Substrate), EMIB (Embedded Multi-die Interconnect Bridge), Foveros, and chiplets [ImageName: 1.00].  TSMC is recognized as a leader in semiconductor manufacturing, not only for its high yield in advanced nodes like 3nm and 2nm, but also for its packaging technology.  The discussion will primarily focus on CoWoS while acknowledging TSMC's broader 3D fabric capabilities.  Intel's EMIB and Foveros technologies will also be touched upon, along with the challenges and advantages of utilizing chiplets. Before diving into specifics, Key Performance Indicators (KPIs) for packaging technology, often referred to as "P3C2," needs to be clarified.

According to Douglas of TSMC, the P3C2 acronym represents Performance, Power, Package Profile, Cycle Time, and Cost [ImageName: 90.00]. Performance considers bandwidth (BW), maximum frequency (Fmax), and overall function. Power relates to efficiency and thermal properties (Tj).  Package profile includes footprint and thickness. Cycle time refers to the speed of bringing the product to market, while cost represents the financial burden on customers.  However, the speaker notes the absence of "yield and reliability" in the P3C2 definition, emphasizing their critical importance from an engineering perspective, despite them being often overlooked by customers. Heterogeneous system integration requires continuous pitch scaling, aiming to reduce dimensions from 36 microns to a few microns to pack more components into a single package.

TSMC's 3D Fabric technology includes both 3D silicon stacking and advanced packaging solutions [ImageName: 209.00].  Silicon stacking options include SoIC (System on Integrated Chips)-P (Bumped), TSMC-SoIC®, and SoIC-X (Bumpless).  Advanced packaging solutions encompass CoWoS® (including CoWoS-S with a silicon interposer and CoWoS-L/R with an RDL interposer), InFO-PoP, InFO-2.5D, and InFO-3D. The primary focus will be on CoWoS due to its use of a silicon interposer.

TSMC's CoWoS updates for 2023 are primarily aimed at High-Performance Computing (HPC) applications that require the integration of advanced logic and High Bandwidth Memory (HBM) [ImageName: 266.00].  TSMC supports over 140 CoWoS products for more than 25 customers. They are developing a 6x reticle-size (approximately 5,000 mm²) RDL interposer capable of accommodating 12 stacks of HBM memory. The AI server market is estimated to reach $1 trillion in the US alone by 2030. High-end customers like Nvidia, AMD, and Qualcomm utilize these advanced packaging solutions. The CoWoS product count is around 140, significantly less compared to Cisco products, which number in the thousands. The 6x reticle size RDL interposers have redistribution layers inside the silicon, which allows signal transmission within the silicon interposer. The previous generation in 2021 used 3x reticle size (~2,500 mm²), and the current generation uses 4x reticle size.

The term 'reticle size' refers to the maximum size of a die that a lithography tool can print [ImageName: 553.00]. A standard reticle field is 26mm x 33mm, which equals 858 mm². Most lithography tools, including ASML tools, use this size.  While the lens of the tool might be huge, the mask size used is 4x, where light with a pattern shines through it onto the wafer with a 4x reduction of the 26mm x 33mm size. This is a universal standard, which makes it very difficult to have die sizes larger than this. For high-end applications with extreme ultraviolet (EUV) lithography, printing is done in a specific manner, with the mask orientation at 26x33 at 4x and 8x in the Y-axis. This is because of the difficulty of printing larger dies with only this area. Therefore, to create a larger die, it is printed twice. Intel had the problem with printing twice, so made a larger mask.  The current mask size is 6 inches, and 26x33 for 4x can be amplified using a 6-inch mask.  However, for high-end EUV lithography where printing is done twice, a 9-inch mask is needed, which no one wants to commit to. TSMC is developing a 6x reticle size, which translates to 5,148 mm² (26x33 x 6).



--------- Chunk Seperation 0 ---------


The NVIDIA H100 is a large GPU with six memory chips [ImageName: 849.00]. The GPU, GH100, has 80 billion transistors. The die size is 814 $mm^2$.
The size of the die is limited by manufacturing constraints. NVIDIA uses a TSMC N4 process. The yield is affected by the large die size.

The TSMC yield is 80% for Apple A13, which has die size about 1 cm x 1 cm, which is 100 $mm^2$ [ImageName: 936.00]. The yield of TSMC for GH100 (die size > 8 $cm^2$) should drop to around 30-40% for such big GPUs and will stay low. The larger the die, the lower the yield due to Poisson statistics. The yield reduction is a function of area. The presenter generated a formula showing the yield reduction as a function of area:
$$y(A) = (1 + \frac{A * 0.223}{2})^{-2}$$
where $y(A)$ is the yield as a function of area $A$.

The presenter explained in a video that particle landing on a wafer is like a bomb landing in London in World War II, which follows a Poisson distribution [ImageName: 1001.00]. In the video, the presenter explained that R. Clark divided the London area by the number of grids and calculated probabilities of bombs dropping in each grid.

Contaminants and particles landing on a wafer are like German bombs on London from the eyes of statistics [ImageName: 1144.00]. The TSMC 5nm Nvidia yield is 80%. There are many different types of particles.



--------- Chunk Seperation 1 ---------


The discussion focuses on yield models and surface particles [ImageName: 1182.00], noting that killer-defect density value for Surface Prep is calculated based on Poisson's model for yield = 99%. Values changed from 2004 are small, due to die size changes, 97 (in 2004 ITRS) to 94.2 (in 2005 ITRS) for 2005 critical particle count per wafer. Currently, different models are used for Yield Enhancement and Starting Materials Surface Prep. Final particle density values are approximately the same for all models. YE and FE will have a workshop to determine the appropriate Starting Materials' defect levels and establish specific call outs for critical cleans (specifically pre-gate). Concern is ability to measure defects at critical particle diameter that contributes to yield.

It's explained that particles landing on a wafer is also Poisson [ImageName: 1185.00]. In the semiconductor process world, it assumes that "particles/defects arriving on the wafer" are like bombs landing in London, a random event with Poisson distribution as follows:

$P(k | \mu) = e^{-\mu} \frac{\mu^k}{k!}$

where a simple Yield model is developed assuming no defects before time t

$P(k = 0) = e^{-\mu} = e^{-DA}$

where D = defect density, A = critical area.

The discussion refers to a classic yield model modified by ITRS [ImageName: 1198.00] with the formula:

$Y = f(A, D)$

where D = defect density, which has a distribution with respect to the defect size and A = critical area (in which a defect has a high probability of resulting in a fault, which also has a distribution with respect to the defect size).

$\Upsilon = \int_0^{\infty} e^{-DA} f(D) dD$
$f(D) = \Gamma(\alpha) B^{\alpha} ]^{-1} D^{\alpha - 1} e^{-D/ \beta}$

Negative binomial:
$Y_r = (1 + \frac{A D}{\alpha})^{-\alpha}$

An example of TSMC Yield vs Die sizes [ImageName: 1211.00] is discussed. TSMC reported that their yield rate is 80% for Apple A13 (die size 0.44cm * 0.44cm). The question is what happens to yield rate if the die size changes? The answer is that they need to estimate $D_0$ first.

$Y(0) \sim e^{-DA}$
$0.8 = exp(-0.44cm * 0.44cm * D_0)$
$D_0 =  -ln(0.8) / 0.44^2 = 1.15 / cm^2$

ITRS suggests $\alpha = 2$
$\Upsilon(A) = (1 + \frac{A*1.15}{2})^{-2}$



--------- Chunk Seperation 2 ---------


