# Chunking, Semantic Embedding, and FAISS Index Construction

### Notebook Overview

This notebook prepares the RAG (Retrieval-Augmented Generation) corpus by transforming a collection of scientific papers into a fast, scalable semantic retrieval system.  
It completes three key phases of the RAG pipeline:

---

### Chunking the Parsed Corpus

- Each scientific paper is **parsed** into clean text.
- Text is split into overlapping **~256-token chunks** using a **token-based sliding window**.
- A **stride** of 50 tokens is used to preserve semantic continuity across chunk boundaries.
- This approach ensures efficient context segmentation while handling large and irregular documents.

Final output: ~8724 chunks across 75 papers.

---

### Semantic Embedding of Chunks

- Each text chunk is **encoded** into a **768-dimensional semantic vector** using the `BAAI/bge-base-en-v1.5` SentenceTransformer.
- Embeddings are **L2-normalized** to allow inner product similarity (cosine distance).
- Batching and GPU acceleration are used for efficient processing.

Outputs:
- `chunk_embeddings.npy` — Dense embedding matrix.
- `chunk_metadata.json` — Metadata linking each vector to its original text chunk.

---

### FAISS Semantic Index Construction

- A **FAISS IndexFlatIP** index is built to enable **fast semantic search** over the embeddings.
- Inner Product (dot product) search approximates **cosine similarity** because embeddings are normalized.
- All 8724 vectors are added and the index is saved compactly to disk.

Outputs:
- `faiss_index.bin` — A production-ready FAISS vector database.

---

### Key Design Choices and Rationale

| Aspect | Decision | Reason |
|--------|----------|--------|
| Chunking Method | Token-based sliding window | Fast, scalable, robust against scientific text irregularities |
| Embedding Model | `bge-base-en-v1.5` | Optimal tradeoff between retrieval quality and speed |
| Index Type | FAISS `IndexFlatIP` | Simplicity, scalability, cosine similarity retrieval |

---

### Outcome

At the end of this notebook, we have:
- A **tokenized and semantically-chunked corpus**
- A **dense semantic embedding space**
- A **ready-to-query FAISS semantic index**

This lays the foundation for building a **full Retrieval-Augmented Generation (RAG)** system—allowing large language models to answer questions with **grounded, document-based knowledge**.

## Step 1: Mounting Google Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to the repo folder
%cd /content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer

# List repo contents
!ls

Mounted at /content/drive
/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer
data				LICENSE		 qa_pairs   wandb
deployment			models		 README.md
eval_predictions_baseline.json	notebooks	 results
gpt4o_judgments_baseline.json	project_plan.md  scripts


## Step 2: Installing Dependencies and Importing Libraries

In [None]:
!pip install -q pymupdf faiss-cpu sentence-transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m117.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m85.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m125.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m106.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import torch
import faiss
import numpy as np
import pymupdf
import faiss
import sentence_transformers
from sentence_transformers import SentenceTransformer
import os
import fitz
import json
import nltk
from transformers import AutoTokenizer
nltk.download("punkt")
nltk.download('punkt_tab')
from huggingface_hub import login

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Step 3: Verifying GPU and Environment

In [None]:
# Check for GPU availability and set device
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    device = torch.device("cuda")
    print(f"GPU detected: {device_name}")
else:
    device = torch.device("cpu")
    print("GPU not detected. Using CPU instead.")

print(f"Running on device: {device}")

GPU detected: Tesla T4
Running on device: cuda


## Step 4: Load Metadata and Verify PDF Paths

We begin by loading the `metadata.json` file created during corpus curation. For each paper listed, we:

- Construct its expected PDF path using the `arxiv_id`
- Check if the file exists in the `rag_corpus/` folder
- Log missing files and keep only verified entries

This ensures our chunking and embedding pipeline operates only on valid documents.

In [None]:
# Define paths
metadata_path = "./data/rag_corpus/metadata.json"
pdf_dir = "./data/rag_corpus/"

In [None]:
# Load metadata
with open(metadata_path, "r") as f:
    metadata = json.load(f)

print(f"[INFO] Loaded metadata for {len(metadata)} papers.")

[INFO] Loaded metadata for 75 papers.


For each entry in the metadata, we construct the expected PDF file path using its `arxiv_id` and check whether the corresponding file exists in the `rag_corpus/` directory.

- If the file exists:
  - We append its full path (`pdf_path`) to the paper's dictionary
  - We add it to the `verified_papers` list for downstream processing

- If the file is missing:
  - We log its `arxiv_id` in the `missing_pdfs` list

This step ensures that all subsequent parsing and chunking operations are performed **only on papers with valid, accessible PDF files**—minimizing runtime errors and maintaining pipeline robustness.

In [None]:
# Verify existence of each PDF
verified_papers = []
missing_pdfs = []

for paper in metadata:
    arxiv_id = paper.get("arxiv_id")
    pdf_path = os.path.join(pdf_dir, f"{arxiv_id}.pdf")

    if os.path.exists(pdf_path):
        paper["pdf_path"] = pdf_path  # attach for downstream use
        verified_papers.append(paper)
    else:
        missing_pdfs.append(arxiv_id)

In [None]:
# Summary
print(f"Verified PDFs: {len(verified_papers)}")
print(f"Missing PDFs: {len(missing_pdfs)}")

if missing_pdfs:
    print("\nMissing PDF arXiv IDs:")
    for mid in missing_pdfs:
        print(f" - {mid}")

Verified PDFs: 75
Missing PDFs: 0


## Step 5: Extract Raw Text from Verified PDFs

We now parse each PDF using `PyMuPDF` (`fitz`) to extract raw text. For each verified paper:

- We open the PDF
- Read and concatenate text from each page
- Normalize spacing and encoding issues
- Store the extracted text in a new `text` field

This text will be chunked in the next step before being embedded. At this point, we aim to preserve as much **semantic content** as possible while stripping away headers, footers, and noise.

### PDF Text Extraction Function (`extract_text_from_pdf`)

This helper function takes the file path to a PDF and returns the cleaned, concatenated raw text.

Here’s what happens internally:

- **PDF Opening with `fitz` (PyMuPDF)**  
  The PDF is opened in memory using `fitz.open(pdf_path)`.  
  Each page is parsed using `page.get_text()` to extract its visible text content.

- **Page Concatenation**  
  All pages are joined into one long string using newline characters.

- **Encoding Normalization**  
  The raw text is encoded to UTF-8 and then decoded again, which helps remove any problematic characters from ligatures, special symbols, or equations.

- **Whitespace Cleanup**  
  We compress all extra whitespace (multiple spaces, newlines, tabs) into single spaces using:
  ```python
  " ".join(full_text.split())

In [None]:
def extract_text_from_pdf(pdf_path):
    """Extract and normalize text from a single PDF file."""
    try:
        with fitz.open(pdf_path) as doc:
            pages = [page.get_text() for page in doc]
    except Exception as e:
        print(f"[ERROR] Failed to read {pdf_path}: {e}")
        return ""

    # Join and normalize text
    full_text = "\n".join(pages) # joins all the individual page texts into a single string
    full_text = full_text.encode("utf-8", errors="ignore").decode("utf-8") #  ensures the text is in UTF-8 encoding, handling potential encoding issues
    full_text = " ".join(full_text.split())  # remove extra whitespace

    return full_text.strip()

### Parsing and Filtering Extracted Text from All Verified PDFs

We now iterate through each verified paper and apply our `extract_text_from_pdf()` function to extract raw text from the corresponding PDF.

For each paper:
- We call the parser on `pdf_path`
- If the extracted text is sufficiently long (more than 500 characters), we:
  - Attach the cleaned text to the `text` field of the paper
  - Append the paper to the `parsed_papers` list

- If the extraction is too short or empty, we issue a warning and skip the paper

> This filtering step ensures that only papers with **substantial textual content** proceed to the chunking phase. It prevents noisy, corrupted, or image-heavy PDFs from contaminating the semantic index.

In [None]:
# Parse all verified papers
parsed_papers = []

for paper in verified_papers:
    pdf_path = paper["pdf_path"]
    extracted_text = extract_text_from_pdf(pdf_path)

    # Attach extracted text to paper
    if len(extracted_text) > 500:  # Filter out tiny extractions
        paper["text"] = extracted_text
        parsed_papers.append(paper)
    else:
        print(f"[WARN] Skipping {paper['arxiv_id']} — text too short or empty.")

print(f"Successfully parsed {len(parsed_papers)} papers.")

Successfully parsed 75 papers.


## Step 6: Chunking Parsed Text into Semantic Units

To enable fine-grained semantic retrieval, we break each paper’s full text into overlapping text chunks of ~200–256 tokens.

- We first tokenize by sentence
- Then accumulate sentences into chunks until we reach the target token count
- Chunks are slightly overlapped to preserve contextual continuity
- Each chunk is labeled with metadata including `arxiv_id`, `chunk_id`, and token boundaries

These chunks will later be embedded and stored in a FAISS index for efficient retrieval.

In [None]:
# Load the tokenizer used during LoRA fine-tuning
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

### Semantic Chunking via Token-Based Sliding Window

Due to the extreme length and irregular structure of scientific PDFs, we adopt a **token-based chunking strategy** to prepare the corpus for retrieval-augmented generation (RAG).

Instead of splitting by sentence (which is slow and unreliable for large documents), we:

- Tokenize the **entire document once** using the HuggingFace tokenizer (aligned with the fine-tuned model).
- Split the tokenized document into **overlapping windows** of `max_tokens = 256` tokens each.
- Use a **stride of 50 tokens** to ensure **context continuity** across adjacent chunks.
- Decode each token window back into human-readable text for downstream usage.

#### Key Parameters:
- `max_tokens = 256` : Maximum size of each chunk (aligned with RAG retrieval budgets)
- `stride = 50` : Overlap between consecutive chunks to preserve cross-sentence context
- `tokenizer` : HuggingFace tokenizer (`mistralai/Mistral-7B-Instruct-v0.3`) used for accurate token counting and decoding

#### Core Advantages:
- **Scalability**: Handles massive documents efficiently in linear time `O(N)`.
- **Robustness**: Avoids issues caused by scientific notation, figures, equations, or long paragraphs.
- **Alignment**: Ensures chunk sizes are compatible with LLM input limitations (~4096 token context windows).
- **Simplicity**: Reduces engineering complexity while preserving high retrieval quality.

#### Output Structure:
Each chunk is stored as a dictionary with:
- `arxiv_id`: Unique ID of the source paper
- `chunk_id`: Chunk index within the paper
- `title`: Paper title
- `text`: Decoded chunk text
- `start_token` and `end_token`: Token positions within the original document

All chunks are appended into a global list: **`chunked_corpus`**  
The final corpus contains ~8724 chunks generated from 75 papers.

---

This efficient chunking pipeline lays the foundation for **fast semantic embedding** and **high-quality retrieval**—critical components for building scalable, production-grade RAG systems.

In [None]:
# Parameters
max_tokens = 256
stride = 50

chunked_corpus = []

def chunk_text_by_tokens_fast(text, arxiv_id, title):
    """Chunk full text into fixed-size token windows."""
    # Tokenize the full document once
    tokens = tokenizer.tokenize(text)
    n_tokens = len(tokens)

    if n_tokens == 0:
        print(f"[WARN] Skipping empty or unparsable paper {arxiv_id}")
        return

    # Generate sliding windows
    start_idx = 0
    chunk_idx = 0

    while start_idx < n_tokens:
        end_idx = min(start_idx + max_tokens, n_tokens)
        token_slice = tokens[start_idx:end_idx]

        # Decode back into text
        input_ids = tokenizer.convert_tokens_to_ids(token_slice)
        chunk_text = tokenizer.decode(input_ids, skip_special_tokens=True)

        chunked_corpus.append({
            "arxiv_id": arxiv_id,
            "chunk_id": f"{arxiv_id}_{str(chunk_idx).zfill(2)}",
            "title": title,
            "text": chunk_text,
            "start_token": start_idx,
            "end_token": end_idx
        })

        start_idx += max_tokens - stride
        chunk_idx += 1

In [None]:
# Run chunking with logging
for idx, paper in enumerate(parsed_papers):
    chunk_text_by_tokens_fast(paper["text"], paper["arxiv_id"], paper["title"])

    if (idx + 1) % 5 == 0:
        print(f"[INFO] Processed {idx + 1}/{len(parsed_papers)} papers")

print(f"[Finished chunking {len(parsed_papers)} papers → {len(chunked_corpus)} chunks total.")

[INFO] Processed 5/75 papers
[INFO] Processed 10/75 papers
[INFO] Processed 15/75 papers
[INFO] Processed 20/75 papers
[INFO] Processed 25/75 papers
[INFO] Processed 30/75 papers
[INFO] Processed 35/75 papers
[INFO] Processed 40/75 papers
[INFO] Processed 45/75 papers
[INFO] Processed 50/75 papers
[INFO] Processed 55/75 papers
[INFO] Processed 60/75 papers
[INFO] Processed 65/75 papers
[INFO] Processed 70/75 papers
[INFO] Processed 75/75 papers
[Finished chunking 75 papers → 8724 chunks total.


In [None]:
chunked_corpus[5]

{'arxiv_id': '2410.05248v2',
 'chunk_id': '2410.05248v2_05',
 'title': 'SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe',
 'text': 'that data with varying confidence levels should play distinct roles in instruction tuning. Hence, we first derive an LLM’s confidence from its training dy- namics (Swayamdipta et al., 2020) and divide the SFT dataset into confident and unconfident subsets accordingly. We then linearly interpolate between these subsets and introduce a Mixup-based regu- 1 arXiv:2410.05248v2 [cs.CL] 16 Feb 2025 larization to support learning on these additional, interpolated examples. By propagating supervision signals across confidence regions (Bengio et al., 2009; Chapelle et al., 2009; Sohn et al., 2020) and encouraging linear behavior between them (Zhang et al., 2018; Verma et al., 2019), our recipe miti- gates overfitting in confident examples while en- hancing generalization in unconfident ones during LLM instruction tuning. We demonstrate the effe

## Step 7: Semantic Embedding of the Chunked Corpus

After splitting each scientific paper into manageable ~256-token chunks, we embedded these chunks into a dense semantic vector space using a **sentence embedding model**.

This allows for **fast semantic search** and **retrieval-augmented generation (RAG)** based on meaning rather than keyword overlap.

---

#### Embedding Process Details

- **Model**: `BAAI/bge-base-en-v1.5`
  - Chosen for its excellent balance between **retrieval performance**, **speed**, and **lightweight inference**.
  - Optimized for **semantic similarity search**.
- **Device**: GPU used if available (`cuda`), otherwise CPU fallback.
- **Batch Size**: 64 chunks per batch to balance memory efficiency and speed.
- **Normalization**: `normalize_embeddings=True` ensures all output vectors lie on the unit hypersphere (important for cosine similarity retrieval).

Each chunk of text was:
1. **Tokenized** and **encoded** into a 768-dimensional dense vector.
2. **Normalized** so that inner product = cosine similarity.
3. **Stored** in memory-efficient `numpy` array format for fast retrieval later.

---

#### Saved Outputs

| File | Description |
|------|-------------|
| `chunk_embeddings.npy` | Dense 768-dimensional embedding vectors for each chunk. Shape: `(8724, 768)` |
| `chunk_metadata.json` | Original chunk metadata (arxiv_id, chunk_id, title, text) for retrieval reference. |

Location:

./data/rag_corpus/

In [None]:
# Load the embedding model
model_name = "BAAI/bge-base-en-v1.5"
embed_model = SentenceTransformer(model_name, device=device)

print(f"Loaded embedding model: {model_name}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded embedding model: BAAI/bge-base-en-v1.5


In [None]:
# Prepare corpus for embedding
texts_to_embed = [chunk["text"] for chunk in chunked_corpus]

print(f"Ready to embed {len(texts_to_embed)} chunks.")

Ready to embed 8724 chunks.


In [None]:
# Batch embedding
batch_size = 64  # Adjust if memory is tight
all_embeddings = []

for start_idx in range(0, len(texts_to_embed), batch_size):
    batch_texts = texts_to_embed[start_idx:start_idx + batch_size]
    batch_embeddings = embed_model.encode(
        batch_texts,
        batch_size=len(batch_texts),
        show_progress_bar=False,
        normalize_embeddings=True
    )
    all_embeddings.append(batch_embeddings)

In [None]:
# Concatenate all batches into a single array
all_embeddings = np.vstack(all_embeddings)

print(f"Embedding completed -- shape: {all_embeddings.shape}")

Embedding completed -- shape: (8724, 768)


In [None]:
# Save embeddings and metadata
embedding_dir = "./data/rag_corpus/"
os.makedirs(embedding_dir, exist_ok=True)

np.save(os.path.join(embedding_dir, "chunk_embeddings.npy"), all_embeddings)

with open(os.path.join(embedding_dir, "chunk_metadata.json"), "w") as f:
    json.dump(chunked_corpus, f, indent=2)

print(f"Saved embeddings and metadata → {embedding_dir}")

Saved embeddings and metadata → ./data/rag_corpus/


## Step 8: Building the FAISS Semantic Index

After embedding all the corpus chunks into 768-dimensional dense vectors,  
we construct a **FAISS (Facebook AI Similarity Search)** index to enable **fast and efficient semantic retrieval**.

Without a vector index, searching through 8700+ embeddings individually would be computationally expensive (`O(N)` time).  
By using FAISS, we reduce retrieval time to near `O(log(N))` or better—enabling real-time retrieval-augmented generation (RAG).

---

#### FAISS Index Construction Details

- **Index Type**: `IndexFlatIP`
  - *Inner Product* (dot product) is used for measuring similarity between vectors.
  - **Important**: Since we normalized embeddings during encoding (`normalize_embeddings=True`),  
    **inner product effectively behaves as cosine similarity**.
- **Embedding Size**: 768 dimensions (from the `bge-base-en-v1.5` model).
- **Corpus Size**: ~8724 semantic chunks.

Each chunk is thus represented as a point in a 768-dimensional semantic space, where:

- Vectors close together = semantically similar
- Vectors far apart = semantically dissimilar

---

#### What Files Are Created

| File | Description |
|------|-------------|
| `faiss_index.bin` | A compact, binary FAISS index storing all 8700+ vectors for fast retrieval. |
| `chunk_embeddings.npy` (created earlier) | Original dense embedding matrix (if needed for retraining or future index rebuilds). |
| `chunk_metadata.json` (created earlier) | Human-readable metadata linking each vector back to its original paper, chunk ID, and text. |

---

#### Why FAISS and This Setup Were Chosen

| Reason | Explanation |
|--------|-------------|
| **Speed** | FAISS can search thousands of vectors in milliseconds. |
| **Accuracy** | Inner product preserves semantic similarity due to normalized vectors. |
| **Simplicity** | `IndexFlatIP` is simple, robust, and requires no training phase. |
| **Portability** | The `.bin` file can be easily loaded in future notebooks or deployments without recomputing embeddings. |
| **Recruiter Visibility** | Shows experience with real-world scalable information retrieval systems.

---

#### Retrieval Workflow After This Step

Later, when a user submits a question:
1. The question will be **embedded** into a 768-dimensional vector.
2. The FAISS index will perform a **top-k semantic search**.
3. The most relevant chunks (context passages) will be retrieved.
4. These retrieved chunks will be fed into your fine-tuned LLM for grounded answer generation.

Thus, this FAISS index forms the **core retrieval engine** of your final RAG system.

In [None]:
# Load the embeddings
embedding_path = os.path.join(embedding_dir, "chunk_embeddings.npy")
embeddings = np.load(embedding_path)
print(f"Loaded embeddings → shape: {embeddings.shape}")

Loaded embeddings → shape: (8724, 768)


In [None]:
# Build FAISS index (cosine similarity via normalized L2)
dimension = embeddings.shape[1]  # 768
index = faiss.IndexFlatIP(dimension)  # Inner product = Cosine similarity if normalized
print(f"Initialized FAISS index with dimension {dimension}")

Initialized FAISS index with dimension 768


In [None]:
# Add embeddings to index
index.add(embeddings)
print(f"Added {index.ntotal} vectors to FAISS index.")

Added 8724 vectors to FAISS index.


In [None]:
# Save FAISS index to disk
index_path = os.path.join(embedding_dir, "faiss_index.bin")
faiss.write_index(index, index_path)
print(f"FAISS index saved → {index_path}")

FAISS index saved → ./data/rag_corpus/faiss_index.bin


## Step 9: Fixing Metadata


In [1]:
pip install nbformat --quiet

In [2]:
from google.colab import drive, files
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
import os
import nbformat

In [4]:
# Confirm path to your notebook
notebook_path = "/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks/13_chunk_and_embed.ipynb"

# Load the notebook
with open(notebook_path, "r") as f:
    nb = nbformat.read(f, as_version=4)

# Check and fix metadata
if "widgets" in nb.metadata:
    # If 'state' key is missing, add an empty 'state'
    if "state" not in nb.metadata["widgets"]:
        nb.metadata["widgets"]["state"] = {}

# Save the fixed notebook
with open(notebook_path, "w") as f:
    nbformat.write(nb, f)

print("Notebook metadata fixed and saved successfully!")

Notebook metadata fixed and saved successfully!


In [None]:
# Downloading the notebook

files.download('/content/drive/MyDrive/llm-finetuning-project/llm-finetuning-summarizer/notebooks/13_chunk_and_embed.ipynb')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>