## Visualise each chunk and its size

### 1. We create nodes as usual first

In [1]:
from src.file_reader import FileReader, process_md, LOGGER
from src.settings import MD_DIR_PATH, PROCESSED_DIR_PATH, CHUNK_SIZE
from pathlib import Path


md_paths = list(Path(MD_DIR_PATH).iterdir())
if md_paths:
    try:
        for filepath in md_paths:
            file_suffix = filepath.suffix
            # Only process for markdown files in the directory
            if file_suffix != ".md":
                continue
            LOGGER.info(filepath)
            with open(filepath, encoding="utf-8") as f:
                md = f.read()
            processed_md = process_md(md)
            filename = filepath.name
            LOGGER.info(filename)
            processed_md_path = str(Path(PROCESSED_DIR_PATH).joinpath(filename))
            with open(processed_md_path, "w", encoding="UTF-8") as f:
                f.write(processed_md)

    except Exception:
        LOGGER.exception("Markdown document parsing failed")
        raise

    LOGGER.info("Parsing and saving MD files completed succesfully")
else:
    LOGGER.info("No MD files found")

chunks = FileReader(input_dir=PROCESSED_DIR_PATH).load_data()

2025-02-21 12:01:50 - src - INFO - /Users/rajeevwarrier/Documents/md_folder/toyota.md
2025-02-21 12:01:50 - src - INFO - toyota.md
2025-02-21 12:01:50 - src - INFO - Parsing and saving MD files completed succesfully


### 2. Print chunk, metadata, length of tokens and check if chunk is within size

In [2]:
import tiktoken

tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")
prev_metadata = ""
start = True
for chunk in chunks:
    inside_limit = False
    metadata_str: str = "\n".join(
        [f"{key}: {value}" for key, value in chunk.metadata.items()]
    )
    token_length = len(tokenizer.encode(chunk.text))
    if token_length < ((CHUNK_SIZE - len(tokenizer.encode(metadata_str))) * 1.4):
        inside_limit = True

    if prev_metadata == chunk.metadata or start:
        print(chunk.metadata)
        print(f"Length of tokens: {token_length}")
        print(f"Is chunk inside token limit?: {inside_limit}")
        print(chunk.text)
        print(100 * "-")
    else:
        print(100 * "=")
        print(100 * "=")
        print(chunk.metadata)
        print(f"Length of tokens: {token_length}")
        print(f"Is chunk inside token limit?: {inside_limit}")
        print(chunk.text)
        print(100 * "-")
    start = False
    prev_metadata = chunk.metadata

{'doc_name': 'toyota', 'doc_type': '.md', 'chunk_size': 256, 'chunk_overlap': 1.4}
Length of tokens: 260
Is chunk inside token limit?: True
# toyota
# Model Name: Camry 2025

Tagline for Camry 2025: Greater Handling & Styling. Category for Camry 2025: Hybrid. Test Drive Availability for Camry 2025: Not Available. Vehicle Link for Camry 2025: /vehicles/camry

# toyota
# Model Name: Camry 2025

# Model Name: Camry 2025
## Technical Specifications of Model Trims
### 2.5L 4 Cylinder Hybrid Limited:
#### Dimensions & Weight

* Overall length (mm): 4920
* Overall width (mm): 1840
* Overall height (mm): 1445
* Wheelbase (mm): 2825
* Tread Front (mm): 1600
* Tread Rear (mm): 1600
* Ground clearance (mm): 145
* Curb weight (kg): 1640 - 1660
* Gross vehicle weight (kg): 2100

# toyota
# Model Name: Camry 2025

# Model Name: Camry 2025
## Technical Specifications of Model Trims
### 2.5L 4 Cylinder Hybrid Limited:
#### Chassis
-----------------------------------------------------------------------