# Getting Knowledge-Base

In [1]:
import fitz # it's pymupdf library
from tqdm.auto import tqdm


pdf_path = "/home/ai/TAC2-lbz/knowledge_base.pdf"

# by this function our text would be cleaner and better for our LLM
def text_formatter(text:str) -> str:
    cleaned_text = text.replace("\n", " ").strip()
    return cleaned_text

# now for opening the pdf and reading it we want this function:
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4, # 1 token is about 4 words. this will be need for passing to LLM
                                "text": text })
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
# pages_and_texts[100]

  from .autonotebook import tqdm as notebook_tqdm
128it [00:00, 213.45it/s]


In [2]:
# create dataFrames from our pages and texts
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,2143,372,8,535.75,– Advanced L2 functions – Stacking support – M...
1,1,653,116,1,163.25,www.eltex-co.ru 2 MES2124MB 220V PC 12V Batter...
2,2,2718,471,1,679.5,Interfaces functions – Head-of-line blocking (...
3,3,3002,546,1,750.5,ОАМ – IEEE 802.3ah Ethernet OAM – Dying Gasp ...
4,4,2374,352,3,593.5,Name Description Image MES1124M AC Ethernet sw...


In [3]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,128.0,128.0,128.0,128.0,128.0
mean,63.5,2153.27,364.02,3.11,538.32
std,37.09,830.73,145.39,2.5,207.68
min,0.0,484.0,65.0,1.0,121.0
25%,31.75,1565.5,260.0,1.0,391.38
50%,63.5,2075.5,355.0,2.0,518.88
75%,95.25,2687.0,455.75,3.25,671.75
max,127.0,4862.0,809.0,12.0,1215.5


# Data PreProcess

token count is important because we can't use embedding models with infinite tokens, and also LLMs. so fo choosing best embedding model and best LLM we have to know token count.

In [4]:
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 128/128 [00:00<00:00, 646.45it/s]


In [5]:
import random 
random.sample(pages_and_texts, k=1)

[{'page_number': 22,
  'page_char_count': 1411,
  'page_word_count': 255,
  'page_sentence_count_raw': 3,
  'page_token_count': 352.75,
  'text': '2 MES2424B/ MES2424FB/ MES2448B/ MES2448E 220V PC 12V 2 Battery capacity,  Ah Battery life, h Battery charge  time, h MES2424B 12 ≈6 ≈9 17 ≈10 ≈13 20 ≈13 ≈15 MES2424FB 12 ≈5 ≈13 17 ≈7 ≈18 20 ≈10 ≈22 MES2448B MES2448E 12 ≈2,5 ≈13 17 ≈5 ≈18 20 ≈6,5 ≈22 *  Technical features of redundancy power supply 3 * Note: — Parameters are given for environment temperature +25 °C; — For MES2424B the use of a rechargeable battery with a capacity  of at least 12 Ah; — For MES2424FB, MES2448B, MES2448E the use of a  rechargeable battery with a capacity of at least 9 Ah is  recommended.  — MES2448E is under development. Technical features (continued) MES2424 AC MES2424 DC MES2424B MES2424FB   MES2448 DC MES2448B MES2448E L3 Multicast groups (IGMP  proxy) 512 512 512 512 2048 2048 2048 SQinQ rules 384 (ingress)/512 (egress) 768 (ingress)/1024 (egress) MAC ACL r

In [6]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,128.0,128.0,128.0,128.0,128.0,128.0
mean,63.5,2153.27,364.02,3.11,538.32,3.19
std,37.09,830.73,145.39,2.5,207.68,2.77
min,0.0,484.0,65.0,1.0,121.0,1.0
25%,31.75,1565.5,260.0,1.0,391.38,1.0
50%,63.5,2075.5,355.0,2.0,518.88,2.0
75%,95.25,2687.0,455.75,3.25,671.75,4.0
max,127.0,4862.0,809.0,12.0,1215.5,13.0


Lets chunk our large sentences into smaller one

splitting sentences in group of 10 sentences 

it's called text splitting and libraries like **LangChain** can do this for us

Goal of doing this is to be more easier to filter our sentences and also much easier for our embedding model.

In [7]:
# Defining the group size
num_sentences_chunk_size = 10

def split_list(input_list:list[str], slice_size:int=num_sentences_chunk_size) -> list[list[str]]:
    return [input_list[i: i + slice_size] for i in range(0, len(input_list), slice_size)]

In [8]:
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(item["sentences"])
    item["number_of_chunks"] = len(item["sentence_chunks"])

100%|██████████| 128/128 [00:00<00:00, 972592.23it/s]


In [9]:
# lets see what we are doing :D

random.sample(pages_and_texts, k=1)

[{'page_number': 34,
  'page_char_count': 3196,
  'page_word_count': 546,
  'page_sentence_count_raw': 2,
  'page_token_count': 799.0,
  'text': '3 www.eltex-co.com 3 IРv6 support — IPv6 Host — Dual-stack IPv4, IРv6 Security functions — DHCP Snooping — DHCP Option 82 — IP Source Guard — Dynamic ARP Inspection (Protection) — MAC-based authentication, Port Security, static MAC  addresses — IEEE 802.1x based authentication per ports — Guest VLAN — DoS attacks prevention — Traffic segmentation — DHCP clients filtering — BPDU attacks prevention — PPPoE Intermediate Agent — DHCPv6 Snooping — IPv6 Source Guard — IPv6 ND Inspection — IPv6 RA Guard Access control lists (ACL) — L2-L3-L4 ACL (Access Control List) — IPv6 ACL — ACL based on: — Switch port — IEEE 802.1p — VLAN ID — EtherType — DSCP — IP protocol type — TCP/UDP port number — User Defined Bytes Quality of Service (QoS) and rate limiting — Shaping, policing — Support for IEEE 802.1p Class of Service — Strict Priority/Weighted Round Rob

In [10]:
df= pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,number_of_chunks
count,128.0,128.0,128.0,128.0,128.0,128.0,128.0
mean,63.5,2153.27,364.02,3.11,538.32,3.19,1.02
std,37.09,830.73,145.39,2.5,207.68,2.77,0.12
min,0.0,484.0,65.0,1.0,121.0,1.0,1.0
25%,31.75,1565.5,260.0,1.0,391.38,1.0,1.0
50%,63.5,2075.5,355.0,2.0,518.88,2.0,1.0
75%,95.25,2687.0,455.75,3.25,671.75,4.0,1.0
max,127.0,4862.0,809.0,12.0,1215.5,13.0,2.0


now we want to have each chunk as a dictionary item  not in as a list of sentence chunks

In [11]:
import re # re is a python library and stands for regex. regex also stands for regular expression XD

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        # join sentences together into a paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # '.A' -> ', A'
        
        chunk_dict["sentence_chunk"] = joined_sentence_chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in (joined_sentence_chunk.split(" "))])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1  token has 4 chars
        
        pages_and_chunks.append(chunk_dict)
        
len(pages_and_chunks) # to see how many pages and chunks we have 

100%|██████████| 128/128 [00:00<00:00, 35507.34it/s]


130

In [12]:
random.sample(pages_and_chunks, 1)

[{'page_number': 37,
  'sentence_chunk': 'Name Description MES2408CP Ethernet switch MES2408CP, 8 ports of 10/100/1000BASE-T (PoE/PoE+), 2 Combo ports of 10/100/1000BASE-T/100BASE-FX/1000BASE-X, L2, 110–250 V AC MES2408P AC Ethernet switch MES2408P AC, 8 ports of 10/100/1000BASE-T (PoE/PoE+), 2 ports of 100BASE-FX/ 1000BASE-X, L2, V AC 176–250  MES2408P DC Ethernet switch MES2408P DC, 8 ports of 10/100/1000BASE-T (PoE/PoE+), 2 ports of 100BASE-FX/ 1000BASE-X, L2, 36–72 V DC MES2408PL Ethernet switch MES2408PL, 8 ports of 10/100/1000BASE-T (PoE/PoE+), 2 ports of 100BASE-FX/ 1000BASE-X, L2, 110–250 V AC  MES2428P AC Ethernet switch MES2428P AC, 24 ports of 10/100/1000BASE-T (PoE/PoE+), 4 Combo ports of 10/100/1000BASE-T/100BASE-FX/1000BASE-X, L2, 17 –264 V AC 6  MES2428P DC Ethernet switch MES2428P DC, 24 ports of 10/100/1000BASE-T (PoE/PoE+), 4 Combo ports of 10/100/1000BASE-T/100BASE-FX/1000BASE-X, L2, 36 72 V DC – MES2424P Ethernet switch MES2424P, 24 ports of 10/100/1000BASE-T (PoE/P

In [13]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,130.0,130.0,130.0,130.0
mean,63.3,2106.02,344.32,526.51
std,36.85,857.83,147.48,214.46
min,0.0,59.0,6.0,14.75
25%,32.25,1448.5,238.25,362.12
50%,62.5,2061.0,340.5,515.25
75%,94.75,2659.25,435.75,664.81
max,127.0,4847.0,794.0,1211.75


lets filter the dataFrame for rows under the 30 tokens. because they are not much useful and they didn't help us

In [14]:
min_token_length = 30
pages_and_chunks_over_min_token_length = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_length[:2]

[{'page_number': 0,
  'sentence_chunk': '– Advanced L2 functions – Stacking support – Multicast support (IGMP Snooping, MVR) – Advanced security functions (L2-L4 ACL, IP Source Guard, Dynamic ARP Inspection, etc.)1 – Uninterruptible power supply from battery – Surge protection The switches provide end users connection to the networks of large enterprises, small and medium-sized businesses and service provider networks using Fast and Gigabit Ethernet interfaces. The access switches support physical stacking, VLANs and multicast groups, as well as advanced security features. Surge protection MES switches are equipped with efficient protection technology against voltage surges (up to 6 kV) caused by lightning discharges. Data sheet Ethernet Access Switches MES www.eltex-co.ru 1 MES1124M MES1124MB MES1124M  rev. B MES2124M MES2124MB MES2124P MES2124F Common  parameters 10/100BASE-T (RJ-45) 24 24 24 – – – – 10/100/1000BASE-T (RJ-45) – – – 24 24 – – 10/100/1000BASE-T (RJ-45) PoE/PoE+ – – – –

In [15]:
df = pd.DataFrame(pages_and_chunks_over_min_token_length)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,129.0,129.0,129.0,129.0
mean,63.34,2121.89,346.94,530.47
std,36.99,841.81,144.98,210.45
min,0.0,347.0,47.0,86.75
25%,32.0,1465.0,242.0,366.25
50%,63.0,2067.0,341.0,516.75
75%,95.0,2667.0,436.0,666.75
max,127.0,4847.0,794.0,1211.75


# Embedding our text chunks

to see what is embeddings and why we are using: https://vickiboykis.com/what_are_embeddings/

all-mpnet-base-v2 model : https://huggingface.co/sentence-transformers/all-mpnet-base-v2

In [16]:
from sentence_transformers import SentenceTransformer

# this would be our embedding model:

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cpu")



In [17]:
embedding_model.to("cuda") # uding gpu for faster embedding 

for item in tqdm(pages_and_chunks_over_min_token_length):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

100%|██████████| 129/129 [00:01<00:00, 84.21it/s] 


In [18]:
random.sample(pages_and_chunks_over_min_token_length, k=1) # see what we get

[{'page_number': 89,
  'sentence_chunk': '4 Use case Access switches PC N×1G N×1G/10G N×1G/10G OSPF, MSTP, ERPS ... GPON network To network core/higher-level equipment Thin client VoIP router RG Aggregation switches MES3300-48, MES3300-48F Gateway TAU-72. IP Physical parameters MES3300-48 MES3300-48F Power supply 100–240 V AC, 50–60 Hz 36–72 V DC Power supply options: џ 1 AC/DC power supply џ 2 hot-swappable AC/DC power supplies Input current 0.3–0.5 A for AC 0.5–1.0 A for DC 0.3–1.0 A for AC 1.0–2.2 A for DC Maximum power consumption 45 W 89 W Heat dissipation 45 W 89 W Dying Gasp support no Operating temperature from -10 to +45 °С Storage temperature from -50 to +70 °С Operating humidity no more than 80 % Cooling Front-to-Back, 4 fans Form factor 19”, 1U Dimensions (W × H × D) 440 × 44 × 330 mm 440 × 44 × 330 mm Weight 5.67 kg 5.68 kg Data sheet MES3300-48, MES3300-48F Ethernet Switches Integrated Networking Solutions www.eltex-co.com',
  'chunk_char_count': 908,
  'chunk_word_count'

In [19]:
import numpy as np
np.shape(pages_and_chunks_over_min_token_length[100]["embedding"])

(768,)

In [20]:
# saving embeddings to a file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_length)
embedding_df_path= "/home/ai/TAC2-lbz/text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embedding_df_path, index=False)

In [21]:
# import reading csv
text_chunks_and_embeddings_df_load = pd.read_csv(embedding_df_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,– Advanced L2 functions – Stacking support – M...,2118,347,529.5,[-4.12980244e-02 -4.51941825e-02 -8.23202636e-...
1,1,www.eltex-co.ru 2 MES2124MB 220V PC 12V Batter...,647,110,161.75,[ 5.26167592e-03 -8.32894593e-02 2.22160108e-...
2,2,Interfaces functions – Head-of-line blocking (...,2689,442,672.25,[ 1.27964774e-02 -6.11136481e-02 1.77732715e-...
3,3,ОАМ – IEEE 802.3ah Ethernet OAM – Dying Gasp –...,2983,527,745.75,[-1.39940754e-02 -2.82517243e-02 1.43435132e-...
4,4,Name Description Image MES1124M AC Ethernet sw...,2353,331,588.25,[-1.04786130e-02 -1.19528892e-02 -3.23442444e-...


In [22]:
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [23]:
import torch
from sentence_transformers import util

def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                convert_to_tensor=True) 

    # Get dot product scores on embeddings
    
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings.")

    scores, indices = torch.topk(input=dot_scores, 
                                k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                embeddings: torch.tensor,
                                pages_and_chunks: list[dict]=pages_and_chunks,
                                n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                embeddings=embeddings,
                                                n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

# RAG   Retrieval-Augmented Generation

In [24]:
import numpy as np
import pandas as pd
import torch

device = "cuda" if torch.cuda.is_available else "cpu" # if gpu is available we choose it and if not we chose cpu as our device

# importing text and embeddings
text_chunks_and_embeddings_df = pd.read_csv("/home/ai/TAC2-lbz/text_chunks_and_embeddings_df.csv")

# now converting embedding column to a np.array
text_chunks_and_embeddings_df["embedding"] = text_chunks_and_embeddings_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# converting embedding into a torch.tensor
embeddings = torch.tensor(np.stack(text_chunks_and_embeddings_df["embedding"].tolist(), axis=0), dtype=torch.float32).to(device=device)

# converting text and embeddings to the list of dictionaries
pages_and_chunks = text_chunks_and_embeddings_df.to_dict(orient="records")

text_chunks_and_embeddings_df # to see what i just created

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,0,– Advanced L2 functions – Stacking support – M...,2118,347,529.50,"[-0.0412980244, -0.0451941825, -0.00823202636,..."
1,1,www.eltex-co.ru 2 MES2124MB 220V PC 12V Batter...,647,110,161.75,"[0.00526167592, -0.0832894593, 0.0222160108, 0..."
2,2,Interfaces functions – Head-of-line blocking (...,2689,442,672.25,"[0.0127964774, -0.0611136481, 0.00177732715, -..."
3,3,ОАМ – IEEE 802.3ah Ethernet OAM – Dying Gasp –...,2983,527,745.75,"[-0.0139940754, -0.0282517243, 0.00143435132, ..."
4,4,Name Description Image MES1124M AC Ethernet sw...,2353,331,588.25,"[-0.010478613, -0.0119528892, -0.0323442444, 0..."
...,...,...,...,...,...,...
124,123,Data sheet 2 Interface features – Head-of-line...,2067,344,516.75,"[0.0256994795, -0.0870650187, 0.000192645501, ..."
125,124,Data sheet 3 Security functions – DHCP snoopin...,3458,539,864.50,"[0.00733969174, -0.0421830937, 0.00279009691, ..."
126,125,Data sheet 4 MIB/IETF – IEEE 802.3 10BASE-T – ...,4847,794,1211.75,"[-0.000385717838, -0.0952754468, 0.0141440779,..."
127,126,– RFC 854 Telnet – RFC 855 Telnet Option Speci...,3394,559,848.50,"[-0.00582209928, -0.0445040613, 0.0292956363, ..."


In [25]:
embeddings.shape # just seeing what i have created :)

torch.Size([129, 768])

In [26]:
# create model

from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device=device) # embedding our query with same model as we embedded our knowledge-base




In [27]:
# Defining the query
query = "What is the firmware version synchronized with Version 4.0 of the MES5448 and MES7048 operation manual?"
print(f"query: {query}")

# embed the query
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to("cuda")

# similarity scores
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]

# Getting top-k results
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

query: What is the firmware version synchronized with Version 4.0 of the MES5448 and MES7048 operation manual?


torch.return_types.topk(
values=tensor([0.4207, 0.4173, 0.4084, 0.4052, 0.3941], device='cuda:0'),
indices=tensor([  5,  16,  21, 105, 103], device='cuda:0'))

# LLM 

In [29]:
!nvidia-smi # to check how much gpu memory is available for choosing the model

Sun Aug 18 10:00:50 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8              23W / 370W |   7369MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


LLM model that i chose : Gemma-7b-it https://huggingface.co/google/gemma-7b-it

In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available 

# quantization config 
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                        bnb_4bit_compute_dtype=torch.float16)


if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")


model_id = "google/gemma-7b-it" 

#tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id)


[INFO] Using attention implementation: sdpa


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.17s/it]


In [31]:
!nvidia-smi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sun Aug 18 10:01:05 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8              22W / 370W |   7369MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [32]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 3072, padding_idx=0)
    (layers): ModuleList(
      (0-27): 28 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (k_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (v_proj): Linear(in_features=3072, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=3072, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (up_proj): Linear(in_features=3072, out_features=24576, bias=False)
          (down_proj): Linear(in_features=24576, out_features=3072, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm((3072,), eps=1e-06)
        (post_attention_layernorm): GemmaRMSNorm((3072,), eps=1

In [33]:
input_text = "give me a description about MES5448 data-center switches"

# prompt

template = [
    {
    "role": "user",
    "content": input_text}
]

prompt= tokenizer.apply_chat_template(conversation=template, tokenize=False, add_generation_prompt=True)
print(f"prompt: \n{prompt}")

prompt: 
<bos><start_of_turn>user
give me a description about MES5448 data-center switches<end_of_turn>
<start_of_turn>model



In [34]:
input_ids = tokenizer(input_text, return_tensors="pt").to("cpu") # failed when using gpu

output = llm_model.generate(**input_ids, max_new_tokens=256)



In [35]:
text_output = tokenizer.decode(output[0])

text_output

'<bos>give me a description about MES5448 data-center switches\n\n**MES5448 Data-Center Switches**\n\nThe Cisco MES5448 is a family of high-performance, scalable, and secure data-center switches designed to meet the demanding requirements of modern data centers. With their industry-leading performance, capacity, and security features, the MES5448 switches are well-suited for a wide range of data-center applications, including:\n\n**Key Features:**\n\n* **Scalable and Flexible:** Supports up to 48 ports in a 1U chassis, with the ability to expand to multiple chassis for even greater capacity.\n* **High Performance:** Delivers industry-leading performance with low latency and high bandwidth.\n* **Secure:** Features advanced security features such as Cisco TrustAnchorTM security modules and Cisco IOS XE software.\n* **Energy-Efficient:** Consumes less power and heat than traditional data-center switches.\n* **Simple to Manage:** Offers simplified management through Cisco Prime Infrastruct

In [36]:
score, indices = retrieve_relevant_resources(query, embeddings)
score, indices

[INFO] Time taken to get scores on 129 embeddings.


(tensor([0.4207, 0.4173, 0.4084, 0.4052, 0.3941], device='cuda:0'),
 tensor([  5,  16,  21, 105, 103], device='cuda:0'))

In [42]:
def prompt_formatter(query: str, context_items: list[dict]) -> str:
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])
    base_prompt = """Based on the following context items, please answer the query.
    Give yourself room to think by extracting relevant passages from the context before answering the query.
    Don't return the thinking, only return the answer.
    Make sure your answers are as explanatory as possible.
    Use the following examples as reference for the ideal answer style.
    \nExample 1:
    Query: What port does the 5324 switch support?
    Answer: The device’s ports support operation at rates of 1 Gbps (SFP), 10 Gbps (SFP+) and 40 Gbps (QSFP) that provides flexible using and ability of smooth transition to higher data rates..
    \nExample 2:
    Query:How does the Switch 5324 support AC/DC fans and power supply
    Answer: "The redundant and hot-swappable fans and AC/DC power supplies together with advanced hardware hardware monitoring functions provide high network reliability and uninterrupted services."
    \nExample 3:
    Query:What is the quality of the service of this switch?
    Answer: "QoS statistics//IEEE 802.1p Class of Service (CoS)//Storm Control for different types of traffic (broadcast, multicast, unknown unicast)"
    \nNow use the following context items to answer the user query:
    {context}
    \nRelevant passages: <extract relevant passages from the context here>
    User query: {query}
    Answer:"""

    base_prompt = base_prompt.format(context=context, query=query)
    template = [
        {"role": "user",
        "content": base_prompt}
    ]
    prompt = tokenizer.apply_chat_template(conversation=template,
                                        tokenize=False,
                                        add_generation_prompt=True)
    return prompt


In [43]:
query = "What are the Monitoring Functions of 5324?"
print(f"Query: {query}")

# Get relevant resources
scores, indices = retrieve_relevant_resources(query=query,
                                            embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                        context_items=context_items)
print(prompt)

Query: What are the Monitoring Functions of 5324?
[INFO] Time taken to get scores on 129 embeddings.
<bos><start_of_turn>user
Based on the following context items, please answer the query.
    Give yourself room to think by extracting relevant passages from the context before answering the query.
    Don't return the thinking, only return the answer.
    Make sure your answers are as explanatory as possible.
    Use the following examples as reference for the ideal answer style.
    
Example 1:
    Query: What port does the 5324 switch support?
    Answer: The device’s ports support operation at rates of 1 Gbps (SFP), 10 Gbps (SFP+) and 40 Gbps (QSFP) that provides flexible using and ability of smooth transition to higher data rates..
    
Example 2:
    Query:How does the Switch 5324 support AC/DC fans and power supply
    Answer: "The redundant and hot-swappable fans and AC/DC power supplies together with advanced hardware hardware monitoring functions provide high network reliabilit

In [44]:
input_ids = tokenizer(prompt, return_tensors="pt").to("cpu")

# Generate an output of tokens
outputs = llm_model.generate(**input_ids,
                            temperature=0.7, # lower temperature = more deterministic outputs, higher temperature = more creative outputs
                            do_sample=True, 
                            max_new_tokens=256) 

# Turn the output tokens into text
output_text = tokenizer.decode(outputs[0])

print(f"Query: {query}")
print(f"RAG answer:\n{output_text.replace(prompt, '')}")

Query: What are the Monitoring Functions of 5324?
RAG answer:
<bos>Sure, here are the extracted relevant passages from the context that answer the user query:

**Monitoring Functions of Switch 5324:**

"The Switch 5324 supports a wide range of monitoring functions, including:

- Interface statistics
- Remote monitoring RMON/SMON
- Task- and traffic type-based CPU utilization monitoring
- Temperature monitoring
- TCAM monitoring
- IPFIX Quality of Service (QoS) and rate limiting
- QoS statistics
- Shaping, Policing
- IEEE 802.1p Class of Service (CoS)
- Broadcast Storm Control
- Bandwidth management
- Strict Priority/Weighted Round Robin (WRR) scheduling algorithms
- Three marking colors
- ACL-based CoS/DSCP assignment
- ACL-based VLAN assignment
- Setting the IEEE 802.1p priority for management VLAN
- DSCP to CoS, CoS to DSCP remarking
- 802.1p DSCP mark assignment for IGMP"

Therefore, the answer to the user query is: The Switch 5324 supports a wide range of monitoring functions, incl

: 