# Telemetry Pipeline 
## Model Quantization Workflow for running Inference on Edge Switch
#### This notebook takes the distilled_pruned_gpt2_telemetry model, performs model compression using quantization.

                    GNU AFFERO GENERAL PUBLIC LICENSE
                       Version 3, 19 November 2007

Copyright (C) 2024 Shaji R. Nathan  
IP Infusion Inc.  
Email: shaji.nathan@ipinfusion.com  

This program is free software: you can redistribute it and/or modify  
it under the terms of the GNU Affero General Public License as  
published by the Free Software Foundation, either version 3 of the  
License, or (at your option) any later version.  

This program is distributed in the hope that it will be useful,  
but WITHOUT ANY WARRANTY; without even the implied warranty of  
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the  
GNU Affero General Public License for more details.  

You should have received a copy of the GNU Affero General Public License  
along with this program. If not, see <https://www.gnu.org/licenses/>.  

As per AGPLv3, if you modify this software and make it available over a  
network, you must provide the source code of your modifications under the  
same license.  

For inquiries, please contact:  
Shaji R. Nathan  
IP Infusion Inc.  
Email: shaji.nathan@ipinfusion.com  


## Check Distilled and Pruned Model for corruption 

In [4]:
import torch, torchvision
from transformers import AutoModelForCausalLM, AutoTokenizer
torchvision.disable_beta_transforms_warning()

In [None]:

print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"PyTorch Version: {torch.__version__}")

try:
    teacher = AutoModelForCausalLM.from_pretrained("./distilled_pruned_gpt2_telemetry", device_map=None)
    print("✅ Model loaded successfully to CPU.")

    teacher = teacher.to("cuda")
    print("✅ Model moved to GPU successfully.")
except Exception as e:
    print(f"❌ Error during model load/move: {e}")

# Quantization of the model 
### Load and Quantize  GPT-2 Model in 4-bit

In [5]:
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

In [2]:
# Step 1: Load Tokenizer
print("🔹 Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load the base tokenizer

🔹 Loading tokenizer...


In [3]:
# Fixed padding issue: Add padding token (GPT-2 does not have one by default)
tokenizer.pad_token = tokenizer.eos_token  # Use the end-of-sequence token as padding


In [6]:
# Step 2: Define the 4-bit Quantization Configuration
print("🔹 Defining 4-bit quantization configuration...")
quantize_config = BaseQuantizeConfig(
    bits=4,  # Use 4-bit quantization
    group_size=128,  # Grouped quantization for better efficiency
    desc_act=False  # Disable activation-descending quantization
)


🔹 Defining 4-bit quantization configuration...


In [7]:
# Step 3: Load the Pruned and Distilled GPT-2 Model
print("🔹 Loading the model...")
model = AutoGPTQForCausalLM.from_pretrained(
    "./distilled_pruned_gpt2_telemetry",  # Path to the fine-tuned model
    quantize_config=quantize_config  # Apply quantization settings
)

🔹 Loading the model...


Some weights of the model checkpoint at ./distilled_pruned_gpt2_telemetry were not used when initializing GPT2LMHeadModel: ['lm_head.weight_mask']
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
# Step 4: Prepare Example Sentences (DO NOT Tokenize Here)
print("🔹 Preparing example sentences for quantization...")
example_sentences = [
    "Hello, how is the temperature?",
    "What is the laser bias ?",
    "Show the channels of the optical modules ",
    "What is the Rx and Tx temperatures ?"
]

🔹 Preparing example sentences for quantization...


In [11]:
# Step 5: Tokenize Sentences and Convert Format 
print("🔹 Tokenizing and formatting input for quantization...")

# Tokenize sentences and extract `input_ids` and `attention_mask`
tokenized_examples = tokenizer(
    example_sentences,
    padding=True,  # Ensure padding
    truncation=True,  # Truncate long inputs
    return_tensors="pt"  # Return PyTorch tensors
)

# Convert tokenized output into a list of dicts (Required Format)
examples_for_quantization = [
    {"input_ids": input_id, "attention_mask": attention_mask}
    for input_id, attention_mask in zip(tokenized_examples["input_ids"], tokenized_examples["attention_mask"])
]

🔹 Tokenizing and formatting input for quantization...


In [12]:
# Step 6: Quantize the Model Using Tokenized Examples
print("🔹 Quantizing the model...")
model.quantize(examples_for_quantization) 

🔹 Quantizing the model...


INFO - Start quantizing layer 1/6
INFO - Quantizing attn.c_attn in layer 1/6...
INFO - Quantizing attn.c_proj in layer 1/6...
INFO - Quantizing mlp.c_fc in layer 1/6...
INFO - Quantizing mlp.c_proj in layer 1/6...
INFO - Start quantizing layer 2/6
INFO - Quantizing attn.c_attn in layer 2/6...
INFO - Quantizing attn.c_proj in layer 2/6...
INFO - Quantizing mlp.c_fc in layer 2/6...
INFO - Quantizing mlp.c_proj in layer 2/6...
INFO - Start quantizing layer 3/6
INFO - Quantizing attn.c_attn in layer 3/6...
INFO - Quantizing attn.c_proj in layer 3/6...
INFO - Quantizing mlp.c_fc in layer 3/6...
INFO - Quantizing mlp.c_proj in layer 3/6...
INFO - Start quantizing layer 4/6
INFO - Quantizing attn.c_attn in layer 4/6...
INFO - Quantizing attn.c_proj in layer 4/6...
INFO - Quantizing mlp.c_fc in layer 4/6...
INFO - Quantizing mlp.c_proj in layer 4/6...
INFO - Start quantizing layer 5/6
INFO - Quantizing attn.c_attn in layer 5/6...
INFO - Quantizing attn.c_proj in layer 5/6...
INFO - Quantizing 

In [13]:
# Step 7: Move Model to CPU (or GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"✅ Model successfully quantized to 4-bit and running on {device}!")

✅ Model successfully quantized to 4-bit and running on cuda!


In [14]:
# Step 8: Save the Quantized Model Locally
save_directory = "./quantized_distilled_pruned_gpt2_telemetry"
print(f"🔹 Saving the quantized model to {save_directory}...")
model.save_quantized(save_directory)  
tokenizer.save_pretrained(save_directory)  # Save tokenizer
print(f"✅ Quantized model and tokenizer saved successfully in {save_directory}!")

🔹 Saving the quantized model to ./quantized_distilled_pruned_gpt2_telemetry...
✅ Quantized model and tokenizer saved successfully in ./quantized_distilled_pruned_gpt2_telemetry!


In [15]:
# Step 9: Verify Saved Files
import os
print("\n📁 Saved files:")
print(os.listdir(save_directory))


📁 Saved files:
['config.json', 'gptq_model-4bit-128g.safetensors', 'merges.txt', 'quantize_config.json', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'vocab.json']


In [16]:
# Step 10: Reload the Saved Quantized Model
print("\n🔹 Reloading the quantized model for verification...")
reloaded_model = AutoGPTQForCausalLM.from_quantized(save_directory)
reloaded_model.to(device)
print("✅ Quantized model reloaded successfully!")

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.



🔹 Reloading the quantized model for verification...
✅ Quantized model reloaded successfully!


# Inference Testing: 
## Quantized LLM Application: AI Operations (AiOPS) Assistant for the NetAdmin

### AiOPS persona: The model acts as an AI-powered network operations assistant.
### NetAdmin user role: The user provides network-related queries (e.g., troubleshooting, diagnostics).
### Prompt Engineering: Uses a structured input format so the model understands the networking context.

In [17]:
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM


In [19]:
# Load the tokenizer
save_directory = "./quantized_distilled_pruned_gpt2_telemetry"
print("🔹 Loading AiOPS Assistant for network troubleshooting...")
tokenizer = AutoTokenizer.from_pretrained(save_directory)

🔹 Loading AiOPS Assistant for network troubleshooting...


In [21]:
# Load the quantized model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoGPTQForCausalLM.from_quantized(save_directory)
model.to(device)
print(f"✅ AiOPS agent is online and running on {device}!")

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.


✅ AiOPS agent is online and running on cuda!


## AiOPS Network Assistant  (Runs on the cell site backhaul router in the cell tower)

In [27]:

# **Function for AiOPS Network Assistant**
def aiops_assist(netadmin_input):
    """
    AiOPS Network Assistant: Analyzes network issues, suggests optimizations, and provides troubleshooting tips.

    Args:
    - netadmin_input (str): The network-related query or issue from the NetAdmin.

    Returns:
    - str: AiOPS response with troubleshooting steps or recommendations.
    """
    aiops_prompt = f"[AiOPS Assistant] NetAdmin request: {netadmin_input}\n[AiOPS Response]:"
    
    inputs = tokenizer(aiops_prompt, return_tensors="pt", padding=True, truncation=True).to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=len(netadmin_input) * 3,  # Dynamically adjust response length
            do_sample=True,  # Enable diverse response generation
            temperature=0.7,  # Control randomness
            top_p=0.9,  # Nucleus sampling
            repetition_penalty=1.2,  # Reduce repetition
            eos_token_id=tokenizer.eos_token_id  # Stop generation when reaching EOS token
        )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# **Loop for Continuous Interaction**
print("\n💬 NetAdmin, type your network issue. Type 'exit' to quit.")

while True:
    netadmin_input = input("👉 Describe the problem (or type 'exit' to quit): ")
    
    # **Exit condition**
    if netadmin_input.lower() in ["exit", "quit", "q"]:
        print("\n🔻 Exiting AiOPS Assistant. Have a great day, NetAdmin! 🚀")
        break  # Stop the loop
    
    # Generate and Display AI-powered Troubleshooting Response
    aiops_response = aiops_assist(netadmin_input)
    print("\n🛠️ AiOPS Assistant Response:")
    print(aiops_response)
    print("\n" + "-" * 80)  # Separator for readability



💬 NetAdmin, type your network issue. Type 'exit' to quit.


👉 Describe the problem (or type 'exit' to quit):  Optical Module not working


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



🛠️ AiOPS Assistant Response:
[AiOPS Assistant] NetAdmin request: Optical Module not working
[AiOPS Response]: Anomalous - Channel 1 Input Power Over-Coupling Timestamp 0.21 dBm (SFP)
response: Normal, QPI+2 Output Voltage 2.19V
Response Time 3.62dBM (-1.74 dB m²), Max B

--------------------------------------------------------------------------------


👉 Describe the problem (or type 'exit' to quit):  exit



🔻 Exiting AiOPS Assistant. Have a great day, NetAdmin! 🚀
