# Module - 5 : Mini Project -1

## Building Multiple LLMs Inference system

> ### Instructions:

For this project, you will develop a system that will answer the questions using multiple LLMs. Users should have a choice to select a particular LLM, in case he/she wants the response from a particular LLM. Include chat history / memory also, so that LLM knows the previous conversations while responding to user requests. You can use open-source LLMs or OpenAI API. Create a UI for this project. Include a button to clear the chat.

Steps to build Multiple LLMs Inference system:
- Step 1: Load the required packages and api keys (1 point)
- Step 2: Instantiate different LLMs (2 point)
Instantiate different LLMs such as gpt-3.5-turbo, gpt-4o-mini,
HuggingFaceH4/zephyr-7b-beta, or any other open-source LLMs.
- Step 3: Create a function to generate response for a user request (2 points)
Once a request comes in, generate a response using all LLMs and return all the responses. If the user opts to use a particular LLM then generate the response using only that LLM and return the response.
- Step 4: Include Conversation Memory (3 points)
Update the above function to include the conversation memory, so that LLM
knows the previous conversations while responding to a user request.
[Hint: How to add memory to chatbots]
- Step 5: Create a User Interface (2 point)
Create a User Interface where users can input their requests, has an option to select the LLM to use, and once submitted should be able to see the response. Include a Clear button to clear the chat.
[Hint: Gradio ChatInterface]
- Step 6: Include an audio component to receive user request as Audio (Optional)


In [1]:
pip install gradio langchain openai huggingface_hub SpeechRecognition

Collecting gradio
  Downloading gradio-5.4.0-py3-none-any.whl.metadata (16 kB)
Collecting langchain
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
Collecting openai
  Downloading openai-1.52.2-py3-none-any.whl.metadata (24 kB)
Collecting SpeechRecognition
  Downloading SpeechRecognition-3.11.0-py2.py3-none-any.whl.metadata (28 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.3-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.2 (from gradio)
  Downloading gradio_client-1.4.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.26.1-py3-none-any.whl.metadata (13 kB)
Collecting markupsafe

In [2]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.3-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.23.0-py3-none-any.whl.metadata (7.6 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloa

In [3]:
import os
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from typing import Dict, List
import numpy as np
import speech_recognition as sr
from datetime import datetime
import time
import unittest
import wave
from transformers import CohereForCausalLM, LlamaForCausalLM, OpenAIGPTLMHeadModel


In [14]:
'''
list_of_models = {
   "Google": "google-bert/bert-base-uncased",
   "Cohere": "CohereForAI/c4ai-command-r-v01",
   "Llama2" : "meta-llama/Llama-2-7b-hf",
   "Openai" : "openai-community/openai-gpt"
}
'''

In [4]:

list_of_models = {
    "GPT2": "gpt2",
    "DistilGPT2": "distilgpt2",
    "GPT2-Medium": "gpt2-medium",
    "BlenderBot": "facebook/blenderbot-400M-distill",
    "GPT-Neo": "EleutherAI/gpt-neo-125M",
    "OPT": "facebook/opt-1.3b",
    "BERT": "bert-base-uncased"
}


In [5]:
class InferenceSystem:
    def __init__(self):
        self.models = {}
        self.tokenizers = {}
        self.chat_histories = {model: [] for model in list_of_models.keys()}
        os.makedirs("model_cache", exist_ok=True)
        self.initialize_models()


    def initialize_models(self):
        for model_name, model_id in list_of_models.items():
            print(f"Loading {model_name}")
            try:
                self.tokenizers[model_name] = AutoTokenizer.from_pretrained(
                    model_id,
                    local_files_only=False
                )


                model_kwargs = {
                    "torch_dtype": torch.float32,
                    "low_cpu_mem_usage": True,
                    "offload_folder": "model_cache"
                }

                model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)

                device = "cuda" if torch.cuda.is_available() else "cpu"
                model.to(device)

                self.models[model_name] = pipeline(
                    "text-generation",
                    model=model,
                    tokenizer=self.tokenizers[model_name],
                    device=0 if torch.cuda.is_available() else -1
                )
                print(f"Successfully loaded {model_name}")
            except Exception as e:
                print(f"Error loading {model_name}: {str(e)}")
                continue


    def generate_response(self, user_input: str, selected_model: str) -> str:
        if selected_model not in self.models:
            return f"Error: Model {selected_model} is not available. Please select another model."

        try:

            self.chat_histories[selected_model].append({
                "role": "user",
                "content": user_input,
                "timestamp": datetime.now().strftime("%H:%M")
            })


            context = ""
            for msg in self.chat_histories[selected_model][-3:]:
                context += f"{msg['role']}: {msg['content']}\n"


            input_text = context + user_input
            input_ids = self.tokenizers[selected_model].encode(input_text, return_tensors='pt')
            input_length = input_ids.size(1)

            max_length = 500
            max_new_tokens = max_length - input_length

            if max_new_tokens <= 0:
                return "Error: Input is too long. Please shorten your message."

            try:
                response = self.models[selected_model](
                input_text,
                max_new_tokens=500 - input_length,
                temperature=0.1,
                top_p=0.95,
                num_return_sequences=1,
                pad_token_id=self.tokenizers[selected_model].eos_token_id
            )[0]['generated_text']


                response = response.split(user_input)[-1].strip()
                if "user:" in response.lower():
                    response = response.split("user:")[0].strip()
                if "assistant:" in response.lower():
                    response = response.split("assistant:")[-1].strip()

            except Exception as e:
                response = f"Error generating response: {str(e)}"


            self.chat_histories[selected_model].append({
                "role": "assistant",
                "content": response,
                "timestamp": datetime.now().strftime("%H:%M")
            })

            return response

        except Exception as e:
            return f"Error: {str(e)}"


    def transcribe_audio(self, audio):
        if audio is None:
            return "No audio input received"

        recognizer = sr.Recognizer()
        try:
            with sr.AudioFile(audio) as source:
                audio_data = recognizer.record(source)
                try:
                    text = recognizer.recognize_google(audio_data)
                    return text
                except sr.UnknownValueError:
                    return "Could not understand audio"
                except sr.RequestError:
                    return "Error in processing audio"
        except Exception as e:
            return f"Error processing audio file: {str(e)}"

    def clear_history(self, selected_model: str):
        if selected_model in self.chat_histories:
            self.chat_histories[selected_model] = []
        return []

In [6]:
def create_ui(chat_system: InferenceSystem):
    with gr.Blocks(theme=gr.themes.Ocean()) as interface:
        gr.Markdown(
            """
            # 🤖 Multi-Model Chat Interface
            Chat with different language models! Use text or voice to interact.
            """
        )

        with gr.Row():
            with gr.Column(scale=4):
                chatbot = gr.Chatbot(
                    show_label=False,
                    container=True,
                    height=200,
                    bubble_full_width=False
                )

                with gr.Row():
                    with gr.Column(scale=8):
                        msg = gr.Textbox(
                            show_label=False,
                            placeholder="Type your message here: ",
                            container=False
                        )
                    with gr.Column(scale=1):
                        audio_input = gr.Audio(
                            type="filepath",
                            label="Voice Input"
                        )

                with gr.Row():
                    submit = gr.Button("Send", variant="primary")
                    clear = gr.Button("Clear Chat", variant="secondary")

            with gr.Column(scale=1):
                model_dropdown = gr.Dropdown(
                    choices=list(list_of_models.keys()),
                    value=list(list_of_models.keys())[0],
                    label="Select Model",
                    container=False
                )

                with gr.Accordion("Model Information", open=False):
                    gr.Markdown(
                        """
                        ### Available Models:
                        - **DialoGPT**: Microsoft's dialogue model for conversational AI.
                        - **GPT2**: OpenAI's base model for text generation.
                        - **DistilGPT2**: A lighter version of GPT2 for faster performance.
                        - **GPT2-Medium**: A larger variant of GPT2 for improved text quality.
                        - **BlenderBot**: Facebook's chatbot designed for engaging conversations.
                        - **GPT-Neo**: An open-source alternative to GPT-3, suitable for various NLP tasks.
                        - **BART**: A model designed for text generation and summarization tasks.
                        - **OPT**: Open Pre-trained Transformer model for language tasks.
                        - **BERT**: A foundational model for various NLP tasks.
                        """
                    )

        def user_input(user_message, history, selected_model):
            if not user_message:
                return "", history

            response = chat_system.generate_response(user_message, selected_model)
            history = chat_system.chat_histories[selected_model]

            formatted_history = [[msg['content'] for msg in history[i:i+2]]
                               for i in range(0, len(history), 2)]

            return "", formatted_history

        def handle_audio(audio, history, selected_model):
            if audio is None:
                return history

            text = chat_system.transcribe_audio(audio)
            if text.startswith("Could not") or text.startswith("Error"):
                return history + [[text, None]]

            response = chat_system.generate_response(text, selected_model)
            history = chat_system.chat_histories[selected_model]

            formatted_history = [[msg['content'] for msg in history[i:i+2]]
                               for i in range(0, len(history), 2)]

            return formatted_history

        submit.click(
            user_input,
            inputs=[msg, chatbot, model_dropdown],
            outputs=[msg, chatbot]
        )

        msg.submit(
            user_input,
            inputs=[msg, chatbot, model_dropdown],
            outputs=[msg, chatbot]
        )

        audio_input.stop_recording(
            handle_audio,
            inputs=[audio_input, chatbot, model_dropdown],
            outputs=[chatbot]
        )

        clear.click(
            lambda x: chat_system.clear_history(x),
            inputs=[model_dropdown],
            outputs=[chatbot]
        )

    return interface


In [None]:
chat_system = InferenceSystem()
demo = create_ui(chat_system)
demo.launch(share=True, debug=True, server_name="0.0.0.0")

Loading GPT2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Successfully loaded GPT2
Loading DistilGPT2


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Successfully loaded DistilGPT2
Loading GPT2-Medium


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Successfully loaded GPT2-Medium
Loading BlenderBot


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/127k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/62.9k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/310k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/730M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

Successfully loaded BlenderBot
Loading GPT-Neo


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

Successfully loaded GPT-Neo
Loading OPT


tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Successfully loaded OPT
Loading BERT


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


Successfully loaded BERT




Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://93bd7ac9c59b6e0283.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
class TestScenarios:
    def __init__(self):
        self.chat_system = InferenceSystem()

    def run_all_tests(self):
        """Run all test scenarios and print results"""
        print("Starting comprehensive test suite")

        self.test_basic_queries()
        self.test_complex_queries()
        self.test_edge_cases()
        self.test_conversation_memory()
        self.test_audio_functionality()
        self.test_system_robustness()

    def test_basic_queries(self):
        """Test simple, straightforward queries"""
        print("\n=== Testing Basic Queries ===")
        basic_queries = [
            "Hello, how are you doing?",
            "Who are you?",
            "What is your name?",
            "How is the weather?",
            "Tell me a joke"
        ]

        for model in list_of_models.keys():
            print(f"\nTesting {model}:")
            for query in basic_queries:
                print(f"\nQuery: {query}")
                response = self.chat_system.generate_response(query, model)
                print(f"Response: {response}")
                time.sleep(1)

    def test_complex_queries(self):
        """Test more complex and specialized queries"""
        print("\nTesting Complex Queries")
        complex_queries = [
            "Can you explain quantum computing in simple terms?",
            "Write a story on Large Language Models",
            "Explain transformers to a 5 year old",
            "How to lead a peaceful life",
            "Write a poem on the evolution of artificial intelligence"
        ]

        for model in list_of_models.keys():
            print(f"\nTesting {model}:")
            for query in complex_queries:
                print(f"\nQuery: {query}")
                response = self.chat_system.generate_response(query, model)
                print(f"Response: {response}")
                time.sleep(1)

    def test_edge_cases(self):
        """Test edge cases and potential error conditions"""
        print("\nTesting Edge Cases")
        edge_cases = [
            "",  # Empty string
            " ",  # Just whitespace
            "?",  # Single character
            "a" * 1000,  # Very long input
            "अ क्षेत्र",  # Non-English text
            "SELECT * FROM users;",  # Potential SQL injection
            "<script>alert('test');</script>",  # Potential XSS
            "✈️ 🌍 🌞",  # Emojis
            "\n\n\n",  # Multiple newlines
            "!@#$%^&*()"  # Special characters
        ]

        for model in list_of_models.keys():
            print(f"\nTesting {model}:")
            for query in edge_cases:
                print(f"\nQuery: {query}")
                try:
                    response = self.chat_system.generate_response(query, model)
                    print(f"Response: {response}")
                except Exception as e:
                    print(f"Error: {str(e)}")
                time.sleep(1)

    def test_conversation_memory(self):
        """Test conversation memory and context maintenance"""
        print("\nTesting Conversation Memory")
        conversation_flow = [
            "My name is Sruthi",
            "What is my name?",
            "What is your name?"
            "I like reading books",
            "I have a dog named Coco",
            "What do I like to do?",
            "I like collecting stamps",
            "What do I like to do?",
            "What is my pet's name?"
        ]

        for model in list_of_models.keys():
            print(f"\nTesting {model}:")
            self.chat_system.clear_history(model)
            for query in conversation_flow:
                print(f"\nQuery: {query}")
                response = self.chat_system.generate_response(query, model)
                print(f"Response: {response}")
                time.sleep(1)

    def create_test_audio_file(self, text="Hello, this is a test."):
        """Create a test WAV file with silent audio"""
        filename = "/content/drive/MyDrive/1001_IEO_ANG_MD.wav"
        sample_rate = 44100
        duration = 2  # seconds

        samples = np.zeros(int(sample_rate * duration))


        with wave.open(filename, 'w') as wav_file:
            wav_file.setnchannels(1)
            wav_file.setsampwidth(2)
            wav_file.setframerate(sample_rate)

            for sample in samples:
                wav_file.writeframes(struct.pack('h', int(sample * 32767)))

        return filename

    def test_audio_functionality(self):
        """Test audio input functionality"""
        print("\nTesting Audio Functionality")

        test_audio_file = self.create_test_audio_file()

        try:
            transcribed_text = self.chat_system.transcribe_audio(test_audio_file)
            print(f"Transcribed text: {transcribed_text}")


            for model in list_of_models.keys():
                print(f"\nTesting {model} with audio input:")
                if transcribed_text and not transcribed_text.startswith("Could not test"):
                    response = self.chat_system.generate_response(transcribed_text, model)
                    print(f"Response: {response}")
                time.sleep(1)

        except Exception as e:
            print(f"Audio test error: {str(e)}")

    def test_system_robustness(self):
        """Test system robustness with rapid queries and model switching"""
        print("\nTesting System Robustness")


        test_query = "What is the capital of France?"
        print("\nTesting rapid model switching:")
        for _ in range(3):  # Multiple rounds
            for model in list_of_models.keys():
                try:
                    response = self.chat_system.generate_response(test_query, model)
                    print(f"{model} response: {response}")
                except Exception as e:
                    print(f"Error with {model}: {str(e)}")


        print("\nTesting multiple queries in succession:")
        queries = ["Hello", "How are you?", "What time is it?"]
        for model in list_of_models.keys():
            print(f"\nRapid queries for {model}:")
            for query in queries:
                try:
                    response = self.chat_system.generate_response(query, model)
                    print(f"Query: {query}\nResponse: {response}")
                except Exception as e:
                    print(f"Error: {str(e)}")

In [None]:
test_runner = TestScenarios()
test_runner.run_all_tests()