# 💬 LLaMA Chatbot Interface

Chat with Meta's LLaMA large language model using a simple interface. This notebook runs in Colab's free tier and requires no API key.

## Model Information
This notebook uses LLaMA 2 7B by default. For different performance profiles, you can use these alternatives:

- **TheBloke/Llama-2-7B-Chat-GGML**: Optimized for CPU inference
- **TheBloke/Llama-2-7B-Chat-GPTQ**: 4-bit quantized for GPU efficiency
- **TheBloke/Llama-2-7B-Chat-AWQ**: Alternative quantization method

To use these models, simply change the `model_name` variable below.

## Features
- Interactive chat interface
- Memory of conversation history
- Adjustable generation parameters
- Multiple model options

## Setup
First, let's install the required packages:

In [None]:
!pip install -q transformers accelerate gradio torch sentencepiece auto-gptq optimum

## Import Dependencies

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import gradio as gr

## Load Model and Tokenizer

In [None]:
def load_model(model_name, quantization="none"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    if quantization == "4bit":
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        )
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quantization_config,
            device_map="auto"
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    return tokenizer, model

# Load default model
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer, model = load_model(model_name)

print(f"Model loaded on: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## Create Chat Function

In [None]:
def format_chat_prompt(message, chat_history):
    prompt = "<s>[INST] "
    
    # Add chat history
    for user_msg, assistant_msg in chat_history:
        prompt += f"{user_msg} [/INST] {assistant_msg} </s><s>[INST] "
    
    # Add current message
    prompt += f"{message} [/INST]"
    return prompt

def chat(message, chat_history, temperature=0.7, max_length=2048, top_p=0.9, repetition_penalty=1.1):
    # Format prompt with chat history
    prompt = format_chat_prompt(message, chat_history)
    
    # Tokenize input
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate response
    outputs = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Decode and extract response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("[/INST]")[-1].strip()
    
    chat_history.append((message, response))
    return "", chat_history

## Create Gradio Interface

In [None]:
interface = gr.ChatInterface(
    fn=chat,
    additional_inputs=[
        gr.Slider(
            minimum=0.1,
            maximum=2.0,
            value=0.7,
            step=0.1,
            label="Temperature",
            info="Higher values make output more random"
        ),
        gr.Slider(
            minimum=256,
            maximum=4096,
            value=2048,
            step=256,
            label="Max Length",
            info="Maximum length of generated response"
        ),
        gr.Slider(
            minimum=0.1,
            maximum=1.0,
            value=0.9,
            step=0.1,
            label="Top P",
            info="Nucleus sampling threshold"
        ),
        gr.Slider(
            minimum=1.0,
            maximum=2.0,
            value=1.1,
            step=0.1,
            label="Repetition Penalty",
            info="Higher values reduce repetition"
        )
    ],
    title="LLaMA Chatbot",
    description="Chat with Meta's LLaMA language model. The model maintains conversation history for context.",
    examples=[
        ["Tell me about the history of artificial intelligence."],
        ["What are some good practices for writing clean code?"],
        ["Explain quantum computing in simple terms."]
    ]
)

interface.launch(share=True)