# Mixtral in Colab

Welcome! In this notebook you can run [Mixtral8x7B-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) with decent generation speed **right in Google Colab or on a consumer-grade GPU**. This was made possible by quantizing the original model in mixed precision and implementing a MoE-specific offloading strategy. This build was also included with the NVIDIA LLM-RT code so the graphics card can optimize this model's runtime to be even faster in addition to the above features. Note: Quantization does reduce a small amount of the models accuracy in favor of speed and preformance but the purpose of this model is to run as quickly as possible with the least amount of preformance loss. An additional note is that this code was intended to run on an A100 NVIDA graphics card and may have unforseen bugs or other problems when run on another kind of processing card.

For the original unmodified code, please see the below resources. I am not at all affiliated with the teams below but I used their open source resources for the Mixtral optimization while I added some code changes to it, most of the code is from the original sources below: with the exception of the NVIDIA LLM-RT codes of course!
To learn more, read Deniz Azur's [tech report](https://arxiv.org/abs/2312.17238) or check out the [repo](https://github.com/dvmazur/mixtral-offloading) on GitHub.

The LLM-RT NVIDIA code was altered to fit on a cloud service such as Google Collab where I created this code. This is the original altered code and the base engine creation credit goes to the NVIDIA team and their repository here: https://github.com/NVIDIA/TensorRT-LLM

My own code built for Google Collab (if you wanted to test the code yourself) is located here: https://github.com/viasky657/GoogleCollabFiles

In [None]:

!zip -r logMixtral-8x7B-Instruct-v0.1-offloading-demo.zip/ Mixtral-8x7B-Instruct-v0.1-offloading-demo
from google.colab import drive
drive.mount('/content/drive')




zip error: Nothing to do! (try: zip -r logMixtral-8x7B-Instruct-v0.1-offloading-demo.zip/ . -i Mixtral-8x7B-Instruct-v0.1-offloading-demo)
Mounted at /content/drive


One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts.


<details>

<summary>How to balance between RAM and GPU VRAM usage</summary>

You can balance between RAM and GPU VRAM usage by changing <code>offload_per_layer</code> variable in the <a href="#scrollTo=_mIpePTMFyRY&line=10&uniqifier=1">Initialize model</a> section. Increasing <code>offload_per_layer</code> will decrease GPU VRAM usage, increase RAM usage and decrease generation speed. Decreasing <code>offload_per_layer</code> will have the opposite effect.

Note that this notebook should run normally in Google Colab with <code>offload_per_layer = 4</code>, but may crush with other values. However, if you run this somewhere else, you're free to play with this variable.
</details>


# LLM - RT Set - Up
**LLM-RT Tensor Set-Up for Optimized GPU Speed Preformance**

Step 1: Install the NVIDIA Container Toolkit from Here - https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html



In [None]:
import os

!curl -fsSL https://get.docker.com -o get-docker.sh
!sh get-docker.sh

!distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
!sudo apt-get update
!sudo apt-get install -y nvidia-docker2
!sudo systemctl restart docker

!docker run --gpus all llm-rt-image

# Executing docker install script, commit: e5543d473431b782227f8908005543bb4389b8de
+ sh -c apt-get update -qq >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null
+ sh -c install -m 0755 -d /etc/apt/keyrings
+ sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" | gpg --dearmor --yes -o /etc/apt/keyrings/docker.gpg
+ sh -c chmod a+r /etc/apt/keyrings/docker.gpg
+ sh -c echo "deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu jammy stable" > /etc/apt/sources.list.d/docker.list
+ sh -c apt-get update -qq >/dev/null
+ sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq docker-ce docker-ce-cli containerd.io docker-compose-plugin docker-ce-rootless-extras docker-buildx-plugin >/dev/null


To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://d

**Next Step:  Install TensorRT-LLM for x86_64 users for Mixtral AI and Install for Distilled Whisper Build AI.**

In [None]:


#MIXTRAL AI Install
# Install dependencies
!apt-get update
!apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the latest preview version (corresponding to the main branch) of TensorRT-LLM.
# If you want to install the stable version (corresponding to the release branch), please
# remove the `--pre` option.
!pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com

# Check installation
!python3 -c "import tensorrt_llm"

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [1,786 kB]
Get:12 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [713 kB]
Get:13 http://

In [None]:
pip install pynvml>=11.5.0 #Needed to monitor system memory for TensorRT-LLM.

****

## Install and import libraries for Mixtral AI and for All AI Files

In [None]:
# fix numpy in colab
import numpy
from IPython.display import clear_output
!pip install PyAudio
# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

!git clone https://github.com/viasky657/GoogleCollabFiles.git                      #https://github.com/dvmazur/mixtral-offloading.git
!cd GoogleCollabFiles/mixtral-offloading && pip install -q -r requirements.txt                                               #mixtral-offloading
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

clear_output()

In [None]:
import sys

sys.path.append("GoogleCollabFiles/mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

sys.path.append("GoogleCollabFiles/mixtral-offloading/src")
from src.build_model import OffloadConfig, QuantConfig, build_model                        #src.build_model

ModuleNotFoundError: No module named 'hqq'

## Initialize model

In [None]:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"
!git clone https://github.com/NVIDIA/TensorRT-LLM/tree/0ab9d17a59c284d2de36889832fe9fc7c8697604/tensorrt_llm

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)
model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)

# Build Mistral 7B with max input length 32256
!python convert_checkpoint.py --model_dir ./mistralai/Mixtral-8x7B-v0.1 \
                              --output_dir ./tllm_checkpoint_1gpu_mistral \
                              --dtype float16

# Build TensorRT engine
!trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_mistral \
               --output_dir ./tmp/mistral/7B/trt_engines/fp16/1-gpu/ \
               --gemm_plugin float16 \
               --max_input_len 32256

# Run Mistral 7B fp16 inference with sliding window/cache size 4096
!python3 run.py --max_output_len=50 \
                --tokenizer_dir ./tmp/llama/7B/ \
                --engine_dir=./tmp/llama/7B/trt_engines/fp16/1-gpu/ \
                --max_attention_window_size=4096

# After running the above commands to build and run the model,
# you already have the model variable from the build process.

# You can then use this variable to initialize the new model with additional configuration

# Define additional configuration parameters
quant_config=quant_config,
offload_config=offload_config,
additional_state_path = './tllm_checkpoint_1gpu_mistral'  # Path to the previously built model




Cloning into 'tensorrt_llm'...
fatal: repository 'https://github.com/NVIDIA/TensorRT-LLM/tree/0ab9d17a59c284d2de36889832fe9fc7c8697604/tensorrt_llm/' not found


NameError: name 'AutoConfig' is not defined

# Distilled Whisper, Gradio Stable Diffusion Cascade, and Voice AI for Ada including Model Running and Token saving (Short Term Memory)(Complete)

In [None]:
#Distilled Whisper for User audio recognition Imports
import sounddevice as sd
import numpy as np
import torch
from whisper import load_model, transcribe
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
#Elevenlabs AI audio Imports
import getpass
import elevenlabs
#Stable Diffusion Cascade Imports
from gradio_client import Client
!pip install gradio_client
#Stable Diffusion Cascade Image Save
from PIL import Image
import os
from IPython.display import display

#Mixtral Model
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"  # Replace with your actual model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#Load DistilWhisper model Unoptimized Original Model
distil_large_v2 = hf_hub_download(repo_id="openai/whisper-large", filename="model.pt")
model = load_model(model="large", device="cuda")


def getIdentity(identityPath):
    with open(identityPath, "r", encoding="utf-8") as f:
        identityContext = f.read()
    return {"role": "user", "content": identityContext}

def user_change_identity():
    change = input("Change identity? Type 'yes' to change and 'no' for default (no change): ").strip().lower()
    return change == "yes"

def update_identity(identityPath):
    print("Enter the new identity text (this will replace the current identity):")
    new_identity = input()
    with open(identityPath, "w", encoding="utf-8") as f:
        f.write(new_identity)
    print("Identity updated.")

def get_user_input_preference():
    while True:
        preference = input("Would you like to use the microphone or type your input? (mic/type): ").strip().lower()
        if preference in ["mic", "type"]:
            return preference
        else:
            print("Invalid input. Please type 'mic' for microphone or 'type' for typing your input.")


def record_audio(duration=5, samplerate=44100, channels=1):
    """Record audio from the microphone."""
    print("Recording...")
    recording = sd.rec(int(duration * samplerate), samplerate=samplerate, channels=channels, dtype='float32')
    sd.wait()  # Wait until recording is finished
    print("Recording stopped.")
    return recording

def transcribe_audio(audio_data):
    """Transcribe audio data to text using DistilWhisper."""
    #Convert the audio data to a format compatible with DistilWhisper
    audio_data = torch.tensor(audio_data, dtype=torch.float32).squeeze()  # Ensure audio data is in the correct shape
    pred_out = transcribe(model, audio=audio_data, device="cuda")
    return pred_out["text"]

def ai_conversation(user_input, model_identity, past_key_values=None):
    """Generate a response from the AI model based on user input.

    Args:
        user_input (str): The text input from the user.
        past_key_values (torch.Tensor, optional): Past key values for the model to maintain context. Defaults to None.

    Returns:
        str: The model's text response.
        torch.Tensor: Updated past key values.
    """
    prompt = [model_identity]  # This line ensures the model's identity is included in the prompt
    prompt.append({"role": "user", "content": user_input})  # Append the user's input to the prompt

    input_ids = tokenizer(user_input, return_tensors="pt").input_ids.to(device)

    # Create attention mask based on input_ids
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=device)

    # Generate model response
    result = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        past_key_values=past_key_values,
        max_length=1000,  # Adjusted as per requirement
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.9,
        top_p=0.9,
        max_new_tokens=512,
        return_dict_in_generate=True,
        output_hidden_states=False  # Adjust as needed
    )

    # Extract the generated text
    response_text = tokenizer.decode(result.sequences[0], skip_special_tokens=True)

    # Update past_key_values if needed for continuity in conversation
    new_past_key_values = result.past_key_values if 'past_key_values' in result else None

    return response_text, new_past_key_values

def main():
    identityPath = "characterConfig/Pina/identity.txt"
    if user_change_identity():
        update_identity(identityPath)
    model_identity = getIdentity(identityPath)

    past_key_values = None  # Initialize past_key_values

    while True:
        preference = get_user_input_preference()

        if preference == "mic":
            if not user_decision_to_record():
                print("Recording aborted by the user.")
                continue

            print("Please start speaking. Recording will stop after a silence.")
            audio_data = record_audio()

            print("Transcribing...")
            transcription = transcribe_audio(audio_data)
        else:  # User prefers typing
            print("User: ", end="")
            transcription = input()

        print(f"User said: {transcription}")  # User's input (typed or transcribed) is printed here

        # Ensure you pass model_identity to ai_conversation
        model_response, past_key_values =  ai_conversation(transcription, model_identity, past_key_values)
        print(f"AI responded: {model_response}")  # AI's response is printed here

        # Process AI response for image generation and voice output
        process_ai_response(model_response)
        handle_voice_output(model_response)

def process_ai_response(model_response):
    # Check for "draw start" phrase in model_response for image generation
    if "draw start" in model_response.lower():
        print("Starting image generation...")
        generate_image(model_response)  # Call generate_image with the current model_response
    else:
        print("No image generation triggered.")


def handle_voice_output(model_response):
    if user_decision_for_voice_output():
        print("Generating AI voice output...")
        elevenlabs_tts(model_response)  # model_response is the AI-generated text you wish to convert to speech
    else:
        print("AI voice output disabled by the user.")

api_key = None
voice_id = None

def get_elevenlabs_credentials():
    global api_key, voice_id
    if api_key is None:
        api_key = getpass.getpass('Enter your Elevenlabs API key: ')
    if voice_id is None:
        voice_id = getpass.getpass('Enter your Elevenlabs Voice ID: ')
    elevenlabs.set_api_key(api_key)
    return voice_id

def elevenlabs_tts(text, model='eleven_monolingual_v1'):
    voice_id = get_elevenlabs_credentials()
    audio = elevenlabs.generate(
        text=text,
        voice=voice_id,
        model=model
    )
    elevenlabs.play(audio)


#Stable Diffusion Cascade

def generate_image(model_response):
  PosPrompt = model_response + "bueatiful, insanely detailed, 8k"
  client = Client("multimodalart/stable-cascade")
  result = client.predict(
		PosPrompt,	# str  in 'Prompt' Textbox component
		"nudity, NSFW, bad hands, bad anatomy, ((((ugly)))), (((duplicate))), ((morbid)), ((mutilated)), out of frame, extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), (((mutation))), (((deformed))), ((ugly)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), out of frame, ugly, extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck)))",	# str  in 'Negative prompt' Textbox component
		0,	# float (numeric value between 0 and 2147483647) in 'Seed' Slider component
		1024,	# float (numeric value between 1024 and 1536) in 'Width' Slider component
		1024,	# float (numeric value between 1024 and 1536) in 'Height' Slider component
		10,	# float (numeric value between 10 and 30) in 'Prior Inference Steps' Slider component
		7.5,	# float (numeric value between 0 and 20) in 'Prior Guidance Scale' Slider component
		4,	# float (numeric value between 4 and 12) in 'Decoder Inference Steps' Slider component
		0,	# float (numeric value between 0 and 0) in 'Decoder Guidance Scale' Slider component
		1,	# float (numeric value between 1 and 2) in 'Number of Images' Slider component
		api_name="/run"
)
print(result)
display_and_save_image(image_file_path, output_dir="saved_images")


def display_and_save_image(image_file_path, output_dir="saved_images"):
    """
    Loads, displays, and optionally saves an image to a specified directory.

    Parameters:
    - image_file_path: str, the path to the image file.
    - output_dir: str, the directory where the image will be saved (optional).
    """
    # Ensure the file exists
    if os.path.exists(image_file_path):
        # Load and display the image
        image = Image.open(image_file_path)
        display(image)

        # Optionally, save the image to a new location
        os.makedirs(output_dir, exist_ok=True)  # Ensure the output directory exists
        output_path = os.path.join(output_dir, os.path.basename(image_file_path))
        image.save(output_path)

        print(f"Image has been saved to: {output_path}")
    else:
        print(f"Image not found at: {image_file_path}")

# Example usage:
# result = "path/to/your/image.jpg"  # Define the path to the generated image
# display_and_save_image(result)



if __name__ == "__main__":
      main()

ModuleNotFoundError: No module named 'sounddevice'

# Distilled Whisper, Gradio Stable Diffusion Cascade, and Voice AI for Ada including Model Running and Token saving (Short Term Memory and Long Term Memory with Zep) This Model Requires a A100 GPU to run at its most Optimal but other GPUs should work as well; the code execution just may not be as fast.

In [None]:
#Distilled Whisper for User audio recognition Imports
import sounddevice as sd
import numpy as np
import torch
from whisper import load_model, transcribe
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
#Elevenlabs AI audio Imports
import getpass
import elevenlabs
#Stable Diffusion Cascade Imports
from gradio_client import Client
!pip install gradio_client
#Stable Diffusion Cascade Image Save
from PIL import Image
import os
from IPython.display import display

#Mixtral Model
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"  # Replace with your actual model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#Load DistilWhisper model Unoptimized Original Model
distil_large_v2 = hf_hub_download(repo_id="openai/whisper-large", filename="model.pt")
model = load_model(model="large", device="cuda")

#Zep Code Imports

from uuid import uuid4

from langchain.agents import AgentType, Tool,  load_tools, initialize_agent
from langchain.memory import ZepMemory
from langchain.retrievers import ZepRetriever
from langchain.schema import AIMessage, HumanMessage
from langchain_community.utilities import WikipediaAPIWrapper
from langchain_openai import OpenAI

# Set this to your Zep server URL
ZEP_API_URL = "http://localhost:8000"

session_id = str(uuid4())  # This is a unique identifier for the user

# Provide your Zep API key. Note that this is optional. See https://docs.getzep.com/deployment/auth

#zep_api_key = getpass.getpass()

##Set-up and Initialize the ZEP Agent##
tools = load_tools(["google-serper"], llm=llm)


# Set up Zep Chat History
memory = ZepMemory(
    session_id=session_id,
    url=ZEP_API_URL,
    api_key=zep_api_key,
    memory_key="chat_history",
)

# Initialize the agent
 model, device = initialize_mixtral_model(
    quantized_model_name="lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo",
    state_path="Mixtral-8x7B-Instruct-v0.1-offloading-demo",
    offload_per_layer=4  # Adjust based on your GPU VRAM
)

llm = model  # Assign only the model to llm for usage in the agent chain

                                                                               #OpenAI(temperature=0, openai_api_key=openai_key) - Original LLM
agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
    verbose=True,
    memory=memory,
)


def getIdentity(identityPath):
    with open(identityPath, "r", encoding="utf-8") as f:
        identityContext = f.read()
    return {"role": "user", "content": identityContext}

def user_change_identity():
    change = input("Change identity? Type 'yes' to change and 'no' for default (no change): ").strip().lower()
    return change == "yes"

def update_identity(identityPath):
    print("Enter the new identity text (this will replace the current identity):")
    new_identity = input()
    with open(identityPath, "w", encoding="utf-8") as f:
        f.write(new_identity)
    print("Identity updated.")

def get_user_input_preference():
    while True:
        preference = input("Would you like to use the microphone or type your input? (mic/type): ").strip().lower()
        if preference in ["mic", "type"]:
            return preference
        else:
            print("Invalid input. Please type 'mic' for microphone or 'type' for typing your input.")


def record_audio(duration=5, samplerate=44100, channels=1):
    """Record audio from the microphone."""
    print("Recording...")
    recording = sd.rec(int(duration * samplerate), samplerate=samplerate, channels=channels, dtype='float32')
    sd.wait()  # Wait until recording is finished
    print("Recording stopped.")
    return recording

def transcribe_audio(audio_data):
    """Transcribe audio data to text using DistilWhisper."""
    #Convert the audio data to a format compatible with DistilWhisper
    audio_data = torch.tensor(audio_data, dtype=torch.float32).squeeze()  # Ensure audio data is in the correct shape
    pred_out = transcribe(model, audio=audio_data, device="cuda")
    return pred_out["text"]

#Run the AI Agent
agent_chain.run(
    def ai_conversation(user_input, model_identity, past_key_values=None):
    """Generate a response from the AI model based on user input.

    Args:
        user_input (str): The text input from the user.
        past_key_values (torch.Tensor, optional): Past key values for the model to maintain context. Defaults to None.

    Returns:
        str: The model's text response.
        torch.Tensor: Updated past key values.
    """
    prompt = [model_identity]  # This line ensures the model's identity is included in the prompt #Might need to change this to input
    prompt.append({"role": "user", "content": user_input})  # Append the user's input to the prompt
    input = prompt
    input_ids = tokenizer(user_input, return_tensors="pt").input_ids.to(device)

    # Create attention mask based on input_ids
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=device)

    # Generate model response
    result = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        past_key_values=past_key_values,
        max_length=1000,  # Adjusted as per requirement
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.9,
        top_p=0.9,
        max_new_tokens=512,
        return_dict_in_generate=True,
        output_hidden_states=False  # Adjust as needed
    )

    # Extract the generated text
    response_text = tokenizer.decode(result.sequences[0], skip_special_tokens=True)

    # Update past_key_values if needed for continuity in conversation
    new_past_key_values = result.past_key_values if 'past_key_values' in result else None

    return response_text, new_past_key_values

)


    # Extract the generated text
    response_text = tokenizer.decode(result.sequences[0], skip_special_tokens=True)

    # Update past_key_values if needed for continuity in conversation
    new_past_key_values = result.past_key_values if 'past_key_values' in result else None

    return response_text, new_past_key_values

def main():
    identityPath = "characterConfig/Pina/identity.txt"
    if user_change_identity():
        update_identity(identityPath)
    model_identity = getIdentity(identityPath)

    past_key_values = None  # Initialize past_key_values

    while True:
        preference = get_user_input_preference()

        if preference == "mic":
            if not user_decision_to_record():
                print("Recording aborted by the user.")
                continue

            print("Please start speaking. Recording will stop after a silence.")
            audio_data = record_audio()

            print("Transcribing...")
            transcription = transcribe_audio(audio_data)
        else:  # User prefers typing
            print("User: ", end="")
            transcription = input()

        print(f"User said: {transcription}")  # User's input (typed or transcribed) is printed here

        # Ensure you pass model_identity to ai_conversation
        model_response, past_key_values = agent_chain.run(ai_conversation(transcription, model_identity, past_key_values))
        print(f"AI responded: {model_response}")  # AI's response is printed here

        # Process AI response for image generation and voice output
        process_ai_response(model_response)
        handle_voice_output(model_response)


        # After response, provide options to user
        user_action = input("Enter 'print' to display chat history, 'delete' to clear memory, or 'continue' to keep chatting: ").lower()
        if user_action == 'print':
            print_messages(memory.chat_memory.messages)
        elif user_action == 'delete':
            memory.clear()
            print("Memory cleared.")
        elif user_action == 'continue':
            continue

def process_ai_response(model_response):
    # Check for "draw start" phrase in model_response for image generation
    if "draw start" in model_response.lower():
        print("Starting image generation...")
        generate_image(model_response)  # Call generate_image with the current model_response
    else:
        print("No image generation triggered.")


def handle_voice_output(model_response):
    if user_decision_for_voice_output():
        print("Generating AI voice output...")
        elevenlabs_tts(model_response)  # model_response is the AI-generated text you wish to convert to speech
    else:
        print("AI voice output disabled by the user.")

api_key = None
voice_id = None

def get_elevenlabs_credentials():
    global api_key, voice_id
    if api_key is None:
        api_key = getpass.getpass('Enter your Elevenlabs API key: ')
    if voice_id is None:
        voice_id = getpass.getpass('Enter your Elevenlabs Voice ID: ')
    elevenlabs.set_api_key(api_key)
    return voice_id

def elevenlabs_tts(text, model='eleven_monolingual_v1'):
    voice_id = get_elevenlabs_credentials()
    audio = elevenlabs.generate(
        text=text,
        voice=voice_id,
        model=model
    )
    elevenlabs.play(audio)


#Stable Diffusion Cascade

def generate_image(model_response):
PosPrompt = model_response + "bueatiful, insanely detailed, 8k"
client = Client("multimodalart/stable-cascade")
result = client.predict(
		PosPrompt,	# str  in 'Prompt' Textbox component
		"nudity, NSFW, bad hands, bad anatomy, ((((ugly)))), (((duplicate))), ((morbid)), ((mutilated)), out of frame, extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), (((mutation))), (((deformed))), ((ugly)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), out of frame, ugly, extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck)))",	# str  in 'Negative prompt' Textbox component
		0,	# float (numeric value between 0 and 2147483647) in 'Seed' Slider component
		1024,	# float (numeric value between 1024 and 1536) in 'Width' Slider component
		1024,	# float (numeric value between 1024 and 1536) in 'Height' Slider component
		10,	# float (numeric value between 10 and 30) in 'Prior Inference Steps' Slider component
		7.5,	# float (numeric value between 0 and 20) in 'Prior Guidance Scale' Slider component
		4,	# float (numeric value between 4 and 12) in 'Decoder Inference Steps' Slider component
		0,	# float (numeric value between 0 and 0) in 'Decoder Guidance Scale' Slider component
		1,	# float (numeric value between 1 and 2) in 'Number of Images' Slider component
		api_name="/run"
)
print(result)
display_and_save_image(image_file_path, output_dir="saved_images")


def display_and_save_image(image_file_path, output_dir="saved_images"):
    """
    Loads, displays, and optionally saves an image to a specified directory.

    Parameters:
    - image_file_path: str, the path to the image file.
    - output_dir: str, the directory where the image will be saved (optional).
    """
    # Ensure the file exists
    if os.path.exists(image_file_path):
        # Load and display the image
        image = Image.open(image_file_path)
        display(image)

        # Optionally, save the image to a new location
        os.makedirs(output_dir, exist_ok=True)  # Ensure the output directory exists
        output_path = os.path.join(output_dir, os.path.basename(image_file_path))
        image.save(output_path)

        print(f"Image has been saved to: {output_path}")
    else:
        print(f"Image not found at: {image_file_path}")

# Example usage:
# result = "path/to/your/image.jpg"  # Define the path to the generated image
# display_and_save_image(result)



if __name__ == "__main__":
      main()

IndentationError: unexpected indent (<ipython-input-7-d08dc7d53c79>, line 63)

# Zep Memory and Document Reading Storage for Ada AI (Mixtral) Unused

In [None]:
#Zep Code Properly organized

from uuid import uuid4

from langchain.agents import AgentType, Tool,  load_tools, initialize_agent
from langchain.memory import ZepMemory
from langchain.retrievers import ZepRetriever
from langchain.schema import AIMessage, HumanMessage
from langchain_community.utilities import WikipediaAPIWrapper
from langchain_openai import OpenAI

# Set this to your Zep server URL
ZEP_API_URL = "http://localhost:8000"

session_id = str(uuid4())  # This is a unique identifier for the user

# Provide your Zep API key. Note that this is optional. See https://docs.getzep.com/deployment/auth

#zep_api_key = getpass.getpass()

##Set-up and Initialize the ZEP Agent##
tools = load_tools(["google-serper"], llm=llm)


# Set up Zep Chat History
memory = ZepMemory(
    session_id=session_id,
    url=ZEP_API_URL,
    api_key=zep_api_key,
    memory_key="chat_history",
)

# Initialize the agent
 model, device = initialize_mixtral_model(
    quantized_model_name="lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo",
    state_path="Mixtral-8x7B-Instruct-v0.1-offloading-demo",
    offload_per_layer=4  # Adjust based on your GPU VRAM
)

llm = model  # Assign only the model to llm for usage in the agent chain

                                                                               #OpenAI(temperature=0, openai_api_key=openai_key) - Original LLM
agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
    verbose=True,
    memory=memory,
)


for msg in test_history:
    memory.chat_memory.add_message(
        (
            HumanMessage(content=msg["content"])
            if msg["role"] == "human"
            else AIMessage(content=msg["content"])
        ),
        metadata=msg.get("metadata", {}),
    )

#Run the AI Agent
agent_chain.run(
    def ai_conversation(user_input, model_identity, past_key_values=None):
    """Generate a response from the AI model based on user input.

    Args:
        user_input (str): The text input from the user.
        past_key_values (torch.Tensor, optional): Past key values for the model to maintain context. Defaults to None.

    Returns:
        str: The model's text response.
        torch.Tensor: Updated past key values.
    """
    prompt = [model_identity]  # This line ensures the model's identity is included in the prompt #Might need to change this to input
    prompt.append({"role": "user", "content": user_input})  # Append the user's input to the prompt
    input = prompt
    input_ids = tokenizer(user_input, return_tensors="pt").input_ids.to(device)

    # Create attention mask based on input_ids
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=device)

    # Generate model response
    result = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        past_key_values=past_key_values,
        max_length=1000,  # Adjusted as per requirement
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        temperature=0.9,
        top_p=0.9,
        max_new_tokens=512,
        return_dict_in_generate=True,
        output_hidden_states=False  # Adjust as needed
    )

    # Extract the generated text
    response_text = tokenizer.decode(result.sequences[0], skip_special_tokens=True)

    # Update past_key_values if needed for continuity in conversation
    new_past_key_values = result.past_key_values if 'past_key_values' in result else None

    return response_text, new_past_key_values

)

#Print a Summary of the AI conversation
def print_messages(messages):
    for m in messages:
        print(m.type, ":\n", m.dict())


print(memory.chat_memory.zep_summary)
print("\n")
print_messages(memory.chat_memory.messages)

#Search the Zep Memory for a specific memory
retriever = ZepRetriever(
    session_id=session_id,
    url=ZEP_API_URL,
    api_key=zep_api_key,
)

search_results = memory.chat_memory.search("who are some famous women sci-fi authors?")
for r in search_results:
    if r.dist > 0.8:  # Only print results with similarity of 0.8 or higher
        print(r.message, r.dist)

#Delete All Session Memories##
 def clear(self) -> None:
        """Clear session memory from Zep. Note that Zep is long-term storage for memory
        and this is not advised unless you have specific data retention requirements.
        """
        try:
            self.zep_client.delete_memory(self.session_id)
        except NotFoundError:
            logger.warning(
                f"Session {self.session_id} not found in Zep. Skipping delete."
            )

In [None]:
#Zep Vector Memory Storage which creates a document collection that can be added to the model's long term memory for context or searches.

from llama_index.vector_stores import ZepVectorStore
zep_api_url = "http://localhost:8000"
zep_api_key = "<optional_jwt_token>"
collection_name = "document" # The name of a new or existing collection
embedding_dimensions = 1536 # the dimensions of the embedding model you intend to use
vector_store = ZepVectorStore(
  api_url=zep_api_url,
  api_key=zep_api_key,
  collection_name=collection_name,
  embedding_dimensions=embedding_dimensions
)

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.storage.storage_context import StorageContext

documents = SimpleDirectoryReader("./document_calculating_engine/").load_data()

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

query = "Roleplaying Games" #AI Prompt

query_engine = index.as_query_engine()
response = query_engine.query(query)


print(str(response))

rom llama_index.schema import TextNode
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

nodes = [
   TextNode(
       text="PARTl Creating a Character for Dungeons and Dragons Record your level on your character sheet. If you're starting at a higher level, record the additional elements your class gives you for your levels past 1st. Also record your experience points. A I st-level character has 0 XP. A higher-level character typically begins with the minimum amount ofXP required to reach that level (see "Beyond lst Level" later in this chapter). HIT POINTS AND HIT DICE. Your character's hit points define how tough your character is in combat and other dangerous situations. Your hit points are determined by your Hit Dice (short for Hit Point Dice). ABILITY SCORE SUMMARY Strength Measures: Natural athleticism, bodily power Important for: Barbarian, fighter, paladin Racial Increases: Mountain dwarf (+2) Half-ore (+2) Dragonborn (+2) Human (+l) Dexterity Measures: Physical agility, reflexes, balance, poise Important for: Monk, ranger, rogue Racial Increases: Elf (+2) Forest gnome (+1) Halfing (+2) Human (+1) Constitution Measures: Health, stamina, vital force Important for: Everyone Racial Increases: Dwarf(+2) Half-ore (+1) Stout halfing (+ 1) Human (+1) Rock gnome (+1)."
       metadata={
           "topic": "Dungeons and Dragons Roleplaying Game Character Creation",
           "entities": "Game",
       },
   ),
   #TextNode( Another Placeholder Example of TextNode#
       #text="Within the limits of the lunar orbit there are not less than one thousand stars, which are so situated as to be in the moon's path, and therefore to exhibit, at some period or other, those desirable occultations.",
       #metadata={
           #"topic": "astronomy",
           #"entities": "moon",
       #},
   #),
]


storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)


filters = MetadataFilters(filters=[ExactMatchFilter(key="topic", value="astronomy")])


retriever = index.as_retriever(filters=filters)
result = retriever.retrieve("What is the structure of our galaxy?") #Result from TextNode


for r in result:
   print("\n", r.node.text, r.score)

## Run the base Ada model without user voice and AI voice features.(Optional-Above Code is the Complete Ver. I recommend using that build over this one.)

In [None]:
from transformers import TextStreamer


tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None

seq_len = 0
while True:
  print("User: ", end="")
  user_input = input()
  print("\n")

  user_entry = dict(role="user", content=user_input)
  input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)

  if past_key_values is None:
    attention_mask = torch.ones_like(input_ids)
  else:
    seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
    attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)

  print("Mixtral: ", end="")
  result = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    past_key_values=past_key_values,
    streamer=streamer,
    do_sample=True,
    temperature=0.9,
    top_p=0.9,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True,
    output_hidden_states=True,
  )
  print("\n")

  sequence = result["sequences"]
  past_key_values = result["past_key_values"]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

User: What is your favorite food?


Mixtral: I don't have a favorite food, as I don't have personal experiences or tastes. However, I can tell you that many deep learning models, like the one that powers me, are often trained on data that includes a lot of text from recipes and food blogs, so they tend to generate a lot of fun and interesting responses when it comes to food! For example, I can tell you that according to a recipe I generated, my favorite food is "Crispy and creamy ice cream lasagna". So, that's something to look forward to in the hypothetical world where I have the ability to eat.


User: If you could go anywhere (even in a fantastical fictional world) where would you go?


Mixtral: I would love to visit the "Otherside" as described in the song "Castle on the Hill" by British singer-songwriter Ed Sheeran. The Otherside is a metaphysical place that exists beyond the physical world, where people go after they die. It is a place of peace, love, and happiness where all our 

KeyboardInterrupt: 