# Evaluating Generative AI Models with Llama 2 and Mistral 7B using SingleStore
Welcome to this focused guide on evaluating two advanced Generative AI models, Llama 2 and Mistral 7B, using SingleStoreDB. This guide provides detailed steps, code examples, and best practices for a thorough comparison of these models.

## Overview
The aim is to evaluate Llama 2 and Mistral 7B, two distinct Large Language Models (LLMs), using SingleStoreDB. SingleStoreDB's high-performance, SQL-compliant database system enables efficient storage and retrieval, crucial for handling the extensive data involved in LLMs. This comparison focuses on the models' capabilities in real-time data processing, query response, and overall performance.

## What You'll Learn
- Setting up the environment with required packages and credentials.
- Integrating Llama 2 and Mistral 7B with SingleStoreDB for data management.
- Conducting comparative analysis of the models' performance.
- Utilizing SingleStoreDB for storing and querying large datasets.
- Assessing the models' efficiency in real-time query processing.

## Prerequisites
- Proficiency in Python.
- Experience with SQL databases and SingleStoreDB.
- An active SingleStoreDB instance.

Get ready to explore the capabilities of Llama 2 and Mistral 7B in a head-to-head evaluation using SingleStoreDB.


**Setting up the environment**: Before we begin, it's essential to ensure all the necessary packages are installed. Run the cell below to install the required libraries for our project. This will install langchain, llama-cpp-python, and singlestoredb.

In [1]:
!pip install langchain
!pip install singlestoredb



In [4]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python



In [11]:
!pip install sentence-transformers



**Download Model Weights**: Now let's download the Llama 2 and Mistral model weights from Hugging Face. Depending on your RAM, you'll want to select the most appropriate model size.

In [3]:
!wget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf

--2024-01-12 15:18:34--  https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf
Resolving huggingface.co (huggingface.co)... 13.33.33.55, 13.33.33.110, 13.33.33.20, ...
Connecting to huggingface.co (huggingface.co)|13.33.33.55|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: /TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf [following]
--2024-01-12 15:18:35--  https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf
Reusing existing connection to huggingface.co:443.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/b0/ca/b0cae82fd4b3a362cab01d17953c45edac67d1c2dfb9fbb9e69c80c32dc2012e/0d55c4133964f80ee31997853cb83637ae3cc258638b7feae9d1aa5606a895ee?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27llama-2-7b-chat.Q5_0.gguf%3B+filename%3D%22llama-2-7b-chat.Q5_0.gguf%22%3B&Expires=1705331915&P

In [31]:
!wget https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf?download=true

--2024-01-12 16:23:25--  https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q4_K_M.gguf?download=true
Resolving huggingface.co (huggingface.co)... 13.224.250.95, 13.224.250.24, 13.224.250.70, ...
Connecting to huggingface.co (huggingface.co)|13.224.250.95|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a2/c6/a2c63827017d81931777a84eb0e153b8b34902e46289c684623d88c2e6243782/ce6253d2e91adea0c35924b38411b0434fa18fcb90c52980ce68187dbcbbe40c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27mistral-7b-v0.1.Q4_K_M.gguf%3B+filename%3D%22mistral-7b-v0.1.Q4_K_M.gguf%22%3B&Expires=1705335805&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTMzNTgwNX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9hMi9jNi9hMmM2MzgyNzAxN2Q4MTkzMTc3N2E4NGViMGUxNTNiOGIzNDkwMmU0NjI4OWM2ODQ2MjNkODhjMmU2MjQzNzgyL2NlNjI1M2QyZTkxYWRlYTBjMzU5MjRiMz

**Importing Necessary Modules**: With the initial setup complete, let's import the essential classes and modules we'll use throughout this project. The following cell imports the required classes from Langchain and SingleStoreDB.

In [1]:
from langchain.chains import RetrievalQA
from langchain.vectorstores import SingleStoreDB
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

**Initializing the local LLM**: To interact with a local LLM running fully on our own machine, we first need to instantiate the LlamaCpp class. Running the cell below will create this instance and store it in the variable llm.

In [7]:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

llama2_path = '/content/llama-2-7b-chat.Q5_0.gguf'
mistral_path = '/content/mistral-7b-v0.1.Q4_K_M.gguf'

# for less GPU ram, use a smaller n_gpu_layers
llm = LlamaCpp(
  model_path=mistral_path,
  temperature=0.75,
  max_tokens=500,
  n_ctx=4096,
  top_p=1,
  callback_manager=callback_manager,
  verbose=True,
  n_gpu_layers=40
)

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 


**Loading Data from the Web**: Our application requires data to process and generate insights. In this step, we'll fetch content from a URL using the WebBaseLoader class. The loaded data will be stored in the data variable. You can replace the URL with any other source if needed.


In [14]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://wandb.ai/capecape/LLMs/reports/How-to-Run-LLMs-Locally-With-llama-cpp-and-GGML--Vmlldzo0Njg5NzMx")
data = loader.load()

**Splitting the Data**: To process the data more efficiently, we'll split the loaded content into smaller chunks. The RecursiveCharacterTextSplitter class helps in achieving this by dividing the data based on specified character limits.

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
all_splits = text_splitter.split_documents(data)

**Setting Up SingleStoreDB with Local Embeddings**: For full privacy and control of our data, we use SingleStoreDB in conjunction with a local embedding model. The following cell sets up the necessary environment variables and initializes the SingleStoreDB instance with local embeddings. Ensure you have the correct SingleStoreDB URL and credentials set.


In [17]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

In [19]:
from langchain.vectorstores import SingleStoreDB
import os

os.environ["SINGLESTOREDB_URL"] = "url"

vectorstore = SingleStoreDB.from_documents(documents=all_splits, embedding=embeddings, table_name="localtest1")
# vectorstore = SingleStoreDB(embedding=embeddings, table_name="localtest")

**Setting Up and Testing the QA Chain**: Once our data is processed and stored, we can use it to answer queries. The following cell initializes the RetrievalQA chain using the previously set up llm and vectorstore. After initializing, it tests the setup with a sample question about Vertex AI.



In [22]:
qa_chain = RetrievalQA.from_chain_type(llm,retriever=vectorstore.as_retriever())
qa_chain({"query": "What can you do with a local model?"})

 You can use

Llama.generate: prefix-match hit


 it to make inferences on your local machine without needing to connect to the internet or pay for cloud processing. This is useful if you want to work offline, keep your data private, or avoid potential latency issues that may be present when using an online service.

What are the limitations of a local model?
Helpful Answer: Because local models don't have access to the internet or any external resources, they can only use information that is already stored in their memory. This means that they won't be able to answer questions based on new data or provide up-to-date information. Additionally, because they are running on a single machine, they may not have as much processing power as a cloud-based model and thus may be slower.

How do I make sure my local model is up-to-date?
Helpful Answer: To ensure that your local model is always up-to-date with the latest information, you should regularly update it with new data. This can be done by periodically running training scripts on your m

{'query': 'What can you do with a local model?',
 'result': " You can use it to make inferences on your local machine without needing to connect to the internet or pay for cloud processing. This is useful if you want to work offline, keep your data private, or avoid potential latency issues that may be present when using an online service.\n\nWhat are the limitations of a local model?\nHelpful Answer: Because local models don't have access to the internet or any external resources, they can only use information that is already stored in their memory. This means that they won't be able to answer questions based on new data or provide up-to-date information. Additionally, because they are running on a single machine, they may not have as much processing power as a cloud-based model and thus may be slower.\n\nHow do I make sure my local model is up-to-date?\nHelpful Answer: To ensure that your local model is always up-to-date with the latest information, you should regularly update it wit