<a href="https://colab.research.google.com/github/vernondmonte/AI_gemma_experiment/blob/colab/gemma_exp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install torch
!pip install langchain_community
!pip install langchain
!pip install transformers
!pip install huggingface_hub
!pip install chromadb
!pip install uuid
!pip install sentence_transformers




In [4]:
import torch
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer, AutoModelForCausalLM
#from dotenv import load_dotenv
from huggingface_hub import login
import os
import pandas as pd

# Retrieval-Augmented Generation (RAG) Using Google Gemma-7b-it

In this notebook we will implement the following RAG architecture to create a customized LLM using text from my textbook **Handbook of Regression Modeling** and Google's Gemma-7b-it open source LLM:

![RAG Architecture](rag_architecture.jpg)

We will be building two components, an information-retrieval (IR) component and a generation component:
* The IR component acts as a knowledge database of text files.  This database will be used to identify documents or text passages most relevant to the intent of the user query. Embeddings will be used to search this database and return the most relevant passages from the textbook.
* The generation component will feed the results of the IR component into the Gemma LLM as context in order to generate an attempt at a comprehensive natural language response to the user prompt.

## Preparing a dataset from my textbook to work with Gemma-7b-it LLM

My textbook is available open source, and its codebase is in an open Github repository.  We will connect to the repository and download the text of the 14 chapters and sections of the textbook, and create a Pandas dataframe with 14 rows and two columns, containing an ID-number for each chapter/section and the text of each chapter/section.

In [5]:
import requests

# chapters are Rmd files with the following names
chapter_list = [
    "01-intro"
   # "02-basic_r",
   # "03-primer_stats",
   # "04-linear_regression",
   # "05-binomial_logistic_regression"
    #"06-multinomial_regression",
    #"07-ordinal_regression",
    #"08-hierarchical_data",
    #"09-survival_analysis",
    #"10-tidy_modeling",
    #"11-power_tests",
    #"12-further",
    #"13-solutions",
    #"14-bibliography"
]

# create a function to obtain the text of each chapter
def get_text(chapter: str) -> str:
    # URL on the Github where the rmd files are stored
    github_url = f"https://raw.githubusercontent.com/keithmcnulty/peopleanalytics-regression-book/master/r/{chapter}.Rmd"

    result = requests.get(github_url)
    return result.text

# iterate over the chapter URLs and pull down the text content
book_text = []
for chapter in chapter_list:
    chapter_text = get_text(chapter)
    book_text.append(chapter_text)

In [6]:
# write to a dataframe
book_data = dict(chapter = list(range(1)), text = book_text)
book_data = pd.DataFrame.from_dict(book_data)

Most of these text documents are too long to fit into Gemma's context window, and so we will need to split them into smaller documents in a way that makes some sort of semantic sense.  

We will use a Langchain transformer to do semantic splitting, with a chunk size of 1000 and a chunk overlap of 150.

In [7]:
# semantically split chapters to a max length of 1000
loader = DataFrameLoader(book_data, page_content_column="text")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

# examine a document to ensure it looks as we expect
docs[0]

Document(page_content="`r if (knitr::is_latex_output()) '\\\\mainmatter'`\n\n# The Importance of Regression in People Analytics {#inf-model}\n\nIn the 19th century, when Francis Galton first used the term 'regression' to describe a statistical phenomenon (see Chapter \\@ref(linear-reg-ols)), little did he know how important that term would be today.  Many of the most powerful tools of statistical inference that we now have at our disposal can be traced back to the types of early analysis that Galton and his contemporaries were engaged in.  The sheer number of different regression-related methodologies and variants that are available to researchers and practitioners today is mind-boggling, and there are still rich veins of ongoing research that are focused on defining and refining new forms of regression to tackle new problems.", metadata={'chapter': 0})

## Generating embeddings and storing in a vector DB

![Embeddings](embeddings.png)

Since we will use embeddings for the IR component, we need to generate embeddings for our split dataset and then write those into a vector database to allow them to be searched.  We will use the *all-MiniLM-L6-v2 model* to generate the embeddings and the ChromaDB vector database to store them.  Vector Databases have limits to the number of documents that can be encoded in a single command, and so we will use a batch command just in case there are too many documents.

In [8]:

import chromadb
from chromadb.utils import embedding_functions
from chromadb.utils.batch_utils import create_batches
import uuid

In [9]:
# set up the ChromaDB
CHROMA_DATA_PATH = "./chroma_data_regression_book/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "regression_book_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

# in case docs have already been written
#client.delete_collection(COLLECTION_NAME)

In [13]:
client.delete_collection(COLLECTION_NAME)

In [14]:

# enable the DB using Cosine Similarity as the distance metric
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBED_MODEL
)

collection = client.create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
)

In [15]:
# write text chunks to DB in batches
batches = create_batches(
    api=client,
    ids=[f"{uuid.uuid4()}" for i in range(len(docs))],
    documents=[doc.page_content for doc in docs],
    metadatas=[{'source': './handbook_of_regression_modeling', 'row': k} for k in range(len(docs))]
)

for batch in batches:
    print(f"Adding batch of size {len(batch[0])}")
    collection.add(ids=batch[0],
                   documents=batch[3],
                   metadatas=batch[2])

Adding batch of size 32


Now our documents are persisting in our vector database, we can try running a query against them.

In [16]:
results = collection.query(
    query_texts=["Which method would you recommend for ordered category outcomes?"],
    n_results=3,
    include=['documents']
)

results

{'ids': [['630c9fe2-70d9-4e07-990e-c55abe6777d7',
   '6de61406-b14a-457b-886e-c377178c487e',
   'e79722a1-c19a-4baf-a14b-ab472459a305']],
 'distances': None,
 'metadatas': None,
 'embeddings': None,
 'documents': [['* Chapter 5 covers binomial logistic regression.  The walkthrough example involves modeling promotion likelihood based on performance metrics.  The exercises involve modeling charitable donation likelihood based on prior donation behavior and demographics.\n* Chapter 6 covers multinomial regression.  The walkthrough example and exercise involves modeling the choice of three health insurance products by company employees based on demographic and position data.\n* Chapter 7 covers ordinal regression.  The walkthrough example involves modeling in-game disciplinary action against soccer players based on prior discipline and other factors.  The exercises involve modeling manager performance based on varied data.',
   '1.  Defining the outcome of interest $\\mathscr{O}$ and the i

## Pipelining the RAG using ChromaDB and Gemma-7b

We now have our vector DB in place so our IR layer is complete.  Now we will load the Gemma-7b LLM via Huggingface.  An access token is needed for this.  This model is large and will take some loading time.

In [17]:
#load_dotenv()
login(token=os.getenv("HF_TOKEN"))

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:
if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print(x)
else:
    print("MPS not available")

MPS not available


In [None]:
# load model to Apple Silicon
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", padding=True, truncation=True, max_length=4)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Now we have loaded the Gemma-7b-it model, we can set up an LLM pipeline.

We can run the model normally and ask a general question to the Gemma model:

In [None]:
prompt = """
<start_of_turn>user
What food should I try in New Mexico?<end_of_turn>
<start_of_turn>model
"""

# embed the prompt
input_ids = tokenizer(prompt, return_tensors="pt")

# generate the answer
outputs = model.generate(**input_ids, max_new_tokens=512)

# decode the answer
tokenizer.decode(outputs[0], skip_special_tokens=True).split('model\n', 1)[1]

Finally, we define a function that executes our IR layer and our LLM summarization layer.  The function accepts a question and queries it against our ChromaDB, retrieving a defined number of documents based on the smallest distance from the query.  The results are joined together and sent to the LLM as context along with the original question, to generate a summarized result.

In [None]:
def ask_question(question: str, model: AutoModelForCausalLM = model, tokenizer: AutoTokenizer = tokenizer, collection: str = COLLECTION_NAME, n_docs: int = 3) -> str:

    # Find close documents in chromadb
    collection = client.get_collection(collection)
    results = collection.query(
       query_texts=[question],
       n_results=n_docs
    )

    # Collect the results in a context
    context = "\n".join([r for r in results['documents'][0]])

    prompt = f"""
    <start_of_turn>user
    You are an expert on statistics and its applications to People Analytics.
    Here is a question: {question}\n\n Answer it with reference to the following information and only using the following information: {context}.<end_of_turn>
    <start_of_turn>model
    """

    # Generate the answer using the LLM
    input_ids = tokenizer(prompt, return_tensors="pt")

    # Return the generated answer
    outputs = model.generate(**input_ids, max_new_tokens=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True).split('model\n', 1)[1]

## Testing our RAG agent

We test our agent by asking some questions where relevant information is known to exist in the IR layer.

In [None]:
ask_question("What method would you recommend I use to model ordered category outcomes and why?")

In [None]:
ask_question('What should I look out for when using Proportional Odds regression?')

In [None]:
ask_question('What type of modeling is most likely to add value in People Analytics?')

In [None]:
ask_question('Can you please explain what is meant by the term "inferential modeling"?')

In [None]:
ask_question('Where did the term regression originate from?')

In [None]:
ask_question('How do I get started using R for regression modeling?')

In [None]:
ask_question('What factors determine statistical power?')

We also test to ensure that questions are only answered based on content in the book.

In [None]:
ask_question('What is the standard model of Physics?')