# Local LLM (Alpaca7B) with TruLens

In this example, we'll load Alpaca7B from huggingface and run inferences locally, and use langchain as our framework to hold the different parts of our application (conversation memory, the llm, prompt templates, etc.). We'll use prompt templates to prime the model to be a gardening expert and ask questions about gardening that rely on past prompts.

We will also track the quality of this model using TruLens. As we get further in the conversation, we may run into issues which we can identify and debug.

In [None]:
'''
!pip3 install torch
!pip -q install git+https://github.com/huggingface/transformers # need to install from github
!pip install -q datasets loralib sentencepiece 
!pip -q install bitsandbytes accelerate
!pip -q install langchain
!pip install xformers
!pip install trulens-eval
'''

In [None]:
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
import openai
import torch
from trulens_eval.tru_db import Query
from trulens_eval.tru import Tru
from trulens_eval import tru_chain
from trulens_eval.tru_feedback import Feedback
from trulens_eval.tru_feedback import OpenAI as Feedback_OpenAI
tru = Tru()

## Create Feedback Function

The first thing we should do is define the qualities of our model we care about. In this case, we primarily care if the statement returned by the LLM is relevant to the user's query. We'll use OpenAI to set up a feedback function for query-statement relevance. Make sure to add your own openai API key!

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "..."
feedback_openai = Feedback_OpenAI()
qs_relevance = Feedback(feedback_openai.qs_relevance).on(question=Query.RecordInput, statement = Query.RecordOutput)

## Loading Alpaca7B

Here we're loading a Alpaca7B using HuggingFacePipeline's from_model_id. Alpaca7B has similar performance to OpenAI's text-davinci-003, but can be run locally on your own machine.

In [None]:
from langchain import HuggingFacePipeline

local_llm = HuggingFacePipeline.from_model_id(model_id="chavinlo/alpaca-native",
    task="text-generation",
    model_kwargs={"temperature":0.6, "top_p":0.95, "max_length":256})

## Setting up a Chat with memory

It's also important for our AI assistant to have memory of the things we tell it. That way it can give information that is most relevant to our location, conditions, etc. and feels more like we are talking to a human.

First we'll set up our AI assistant to remember up to 4 turns in our conversation using ConversationBufferWindowMemory.

Then we'll update our prompt template to prime it as a gardening expert.

Last, we'll wrap it with truchain. You'll notice that this results in our first logs of the chain itself along with the feedback definition.

In [None]:
from langchain.chains import ConversationChain
from langchain.chains.conversation.memory import ConversationBufferWindowMemory

# set the window memory to go back 4 turns
window_memory = ConversationBufferWindowMemory(k=4)

# create the conversation chain with the given window memory
conversation = ConversationChain(
    llm=local_llm, 
    verbose=True, 
    memory=window_memory
)

# update the conversation prompt template to prime it as a gardening expert
conversation.prompt.template = '''The following is a friendly conversation between a human and an AI gardening expert. The AI is an expert on gardening and gives recommendations specific to location and conditions. If the AI does not know the answer to a question, it truthfully says it does not know. 

Current conversation:
{history}
Human: {input}
AI:'''

# wrap with truchain to instrument it
tc_conversation = tru.Chain(conversation, chain_id='GardeningAIwithMemory_v1', feedbacks=[qs_relevance])

Now that we've set up our chain, we can make the first call and ask our AI gardening assistant a question!

While this takes a bit of time to run on our local machine, it's nonetheless pretty impressive that we can run such a high quality LLM locally.

In [None]:
# make the first call to our AI gardening assistant!
response, record = tc_conversation.call_with_record("I live in the pacific northwest, what can I plant in my outside garden?")
display(response)

In [None]:
# continue the conversation!
response, record = tc_conversation.call_with_record("What kind of birds am I most likely to see?")
display(response)

In [None]:
# keep it up!
response, record = tc_conversation.call_with_record("Thanks! Blue Jays would be awesome, what kind of bird feeder should I get to attract them?")
display(response)

Oh, looks like something is going wrong and our LLM stopped responding usefully. Let's run the trulens dashboard to explore what the issue might be.

In [None]:
tru.run_dashboard(force=True)

Exploring the dashboard, we found that quality degraded on the third call to the LLM. We've also hypothesized that there may be a conflict between our max token limit of the LLM and the 4 turn window memory.

In [None]:
from langchain import HuggingFacePipeline

local_llm = HuggingFacePipeline.from_model_id(model_id="chavinlo/alpaca-native",
    task="text-generation",
    model_kwargs={"temperature":0.6, "top_p":0.95, "max_length":400})

from langchain.memory import ConversationTokenBufferMemory

# Instead of window memory, let's use token memory to match the model token limit
token_memory = ConversationTokenBufferMemory(llm = local_llm, max_token_limit=400)

conversation = ConversationChain(
    llm=local_llm, 
    verbose=True, 
    memory=token_memory
)

# wrap with truchain to instrument your chain
tc_conversation = tru.Chain(conversation, chain_id='GardeningAIwithMemory_v2', feedbacks=[qs_relevance])

In [None]:
response, record = tc_conversation.call_with_record("What kind of pests I should worry about?")
display(response)

In [None]:
response, record = tc_conversation.call_with_record("What kind of flowers will grow best in the northeast US?")
display(response)

In [None]:
response, record = tc_conversation.call_with_record("What is the typical soil make-up in gardens in my area?")
display(response)

In [None]:
response, record = tc_conversation.call_with_record("I'd like to grow a large tree in my backyard. Any recommendations that work well with the soil?")
display(response)

In [None]:
response, record = tc_conversation.call_with_record("What other garden improvements should I make to complement these tree recommendations?")
display(response)

Our AI assistant now no longer runs out of tokens in memory. Wahoo!