In [1]:
# How GPT4All works?
# It is trained on top of Facebook’s LLaMA model, which released its weights under a non-commercial license. 
# Still, running the mentioned architecture on your local PC is impossible due to the large (7 billion) number of parameters.
# The authors incorporated two tricks to do efficient fine-tuning and inference. 
# We will focus on inference since the fine-tuning process is out of the scope of this course.

# The main contribution of GPT4All models is the ability to run them on a CPU. Testing these models is practically free 
# because the recent PCs have powerful Central Processing Units. The underlying algorithm that helps with making it happen is called Quantization. 
# It basically converts the pre-trained model weights to 4-bit precision using the GGML format.  
# So, the model uses fewer bits to represent the numbers. There are two main advantages to using this technique:

# Reducing Memory Usage: It makes deploying the models more efficient on low-resource devices.
# Faster Inference: The models will be faster during the generation process since there will be fewer computations.
# It is true that we are sacrificing quality by a small margin when using this approach. However, 
# it is a trade-off between no access at all and accessing a slightly underpowered model!

In [2]:
from langchain.llms import GPT4All
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.base import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [3]:
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

In [4]:
# need to convert the downloaded gpt4all-lora-quantized-ggml.bin model into latest format
# git clone https://github.com/ggerganov/llama.cpp.git
# cd llama.cpp && git checkout 2b26469
# python3 llama.cpp/convert.py ./models/gpt4all-lora-quantized-ggml.bin

In [5]:
# following code would run with these versions of langchain and pyllamacpp
# pip install -q langchain==0.0.152 pyllamacpp==1.0.7

In [6]:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = GPT4All(model="./ggml-model-q4_0.bin", callback_manager=callback_manager, verbose=True)
llm_chain = LLMChain(prompt=prompt, llm=llm)

In [7]:
question = "What happens when it rains somewhere?"
llm_chain.run(question)

 Question: What happens when it rains somewhere?

Answer: Let's think step by step. When there is rain, water vapors are released into the atmosphere from various sources such as evaporated sweat and respiratory exhalations of humans or animals; these forms can lead to cloud formation which then proceed with precipitation. As a result, raindrops form on clouds that appear in different sizes depending upon their altitude and temperature above them (this is called the wet-bulb thermometer); eventually this rain falls back into earth's surface either through evaporation or wind circulation from weather systems; it then flows down to lakes, rivers, oceans etc. where they form streams & waves which can create floods depending upon their strength and amount of rainfall occurring in a particular area.

" Question: What happens when it rains somewhere?\n\nAnswer: Let's think step by step. When there is rain, water vapors are released into the atmosphere from various sources such as evaporated sweat and respiratory exhalations of humans or animals; these forms can lead to cloud formation which then proceed with precipitation. As a result, raindrops form on clouds that appear in different sizes depending upon their altitude and temperature above them (this is called the wet-bulb thermometer); eventually this rain falls back into earth's surface either through evaporation or wind circulation from weather systems; it then flows down to lakes, rivers, oceans etc. where they form streams & waves which can create floods depending upon their strength and amount of rainfall occurring in a particular area."

In [8]:
# The default behavior is to wait for the model to finish its inference process to print out its outputs. 
# However, it could take more than an hour (depending on your hardware) to 
# respond to one prompt because of the large number of parameters in the model. We can use the StreamingStdOutCallbackHandler() 
# callback to instantly show the latest generated token. This way, we can be sure that the generation process is running and 
# the model shows the expected behavior. Otherwise, it is possible to stop the inference and adjust the prompt.

# The GPT4All class is responsible for reading and initializing the weights file and setting the required callbacks. 
# Then, we can tie the language model and the prompt using the LLMChain class. It will enable us to ask questions from the model using the run() object.