
<center> <h1> Using Local Open Source LLMs</h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

In this notebook we will learn how to download and run LLMs locally using this colab notebook. You can run this notebook in your local server also as long as you have a valid GPU with enough Memory to run these models!

The model we will be trying here is the:

__[Google Gemma 2B IT LLM](https://huggingface.co/google/recurrentgemma-2b-it)__ model which is a 2B parameter transformer LLM built by Google and is a instruct fine-tuned version of the [Google Gemma 2B LLM](https://huggingface.co/google/recurrentgemma-2b)

RecurrentGemma is a family of open language models built on a novel recurrent architecture developed at Google. Both pre-trained and instruction-tuned versions are available in English.

Like Gemma, RecurrentGemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Because of its novel architecture, RecurrentGemma requires less memory than Gemma and achieves faster inference when generating long sequences.

__You will need at least 5GB of GPU memory to swiftly run inference with Recurrent Gemma IT 2B.__


When using Google Colab remember to change the runtime type as follows and select an available GPU to run the LLM faster

![](https://i.imgur.com/a26Qmdw.png)

## Check your GPU Memory Available

In [None]:
# !nvidia-smi  # run only if you have connected to a GPU runtime

## Install Necessary Dependencies

In [None]:
# !pip install transformers accelerate

__Restart the runtime from the Runtime menu above to make sure the installed libraries are ready to be used in Colab__

## Login to Huggingface using your Token

Get your token [here](https://huggingface.co/settings/tokens) and login using the following code

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## Load the LLM locally using Huggingface

In [None]:
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "google/recurrentgemma-2b-it"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,
)

### Try out a basic prompt

In [None]:
chat = [
    { "role": "user", "content": "Explain what is AI in 3 bullet points" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

In [None]:
print(prompt)

In [None]:
model.device

In [None]:
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device),
                         max_new_tokens=150)
print(tokenizer.decode(outputs[0]))

Remember to always refer to the [__documentation__](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate) where all the arguments of the generation pipeline are mentioned in detail. Most notably:

- **max_length:** The maximum length of the sequence to be generated
- **max_new_tokens:** The maximum numbers of tokens to generate, ignore the current number of tokens. Use either max_new_tokens or max_length but not both, they serve the same purpose
- **do_sample:** Whether or not to use sampling. False means use greedy decoding i.e temperature=0
- **temperature:** Between 0 - 1, The value used to module the next token probabilities. Higher temperature means the results may vary and be more creative

In [None]:
outputs = model.generate(input_ids=inputs.to(model.device),
                         max_new_tokens=150,
                         do_sample=True,
                         temperature=0.5
                         )
print(tokenizer.decode(outputs[0]))

### Pipelines make it easier to send prompts

You don't need to encode and decode your inputs and outputs everytime

In [None]:
gemma_pipe = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cuda",
)

In [None]:
prompt

In [None]:
response = gemma_pipe(prompt,
                      max_new_tokens=150,
                      do_sample=True,
                      temperature=0.5,
                      return_full_text=False) # dont return back the input prompt, only show the response

In [None]:
response

In [None]:
print(response[0]['generated_text'])

In [None]:
from IPython.display import display, Markdown

display(Markdown(response[0]['generated_text']))

## Check how much GPU Memory the LLM Uses

Remember the Gemma-2B uses more than 5GB GPU memory

In [None]:
# !nvidia-smi

In [None]:
gemma_pipe

## Prompting with Open-Source LLM

Now we will use our locally loaded LLM and try some tasks with prompting

### 1. Basic Q & A

In [None]:
def create_gemma_prompt(prompt_text):
  chat = [
    { "role": "user", "content": prompt_text },
  ]
  prompt = tokenizer.apply_chat_template(chat, tokenize=False,
                                         add_generation_prompt=True)
  return prompt

In [None]:
prompt_txt = "Can you explain what is mortgage?"
prompt = create_gemma_prompt(prompt_txt)
print(prompt)

In [None]:
response = gemma_pipe(prompt,
                      max_new_tokens=1000,
                      do_sample=True,
                      temperature=0.5,
                      return_full_text=False)
print(response[0]['generated_text'])

In [None]:
display(Markdown(response[0]['generated_text']))

### 2. Report Summarization

In [None]:
report = """
Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data. The recent buzz around generative AI has been driven by the simplicity of new user interfaces for creating high-quality text, graphics and videos in a matter of seconds.
The technology, it should be noted, is not brand-new. Generative AI was introduced in the 1960s in chatbots. But it was not until 2014, with the introduction of generative adversarial networks, or GANs -- a type of machine learning algorithm -- that generative AI could create convincingly authentic images, videos and audio of real people.
On the one hand, this newfound capability has opened up opportunities that include better movie dubbing and rich educational content. It also unlocked concerns about deepfakes -- digitally forged images or videos -- and harmful cybersecurity attacks on businesses, including nefarious requests that realistically mimic an employee's boss.
Two additional recent advances that will be discussed in more detail below have played a critical part in generative AI going mainstream: transformers and the breakthrough language models they enabled. Transformers are a type of machine learning that made it possible for researchers to train ever-larger models without having to label all of the data in advance. New models could thus be trained on billions of pages of text, resulting in answers with more depth. In addition, transformers unlocked a new notion called attention that enabled models to track the connections between words across pages, chapters and books rather than just in individual sentences. And not just words: Transformers could also use their ability to track connections to analyze code, proteins, chemicals and DNA.
The rapid advances in so-called large language models (LLMs) -- i.e., models with billions or even trillions of parameters -- have opened a new era in which generative AI models can write engaging text, paint photorealistic images and even create somewhat entertaining sitcoms on the fly. Moreover, innovations in multimodal AI enable teams to generate content across multiple types of media, including text, graphics and video. This is the basis for tools like Dall-E that automatically create images from a text description or generate text captions from images.
These breakthroughs notwithstanding, we are still in the early days of using generative AI to create readable text and photorealistic stylized graphics. Early implementations have had issues with accuracy and bias, as well as being prone to hallucinations and spitting back weird answers. Still, progress thus far indicates that the inherent capabilities of this generative AI could fundamentally change enterprise technology how businesses operate. Going forward, this technology could help write code, design new drugs, develop products, redesign business processes and transform supply chains.
"""

prompt_txt = f"""
Summarize the following report delimited by triple backticks on Generative AI in max 5 lines

Report:
```{report}```
"""

prompt = create_gemma_prompt(prompt_txt)

llm_response = gemma_pipe(prompt,
                      max_new_tokens=500,
                      do_sample=False,
                      return_full_text=False)

In [None]:
print(llm_response[0]['generated_text'])

In [None]:
display(Markdown(llm_response[0]['generated_text']))

### 3. Basic Sentiment Analysis

In [None]:
review = """I recently worked with this real estate company to purchase my first home,
    and the experience was outstanding. The agent was knowledgeable, patient, and incredibly responsive.
    They guided me through every step of the process, making what could have been a stressful
    experience very smooth and enjoyable.
    """

In [None]:
prompt_txt = f"""
Act as a customer review analyst, given the following customer review text,
do the following tasks:
- Find the sentiment (positive, negative or neutral)
- Extract max 5 key topics or phrases of the good or bad in the review

Review Text:
{review}
"""
prompt = create_gemma_prompt(prompt_txt)
llm_response = gemma_pipe(prompt,
                    max_new_tokens=150,
                    do_sample=False,
                    return_full_text=False)
response=llm_response[0]['generated_text']

In [None]:
print(response)

In [None]:
display(Markdown(response))