# **Run Google Gemma Model on free google colab**
- Check Out AI Tools [Sohipm](https://www.sohipm.com)
- Follow me on [YouTube](https://www.youtube.com/channel/UCrWd_NG2P2i7gvIIY-9KwJw)
- Follow me on [LinkedIn](https://www.linkedin.com/in/rakesh-sahni/)
- Follow me on Twitter [@rakeshsahni_](https://twitter.com/rakeshsahni_)


# Setup

In [None]:
#@markdown **Library Installation**

from IPython.display import clear_output  # To keep the output clean

# Core library for working with large language models
!pip install git+https://github.com/huggingface/transformers.git

# Optimizes training on multiple GPUs/TPUs
!pip install git+https://github.com/huggingface/accelerate.git

# Reduces memory usage of models
!pip install bitsandbytes

# Accessing and preparing datasets
!pip install datasets

# Efficient fine-tuning techniques (LoRA, QLoRA)
!pip install peft

# Creating quick demos and interfaces for your models
!pip install gradio

clear_output()
print("\nDone")



Done


In [None]:
#@markdown **Environment Setup**
from google.colab import userdata
# userdata.get('<secretName>')

import numpy as np
import pandas as pd
import os
import json

import gradio as gr

import torch
import torch.nn as nn
import torch.nn.functional as F

from datasets import load_dataset

from peft import LoraConfig, get_peft_model

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments


clear_output()
print("\nDone")


Done


In [None]:
#@markdown **Create BitsAndBytesConfig**
# https://medium.com/@rakeshrajpurohit/model-quantization-with-hugging-face-transformers-and-bitsandbytes-integration-b4c9983e8996


quantization_config = BitsAndBytesConfig(
    # Load the model in 4-bit precision
    load_in_4bit=True,
    # Use double quantization for 4-bit precision
    bnb_4bit_use_double_quant=True,
    # Use the "nf4" quantization type (4-bit quantization with improved numerical stability)
    bnb_4bit_quant_type="nf4",
    # Set the compute dtype (data type for computations) to bfloat16 (Brain Floating Point 16)
    bnb_4bit_compute_dtype=torch.bfloat16, # torch.float16
)

# Clear the output of the notebook cell
clear_output()

print("\nDone")


Done


In [None]:
#@markdown **Load Tokenizer and Model**
model_id = "google/gemma-2b-it" #@param ["google/gemma-2b-it"]

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    add_eos_token = True,
    token=userdata.get('HF_TOKEN'),
    trust_remote_code = True
)


model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    quantization_config = quantization_config,
    # use_cache=False,
    # attn_implementation="flash_attention_2",
    token=userdata.get('HF_TOKEN'),
    trust_remote_code=True
    )

clear_output()
print("\nDone")


Done


In [None]:
#@markdown **Create predict fun to return model response**
responsePrompt = lambda x: f"<start_of_turn>user\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.\nInstruction: {x}<end_of_turn>\n\n<start_of_turn>model\n"

def predict(message, history):

    text = responsePrompt(message)
    device = "cuda:0"
    # print(text)
    inputs = tokenizer(text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=1000)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

In [None]:
print(responsePrompt("Hi there!"))

<start_of_turn>user
Below is an instruction that describes a task. Write a response that appropriately completes the request.
Instruction: Hi there!<end_of_turn>

<start_of_turn>model



In [None]:
#@markdown **Run Gradio ChatUI**
gr.ChatInterface(
    predict,
    examples = [
        "Write a quiz question with four options about the solar system, and also show the correct answer."
    ]
).launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://00b33909bd0cf9a850.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


