In [1]:
# %pip install google-cloud-aiplatform==1.25.0
# %pip install google-api-core==1.33.1

In [2]:
import vertexai
from vertexai.preview.language_models import TextGenerationModel

In [3]:
def predict_large_language_model_sample(
    project_id: str,
    model_name: str,
    temperature: float,
    max_decode_steps: int,
    top_p: float,
    top_k: int,
    content: str,
    location: str = "us-central1",
    tuned_model_name: str = "",
    ) :
    """Predict using a Large Language Model."""
    vertexai.init(project=project_id, location=location)
    model = TextGenerationModel.from_pretrained(model_name)
    if tuned_model_name:
        model = model.get_tuned_model(tuned_model_name)
    response = model.predict(
        content,
        temperature=temperature,
        max_output_tokens=max_decode_steps,
        top_k=top_k,
        top_p=top_p,)
    print(f"Response from Model: {response.text}")

In [4]:
text = '''An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2'''

In [5]:
print(text)

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput

In [6]:
prompt = '''Rewrite the following article so it can be understood by 5 year old:'''

In [7]:
predict_large_language_model_sample(project_id = "msca-31013-big-data", 
                                    model_name = "text-bison@001", 
                                    temperature = 0.2, 
                                    max_decode_steps = 768, 
                                    top_p = 0.8, 
                                    top_k = 40, 
                                    location = "us-central1",
                                    content = prompt + ' ' + text)

Response from Model: . Computers are getting better at understanding language. They can do this by learning from a lot of data. But when they learn from a lot of data, they can become too big and slow.

Scientists have found a way to make computers understand language better without making them too big or slow. They do this by using a technique called "low-rank adaptation." This technique allows computers to learn from a lot of data without having to store all of the data in their memory.

This is important because it means that computers will be able to understand language better and faster. This could lead to new ways for computers to help us, such as by writing better emails, translating languages, or helping us with our homework.


In [8]:
prompt = '''Rewrite the following article so it can be understood by 10 year old:'''

In [9]:
predict_large_language_model_sample(project_id = "msca-31013-big-data", 
                                    model_name = "text-bison@001", 
                                    temperature = 0.2, 
                                    max_decode_steps = 256, 
                                    top_p = 0.95, 
                                    top_k = 40, 
                                    location = "us-central1",
                                    content = prompt + ' ' + text)

Response from Model: . Computers are getting better at understanding human language. One way they do this is by learning from a lot of text. This is called "pre-training". Once a computer has been pre-trained, it can be "fine-tuned" to do a specific task, like answering your questions or writing different kinds of text.

But pre-training and fine-tuning computers can be very expensive. That's why we're working on a new way to do it that's much cheaper. Our method is called "Low-Rank Adaptation". It works by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. This reduces the number of trainable parameters for downstream tasks, which makes it much cheaper to fine-tune the model.

We've tested our method on several different language models, and it works just as well as fine-tuning, even though it's much cheaper. We've also released a package that makes it easy to use LoRA with PyTorch models.


In [10]:
prompt = '''Rewrite the following article so it can be understood by 15 year old:'''

In [11]:
predict_large_language_model_sample(project_id = "msca-31013-big-data", 
                                    model_name = "text-bison@001", 
                                    temperature = 0.2, 
                                    max_decode_steps = 256, 
                                    top_p = 0.95, 
                                    top_k = 40, 
                                    location = "us-central1",
                                    content = prompt + ' ' + text)

Response from Model: . One way to make computers understand human language is to teach them a lot of words and phrases. We do this by feeding them a large amount of text data, and then we make them guess what the next word in a sentence should be. This process is called "pre-training".

Once a computer model has been pre-trained, we can then "fine-tune" it to perform a specific task. For example, we can fine-tune a pre-trained model to write different kinds of creative text, like poems or code.

However, fine-tuning a large pre-trained model can be very computationally expensive. This is because the model has a lot of parameters, and we need to train all of them.

To make fine-tuning more efficient, we can use a technique called "low-rank adaptation". This technique reduces the number of parameters in the model, without reducing the model's accuracy.

We have developed a new low-rank adaptation method that works very well. Our method is called "LoRA". LoRA can reduce the number of para

In [12]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Thu, 11 May 2023 20:36:45'