<a href="https://colab.research.google.com/github/yashambkr/Petals_Llama2/blob/main/Petals_Getting_started_with_LLaMA2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div align="center">
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>

# Getting started with Petals

This notebook will guide you through the basics of Petals &mdash; a system for inference and fine-tuning 100B+ language models without the need to have high-end GPUs. With Petals, you can join compute resources with other people over the Internet and run large language models such as LLaMA, Guanaco, or BLOOM right from your desktop computer or Google Colab.

💬 If you meet any issues while running this notebook, let us know in the **[#running-a-client](https://discord.gg/J29mCBNBvm)** channel of our Discord!

So, let's get started! First, let's install [the Petals package](https://github.com/bigscience-workshop/petals):

In [21]:
%pip install petals



🦙 **Want to run LLaMA 2?**

1. Request access to its weights &mdash; first on the [Meta AI website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/), then at 🤗 [Model Hub](https://huggingface.co/meta-llama/Llama-2-70b-hf) (make sure to use the same email).
2. Create an access token [here](https://huggingface.co/settings/tokens).
3. Run this command before calling `AutoTokenizer.from_pretrained(...)`:

    `!huggingface-cli login --token YOUR_TOKEN_HERE`

📋 **Friendly reminder.** This Colab is provided for demo purposes. If you want to use these models in your own projects, make sure you follow their terms of use (see [LLaMA](https://bit.ly/llama-license) and [LLaMA 2](https://bit.ly/llama2-license) licenses) and have an approved access to their weights.

In [22]:
!huggingface-cli login --token YOUR_TOKEN_HERE

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Loading the distributed model 🚀

Let's start with the easiest task &mdash; creating a distributed model and using it for generating text. This machine will download a small part of the model weights and rely on other computers in the network to run the rest of the model.

The Petals interface is similar to the 🤗 [Transformers](https://github.com/huggingface/transformers) library &mdash; it feels like you're working with a local model even though parts of it are hosted remotely. We suggest to start with the "classic" [LLaMA-65B](https://github.com/facebookresearch/llama/blob/llama_v1/MODEL_CARD.md), but you can also use [LLaMA 2 (70B)](https://huggingface.co/meta-llama/Llama-2-70b-hf) if you have access to it (see below).

In [23]:
import torch
from transformers import AutoTokenizer
from petals import AutoDistributedModelForCausalLM

model_name = "meta-llama/Llama-2-70b-chat-hf"
# You could also use "meta-llama/Llama-2-70b-hf", "enoch/llama-65b-hf", or
# "bigscience/bloom" - basically, any Hugging Face Hub repo with a supported model architecture

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
model = model.cuda()

Jul 21 18:23:23.816 [[1m[34mINFO[0m] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
Jul 21 18:23:23.821 [[1m[34mINFO[0m] Using DHT prefix: Llama-2-70b-chat-hf


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### ✍️ How to generate text?

Let's try to generate something by calling __`model.generate()`__ method.

The first call to this method takes a few seconds to connect to the Petals swarm. Once we do that, you should expect generation speed of up to **5-6 tokens/sec**. If you don't have enough GPU memory to host the entire model, this is much faster than what you get with other methods, such as offloading or running the model on CPU.

In [24]:
inputs = tokenizer('generate a poem on indian food', return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Jul 21 18:23:50.594 [[1m[34mINFO[0m] Route found: 0:32 via …UpunUJ => 32:33 via …dU8vhr => 33:35 via …UpunUJ => 35:40 via …dU8vhr => 40:80 via …pmFgLT


generate a poem on indian food.

Indian food, a symphony of flavors,
A feast for the senses, a culinary delight.
From spicy curries to creamy kormas,
A journey of taste, a treat for the palate.

The aroma of basmati, a fragrant delight,
The flavors of cardamom, a sweet and savory blend.
The richness of ghee, a buttery


The `model.generate()` method runs **greedy** generation by default. However, you can use other generation methods like **top-p/top-k sampling** or **beam search** (you'll see an example in a bit), or even implement your own.

🔏 **Note:** Your data is processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.