# Setup and Query LLaMA-2 Model
This notebook will guide you through installing required libraries, setting up the LLaMA-2 model, and querying it using natural language.

## Install Required Libraries
We need to install the necessary libraries for PyTorch, TorchVision, Tranformers. Additionally, we'll install a dependency for our user interface using gradio.

In [1]:

!pip install bitsandbytes -q --upgrade
!pip install accelerate transformers gradio

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting gradio
  Downloading gradio-4.41.0-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.112.0-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.3.0 (from gradio)
  Downloading gradio_client-1.3.0-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.10.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting pydub (from 

## Import Required Libraries
Next, we'll import the necessary libraries for tokenization, model setup, and text generation.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import torch
import gradio as gr


## Define Model and Tokenizer
We'll define the model name and the authentication token required to access the LLaMA-2 model from Hugging Face.

In [3]:
model_name = "meta-llama/Llama-2-7b-chat-hf"
token_file = open("HF_TOKEN.txt")
auth_token = token_file.readline().strip();

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir='./model/', token=auth_token)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [4]:
print(torch.cuda.is_available())

True


## Load the Model
Now, we'll load the LLaMA-2 model using the previously defined name and authentication token. We'll also set some model parameters.

In [5]:
# Check if a GPU is available and set the device accordingly
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the LLaMA-2 model using the previously defined name and authentication token. We'll also set some model parameters.
model = AutoModelForCausalLM.from_pretrained(model_name, cache_dir='model/',
                                             token=auth_token,
                                             torch_dtype= torch.float16,
                                             rope_scaling={"type": "dynamic", "factor": 2},
                                             load_in_8bit=True)




config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [7]:
# print(device)
# print(type(model))

## Setup a Prompt
We'll create a prompt that we want to query the model with.
Testing preprompted model for NPL


In [None]:
# prompt = "### User:What is the fastest car in the world and how much does it cost? ### Assistant:"
# inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


#Gradio
Using gradio for inputs and outputs of llms without loaded doc or prompts.

In [8]:
def generate_response(user_input):
  prompt = f"### User:{user_input} ### Assistant:"
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
  streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
  output = model.generate(**inputs, streamer=streamer, use_cache=True, max_new_tokens=500)
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)
  # Split the output and return only the assistant's response
  assistant_response = output_text.split("### Assistant:")[-1].strip()
  return assistant_response

create a Gradio interface:

In [9]:
demo = gr.Interface(
    fn=generate_response,
    inputs=gr.Textbox(lines=2, label="Enter your question:"),
    outputs=gr.Textbox(label="Model Response"),
)

demo.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://9327abff303fd9f55d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


 A school is an educational institution where children and young adults go to learn and develop their knowledge, skills, and abilities under the supervision of trained teachers and staff. The primary purpose of a school is to provide a structured and supportive environment for students to acquire the necessary skills and knowledge to succeed in their future academic and professional pursuits. Schools typically offer a wide range of subjects, including mathematics, science, language arts, social studies, and elective courses, as well as extracurricular activities and programs to help students develop their interests and talents. Additionally, schools often provide support services such as counseling, tutoring, and special education to help students succeed and reach their full potential.
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://9327abff303fd9f55d.gradio.live




In [10]:
!pip list

Package                          Version
-------------------------------- ---------------------
absl-py                          1.4.0
accelerate                       0.32.1
aiofiles                         23.2.1
aiohappyeyeballs                 2.3.4
aiohttp                          3.10.1
aiosignal                        1.3.1
alabaster                        0.7.16
albucore                         0.0.13
albumentations                   1.4.13
altair                           4.2.2
annotated-types                  0.7.0
anyio                            3.7.1
argon2-cffi                      23.1.0
argon2-cffi-bindings             21.2.0
array_record                     0.5.1
arviz                            0.18.0
asn1crypto                       1.5.1
astropy                          6.1.2
astropy-iers-data                0.2024.8.5.0.32.23
astunparse                       1.6.3
async-timeout                    4.0.3
atpublic                         4.1.0
attrs                   