<a href="https://colab.research.google.com/github/wtergan/ML_notebooks/blob/main/Nous_Hermes_13B_GPTQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### __Nous-Hermes-13B-GPTQ in HuggingFace__

Usage of the open source model Nous-Hermes-13B model in HuggingFace. This model is the GPTQ 4-bit version of the orginal model.

Developed by Nous Research, this model is the result of fine-tuning the LLaMA 13B model on over 300K instructions. Claimed to rival Open AI's GPT-3.5-turbo model in performance in a plethora of tasks.

Fine tuning performed with a 2000 sequence length, on an 8x A100 80GB DGX machine for about 50 hours.

Fine tuned almost entirely on synthetic GPT-4 outputs. Data sources includes GPTTeacher, code instruct datasets, CodeAlpaca, Evol_Instruct Uncensored, and more. Also incudes math, biology, physic dataets.

In [None]:
!pip install -qU transformers accelerate einops langchain auto-gptq sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.6/227.6 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m69.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.7/52.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m80.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m110.5 MB/s[0m eta

In [None]:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse

model_name_or_path = "TheBloke/Nous-Hermes-13B-GPTQ"
model_basename = "nous-hermes-13b-GPTQ-4bit-128g.no-act.order"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=False)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
                                           model_basename=model_basename,
                                           use_safetensors=True,
                                           trust_remote_code=True,
                                           device="cuda:0",
                                           use_triton=use_triton,
                                           quantize_config=None)

print("\n\n*** Generate:")

prompt = "Tell me about AI"
prompt_template = f"""### Human: {prompt} ### Assistant:"""

input_ids = tokenizer(prompt_template, return_tensors="pt").input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# To prevent printing spurious transformers error when using pipeline w/ AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

Downloading (…)okenizer_config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/478 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/606 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/135 [00:00<?, ?B/s]

Downloading (…)ct.order.safetensors:   0%|          | 0.00/7.45G [00:00<?, ?B/s]





*** Generate:
<s> ### Human: Tell me about AI ### Assistant: Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. These machines are designed to perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI technology is used in a wide range of applications, including natural language processing, robotics, autonomous vehicles, and medical diagnosis. The goal of AI is to create machines that can function intelligently and independently, without human intervention.</s>


In [None]:
# Utilization of transformers pipeline to achieve the same thing.
prompt = "Tell me about AI"
prompt_template = f"""### Human: {prompt} ###Assistant:"""

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]["generated_text"])

*** Pipeline:
### Human: Tell me about AI ###Assistant: Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It involves machine learning, natural language processing, computer vision, and other technologies that enable computers to perform tasks that would normally require human intervention. AI has a wide range of applications across various industries such as healthcare, finance, transportation, manufacturing, and more.


In [None]:
print(pipe(prompt_template))

[{'generated_text': '### Human: Tell me about AI ###Assistant: Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. It involves machine learning, natural language processing, computer vision, and other technologies that enable computers to perform tasks that would normally require human intervention. AI has a wide range of applications across various industries such as healthcare, finance, transportation, manufacturing, and more.'}]
