# Setup

In [3]:
!nvidia-smi

Sun Apr 30 00:59:51 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -U sentencepiece accelerate -q
!pip install bitsandbytes -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers.utils.logging import set_verbosity

set_verbosity(40)

import warnings
# ignore hf pipeline complaints
warnings.filterwarnings("ignore", category=UserWarning, module='transformers')
warnings.filterwarnings("ignore", category=FutureWarning, module='transformers')

In [4]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

from datasets import load_dataset

# Summarizer

We use an 8-bit quantized version of the pszemrajlong-t5-tglobal-xl-16384-book-summary model, The model has been compressed using bitsandbytes and can be loaded with low memory usage.

[Details](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary-8bit)


In [10]:
class Summarizer():
  def __init__(self):
    hf_tag = "pszemraj/long-t5-tglobal-xl-16384-book-summary-8bit"
    self.tokenizer = AutoTokenizer.from_pretrained(hf_tag)
    self.model = AutoModelForSeq2SeqLM.from_pretrained(hf_tag, load_in_8bit=True, device_map="auto")
    self.params = {
                      "max_length": 256,
                      "min_length": 8,
                      "no_repeat_ngram_size": 3,
                      "early_stopping": True,
                      "repetition_penalty": 3.5,
                      "length_penalty": 0.4,
                      "encoder_no_repeat_ngram_size": 3,
                      "num_beams": 4,
                   } # parameters for text generation out of model
  
  def get_mem_footprint(self):
    fp = self.model.get_memory_footprint() * (10 ** -9)
    return f"memory footprint is approx {round(fp, 2)} GB"

  def summarize(self, long_text):
    input_ids = self.tokenizer(long_text, return_tensors="pt").input_ids.to("cuda")
    output = self.model.generate(input_ids, **params)
    summary = self.tokenizer.batch_decode(output, skip_special_tokens=True)
    return summary

In [11]:
summarizer = Summarizer()

In [12]:
summarizer.get_mem_footprint()

'memory footprint is approx 3.18 GB'

In [13]:
long_text = """
A value that is outside the range of some numbers' global distribution is generally referred to as an outlier. Outlier detection has been widely used and covered in the current literature, and having prior knowledge of the distribution of your features helps with the task of outlier detection. More specifically, we have observed that classic quantization at scale fails for transformer-based models >6B parameters. While large outlier features are also present in smaller models, we observe that a certain threshold these outliers from highly systematic patterns across transformers which are present in every layer of the transformer. For more details on these phenomena see the LLM.int8() paper and emergent features blog post.

As mentioned earlier, 8-bit precision is extremely constrained, therefore quantizing a vector with several big values can produce wildly erroneous results. Additionally, because of a built-in characteristic of the transformer-based architecture that links all the elements together, these errors tend to compound as they get propagated across multiple layers. Therefore, mixed-precision decomposition has been developed to facilitate efficient quantization with such extreme outliers. It is discussed next.
"""

In [14]:
summarizer.summarize(long_text)

['Outliers are values that fall outside of the normal distribution. These are usually large values, such as the x and y axes.']