<a href="https://colab.research.google.com/github/rayaneghilene/OpenELM-tests/blob/main/OpenELM_3B_chat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Apple OpenELM 3B text generation (4 bit quant)

## Install the latest version of transformers from Github

In [23]:
!pip -q install git+https://github.com/huggingface/transformers --progress-bar off
!pip install -q datasets loralib sentencepiece --progress-bar off
!pip -q install bitsandbytes accelerate xformers einops --progress-bar off

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [22]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).

## Tokenizer
The OpenELM model family uses the Llama-2-7b Tokenizer, this means

In [24]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

In [38]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf",
                                          use_auth_token=True)

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

## Model

We use BitsAndBytes to get quantized Version of the model in 4 bit. This allows us to run the model on GPU poor machines / Colab notebooks

In [39]:
from transformers import BitsAndBytesConfig

compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

model = AutoModelForCausalLM.from_pretrained("apple/OpenELM-3B-Instruct",
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                             trust_remote_code=True,
                                             quantization_config=bnb_config
                                             )



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Text Generation Pipeline

In [40]:
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.float16,
                device_map="auto",
                do_sample=True,
                top_k=30,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,
                trust_remote_code=True
                )

In [42]:
pipe = pipeline("text-generation", tokenizer=tokenizer, model=model)

Prompt = "generate random text"
pipe("generate random text")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': 'generate random text'}]

### The prompts & response

In [33]:
import json
import textwrap

B_INST, E_INST = "[INST]", "[/INST]"

def get_prompt(instruction):
    prompt_template =  B_INST + instruction + E_INST
    return prompt_template

def cut_off_text(text, prompt):
    cutoff_phrase = prompt
    index = text.find(cutoff_phrase)
    if index != -1:
        return text[:index]
    else:
        return text

def remove_substring(string, substring):
    return string.replace(substring, "")



def generate(text):
    prompt = get_prompt(text)
    with torch.autocast('cuda', dtype=torch.float16):
        inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs,
                                 max_new_tokens=512,
                                 eos_token_id=tokenizer.eos_token_id,
                                 pad_token_id=tokenizer.eos_token_id,
                                 )
        final_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        final_outputs = cut_off_text(final_outputs, '</s>')
        final_outputs = remove_substring(final_outputs, prompt)

    return final_outputs

def parse_text(text):
        wrapped_text = textwrap.fill(text, width=100)
        print(wrapped_text +'\n\n')


## Test the model on a custom prompt

In [34]:
%%time
prompt = 'What are the differences between alpacas, vicunas and llamas?'
generated_text = generate(prompt)
parse_text(generated_text)




CPU times: user 37.2 s, sys: 78.9 ms, total: 37.3 s
Wall time: 37.8 s


## Conclusion

Despite having 3B parameters, the performance of the model fell short of expectations. This underscores the ongoing need for enhancements and refinements to ensure optimal functionality and effectiveness in various tasks and contexts.

Contact me at rayane.ghilene@ensea.fr if you have any questions.