<a href="https://colab.research.google.com/github/wtergan/MPT-7B/blob/main/MPT_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## __MPT-7B in HuggingFace and LangChain__

Notebook is derived from the notebook found within the pinecone-io/examples [Github](https://github.com/pinecone-io/examples/blob/master/generation/llm-field-guide/mpt-7b/mpt-7b-huggingface-langchain.ipynb).

Afromentioned github has been rewritten/revised for educational and brevity purposes.

Usage of the open source MPT-7B model in both [HuggingFace](https://huggingface.co/) and [LangChain](https://docs.langchain.com/docs/).

[MPT-7B](https://huggingface.co/mosaicml/mpt-7b) is one of several models in a family of MosaicPretrainedTransformer (MPT) models. This specific 7B variant a decoder-style transformer model thats been pretrained by [MosaicML](https://www.mosaicml.com/) on 1T tokens.

These models utilizes Attention with Linear Biases [(ALiBi)](https://arxiv.org/pdf/2108.12409.pdf) as a replacement for traditional positional embeddings (like sinusoidal functions), and allows the model to be trained with high throughput efficiency and stable convergence. The paper attempted to address a fundamental problem thats inherent in transformer models: "How can it handle input sequences that are longer than the ones it saw during training? (*extrapolation*)." They introduced Attention with Linear Biases to try to solve this issue. ALiBi do not add postional embeddings to the input word embeddings, but rather biases query-key attention scores with a penality that is proportional to their distance. This allows the model to extrapolate, as well as achieving similar perplexity to the typical sinusoidal method when used on larger, billion+ parameter models and better perplexity to the typical sinusoidal method when used on smaller models with parameters less than 1 billion. 

MPT-7B was trained on an enormous amount of data (1T token equivalent). Dataset is similar in volume to the 1T token dataset used for training of [LLaMA](https://arxiv.org/pdf/2302.13971.pdf)

It also utilizes Fast and Memory-Efficient Exact Attention [(Flash Attention)](https://arxiv.org/pdf/2205.14135.pdf), which is designed to help speed up the transformer by reordering the attention computation and using tiling to reduce the number of memory reads and writes between difference levels of GPU memory. It splits the orginial *Q, K, V* matrices _(N_ _*_  _embed)_ into multiple blocks of size *B*. It then loads the *K, V* blocks from the high-bandwidth memory (HBM) to the fast on-chip SRAM. In each block, the FlashAttention loop over blocks of *Q* matrix, loading them to SRAM, and writing the output of the attention back to HBM. This speeds up the performance to various transformers substantially when training, compared to baseline implementations.

Good video by one of the authors of the FlashAttention technique __(Ti Dao)__ that explains it in more detail. [YouTube](https://www.youtube.com/watch?v=FThvfkXWqtE)

Must be run on a GPU, as running this model on CPU will be too slow for practical usage. Can change RUNTIME on Google Colab by going to __Runtime > Change runtime type > Hardware accelerator > GPU > GPU Type > T4/V100/A100/etc.__

__Note:__ it is highly recommended to change the Runtime to CPU (or if possible, disconnecting Runtime) when not actively using the GPU, so that compute units is not wasted (when using Google Colab, especially Colab Pro). GPU used when writing/testing code in this notebook is the T4 GPU.

***

*__Lets start by installing the required libraries for this notebook to execute.__*

Brief description for each library:

`transformers` 
* library that provides SOTA models and tools for various tasks such as text classificaiton, question answering, summarization, translation, etc.

`accelerate` 
* library that simplifies distributed training of transformers models on any type of setup, such as multiple GPUs on one machine or multiple GPUs across several machines. Easy integration with PyTorch and TensorFlow.

`einops`
* library of a unified syntax for tensor manipulation operations, such as reshaping, rearranging, repeating, reducing. etc.

`langchain` 
* framework that facilitaties the creation of applications using large language models. Integrates with various systems and data sources, such as cloud storage, web scraping, APIs, databases, etc. Support ofr OpenAi, Antropic, HuggingFace, etc.

`wikipedia` 
* provides a simple API for accessing and parsing data from wikipedia. Can perform tasks such as searching for articles, getting summaries, getting images, etc.

`xformers` 
* library that has customizable and efficient building blocks for transformer models. Contains carious components and extensions for transformers, such as attention mechanisms, position methods, activation functions, etc. 

***

In [None]:
!pip install -qU transformers accelerate einops langchain wikipedia xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m88.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m939.3/939.3 kB[0m [31m71.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m129.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m84.1 MB/s[0m 

#### __*HuggingFace Implementation*__

Initialization of a `text-generation` pipeline with [HuggingFace transformers](https://huggingface.co/docs/transformers/index).

Pipeline consists of these attributes:

* A Large Language Model of your choice In this case we will be using `mosaicml/mpt-7b-instruct`.

* The respective tokenizer for the model.

* A stopping criteria object.

Lets download and iniitalize the model and move it to our CUDA-enabled GPU. Around 10 minutes for completion of the process.

***
Just a brief explanation of the following libraries imported.

`torch` 
* provides tensor computation and DL functionalities for Python. Support for CPUs, GPUs, TPUs.

`cuda`
* subpackage of `torch` that provides CUDA specific functions and classes for working with tensors on NVIDIA GPUs. Inclusion of wrappers for CUDA libraries, like cuDNN, cuLABS, etc.

`blofat16`
* Datatype that represents a 16-bit floating point number with 1 sign bit, 8 exponent bits, and 7 mantissa bits. Similar to float-16, but has a more dynamic range and less precision. Supported by TPUs and some GPUs like cdDNN v8 API. Used for mixed precision training or inference to reduce memory usage and improve speed.



In [None]:
from torch import cuda, bfloat16
import transformers

# Sets device to GPU is current device if available, else just use CPU.
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Creation of an instance of a model class for causal language modeling. Uses the 
# name/path of the pretrained model thats introduced in .from_pretrained to automatically
# retrieve the relevant model.
model = transformers.AutoModelForCausalLM.from_pretrained('mosaicml/mpt-7b-instruct',
                                                           trust_remote_code=True,
                                                           torch_dtype=bfloat16,
                                                           max_seq_len=2048)

# Sets the model to evaluation moode, sends the model to the device specified.
model.eval()
model.to(device)
print(f"Model loaded on {device}")


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading (…)configuration_mpt.py:   0%|          | 0.00/9.08k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- configuration_mpt.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)main/modeling_mpt.py:   0%|          | 0.00/17.4k [00:00<?, ?B/s]

Downloading (…)meta_init_context.py:   0%|          | 0.00/3.64k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- meta_init_context.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)resolve/main/norm.py:   0%|          | 0.00/2.56k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- norm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)in/param_init_fns.py:   0%|          | 0.00/12.6k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- param_init_fns.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)solve/main/blocks.py:   0%|          | 0.00/2.49k [00:00<?, ?B/s]

Downloading (…)ve/main/attention.py:   0%|          | 0.00/16.8k [00:00<?, ?B/s]

Downloading (…)flash_attn_triton.py:   0%|          | 0.00/28.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- flash_attn_triton.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- attention.py
- flash_attn_triton.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- blocks.py
- attention.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)refixlm_converter.py:   0%|          | 0.00/27.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- hf_prefixlm_converter.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)n/adapt_tokenizer.py:   0%|          | 0.00/1.75k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- adapt_tokenizer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/mosaicml/mpt-7b-instruct:
- modeling_mpt.py
- meta_init_context.py
- norm.py
- param_init_fns.py
- blocks.py
- hf_prefixlm_converter.py
- adapt_tokenizer.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)model.bin.index.json:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.36G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/91.0 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline/model needs a [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) that converts the text input into its equivalent token IDs. It must be the same tokenizer used to train the model. MPT-7B used the `EleutherAI/gpt-neox-20b` tokenizer, so this is what must be used.
***

`Autotokenizer`
* Generic tokenizer class that will be instantiated as one of the tokenizer classes of the library when created with the .from_pretrained() class method. 
Returns the correct tokenizer class instance based ont he model_type property of the config object.
***

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

Downloading (…)okenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Finally is the definition of the _stopping criteria_ of the model. This stopping criteria allows for the specification of when the model should stop generative text. If this is not provided, the model goes on a tangent after answering the initial question.
***
`StoppingCriteria`
* Abstract base class for all stopping criteria that can be applied during text generation. A rule/condition that determines when to stop the generation loop. For example, can be based on the max length of the output sequence, the max time elapsed, etc. Used to change when to stop generation (other than EOS token).

`StoppingCriteriaList`
* A class that inherits from list and contains instances of subclasses of `StoppingCriteria`. Allows users to pass multiple stopping criteria to the generate method of the model an apply them in a logical OR fashion. For example, a `StoppingCriteriaList` can contatin `MaxLengthCriteria` or `MaxtimeCriteria`, which means the generation loop will stop when either the max length or the max time is reached.
***

In [None]:
import torch
from transformers import StoppingCriteria, StoppingCriteriaList

#MPT-7b is trained to add "<|endoftext|>" at the end of generations. Lets get 
# that token's id.
stop_token_ids = tokenizer.convert_tokens_to_ids(["<|endoftext|>"])

"""Inherits StoppingCriteria and overrides the call method to imnplement a custom stopping
   criterion that stops the generation loop when any of the stop token ids is generated.
   It also takes advantage of the StoppingCriteriaList class to wrap the custom stopping
   criterion in a list object that can be passd to the generate method of the model.
"""
class StopOnTokens(StoppingCriteria):
  def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs):
    for stop_id in stop_token_ids:
      if input_ids[0][-1] == stop_id:
        return True
    return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])


Now that all of the attributes for the HuggingFace pipeline is initialized (the model of choice, tokenizer, and stopping criteria), pipeline in its entirety can be initialized. 

The following code snippet creates an instance of the transformers.pipeline class, which allows for easy application of a model for a specific task. Task in this case is text generation. Lets go into detail what each of the parameters used means.

`model`
* Model used for text generation. 

`tokenizer`
* Tokenizer used for text generation, in tandem with the model.

`return_full_text`
* Boolean that indicates whether to return the full text or only the generated text. If true, returns the input and generated text.

`task`
* Name of the task to perform. In this case, its `text-generation`, but can also be tasks such as `sentiment-analysis`, `question-answering`, etc.

`device`
* The device to use for text generation.

`stopping_criteria`
* List of the stopping criteria objects that determines when to stop the generation loop. Can be either predefined criteria classes or custom (such as in this case).

`temperature`
* Float that controls the randomness of the generated text. 0.0 mean deterministic, 1.0 means max randomness. 

`top_p`
* Float that controls the probability mass of the tokens to sample from. 0.0 means no sampling, 1.0 means full sampling. Lower `top_p` means more likely tokens are favroed over less likely ones. 0.15 means only tokens whose cumulative probablity adds up to 15% is sampled from.

`top_k`
* Integer that controls the number of tokens to sample from. 0 means no restiction and relies on `top_p`, any positive value means more liekly top_k tokens are favored over less liekly ones.

`max_new_tokens`
* Integer that controls the max number of tokens to generate in the output sequence. None means no limit and relies on the stopping criteria instead. Lower integers means shorter output sequences. 64 tokens means up to 64 new tokens can be generated in the output sequences.

`repitition_penalty`
* Float that controls the penalty applied to repeated tokens in the output sequences. None means no penalty, relies on temperature, lower means more repitition is allowed in the output sequences. 1.1 in this case means slight penalty is applied.
***

In [None]:
# Creation of an instance of the HF transformers pipeline class.
generate_text = transformers.pipeline(model=model, tokenizer=tokenizer,
                                      return_full_text=True, #langchain expects full text.
                                      task='text-generation',
                                      device=device,
                                      stopping_criteria=stopping_criteria,
                                      temperature=0.1, 
                                      top_p=0.15,
                                      top_k=0,
                                      max_new_tokens=64,
                                      repetition_penalty=1.1)

The model 'MPTForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormFor

sanity check to determine if the `generate_text` is working.

In [None]:
result = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(result[0]["generated_text"])

Explain to me the difference between nuclear fission and fusion.
Nuclear Fission is a process that splits heavy atoms into smaller, lighter ones by releasing energy in the form of heat or radiation. Nuclear Fusion occurs when two light atomic nuclei are combined together to create one heavier nucleus which releases more energy than what was used for its creation (fusion reaction). The most common example of


#### *LangChain Implementation*

Implementation of LangChain for model usage.

Recall, [LangChain](https://docs.langchain.com/docs/) is a framework for developing applications powered by LLMs. Two major attributes this framework provides:

* *Components*: Modular abstractions for the componment neccessary to work with Language models. Has collections of implementations for all of these abstractions. 

* *Use-Case Specific Chains*: Assembling these components in particular ways in order to best accomplish a particular use case. Inteneded to be higher level interface through which people can easily get startec with a specific use case. 

Must go over docs in depth for complete understanding of this framework. This is just a simple overview.
***

Following libraries to be used:

`PromptTemplate`
* Class thast allows for the creation and managing of prompts for LLMs. Piece of text that guides the LLM to perfrom a specific task. Takes in a set of parameters from the end user and generates a prompt. May contain
    * instructions to the language model.
    * set of few shot examples to help the language model generate a better response.
    * a question to the language model.

`LLMChain`
* Adds some functionality around language models. Class thats a core abstraction that represents a sequence of calls to an LLM or a different utility. For example, LLMChain can fetch data from an external source, pass it to an LLM for processing then return the result. (Chain to run queries against LLMs)

`HuggingFacePipeline` 
* Class thats an integration with the HuggingFace library that allows for use of any LLM available on their platform. Simply specifiy the model name, task name, and any additional parameter.
***

In [None]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

# Template for an instruction with no input.
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="{instruction}"
)
llm = HuggingFacePipeline(pipeline=generate_text)
llm_chain = LLMChain(llm=llm, prompt=prompt)

In [None]:
print(llm_chain.predict(instruction="Explain to me the difference between nuclear fission and fusion.").lstrip())

Nuclear Fission is a process that splits heavy atoms into smaller, lighter ones by releasing energy in the form of heat or radiation. Nuclear Fusion occurs when two light atomic nuclei are combined together to create one heavier nucleus which releases more energy than what was used for its creation (fusion reaction). The most common example of
