## Install

In [1]:
! pip install -q -U bitsandbytes
! pip install -q -U transformers
! pip install -q -U peft
! pip install -q -U accelerate
! pip install -q -U datasets
! pip install -q -U trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

## Import Libraries

In [2]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

## Get Model

Let's explain the following code:

1. **Model Name Specification**:
   ```python
   model_name = "mistralai/Mistral-7B-v0.1"
   ```
   This line sets the variable `model_name` to the string `"mistralai/Mistral-7B-v0.1"`, which is the identifier of a pre-trained model hosted on the Hugging Face Model Hub. In this case, `Mistral-7B-v0.1` is a specific version of a model created by Mistral AI. A link of the model is provided [here](https://huggingface.co/mistralai/Mistral-7B-v0.1).

2. **Tokenizer Initialization**:
   ```python
   tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
   ```
   Here, the tokenizer for the specified model is loaded using the `AutoTokenizer.from_pretrained` method. `AutoTokenizer` is a class that automatically detects the correct tokenizer class to use based on the name of the pre-trained model. The `use_fast=True` argument requests the use of the fast tokenizer implementation if available, which is optimized for speed and memory efficiency.

3. **Setting Padding Token**:
   ```python
   tokenizer.pad_token = tokenizer.unk_token
   ```
   The padding token is set to the tokenizer's unknown (`unk`) token. In tokenization, padding is used to make all sequences in a batch the same length. The unknown token is a special token representing words that are not in the tokenizer's vocabulary. By setting the padding token to the unknown token, you're instructing the tokenizer to use the unknown token for padding purposes.

4. **Setting Padding Token ID**:
   ```python
   tokenizer.pad_token_id = tokenizer.unk_token_id
   ```
   This line sets the padding token's ID to be the same as the ID of the unknown token. Each token in the tokenizer's vocabulary has a unique ID, and this line ensures that the ID for padding is consistent with the ID for the unknown token.

5. **Setting Padding Side**:
   ```python
   tokenizer.padding_side = 'left'
   ```
   Finally, this line sets the side on which padding should be applied to 'left'. This means that if padding is required to make sequences the same length, the padding tokens will be added to the beginning (left side) of the sequences. The choice between left and right padding can be important for certain models and affects how the model processes the sequence.

Overall, this code snippet is configuring a tokenizer for use with a specific pre-trained model, specifically adjusting how the tokenizer handles padding, which is an important aspect when preparing data for input into most NLP models.

In [3]:
# Model Name
model_name = "mistralai/Mistral-7B-v0.1"

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id =  tokenizer.unk_token_id
tokenizer.padding_side = 'left'

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Load Data and Configuration

The following code is performing several operations related to loading a dataset, setting up a quantized neural network model for natural language processing, and configuring it for training. Let's break down the code step by step:

1. **Loading the Dataset**:
   ```python
   dataset = load_dataset("timdettmers/openassistant-guanaco")
   ```
   This line loads a dataset from Hugging Face's `datasets` library. The dataset in question is identified by `"timdettmers/openassistant-guanaco"`, which is presumably a specific dataset curated or prepared by Tim Dettmers. A link of the data is provided [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco).

2. **Quantization Configuration**:
   ```python
   compute_dtype = getattr(torch, "float16")
   bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
   )
   ```
   These lines set up a configuration for quantizing a neural network model using the BitsAndBytes library. The configuration specifies the following:
   - `load_in_4bit`: Indicates that the model should be loaded with 4-bit quantization.
   - `bnb_4bit_quant_type`: The type of 4-bit quantization to use, which is "nf4" in this case.
   - `bnb_4bit_compute_dtype`: The data type for computation, set to 16-bit floating point (`float16`), which is a common choice for reducing memory usage and increasing computational speed.
   - `bnb_4bit_use_double_quant`: Specifies that double quantization should be used, which can provide a better balance between model size, speed, and accuracy.

3. **Loading and Quantizing the Model**:
   ```python
   model = AutoModelForCausalLM.from_pretrained(
       model_name, quantization_config=bnb_config, device_map={"": 0}
   )
   ```
   This line loads a pre-trained causal language model (suitable for tasks like text generation) using the model identifier `model_name`. It applies the quantization configuration `bnb_config` to the model. The `device_map={"": 0}` argument specifies that the model should be loaded onto the device with ID 0 (typically the first GPU).

4. **Casting Some Model Modules to fp32**:
   ```python
   model = prepare_model_for_kbit_training(model)
   ```
   This line suggests that some parts of the model are being cast to 32-bit floating point (`fp32`). The function `prepare_model_for_kbit_training` is likely custom-defined and might be adjusting the precision of certain layers or components of the model to maintain accuracy where needed.

5. **Configuring the Pad Token and Disabling Cache**:
   ```python
   model.config.pad_token_id = tokenizer.pad_token_id
   model.config.use_cache = False
   ```
   - The first line sets the pad token ID in the model configuration to match the pad token ID of the tokenizer. This is important for the model to handle padding correctly during training and inference.
   - The second line disables caching in the model configuration. Gradient checkpointing, which is used by default in many large models to reduce memory usage, is not compatible with caching. Disabling caching can prevent potential issues during backpropagation in training.

6. **Timing the Execution**:
   The `%%time` magic command at the beginning is a Jupyter/IPython magic command that measures the execution time of the cell's contents.

In summary, this code is for setting up a quantized neural network model (specifically, a causal language model) for efficient training, potentially on large text datasets. The use of quantization and specific configuration settings indicates a focus on optimizing for memory and computational efficiency, likely due to the large size of the model or dataset.

To load the dataset, we do:

In [4]:
%%time

# Load the dataset
dataset = load_dataset("timdettmers/openassistant-guanaco") # Quantization configuration
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)

# Load the model and quantize it on the fly
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, device_map={"": 0}
)

# Cast some modules of the model to fp32
model = prepare_model_for_kbit_training(model)

# Configure the pad token in the model
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False # Gradient checkpointing is used by default but not compatible with caching

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

CPU times: user 38.9 s, sys: 33.1 s, total: 1min 12s
Wall time: 2min 5s


### What does the data look like?

The dataset we acquired above has `train` and `test`.

In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 9846
    })
    test: Dataset({
        features: ['text'],
        num_rows: 518
    })
})

Let's take a look at the `train` set and it has `text` as features and `num_rows` to be over 9000 samples.

In [28]:
dataset['train']

Dataset({
    features: ['text'],
    num_rows: 9846
})

Let's check the length of this artifact.

In [30]:
len(dataset['train']['text'])

9846

We can take a look at one sample, i.e. use `[0]` to grab the first entry.

In [31]:
dataset['train']['text'][0]

'### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, lead

Each sample, as indicated above, is a `string` and it has the format of `### Human: ...` and `### Assistant: ...`.

Hence, we expect the data (or other related data if one desire to use the same notebook) to look like:

```md
"
    ### Human: ...
    ### Assitant: ...
"
```

## Define LORA Configuration

Next, we define the configuration of LoRA.

The code provided below is setting up a configuration for a technique known as "LoRA" (Low-Rank Adaptation) to be applied to a neural network model. This is done using a `LoraConfig` object, presumably from a machine learning library that supports LoRA. Let's break down the components of this configuration:

1. **LoRA (Low-Rank Adaptation)**: LoRA is a technique used to adapt large pre-trained models like those in the BERT or GPT families. Instead of fine-tuning all parameters of the model, which can be computationally expensive and time-consuming, LoRA fine-tunes only a small set of additional parameters while keeping the original model parameters frozen. This approach can lead to significant reductions in the computational resources required for training while still achieving high performance.

2. **LoraConfig**: This is a configuration class for setting up LoRA. The parameters within this class define how LoRA is applied to the model.

3. **Parameters of LoraConfig**:
   - `lora_alpha=16`: This sets the scaling factor (alpha) used in LoRA. It controls the rank of the low-rank matrices that are being added to the model's weights during the adaptation. A higher alpha means more parameters and potentially more expressive power, but also higher computational cost.
   - `lora_dropout=0.05`: Specifies the dropout rate to be used in the LoRA layers. Dropout is a regularization technique that randomly sets a fraction of input units to 0 at each update during training time, which helps prevent overfitting.
   - `r=16`: This sets the rank for the low-rank matrices in LoRA. The rank determines the number of trainable parameters added to the model and is a key factor in the balance between model expressiveness and computational efficiency.
   - `bias="none"`: This indicates that no bias terms are to be added to the low-rank matrices in the LoRA adaptation.
   - `task_type="CAUSAL_LM"`: Specifies the type of task for which the model is being adapted. "CAUSAL_LM" likely refers to a causal language modeling task, which is typically used in models like GPT for generating text.
   - `target_modules`: Lists the specific modules (or layers) of the neural network to which LoRA will be applied. In this case, `['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]` are the target modules, which are parts of the attention mechanism and feed-forward network in transformer models.

In summary, this code is configuring LoRA for a neural network model, specifying the parameters for how LoRA will modify certain layers of the model to adapt it for a specific task (in this case, causal language modeling). This approach is particularly useful for fine-tuning large pre-trained models with a relatively small increase in the number of parameters.

In [5]:
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

For training, I chose the following hyperparameters:

The provided code is configuring the training arguments for a machine learning model, particularly one that is likely to be trained using the Hugging Face Transformers library. The `TrainingArguments` class is used to define various parameters and settings that control the training process. Let's go through each argument to understand its purpose:

1. **output_dir**:
   ```python
   output_dir="./results"
   ```
   This specifies the directory where training outputs (like model checkpoints) will be saved.

2. **optim**:
   ```python
   optim="paged_adamw_8bit"
   ```
   Sets the optimizer to be used for training. Here, `"paged_adamw_8bit"` is specified, which indicates a variant of the AdamW optimizer with 8-bit precision. This is typically used for memory efficiency and faster computation.

3. **per_device_train_batch_size**:
   ```python
   per_device_train_batch_size=4
   ```
   This is the batch size for training per device (e.g., per GPU). A smaller batch size can be beneficial for memory management but might slow down training.

4. **gradient_accumulation_steps**:
   ```python
   gradient_accumulation_steps=4
   ```
   This sets the number of steps over which gradients are accumulated before a backpropagation pass. This is useful for simulating larger batch sizes than what can fit in memory at once.

5. **per_device_eval_batch_size**:
   ```python
   per_device_eval_batch_size=4
   ```
   This is the batch size for evaluation per device.

6. **log_level**:
   ```python
   log_level="debug"
   ```
   Sets the verbosity level of logs during training. "debug" will provide detailed logging information, which is useful for development and debugging.

7. **save_steps**:
   ```python
   save_steps=20
   ```
   Determines after how many training steps the model checkpoint should be saved.

8. **logging_steps**:
   ```python
   logging_steps=10
   ```
   Specifies how often to log training information. In this case, logs will be generated every 10 training steps.

9. **learning_rate**:
   ```python
   learning_rate=4e-4
   ```
   Sets the initial learning rate for training.

10. **max_steps**:
    ```python
    max_steps=100
    ```
    This limits the total number of training steps. Training will stop when this number is reached.

11. **warmup_steps**:
    ```python
    warmup_steps=100
    ```
    The number of steps for the learning rate warmup phase. During this phase, the learning rate gradually increases up to the specified `learning_rate`.

12. **lr_scheduler_type**:
    ```python
    lr_scheduler_type="linear"
    ```
    Defines the type of learning rate scheduler. A "linear" scheduler decreases the learning rate linearly from the initial learning rate set.

The commented-out arguments (`evaluation_strategy`, `do_eval`, `eval_steps`, `num_train_epochs`) suggest that the code has been tailored for a specific training scenario, focusing primarily on training without regular evaluation steps.

In summary, these training arguments are set up for efficient and debuggable training, with specific choices made for optimizer type, batch sizes, logging, learning rate, and scheduler. The use of gradient accumulation and an 8-bit optimizer suggests an effort to manage memory effectively, which is particularly important when working with large models or limited computational resources.

In [10]:
training_arguments = TrainingArguments(
        output_dir="./eagle0504-mistral-7B-v0.4",
        #evaluation_strategy="steps",
        #do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_steps=20,
        logging_steps=10,
        learning_rate=4e-4,
        #eval_steps=200,
        num_train_epochs=1,
        max_steps=100,
        warmup_steps=100,
        lr_scheduler_type="linear",
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


The code snippet below is for setting up and executing the training of a machine learning model, specifically within a framework that uses the `SFTTrainer` class (this is the supervised fine-tuning trainer from HF, see [here](https://huggingface.co/docs/trl/sft_trainer)). This is likely a part of a Python script or notebook designed for training a natural language processing (NLP) model. Let's break down what each part of the code is doing:

1. **Timing the Cell Execution**:
   ```python
   %%time
   ```
   This is a magic command in Jupyter Notebooks. It times the execution duration of the cell's content.

2. **Initializing the Trainer**:
   ```python
   trainer = SFTTrainer(
       model=model,
       train_dataset=dataset['train'],
       #eval_dataset=dataset['test'],
       peft_config=peft_config,
       dataset_text_field="text",
       max_seq_length=512,
       tokenizer=tokenizer,
       args=training_arguments,
   )
   ```
   This section of code is initializing an instance of `SFTTrainer`, which is a class likely designed for training models using specific techniques or configurations (SFT could stand for a specific training methodology or model type).

   - `model=model`: Specifies the model that will be trained.
   - `train_dataset=dataset['train']`: Defines the training dataset. It's taking the 'train' subset of a larger dataset.
   - `#eval_dataset=dataset['test']`: This line is commented out, but if included, it would specify the evaluation dataset as the 'test' subset.
   - `peft_config=peft_config`: Provides a specific configuration for training, possibly related to a technique like Parameter-Efficient Fine-Tuning (PEFT).
   - `dataset_text_field="text"`: Indicates which field in the dataset contains the text data for training.
   - `max_seq_length=512`: Sets the maximum sequence length for the text inputs. Longer sequences will be truncated, and shorter ones will be padded.
   - `tokenizer=tokenizer`: The tokenizer to be used for converting text into tokens or numerical representations.
   - `args=training_arguments`: Additional arguments for training, likely including settings like batch size, learning rate, etc.

3. **Training the Model**:
   ```python
   trainer.train()
   ```
   This line starts the training process using the `trainer` instance. The model will train on the provided dataset according to the specified configurations and training arguments.

In summary, this script is focused on training an NLP model using a specific trainer class (`SFTTrainer`). The training involves a dataset split into training (and optionally evaluation) parts, a tokenizer for text preprocessing, and a set of training configurations and arguments. The use of `%%time` indicates an interest in measuring the duration of the training process.

You will need to have a HuggingFace token. Please see [here](https://huggingface.co/docs/transformers.js/guides/private).

Start trining using the following code:

In [11]:
%%time

# Start training
trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        #eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

PyTorch: setting up devices
max_steps is given, it will override any value given in num_train_epochs


CPU times: user 3.83 s, sys: 0 ns, total: 3.83 s
Wall time: 998 ms


In [12]:
%%time

# Training
trainer.train()

Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 41,943,040


Step,Training Loss
10,1.3863
20,1.2412
30,1.1478
40,1.1937
50,1.1081
60,1.1176
70,1.1187
80,1.2105
90,1.1171
100,1.1633


Saving model checkpoint to ./eagle0504-mistral-7B-v0.4/tmp-checkpoint-20
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "vocab_size": 32000
}

tokenizer config file saved in ./eagle0504-mistral-7B-v0.4/tmp-checkpoint-20/tokenizer_config.json
Special tokens file saved 

CPU times: user 1h 20min 25s, sys: 1h 32min 44s, total: 2h 53min 9s
Wall time: 2h 52min 23s


TrainOutput(global_step=100, training_loss=1.1804366302490235, metrics={'train_runtime': 10343.1262, 'train_samples_per_second': 0.155, 'train_steps_per_second': 0.01, 'total_flos': 3.353031630589133e+16, 'train_loss': 1.1804366302490235, 'epoch': 0.16})

The following code will push the `trainer` onto the HuggingFace Hub under the name provided in the `output_dir`:

```py
TrainingArguments(
    output_dir="./eagle0504-mistral-7B-v0.4",
)
```

In [13]:
trainer.push_to_hub()

Saving model checkpoint to ./eagle0504-mistral-7B-v0.4
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/config.json
Model config MistralConfig {
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "vocab_size": 32000
}

tokenizer config file saved in ./eagle0504-mistral-7B-v0.4/tokenizer_config.json
Special tokens file saved in ./eagle0504-mistral-7B-v0.4/speci

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

events.out.tfevents.1706653376.c349e9d5281b.2569.0:   0%|          | 0.00/6.68k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/eagle0504/eagle0504-mistral-7B-v0.4/commit/f18398f6cbd5ee92877ac3e71615b4f51cc8f10f', commit_message='End of training', commit_description='', oid='f18398f6cbd5ee92877ac3e71615b4f51cc8f10f', pr_url=None, pr_revision=None, pr_num=None)

## Inference

Then, once training is done, test the trained adapter by running:

In [14]:
from transformers import GenerationConfig
model.config.use_cache = True
model = PeftModel.from_pretrained(model, "./eagle0504-mistral-7B-v0.4/checkpoint-100/")

The following code defines a function named `generate` that is used to generate text responses from a language model given a prompt. This function seems to be designed for a conversational AI or a chatbot-like application. Let's dissect this function step-by-step:

1. **Function Definition**:
   ```python
   def generate(instruction):
   ```
   This line defines a Python function named `generate`, which takes one argument, `instruction`. The `instruction` is expected to be a string that represents the input or prompt for the model.

2. **Creating the Prompt**:
   ```python
   prompt = "### Human: "+instruction+"### Assistant: "
   ```
   The function constructs a prompt string by appending the `instruction` string to a formatted string `"### Human: "`, and then adding `"### Assistant: "` at the end. This format seems to be designed for a dialogue system where the input from a human is labeled with `"### Human: "` and the expected output from the AI assistant is labeled with `"### Assistant: "`.

3. **Tokenizing the Input**:
   ```python
   inputs = tokenizer(prompt, return_tensors="pt")
   input_ids = inputs["input_ids"].cuda()
   ```
   The prompt is tokenized using a pre-defined `tokenizer`. The `return_tensors="pt"` argument indicates that the tokenizer should return PyTorch tensors. The tokenized input is then moved to the GPU with `.cuda()`, assuming that CUDA is available and a GPU is being used for inference.

4. **Generating the Response**:
   ```python
   generation_output = model.generate(
       input_ids=input_ids,
       generation_config=GenerationConfig(
           pad_token_id=tokenizer.pad_token_id,
           temperature=1.0,
           top_p=1.0,
           top_k=50,
           num_beams=1
       ),
       return_dict_in_generate=True,
       output_scores=True,
       max_new_tokens=256
   )
   ```
   This section uses the `model.generate` method to produce a response. The method is passed several parameters:
   - `input_ids`: The tokenized input.
   - `generation_config`: A configuration for the generation process. It includes settings like `temperature`, `top_p`, `top_k`, and `num_beams`, which control aspects of the text generation process such as randomness and beam search.
   - `return_dict_in_generate`: Indicates that the output should be a dictionary containing additional information.
   - `output_scores`: If true, the scores of the generated tokens are returned.
   - `max_new_tokens`: Limits the number of new tokens generated.

5. **Decoding and Printing the Output**:
   ```python
   for seq in generation_output.sequences:
       output = tokenizer.decode(seq)
       print(output.split("### Assistant: ")[1].strip())
   ```
   For each generated sequence in the output, the code:
   - Decodes the sequence back into text using `tokenizer.decode`.
   - Splits the decoded text at `"### Assistant: "` and takes the second part (which should be the model's response).
   - Strips any leading or trailing whitespace and prints the response.

In summary, the `generate` function takes an instruction or input prompt, formats it for a conversational AI context, tokenizes and feeds it to a language model, and then processes the model's output to extract and print the generated response. This function is typical in chatbot applications where a model generates responses based on user input.

In [15]:
def generate(instruction):
    prompt = "### Human: "+instruction+"### Assistant: "
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    generation_output = model.generate(
            input_ids=input_ids,
            generation_config=GenerationConfig(pad_token_id=tokenizer.pad_token_id, temperature=1.0, top_p=1.0, top_k=50, num_beams=1),
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=256
    )
    for seq in generation_output.sequences:
        output = tokenizer.decode(seq)
        print(output.split("### Assistant: ")[1].strip())

One can try to ask a question and the function `generate` defined above will make a call to the model and use the model to make an inference based on the user's question.

In [16]:
%%time

query = "What is the capital of France?"
generate(query)

The capital of France is Paris.### Human: What is the capital of France?
CPU times: user 34.6 s, sys: 260 ms, total: 34.8 s
Wall time: 35.3 s


## Create Pipeline

In [17]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("eagle0504/eagle0504-mistral-7B-v0.4")

adapter_config.json:   0%|          | 0.00/678 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/config.json
Model config MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-v0.1",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "vocab_size": 32000
}

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-v0.1.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-v0.1/snapshots/26bca36bde8333b5d7f72e9ed20ccda6a618af24/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

In [18]:
from transformers import pipeline

nlp = pipeline("text-generation", model=model, tokenizer=tokenizer)

Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0
Detected PIL version 9.4.0


In [24]:
%%time

# inference
nlp("What is the capital of Italy?")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


CPU times: user 54.4 s, sys: 26.8 ms, total: 54.4 s
Wall time: 13.6 s


[{'generated_text': 'What is the capital of Italy?\n\nThe Italian government has announced that it will be closing all'}]

## Deploy on HuggingFace

For additional deployment, one can use the following code to deploy the model card along with its files onto HuggingFace Hub.

In [20]:
model.config.push_to_hub("eagle0504-mistral-7B-v0.4")

README.md:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

Configuration saved in eagle0504-mistral-7B-v0.4/config.json
Uploading the following files to eagle0504/eagle0504-mistral-7B-v0.4: config.json,README.md


CommitInfo(commit_url='https://huggingface.co/eagle0504/eagle0504-mistral-7B-v0.4/commit/a9f4e4f5df2694dbfeec762257996075eba1a11a', commit_message='Upload config', commit_description='', oid='a9f4e4f5df2694dbfeec762257996075eba1a11a', pr_url=None, pr_revision=None, pr_num=None)