# Fine-tune typeof/Mistral 3.3B model provided by huggingface: End-to-End


### A Summarization Use Case
In this notebook, we build an end-to-end workflow for fine-tuning the non-Foundation Model from huggingface. We choose Mistral 3.3B (https://huggingface.co/typeof/mistral-3.3B) to perform the customization through fine-tuning, we then test the provisioned model invocation

> *This notebook should work well with the **`Data Science 3.0`**, **`Python 3`**, and **`ml.c5.2xlarge`** kernel in SageMaker Studio*

## Prerequisites

 - Make sure you have executed `Preparation_dataset.ipynb` notebook.


## Setup
Install and import all the needed libraries and dependencies to complete this notebook.

Please ignore error messages related to pip's dependency resolver.

In [2]:
!pip install s3fs
!pip install torch
!pip install transformers
!pip install datasets
!pip install accelerate -U

[0m

In [3]:
# restart kernel for packages to take effect
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [4]:
## Fetching varialbes from `00_setup.ipynb` notebook. 
#%store -r role_arn
#%store -r s3_train_uri
#%store -r s3_validation_uri
#%store -r s3_test_uri
#%store -r bucket_name
#import pprint
#pprint.pp(role_arn)
#pprint.pp(s3_train_uri)
#pprint.pp(s3_validation_uri)
#pprint.pp(s3_test_uri)
#pprint.pp(bucket_name)

In [5]:
import os
import pandas as pd
import boto3
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Seq2SeqTrainer, Seq2SeqTrainingArguments

## Create the Fine-Tuning Job
<div class="alert alert-block alert-info">
<b>Note:</b> Fine-tuning job will take around 60mins to complete with 5K records.</div>

Meta Llama2 customization hyperparameters: 
- `epochs`: The number of iterations through the entire training dataset and can take up any integer values in the range of 1-10, with a default value of 2.
- `batchSize`: The number of samples processed before updating model parametersand can take up any integer values in the range of 1-64, with a default value of 1.
- `learningRate`:	The rate at which model parameters are updated after each batch	which can take up a float value betweek 0.0-1.0 with a default value set to	1.00E-5.
- `learningRateWarmupSteps`: The number of iterations over which the learning rate is gradually increased to the specified rate and can take any integer value between 0-250 with a default value of 5.

For guidelines on setting hyper-parameters refer to the guidelines provided [here](#https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-guidelines.html)

In [6]:
import os
import pandas as pd
from datasets import Dataset

# Define the paths to your local dataset files
dataset_folder = "fine-tuning-datasets"
train_file_path = os.path.join(dataset_folder, "train.csv")
validation_file_path = os.path.join(dataset_folder, "validation.csv")
test_file_path = os.path.join(dataset_folder, "test.csv")

# Load the datasets using pandas
train_df = pd.read_csv(train_file_path)
validation_df = pd.read_csv(validation_file_path)
test_df = pd.read_csv(test_file_path)

# Convert the pandas DataFrames into Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
validation_dataset = Dataset.from_pandas(validation_df)
test_dataset = Dataset.from_pandas(test_df)

# Optionally, you can inspect the datasets to ensure they loaded correctly
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(validation_dataset)}")
print(f"Test set size: {len(test_dataset)}")

Training set size: 293216
Validation set size: 32580
Test set size: 81450


In [7]:
from transformers import AutoTokenizer, AutoModelForCausalLM, Seq2SeqTrainer, Seq2SeqTrainingArguments

# Load the Mistral 3.3B model and tokenizer
model_name = "typeof/mistral-3.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def preprocess_function(examples):
    inputs = tokenizer(examples['translation_input'], max_length=128, truncation=True, padding="max_length")
    targets = tokenizer(examples['translation_target'], max_length=128, truncation=True, padding="max_length")

    inputs['labels'] = targets['input_ids']
    return inputs

# Sample a small subset of the dataset for quick training
train_dataset = train_dataset.shuffle(seed=42).select(range(10000))  # Using only the first 100 examples
validation_dataset = validation_dataset.shuffle(seed=42).select(range(2000))  # Using only the first 20 examples
test_dataset = test_dataset.shuffle(seed=42).select(range(2000))  # Using only the first 20 examples

train_dataset = train_dataset.map(preprocess_function, batched=True)
validation_dataset = validation_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Remove unused columns
train_dataset = train_dataset.remove_columns(["translation_input", "translation_target"])
validation_dataset = validation_dataset.remove_columns(["translation_input", "translation_target"])
test_dataset = test_dataset.remove_columns(["translation_input", "translation_target"])

# Set up the training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",  # Enable evaluation at each epoch
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    logging_dir='./logs',
    logging_steps=1000,
    remove_unused_columns=False,
)

# Initialize lists to store loss values
train_loss_values = []
eval_loss_values = []

# Customize the trainer to log loss values
class CustomTrainer(Seq2SeqTrainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        loss = super().compute_loss(model, inputs, return_outputs)
        train_loss_values.append(loss.item())
        return loss

    def evaluate(self, eval_dataset=None):
        metrics = super().evaluate(eval_dataset)
        eval_loss_values.append(metrics['eval_loss'])
        return metrics

# Initialize the trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    learning_rate=5e-5,  # Higher learning rate since fewer layers are being fine-tuned
    per_device_train_batch_size=8,  # Adjust based on your hardware
    num_train_epochs=2,  # Fewer epochs since we are fine-tuning fewer layers
    logging_dir='./logs',
    save_total_limit=2,
    fp16=True,  # Mixed precision for faster training and reduced memory usage
)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # Use your full training dataset
    eval_dataset=validation_dataset,  # Use your full validation dataset
)
# Start training
trainer.train()

# Plot the loss values
plt.plot(train_loss_values, label="Training Loss")
plt.plot(eval_loss_values, label="Validation Loss")
plt.xlabel("Steps")
plt.ylabel("Loss")
plt.legend()
plt.show()

bucket_name = "bucketforsolutionsfrontendweiyi" 

# Save the model and tokenizer
model.save_pretrained("./mistral-3.3B-finetuned")
tokenizer.save_pretrained("./mistral-3.3B-finetuned")

# Optionally upload the fine-tuned model back to S3
model_dir = "./mistral-3.3B-finetuned"
s3_client = boto3.client('s3')

for file_name in os.listdir(model_dir):
    s3_client.upload_file(os.path.join(model_dir, file_name), bucket_name, f"fine-tuned-model/{file_name}")

print(f"Fine-tuned model uploaded to s3://{bucket_name}/fine-tuned-model/")

#just for an idea: the fine tuning for the whole dataset with 32vCPU + 128 GiB is gonna take 500+ hours and for only a part of the dataset, 10000 2000 2000, 6 hours;
    #that's why we should consider fine-tuning with certain layers while keeping the part of the layers frozen

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

## Exemple
<div class="alert alert-block alert-warning">
<b>Warning:</b> Please make sure to delete providsioned throughput with the following code as there will be cost incurred if its left in running state, even if you are not using it. 
</div>

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model from the local directory
tokenizer = AutoTokenizer.from_pretrained("./mistral-3.3B-finetuned")
model = AutoModelForCausalLM.from_pretrained("./mistral-3.3B-finetuned")

# Example prompt
prompt = "Translate this sentence to French: How are you today?"

# Tokenize the input
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate the output
outputs = model.generate(input_ids, max_length=50)

# Decode the generated tokens into text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

<div class="alert alert-block alert-info">
<b>Note:</b> Please finish up the cleaning process by running 04_cleanup.ipynb to clean up the other resources. </div>

## Clean up
<div class="alert alert-block alert-warning">
<b>Warning:</b> Please make sure to delete the running kernel after the projecy as there will be cost incurred if its left in running state, even if you are not using it. 
</div>

## Final words
Since the client enterprise is looking for a better adaptation for the terminologies in the domain of computer science, I recommand using datasets oriented, for exemple, as the content of the URL below: (https://www.proz.com/glossary-translations/english-to-french-translations/it-information-technology/page1)