<a href="https://colab.research.google.com/github/royam0820/HuggingFace/blob/main/dataset_generate_images.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.youtube.com/watch?v=z2QE12p3kMM


# Setup

In [31]:
!python --version

Python 3.10.12


In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# getting from Google drive the llm fine-tuned
!unzip /content/drive/MyDrive/llm_tuning/llama2-MJ-prompts.zip

Archive:  /content/drive/MyDrive/llm_tuning/llama2-MJ-prompts.zip
   creating: content/llama2-MJ-prompts/
 extracting: content/llama2-MJ-prompts/added_tokens.json  
  inflating: content/llama2-MJ-prompts/adapter_model.bin  
  inflating: content/llama2-MJ-prompts/tokenizer.model  
  inflating: content/llama2-MJ-prompts/special_tokens_map.json  
   creating: content/llama2-MJ-prompts/checkpoint-7/
 extracting: content/llama2-MJ-prompts/checkpoint-7/added_tokens.json  
  inflating: content/llama2-MJ-prompts/checkpoint-7/adapter_model.bin  
  inflating: content/llama2-MJ-prompts/checkpoint-7/tokenizer.model  
  inflating: content/llama2-MJ-prompts/checkpoint-7/trainer_state.json  
  inflating: content/llama2-MJ-prompts/checkpoint-7/special_tokens_map.json  
  inflating: content/llama2-MJ-prompts/checkpoint-7/rng_state.pth  
  inflating: content/llama2-MJ-prompts/checkpoint-7/pytorch_model.bin  
  inflating: content/llama2-MJ-prompts/checkpoint-7/README.md  
  inflating: content/llama2-MJ-p

# Why LLM Fine-Tuning?

Fine-tuning of LLMs is the conventional method that retrains all model parameters for a specific task or domain. Fine-tuning a Large Language Model (LLM) is beneficial for several reasons:

- **Domain Specificity**: General-purpose language models are trained on a wide variety of data and are not specialized in any particular domain. Fine-tuning allows you to adapt the model to specific industries, topics, or types of language, such as medical terminology, legal jargon, or technical language.

- **Improved Accuracy**: Fine-tuning on a specific dataset can improve the model’s performance on tasks related to that data. This could mean more accurate classifications, better sentiment analysis, or more relevant generated text.

- **Resource Efficiency**: Fine-tuning only a subset of the model’s parameters can be more computationally efficient than training a new model from scratch. This can be particularly important when computational resources are limited.

- **Data Privacy**: If you have sensitive or proprietary data, fine-tuning a pre-trained model on your own infrastructure allows you to benefit from the capabilities of large language models without sharing your data externally.

- **Task Adaptation**: General-purpose language models are not optimized for specific tasks like question-answering, summarization, or translation. Fine-tuning can adapt the model for these specialized tasks.

- **Contextual Understanding**: Fine-tuning can help the model better understand the context in which it will be used, making it more effective at generating appropriate and useful responses.

- **Reduced Training Time**: Starting with a pre-trained model and fine-tuning it for a specific task can be much faster than training a model from scratch.

- **Avoid Overfitting**: When you have a small dataset, training a large model from scratch can lead to overfitting. Fine-tuning can mitigate this risk, as the model has already learned general language features from a large dataset and only needs to adapt to the specificities of the new data.

- **Leverage Pre-trained Features**: Large language models trained on extensive datasets have already learned a wide array of features, from basic syntax and grammar to high-level semantic understanding. Fine-tuning allows you to leverage these features for your specific application.

- **Customization**: Fine-tuning allows you to tailor the model’s behavior to specific requirements, such as generating text in a particular style, tone, or format.

In summary, fine-tuning a large language model allows you to customize its capabilities for specific tasks, domains, or datasets, improving its performance and making it more applicable to your particular needs.

# GPT4 - Code interpreter - Dataset creation

https://chat.openai.com/share/5380107c-821d-4849-993b-938fc8545268

The goal of the dataset we are going to create is:
- the user provide a concept and
- the model is going to produce a description associated to the concept.
So initially we have two fields:
- concept
- description

You need to generate at least 300 rows so that the model can learn.

Then, you need to add an additional column called `text` that will have a specific format for the LLM with the tokens ###Human and ###Assistant:



```
###Human: generate a midjourney prompt for A sunset over the mountains ###Assistant: The sun is setting behind jagged mountain peaks. The sky is filled with shades of orange, pink, and purple, casting a warm glow on the mountains. Clouds lightly scattered, allowing the colors to shine through.
```

The `text` column will hold the concept and description, with the tokens identified as ### and the model fine-tuning will be based on this column `text`.


## Some considerations
- Make the dataset large enough so that the model can learn from this dataset.
- Make sure to have an end of sentence <eos> token so that when the model encounters that token, it will stop generating a new word; remember we are working on a CausalLM predicting the next word.
- A new column called `text` must be present that has the following structure:


```
text = f"###Human:\n\ngenerate a midjourney prompt for {concept}\n\n####Assistant:\n{description}"
```



generate a datasets on medical terms  structured with 2 columns:
concept: indicating the medical term, and description: giving a description regarding the medical term.
Create a pandas dataframe for 200 rows and save it to a csv file named train.csv

vv create a dataset that contains concept-prompt pair. For each of the concepts like "A person walking in the rain" create a detailed description that can be used by an AI image generator to create images.

In [1]:
# Generating a dataset of concept-prompt pairs for AI image generation
concept_prompts = [
    {
        "concept": "A person walking in the rain",
        "description": "A silhouette of an adult person holding a black umbrella, walking on a wet city street. The sky is dark and cloudy, filled with raindrops. Streetlights cast soft glows, reflecting on the puddles on the ground."
    },
    {
        "concept": "A sunset over the ocean",
        "description": "The sky is ablaze with shades of orange, pink, and purple, as the sun sinks below the horizon. The ocean waves gently lap the shore, reflecting the colors of the sky. A small sailboat is in the distance."
    },
    {
        "concept": "A snowy mountain range",
        "description": "Tall, rugged mountains covered in pristine white snow under a clear blue sky. Pine trees are scattered at the lower elevations. A lone eagle soars high above the peaks."
    },
    {
        "concept": "A bustling city market",
        "description": "A crowded marketplace filled with vendors selling fruits, vegetables, and spices. People are haggling, children are running around, and the air is filled with the aroma of freshly cooked food. Colorful fabric canopies provide shade."
    },
    {
        "concept": "A tranquil Japanese garden",
        "description": "A peaceful garden featuring a koi pond surrounded by meticulously pruned bonsai trees. A wooden bridge arches over the pond. Cherry blossoms are in full bloom, and a traditional stone lantern stands off to one side."
    }
]

concept_prompts


[{'concept': 'A person walking in the rain',
  'description': 'A silhouette of an adult person holding a black umbrella, walking on a wet city street. The sky is dark and cloudy, filled with raindrops. Streetlights cast soft glows, reflecting on the puddles on the ground.'},
 {'concept': 'A sunset over the ocean',
  'description': 'The sky is ablaze with shades of orange, pink, and purple, as the sun sinks below the horizon. The ocean waves gently lap the shore, reflecting the colors of the sky. A small sailboat is in the distance.'},
 {'concept': 'A snowy mountain range',
  'description': 'Tall, rugged mountains covered in pristine white snow under a clear blue sky. Pine trees are scattered at the lower elevations. A lone eagle soars high above the peaks.'},
 {'concept': 'A bustling city market',
  'description': 'A crowded marketplace filled with vendors selling fruits, vegetables, and spices. People are haggling, children are running around, and the air is filled with the aroma of f

vv create a pandas dataframe with concept and description,  include  25 unique pair of the concept and description pairs

In [2]:
import pandas as pd

# Extending the concept-prompts dataset to include 25 unique pairs
extended_concept_prompts = [
    {"concept": "A person walking in the rain", "description": "A silhouette of an adult person holding a black umbrella, walking on a wet city street. The sky is dark and cloudy, filled with raindrops. Streetlights cast soft glows, reflecting on the puddles on the ground."},
    {"concept": "A sunset over the ocean", "description": "The sky is ablaze with shades of orange, pink, and purple, as the sun sinks below the horizon. The ocean waves gently lap the shore, reflecting the colors of the sky. A small sailboat is in the distance."},
    {"concept": "A snowy mountain range", "description": "Tall, rugged mountains covered in pristine white snow under a clear blue sky. Pine trees are scattered at the lower elevations. A lone eagle soars high above the peaks."},
    {"concept": "A bustling city market", "description": "A crowded marketplace filled with vendors selling fruits, vegetables, and spices. People are haggling, children are running around, and the air is filled with the aroma of freshly cooked food. Colorful fabric canopies provide shade."},
    {"concept": "A tranquil Japanese garden", "description": "A peaceful garden featuring a koi pond surrounded by meticulously pruned bonsai trees. A wooden bridge arches over the pond. Cherry blossoms are in full bloom, and a traditional stone lantern stands off to one side."},
    {"concept": "A futuristic cityscape", "description": "Skyscrapers with neon lights and flying cars zipping between them. The sky is dark but filled with the glow of billboards and digital screens."},
    {"concept": "A medieval castle", "description": "A stone castle with turrets and a moat, set against a forest. Flags are flying, and knights are patrolling the walls."},
    {"concept": "A desert oasis", "description": "Golden sand dunes surrounding a small body of clear water with palm trees. Camels are resting nearby."},
    {"concept": "An autumn forest", "description": "Trees with leaves in shades of red, orange, and yellow. The ground is covered with fallen leaves, and a stream flows gently."},
    {"concept": "A tropical beach", "description": "White sand, turquoise water, and palm trees. A hammock is strung between two trees, and the sun is shining brightly."},
    {"concept": "A galaxy", "description": "A swirling mass of stars, gas, and dust. Bright clusters and dark voids make up a celestial tapestry."},
    {"concept": "A Victorian mansion", "description": "A grand house with intricate wooden details, a large porch, and tall windows. The garden is well-kept, with blooming flowers."},
    {"concept": "A jazz band", "description": "Musicians in a dimly lit room, playing saxophones, trumpets, and drums. The atmosphere is energetic and the crowd is engaged."},
    {"concept": "A wild west town", "description": "Wooden buildings line a dusty street. Horses are tied to posts, and cowboys walk with spurs jingling."},
    {"concept": "A coral reef", "description": "Colorful corals and fish in clear, shallow waters. Sunlight filters through, creating a mosaic of light and shadow."},
    {"concept": "A space station", "description": "A complex structure orbiting Earth, with solar panels and modules connected by tunnels. Astronauts are conducting experiments."},
    {"concept": "A vineyard", "description": "Rows of grapevines on rolling hills. A farmhouse and barrels are in the background, under a clear, sunny sky."},
    {"concept": "A safari", "description": "A savannah with acacia trees, where elephants, zebras, and lions roam. A jeep is parked in the distance, observing the wildlife."},
    {"concept": "A lighthouse on a cliff", "description": "A tall lighthouse stands on a rocky cliff, overlooking a stormy sea. Waves crash against the rocks, and the light is on."},
    {"concept": "A haunted house", "description": "An old, dilapidated house with broken windows and overgrown weeds. It's night, and an eerie glow emanates from within."},
    {"concept": "A dense jungle", "description": "Tall trees, vines, and lush foliage. Exotic birds and animals can be seen and heard, creating a cacophony of sounds."},
    {"concept": "A modern art gallery", "description": "White walls adorned with abstract paintings and sculptures. Visitors are contemplating the art, and the atmosphere is quiet."},
    {"concept": "A skate park", "description": "Concrete ramps, rails, and half-pipes, with skaters performing tricks. Graffiti decorates the surfaces, adding to the urban vibe."},
    {"concept": "A glacier", "description": "A massive sheet of ice moving slowly through a mountain valley. The surface is jagged, and crevasses are visible."},
    {"concept": "A battlefield", "description": "Soldiers in camouflage with tanks and artillery, engaged in combat. The sky is filled with smoke and explosions."}
]

# Creating a pandas DataFrame
df_concept_prompts = pd.DataFrame(extended_concept_prompts)
df_concept_prompts.head(25)


Unnamed: 0,concept,description
0,A person walking in the rain,A silhouette of an adult person holding a blac...
1,A sunset over the ocean,"The sky is ablaze with shades of orange, pink,..."
2,A snowy mountain range,"Tall, rugged mountains covered in pristine whi..."
3,A bustling city market,A crowded marketplace filled with vendors sell...
4,A tranquil Japanese garden,A peaceful garden featuring a koi pond surroun...
5,A futuristic cityscape,Skyscrapers with neon lights and flying cars z...
6,A medieval castle,"A stone castle with turrets and a moat, set ag..."
7,A desert oasis,Golden sand dunes surrounding a small body of ...
8,An autumn forest,"Trees with leaves in shades of red, orange, an..."
9,A tropical beach,"White sand, turquoise water, and palm trees. A..."


In [3]:
len(df_concept_prompts )

25

create a csv file from the above dataframe

In [4]:
# save the dataframe to a csv file
df_concept_prompts.to_csv('train.csv', index=False)

vv create another unique 25 concept-description pairs and write them to another csv file. Repeat the same process 4 times. Do not include the iteration number in the concept column, only strings allowed.

ok now combine all the csv files into a single dataframe and create a new csv file called train.csv

Now, create another column called text with the following format structure:
text=f"###Human:\nGenerate a medical term prompt for {concept}\n\n###Assistant:\n{description}"

NB: the train.csv file has now 100 rows in total. It is a simple dataset, the goal is to show you how to create a dataset with code interpretor  and train it with a llama2 model for fine-tuning, then for inference as a last stage.

In [5]:
import pandas as pd
df=pd.read_csv('train.csv')

In [6]:
df.head()

Unnamed: 0,concept,description,text
0,A person walking in the rain,A young adult wearing a navy-blue raincoat and...,###Human:\ngenerate a midjourney prompt for A ...
1,A sunset over the mountains,The sun is setting behind jagged mountain peak...,###Human:\ngenerate a midjourney prompt for A ...
2,A bustling city street at night,A city street bustling with activity at night....,###Human:\ngenerate a midjourney prompt for A ...
3,A serene lake surrounded by trees,A peaceful lake encircled by dense trees. The ...,###Human:\ngenerate a midjourney prompt for A ...
4,A snowy landscape with a lone cabin,A snowy landscape featuring a single wooden ca...,###Human:\ngenerate a midjourney prompt for A ...


In [9]:
print(df.text[0])

###Human:
generate a midjourney prompt for A person walking in the rain

###Assistant:
A young adult wearing a navy-blue raincoat and matching rain boots walks on a wet cobblestone street. Raindrops create ripples in the puddles. They hold a red umbrella that shields them from the pouring rain. Their face is relaxed, enjoying the rainfall.


# Training

Le terme 'auto-train' d'HF réfère à diverses techniques en apprentissage automatique et en intelligence artificielle où le processus d'entraînement d'un modèle est automatisé dans une certaine mesure. L'objectif est de simplifier les étapes souvent complexes et chronophages impliquées dans la préparation des données, la sélection des caractéristiques, la sélection du modèle, le réglage des hyperparamètres et l'évaluation.

In [1]:
!pip install autotrain-advanced
!pip install huggingface_hub

Collecting autotrain-advanced
  Downloading autotrain_advanced-0.6.37-py3-none-any.whl (130 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/130.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m112.6/130.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.4/130.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Collecting codecarbon==2.2.3 (from autotrain-advanced)
  Downloading codecarbon-2.2.3-py3-none-any.whl (174 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/174.1 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets[vision]~=2.14.0 (from autotrain-advanced)
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate==0.3.0 (from autotr



NB: WARNING restart the runtime.

In [2]:
# #autotrain help
# !autotrain llm --help

usage: autotrain <command> [<args>] llm [-h] [--train] [--deploy] [--inference]
                                        [--data_path DATA_PATH] [--train_split TRAIN_SPLIT]
                                        [--valid_split VALID_SPLIT] [--text_column TEXT_COLUMN]
                                        [--rejected_text_column REJECTED_TEXT_COLUMN]
                                        [--model MODEL] [--learning_rate LEARNING_RATE]
                                        [--num_train_epochs NUM_TRAIN_EPOCHS]
                                        [--train_batch_size TRAIN_BATCH_SIZE]
                                        [--warmup_ratio WARMUP_RATIO]
                                        [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
                                        [--optimizer OPTIMIZER] [--scheduler SCHEDULER]
                                        [--weight_decay WEIGHT_DECAY]
                                        [--max_grad_norm MAX_GRAD_NORM] [-

In [1]:
!autotrain setup --update-torch

> [1mINFO    Installing latest transformers@main[0m
> [1mINFO    Successfully installed latest transformers[0m
> [1mINFO    Installing latest peft@main[0m
> [1mINFO    Successfully installed latest peft[0m
> [1mINFO    Installing latest diffusers@main[0m
> [1mINFO    Successfully installed latest diffusers[0m
> [1mINFO    Installing latest trl@main[0m
> [1mINFO    Successfully installed latest trl[0m
> [1mINFO    Installing latest xformers[0m
> [1mINFO    Successfully installed latest xformers[0m
> [1mINFO    Installing latest PyTorch[0m
> [1mINFO    Successfully installed latest PyTorch[0m


In [2]:
# logging to the HF hub to get access to the authentication token
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [10]:
!pip install -U xformers



In [9]:
# fine-tune LLM
!autotrain llm --train   \
--project_name 'llama2-MJ-prompts' \
--model 'abhishek/llama-2-7b-hf-small-shards' \
--text_column text \
--data_path .  \
--use_peft  \
--use_int4 \
--use_fp16 = True \
--learning_rate 2e-4  \
--train_batch_size 4 \
--num_train_epochs 9 \
--trainer sft  \
--model_max_length 2048  \
--push_to_hub  \
--token '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' \
--repo_id royam0820/llama2-MJ-prompts  \
--block_size 1024 > training.log &

    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cu118)
    Python  3.10.13 (you have 3.10.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
> [1mINFO    Running LLM[0m
> [1mINFO    Params: Namespace(version=False, train=True, deploy=False, inference=False, data_path='.', train_split='train', valid_split=None, text_column='text', rejected_text_column='rejected', model='abhishek/llama-2-7b-hf-small-shards', learning_rate=3e-05, num_train_epochs=1, train_batch_size=2, warmup_ratio=0.1, gradient_accumulation_steps=1, optimizer='adamw_torch', scheduler='linear', weight_decay=0.0, max_grad_norm=1.0, seed=42, add_eos_token=False, block_size=-1, use_peft=True, lora_r=16, lora_alpha=32, lora_dropout=0.05, logging_steps=-1, project_name='llama2-MJ-prompts', evaluation_strategy='epoch', save_total_limit=1, save_stra

In [11]:
# saving the fine-tune training folder
!zip -r /content/llama2-MJ-prompts.zip /content/llama2-MJ-prompts

  adding: content/llama2-MJ-prompts/ (stored 0%)
  adding: content/llama2-MJ-prompts/added_tokens.json (stored 0%)
  adding: content/llama2-MJ-prompts/adapter_model.bin (deflated 9%)
  adding: content/llama2-MJ-prompts/tokenizer.model (deflated 55%)
  adding: content/llama2-MJ-prompts/special_tokens_map.json (deflated 71%)
  adding: content/llama2-MJ-prompts/checkpoint-7/ (stored 0%)
  adding: content/llama2-MJ-prompts/checkpoint-7/added_tokens.json (stored 0%)
  adding: content/llama2-MJ-prompts/checkpoint-7/adapter_model.bin (deflated 9%)
  adding: content/llama2-MJ-prompts/checkpoint-7/tokenizer.model (deflated 55%)
  adding: content/llama2-MJ-prompts/checkpoint-7/trainer_state.json (deflated 69%)
  adding: content/llama2-MJ-prompts/checkpoint-7/special_tokens_map.json (deflated 71%)
  adding: content/llama2-MJ-prompts/checkpoint-7/rng_state.pth (deflated 25%)
  adding: content/llama2-MJ-prompts/checkpoint-7/pytorch_model.bin (deflated 65%)
  adding: content/llama2-MJ-prompts/checkp

### Issue with the train.csv file
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 18905: invalid start byte


In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('train.csv',  encoding='unicode_escape')

In [6]:
df.to_csv('train.csv', encoding='utf-8', index=False)

In [7]:
df=pd.read_csv('train.csv')

In [8]:
df.text[0]

'###Human:\n\ngenerate a midjourney prompt for A person walking in the rain\n\n####Assistant:\nA young adult wearing a navy-blue raincoat and matching rain boots walks on a wet cobblestone street. Raindrops create ripples in the puddles. They hold a red umbrella that shields them from the pouring rain. Their face is relaxed, enjoying the rainfall.'

NB:  LINKS:

- autotrain: https://huggingface.co/autotrain
- autotrain GitHub: https://github.com/huggingface/autotr...

# Inference


In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import DataParallel #for multiple gpus

In [3]:
# accessing the newly fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("/content/llama2-MJ-prompts")
model = AutoModelForCausalLM.from_pretrained("/content/llama2-MJ-prompts")

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]



In [9]:
input_context = '''
"###Human:
generate a midjourney prompt for a child running in a rain, give a detailed description

####Assistant:
'''

NB: the `###Assitant:` is empty because the newly fine-tuned model  will be able to generate the prompt.

In [10]:
input_ids = tokenizer.encode(input_context, return_tensors='pt')

In [36]:
# input_ids = tokenizer(input_context, return_tensors='pt').input_ids.cuda()

In [None]:
#output = model.generate(input_ids, max_length=85, temperature=0.3, num_return_sequences=1)

In [11]:
output = model.generate(input_ids, max_length=100, temperature=0.3, num_return_sequences=1, repetition_penalty=1.1, eos_token_id=tokenizer.eos_token_id )

In [12]:
generated_text = tokenizer.decode(output[0], skip_special_token=True)

In [13]:
print(generated_text)

<s> 
"###Human:
generate a midjourney prompt for a child running in a rain, give a detailed description

####Assistant:
The child is running in the rain, with a smile on their face. They are wearing a yellow raincoat and blue rain boots. They are holding an umbrella in one hand and a toy in the other. They are laughing and splashing in the puddles.



In [None]:
# output = model.generate(inputs=input_ids, do_sample=True, temperature=0.7, max_new_tokens=512)
# print(tokenizer.decode(output[0]))

In [None]:
#output = model.generate(input_ids, max_length=100, temperature=0.3, num_return_sequences=1, do_sample=False)

In [30]:
inputs = tokenizer(input_context, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=4, do_sample=True, max_new_tokens=1024, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



"###Human:
generate a midjourney prompt for a castle on an edge

####Assistant:
The castle is on the edge of a cliff, overlooking the sea. It is made of stone and has many towers and turrets.

####Human:
generate a midjourney prompt for a castle on an edge

####Assistant:
The castle is on the edge of a cliff, overlooking the sea. It is made of stone and has many towers and turrets.

####Human:
generate a midjourney prompt for a castle on an edge

####Assistant:
The castle is on the edge of a cliff, overlooking the sea. It is made of stone and has many towers and turrets.

####Human:
generate a midjourney prompt for a castle on an edge

####Assistant:
The castle is on the edge of a cliff, overlooking the sea. It is made of stone and has many towers and turrets.

####Human:
generate a midjourney prompt for a castle on an edge

####Assistant:
The castle is on the edge of a cliff, overlooking the sea. It is made of stone and has many towers and turrets.

####Human:
generate a midjourney p

NB:
- `**inputs`: Unpacks the tokenized input data.
- `num_beams=4`: Beam search with 4 beams. Provides a trade-off between quality and speed.
- `do_sample=True`: Sampling is enabled, making the output text more random. It switches the generation mode from deterministic to probabilistic (or stochastic).
- `max_new_tokens=1024`: The maximum number of tokens for the generated text.

> **Beam search** is a search algorithm used for finding the most likely sequence of tokens when generating text from a language model. It is  commonly employed in natural language processing tasks like machine translation, text summarization, and text generation.

# Web Interface

In [None]:
# to fix the absence of utf-8 locale on Colab
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
!pip install gradio

In [None]:
import gradio as gr

def predict(prompt):

  sequences = pipeline(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
  )

  response = ''
  for seq in sequences:
    response += seq['generated_text']

  return response

NB: The inference code is the same as before, but here we are including our pipeline, configured as usual, inside a function so that it can be called at will to pass us the user’s request via the prompt parameter.

In [None]:
demo = gr.Interface(
  fn=predict,
  inputs=gr.Textbox(label="Please, write your request here:", placeholder="example: def fibonacci(", lines=5),
  outputs=gr.Textbox(label="Answer (inference):"),
  title='On Premesis Code LLama2 Helper',
  description='description',
  article='My article on Medium https://medium.com',
  examples=[["def Fibonacci("], ["function DotProduct("], ['springboot profile'], ['write a class for manage shipment']],
  allow_flagging="never"
)

demo.launch()

NB: There are three main configurations, and they concern:
- fn the interfacing with the underlying model which is done via our function;
- input where we customize the text area where we insert the prompt by specifying its title, internal hint and length;
- output for the text area of the ai responses, we simply configure the title;

The other configurations allow us to specify a title, an html description, and the list of ready-made examples.

Finally, and we have thus come to the end of this presentation, the last command launches the application by returning to the console the address to connect to with the browser to test the application.

# Documentation
**`model.generate`** method is often used with transformers-based models like those provided by the Hugging Face Transformers library. Below are some commonly used arguments for `model.generate`:

- **`input_ids`**: Tensor containing the token IDs to be fed into the model.

- **`max_length`**: Maximum sequence length for the generated text. The generation will stop once this length is reached.

- **`min_length`**: Minimum sequence length for the generated text. The model will continue generating until this length is reached.

- **`do_sample`**: Whether to sample the next token randomly based on the distribution of the logits (True) or to take the token with the highest logit (False).

- **`temperature`**: Controls the randomness of the token sampling when `do_sample=True`. Higher values make the output more random, while lower values make it more deterministic.

- **`top_k`**: Limits the number of highest-probability tokens considered for sampling. Only relevant when `do_sample=True`.

- **`top_p`**: Also known as "nucleus sampling," this parameter sets a cumulative probability threshold. Tokens with a cumulative probability above this value are excluded from sampling. Only relevant when `do_sample=True`.

- **`num_return_sequences`**: Number of different sequences to generate. Useful for getting multiple outputs for a single input.

- **`pad_token_id`**: Token ID used for padding when the generated sequence is shorter than max_length.

- **`eos_token_id`**: Token ID signaling the end of a sequence. When this token is generated, the sequence will stop.

- **`length_penalty`**: Exponential penalty to apply to the sequence length. Values > 1.0 encourage longer sequences, while values < 1.0 encourage shorter sequences.

- **`early_stopping`**: Whether to stop generation as soon as the end-of-sequence token is generated.

- **`num_beams`**: Number of beams for beam search. Beam search is a technique that explores multiple possibilities in parallel, aiming to find the most probable sequence. Setting this to a value greater than 1 enables beam search.

- **`no_repeat_ngram_size`**: Size of the n-gram window used to prevent repetition of n-grams in the generated text.

- **`bad_words_ids`**: List of token IDs that should not appear in the generated text.

- **`attention_mask`**: Mask to apply to the attention mechanism, typically to ignore padding tokens.

- **`decoder_start_token_id`**:
Token ID that should be used as the starting token for decoding in sequence-to-sequence models.

Ref.: https://huggingface.co/docs/transformers/main_classes/text_generation

# Ressources

https://github.com/Zjh-819/LLMDataHub

https://www.deepset.ai/blog/llm-finetuning

https://colab.research.google.com/github/Tims793/Autotrain_llm/blob/main/Copy_of_AutoTrain_LLM.ipynb#scrollTo=AKQg_yF2MFal

https://huggingface.co/docs/transformers/v4.34.1/en/generation_strategies#default-text-generation-configuration

https://platform.openai.com/docs/guides/fine-tuning/use-a-fine-tuned-model

https://huggingface.co/learn/nlp-course/chapter9/1?fw=pt
