# 1. Prerequisites and Installation

To begin, you need to **install the necessary Python libraries**.

We will use the **Hugging Face libraries**, which greatly simplify the training of state-of-the-art models.

* **transformers**: To access pre-trained models like **Whisper**.
* **datasets**: To manage and preprocess your audio-text database.
* **accelerate**: For easy management of GPU training.
* **soundfile** and **librosa**: For processing audio files.
* **jiwer**: To evaluate the **Word Error Rate (WER)**.
* **gradio**: To create the graphical interface.
* **torch**: The machine learning framework (or tensorflow if you prefer).

The best approach is to **fine-tune a pre-trained model** on a large multilingual corpus, such as **OpenAI's Whisper**, on our own Fulfulde database. Here is a complete guide to achieve this using a Jupyter Notebook.

In [None]:
!pip install transformers datasets accelerate soundfile librosa jiwer gradio torch
!pip install torchaudio torchcodec
!pip install tensorboard

In [None]:
# To be placed in the VERY FIRST cell of our notebook
!pip install --upgrade transformers accelerate

# 2. Database Structure
For fine-tuning, our database must have a specific structure. We need audio-text pairs.
- File type: Audio files can be in .mp3 format. (Example: ful1.mp3; ful2.mp3; ful3.mp3; ful4.mp3; ful5.mp3)
- Text files must be simple .txt files. (Example: ful1.txt; ful2.txt; ful3.txt; ful4.txt; ful5.txt)
- Our system scans a local directory, associates the mp3 audio files with their corresponding txt text files, and creates the DataFrame for training. 

# 3. Python code for fine-tuning in a Jupyter Notebook
This code covers data loading, fine-tuning the Whisper model, and evaluating its performance.

### A. Loading data from a local directory
1. Defining the directory: We must first define the path to the folder where our audio and text files are located.
2. Browsing audio files: The glob.glob( ) function is used to find all files with the .mp3 extension in the specified directory.
3. Pairing: For each audio file found, the code constructs the expected transcript file name by replacing the .mp3 extension with .txt (os.path.splitext()).
4. Checking for existence: It then checks whether this text file exists. If it does not, a warning message is displayed and the audio file is ignored, thus avoiding missing transcription errors.
5. Reading transcripts: If the .txt file is found, its contents are read and stored as the transcript. Utf-8 encoding is specified to correctly handle Fulfulde special characters.
6. Creating the DataFrame: Finally, the paths to the audio files and transcripts are combined into a dictionary, which is used to create the DataFrame df needed for the next step of fine-tuning.

This solution is more flexible because it automatically generates the DataFrame based on the files we place in our working directory, which simplifies database management.

In [None]:
import pandas as pd
import os
import glob
from datasets import Dataset, Audio

data_dir = 'C:/Users/CCI-CNDT/tpt/transcription/data'
if not os.path.isdir(data_dir):
    raise FileNotFoundError(f"Error: The directory '{data_dir}' does not exist. Please create it and place your files there.")

audio_paths = []Browse all files in the directory to find audio-text pairs
for filepath in glob.glob(os.path.join(data_dir, '*.mp3')):
    transcript_path = os.path.splitext(filepath)[0] + '.txt'
    
    if os.path.exists(transcript_path):
        try:
            with open(transcript_path, 'r', encoding='utf-8') as f:
                transcript_text = f.read().strip()
            
            audio_paths.append(filepath)
            transcriptions.append(transcript_text)
            
        except Exception as e:
            print(f"Error reading file {transcript_path} : {e}")
    else:
        print(f"Warning: Missing transcript file for {filepath}. This file will be ignored.")

if not audio_paths:
    raise ValueError("No valid audio/text pairs (mp3 + txt) were found in the directory. Ensure that each .mp3 file has a corresponding .txt file with the same name.)

df = pd.DataFrame({
    "path": audio_paths,
    "transcription": transcriptions
})

print(f"\nDataFrame successfully created. {len(df)} audio-text pairs found. Here are the first 5 lines :\n")
print(df.head())

dataset = Dataset.from_pandas(df)

dataset = dataset.cast_column("path", Audio(sampling_rate=16000))

### B. Loading and Preprocessing with Hugging Face 

In [None]:
from transformers import WhisperFeatureExtractor, WhisperTokenizer, WhisperProcessor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", task="transcribe")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", task="transcribe")

def prepare_dataset(batch):
    """
    Pre-processing function for the database.
    """
    audio = batch["path"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

tokenized_dataset = dataset.map(prepare_dataset, remove_columns=dataset.column_names)

filtered_dataset = tokenized_dataset.filter(
    lambda example: len(example["labels"]) < 1024,
    num_proc=1 
)

print("\n--- Data filtering ---")
print(f"Number of samples before filtering : {len(tokenized_dataset)}")
print(f"Number of samples after filtering : {len(filtered_dataset)}")
print(f"Number of samples deleted : {len(tokenized_dataset) - len(filtered_dataset)}")

### C. Model Fine-Tuning
 It manages the training and saving of your model.

In [None]:
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().item():
            labels = labels[:, 1:]
        
        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-fulfulde-model",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  
    learning_rate=1e-5,
    warmup_steps=50,
    max_steps=400,
    gradient_checkpointing=False,  
    fp16=True,
    
    eval_strategy="steps",  
    
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=100,
    eval_steps=100,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
)


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    from jiwer import wer
    wer_score = wer(label_str, pred_str)
    return {"wer": wer_score}

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=filtered_dataset, 
    eval_dataset=filtered_dataset,  
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

trainer.train()

trainer.save_model("./final-model")
processor.save_pretrained("./final-model")

## 4. Transcription and Graphical Interface
Once the model has been trained, we can use it for transcription via a simple interface.

In [None]:
import os
from transformers import pipeline
import torch
import gradio as gr

print("Loading the ASR model...")
asr_pipe = pipeline(
    "automatic-speech-recognition",
    model="./final-model",
    device=0 if torch.cuda.is_available() else -1,
)
print("Template loaded successfully.")

def transcribe_audio_file(audio_file):
    """
    Function to transcribe an audio file
    """
    if audio_file is None:
        return "Error: Please provide an audio file or record from the microphone.."
    
    try:
        print(f"Transcription of the file : {audio_file}")
        transcription = asr_pipe(audio_file)
        print("Transcription completed.")
        return transcription["text"]
    except Exception as e:
        print(f"An error has occurred : {e}")
        return f"Transcription error : {e}"

def clear_all():
    return [None, None]

app_css = """
.gradio-container {
    border: 3px solid #1E90FF !important; /* Couleur bleu (dodgerblue) */
    border-radius: 15px !important;      /* Coins arrondis */
    padding: 15px !important;            /* Espace entre la bordure et le contenu */
}
"""

with gr.Blocks(css=app_css, theme=gr.themes.Default()) as iface:
    gr.Markdown("# Fulfulde Transcription Service")
    gr.Markdown("Download an audio file in Fulfulde or record your voice to obtain the transcription.")
    
    with gr.Row():
        audio_input = gr.Audio(type="filepath", label="Fichier audio Fulfulde")
        
        text_output = gr.Textbox(label="Transcription", lines=7) 

    with gr.Row():
        submit_button = gr.Button("Transcribe", variant="primary")
        clear_button = gr.Button("Clear")

    submit_button.click(
        fn=transcribe_audio_file,
        inputs=audio_input,
        outputs=text_output
    )
    clear_button.click(
        fn=clear_all,
        inputs=None,
        outputs=[audio_input, text_output]
    )

print("Launch of the Gradio interface...")
iface.launch(share=True)

## 5. Final Evaluation
To evaluate the quality of the transcription, the Word Error Rate (WER) is the best metric.
It allows us to evaluate the performance of our model once it has been trained, by comparing the generated transcriptions with our reference transcriptions.

In [None]:
import jiwer

def calculate_wer(reference_text, hypothesis_text):
    """
    Calculate the Word Error Rate (WER).
    """
    reference = reference_text.lower().split()
    hypothesis = hypothesis_text.lower().split()

    error = jiwer.wer(reference, hypothesis)
    return error * 100

# Example
reference = "Mi yeyi yeeygo ndabbawaaji." 
hypothesis = "mi yeyi yeeygo ndabbawaaji."

wer_score = calculate_wer(reference, hypothesis)

print(f"R√©f√©rence : '{reference}'")
print(f"Transcription du mod√®le : '{hypothesis}'")
print(f"Score WER : {wer_score:.2f}%")

if wer_score < 10.0:
    print("The model performs very well. üí™")
elif wer_score < 30.0:
    print("The model performs acceptably, but could be improved. üìà")
else:
    print("The model has a high WER; more data or better training is required. üìâ")

In [None]:
import os
from transformers import pipeline
import torch
import gradio as gr

asr_pipe = pipeline(
    "automatic-speech-recognition",
    model="./final-model",
    device=0 if torch.cuda.is_available() else -1,
)

def transcribe_audio_file(audio_file):
    """
    Function to transcribe an audio file
    """
    if audio_file is None:
        return "Error: Please provide an audio file or record from the microphone."
    
    try:
        transcription = asr_pipe(audio_file)
        return transcription["text"]
    except Exception as e:
        return f"Erreur de transcription : {e}"

iface = gr.Interface(
    fn=transcribe_audio_file,
    inputs=gr.Audio(type="filepath", label="Fulfulde audio file"),
    outputs=gr.Textbox(label="Transcription"),
    title="Fulfulde Transcription Service",
    description="Download an audio file in Fulfulde to obtain its transcription."
)

iface.launch(share=True)