# GENERAL DESCRIPTION

This code is part of the thesis conducted by **Vemmos Anastasios** for the **Department of Management Science and Technology** at the **Athens University of Economics and Business**. The thesis, titled _"Data-to-Text Transformation with Machine Learning: Literature Review and Empirical Research on Greek Dataset Creation,"_ aims to showcase the performance of four models based on a Greek table-to-text dataset.

For each dataset, well-established metrics such as **BLEU**, **METEOR**, and **ROUGE** were used to automatically evaluate each model's performance.




# Installation of all necessary libraries

In [7]:
!pip install transformers[torch] accelerate -U
!pip install datasets
!pip install rouge_score
!pip install datasets nltk parent
!pip install transformers datasets torch
!pip install transformers datasets evaluate nltk rouge-score
!pip install openai

[31mERROR: Ignored the following versions that require a different python version: 23.1001 Requires-Python >=3.11,<4.0; 23.1002 Requires-Python >=3.11,<4.0; 23.1003 Requires-Python >=3.11,<4.0; 23.1004 Requires-Python >=3.11,<4.0[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement parent (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for parent[0m[31m


##  Mounting to Google Drive

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Importing the dataset into dataframe format

In [3]:
import pandas as pd

# Correct path to your dataset in Google Drive
file_path = '/content/drive/My Drive/Greek_D2T_dataset.xlsx'

# Load the dataset using read_excel
df = pd.read_excel(file_path)
df


Unnamed: 0,Όνομα,Πεδίο,Τόπος Γέννησης,Ημερομηνία Γέννησης,Ημερομηνία Θανάτου,Περιγραφή
0,Πέτρος Α΄ της Βραζιλίας,"μουσική, συνθέτης",Κελούζ,12-10-1798,24-09-1834,Πέτρος Α΄ της Βραζιλίας ασχολήθηκε με τα πεδία...
1,Πάπας Ιωάννης Παύλος Β΄,"ποίηση, δράμα",Βαντοβίτσε,18-05-1920,02-04-2005,Πάπας Ιωάννης Παύλος Β΄ ασχολήθηκε με τα πεδία...
2,Μαχάτμα Γκάντι,"φιλοσοφία, Πασιφισμός, νομική",Πορμπαντάρ,02-10-1869,30-01-1948,Μαχάτμα Γκάντι ασχολήθηκε με τα πεδία φιλοσοφί...
3,Δάντης Αλιγκιέρι,"ποίηση, γλωσσολογία, λογοτεχνία, πολιτική φιλο...",Φλωρεντία,01-01-1265,22-09-1321,Δάντης Αλιγκιέρι ασχολήθηκε με τα πεδία ποίηση...
4,Αλεσσάντρο Μαντσόνι,"ποίηση, δράμα",Μιλάνο,07-03-1785,22-05-1873,Αλεσσάντρο Μαντσόνι ασχολήθηκε με τα πεδία ποί...
...,...,...,...,...,...,...
757,Παναγιώτης Μπαχράμης,Πλάγιος μέσος,Καλαμάτα,11-03-1976,12-08-2010,"Ο Παναγιώτης Μπαχράμης, ο οποίος γεννήθηκε στη..."
758,Παναγιώτης Κατσούρης,Πλάγιος μέσος,Αθήνα,27-10-1976,08-02-1998,"Ο Παναγιώτης Κατσούρης, ο οποίος γεννήθηκε στη..."
759,Γιάννης Κοσκινιάτης,Πλάγιος μέσος,Ελλάδα,28-08-1983,28-10-2008,"Ο Γιάννης Κοσκινιάτης, ο οποίος γεννήθηκε στην..."
760,Θανάσης Τριμπόνιας,αμυντικός,Αθήνα,01-05-1984,23-07-2012,"Ο Θανάσης Τριμπόνιας, ο οποίος γεννήθηκε στην ..."


## Tokenizing both input and output so that they can be used in the model

In [10]:
from transformers import BartTokenizer

# Initialize the BART tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')

# Concatenate input columns into a single string for each row
input_texts = df.apply(
    lambda row: f"Όνομα: {row['Όνομα']}, Πεδίο {row['Πεδίο']}, Τόπος Γέννησης: {row['Τόπος Γέννησης']}, Ημερομηνία Γέννησης: {row['Ημερομηνία Γέννησης']}, Ημερομηνία Θανάτου: {row['Ημερομηνία Θανάτου']}",
    axis=1
).tolist()

# Extract the target texts
output_texts = [str(text) for text in df['Περιγραφή'].tolist()]

if isinstance(output_texts, list) and all(isinstance(item, str) for item in output_texts):
    # Tokenize the input and output texts
    input_encodings = tokenizer(input_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
    output_encodings = tokenizer(output_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")

    print(input_encodings)
    print(output_encodings)
else:
    print("Error: output_texts should be a list of strings.")

{'input_ids': tensor([[    0, 41335, 14285,  ...,     1,     1,     1],
        [    0, 41335, 14285,  ...,     1,     1,     1],
        [    0, 41335, 14285,  ...,     1,     1,     1],
        ...,
        [    0, 41335, 14285,  ...,     1,     1,     1],
        [    0, 41335, 14285,  ...,     1,     1,     1],
        [    0, 41335, 14285,  ...,     1,     1,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}
{'input_ids': tensor([[    0, 41335, 21402,  ...,     1,     1,     1],
        [    0, 41335, 21402,  ...,     1,     1,     1],
        [    0, 41335,    48,  ...,     1,     1,     1],
        ...,
        [    0, 41335,  4333,  ...,     1,     1,     1],
        [    0, 41335,  4333,  ...,     1,     1,     1],
        [    0, 41335,  4333,  ...,     1,     1,     1]]), 'attentio

## Creating the expected dataset for training

In [12]:
import torch
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, input_encodings, output_encodings):
        self.input_encodings = input_encodings
        self.output_encodings = output_encodings

    def __len__(self):
        return len(self.input_encodings['input_ids'])

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.input_encodings.items()}
        item['labels'] = self.output_encodings['input_ids'][idx].clone().detach()
        return item

# Create dataset
dataset = TextDataset(input_encodings, output_encodings)



## Defining the training structure of bart model


In [13]:
from transformers import BartForConditionalGeneration, Trainer, TrainingArguments

# Load the BART model
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')

# Define training arguments
training_args = TrainingArguments(
    output_dir='/content/drive/My Drive/results',  # Save results to Google Drive
    num_train_epochs=3,
    per_device_train_batch_size=2,  # Adjust based on your GPU memory
    per_device_eval_batch_size=2,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='/content/drive/My Drive/logs',  # Save logs to Google Drive
    logging_steps=10,
    evaluation_strategy="epoch",  # Evaluate every epoch
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,  # For simplicity, using the same dataset for evaluation
)

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]



## Fine-tuning the model

In [14]:
# Fine-tune the model
trainer.train()

# Save the model
model.save_pretrained('/content/drive/My Drive/bart_model')
tokenizer.save_pretrained('/content/drive/My Drive/bart_tokenizer')


Epoch,Training Loss,Validation Loss
1,0.0488,0.034643
2,0.0202,0.008443
3,0.0039,0.006443


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('/content/drive/My Drive/bart_tokenizer/tokenizer_config.json',
 '/content/drive/My Drive/bart_tokenizer/special_tokens_map.json',
 '/content/drive/My Drive/bart_tokenizer/vocab.json',
 '/content/drive/My Drive/bart_tokenizer/merges.txt',
 '/content/drive/My Drive/bart_tokenizer/added_tokens.json')

## Loading of the fine-tuned model

In [15]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Path to your fine-tuned model and tokenizer
model_path = '/content/drive/My Drive/bart_model'
tokenizer_path = '/content/drive/My Drive/bart_tokenizer'

# Load the model and tokenizer
model = BartForConditionalGeneration.from_pretrained(model_path)
tokenizer = BartTokenizer.from_pretrained(tokenizer_path)


## Generating Model's Predictions

In [16]:
def generate_output(input_text):
    # Tokenize the input text
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True, padding=True, max_length=512)

    # Generate output using the model
    outputs = model.generate(
        inputs['input_ids'],
        max_length=512,
        num_beams=4,
        early_stopping=True,
        no_repeat_ngram_size=3,
        forced_bos_token_id=0,
        forced_eos_token_id=2
    )

    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Test the function with a sample input
sample_input = "Όνομα: Θανάσης Αθανασιάδης, Θέση: Αμυντικός, Ύψος: 180cm, Τόπος Γέννησης: Αθήνα, Ημερομηνία Γέννησης: 1990-01-01, Ημερομηνία Θανάτου: 2024-01-01"
output_text = generate_output(sample_input)

# Display the output
print("Input:")
print(sample_input)
print("\nGenerated Output:")
print(output_text)

Input:
Όνομα: Θανάσης Αθανασιάδης, Θέση: Αμυντικός, Ύψος: 180cm, Τόπος Γέννησης: Αθήνα, Ημερομηνία Γέννησης: 1990-01-01, Ημερομηνία Θανάτου: 2024-01-01

Generated Output:
Ο Θανάσης Αθανασιάδ- ς, ο οποίος γεννήθηκε στην Ο Αμυντικός, πείχε ύψος 1.72m, έπαιζε ως Βμύχας 0.97m. Πέθβανε και είθενε ψς 2024-01-20.


## Calculating the BLEU, ROUGE and METEOR metrics

In [17]:
from datasets import load_metric
import numpy as np
import torch

# Load BLEU, ROUGE, and METEOR metrics
bleu_metric = load_metric('bleu', trust_remote_code=True)
rouge_metric = load_metric('rouge', trust_remote_code=True)
meteor_metric = load_metric('meteor', trust_remote_code=True)

def compute_metrics(preds, labels):
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Prepare the data for BLEU, ROUGE and METEOR metrics
    decoded_preds = [pred.split() for pred in decoded_preds]
    decoded_labels = [[label.split()] for label in decoded_labels]  # each label needs to be a list of lists

    # BLEU
    bleu = bleu_metric.compute(predictions=decoded_preds, references=decoded_labels)

    # ROUGE
    rouge = rouge_metric.compute(predictions=decoded_preds, references=decoded_labels, rouge_types=["rouge1", "rouge2", "rougeL"])

    # METEOR
    meteor = meteor_metric.compute(predictions=decoded_preds, references=decoded_labels)

    return {
        'bleu': bleu['bleu'],
        'rouge1': rouge['rouge1'].mid.fmeasure,
        'rouge2': rouge['rouge2'].mid.fmeasure,
        'rougeL': rouge['rougeL'].mid.fmeasure,
        'meteor': meteor['meteor']
    }

# Create a DataLoader for the evaluation dataset
eval_dataloader = torch.utils.data.DataLoader(dataset, batch_size=2)

def evaluate_model(model, eval_dataloader):
    model.eval()
    all_preds = []
    all_labels = []

    for batch in eval_dataloader:
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        labels = batch['labels'].to(model.device)

        with torch.no_grad():
            outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=512)

        all_preds.extend(outputs.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

    # Convert predictions and labels to lists of sequences
    all_preds = [list(pred) for pred in all_preds]
    all_labels = [list(label) for label in all_labels]

    metrics = compute_metrics(all_preds, all_labels)
    return metrics

# Evaluate the model
metrics = evaluate_model(model, eval_dataloader)
print(metrics)


  bleu_metric = load_metric('bleu', trust_remote_code=True)


Downloading builder script:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


{'bleu': 0.2519140713286194, 'rouge1': 0.8542945607906234, 'rouge2': 0.506941333826954, 'rougeL': 0.6468332152039871, 'meteor': 0.6386021928150123}


## Model Meltemi

Meltemi is the first Greek Large Language Model (LLM), trained by the Institute for Language and Speech Processing of Athena Research & Innovation Center. It is built on top Mistral-7b with the vision to create an LLM with enhanced reasoning capabilities in Greek.


## Import of all the necessary libraries

In [16]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import evaluate
import nltk

# Ensure nltk packages are downloaded
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## Model's download

In [4]:
# Load the Meltemi model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("ilsp/Meltemi-7B-v1")
model = AutoModelForCausalLM.from_pretrained("ilsp/Meltemi-7B-v1").to('cuda')

# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/966 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/1.18M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.97M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.60G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/504M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## Dataset transformation to appropriate input-output pair

In [13]:
# Load your dataset
file_path = '/content/drive/My Drive/Greek_D2T_dataset.xlsx'
df = pd.read_excel(file_path)

# Concatenate input columns into a single string for each row
df['input_text'] = df.apply(
    lambda row: f"Όνομα: {row['Όνομα']}, Πεδίο: {row['Πεδίο']}, Τόπος Γέννησης: {row['Τόπος Γέννησης']}, Ημερομηνία Γέννησης: {row['Ημερομηνία Γέννησης']}, Ημερομηνία Θανάτου: {row['Ημερομηνία Θανάτου']}",
    axis=1
)

# Reset the index to avoid potential issues
df.reset_index(drop=True, inplace=True)

# Split the dataset into training and evaluation sets
train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)


## Model's generated predictions and evaluation metrics

In [19]:
# Convert eval_df to a Dataset
eval_dataset = Dataset.from_pandas(eval_df)

# Tokenize the evaluation dataset
def tokenize_function(examples):
    return tokenizer(examples['input_text'], padding="max_length", truncation=True, max_length=128)

tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

# Define a function to generate model predictions
def generate_predictions(batch):
    inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=128).to('cuda')
    inputs = {k: v.to('cuda') for k, v in inputs.items()}
    outputs = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)
    predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return predictions

# Generate predictions for the evaluation set
eval_df['predictions'] = generate_predictions(eval_df['input_text'].tolist())

# Load evaluation metrics
bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")
rouge = evaluate.load("rouge")

# Compute BLEU
bleu_result = bleu.compute(predictions=eval_df['predictions'].tolist(), references=eval_df['Περιγραφή'].tolist())
print(f"BLEU: {bleu_result['bleu']}")

# Compute METEOR
meteor_result = meteor.compute(predictions=eval_df['predictions'].tolist(), references=eval_df['Περιγραφή'].tolist())
print(f"METEOR: {meteor_result['meteor']}")

# Compute ROUGE
rouge_result = rouge.compute(predictions=eval_df['predictions'].tolist(), references=eval_df['Περιγραφή'].tolist())
print(f"ROUGE-L: {rouge_result['rougeL']}")



Map:   0%|          | 0/153 [00:00<?, ? examples/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


BLEU: 0.18938819787566924
METEOR: 0.2517021684128649
ROUGE-L: 0.43072493954846935


## GPT- DAVINCI
Davinci-002 is a model produced by OpenAI, that has enhanced text completion capabilities. This model was used since the generation task and it is the most affordable one from the OpenAI's ecosystem.

## Importing of all the necessary libraries


In [11]:
!pip install openai==0.28



In [8]:
import openai
import pandas as pd
import json
import time

## Preparation of the dataset into the expected model's format (json)

In [9]:
# Load your OpenAI API key
openai.api_key = 'openai.api_key'  # Replace with your actual OpenAI API key

# Load your dataset
file_path = '/content/drive/My Drive/Greek_D2T_dataset.xlsx'
df = pd.read_excel(file_path)

# Prepare the dataset
df['input_text'] = df.apply(
    lambda row: f"Όνομα: {row['Όνομα']}, Πεδίο: {row['Πεδίο']}, Τόπος Γέννησης: {row['Τόπος Γέννησης']}, Ημερομηνία Γέννησης: {row['Ημερομηνία Γέννησης']}, Ημερομηνία Θανάτου: {row['Ημερομηνία Θανάτου']}",
    axis=1
)

# Convert to JSONL format
train_data = []
for index, row in df.iterrows():
    train_data.append({"prompt": row['input_text'], "completion": row['Περιγραφή']})

# Save to JSONL file
train_file_path = 'train_data.jsonl'
with open(train_file_path, 'w', encoding='utf-8') as f:
    for item in train_data:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")



##  Uploading of the dataset into OpenAI's storage space

In [12]:
from openai import OpenAI

client = OpenAI(api_key='openai.api_key')


client.files.create(
  file=open(train_file_path, "rb"),
  purpose="fine-tune"
)

## Definition of fine-tuning job

In [26]:
from openai import OpenAI
client = OpenAI(api_key='api_key')

client.fine_tuning.jobs.create(
  training_file="file-MvhXsTr7oeCTeDMng3ujia8B",
  model="davinci-002"
)

FineTuningJob(id='ftjob-ha8h3vyLxqzgd8xdupCLc1Ua', created_at=1720398680, error=Error(code=None, message=None, param=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='davinci-002', object='fine_tuning.job', organization_id='org-XD6qOXycfEhzMPYtWa6ETzFq', result_files=[], seed=1326021275, status='validating_files', trained_tokens=None, training_file='file-MvhXsTr7oeCTeDMng3ujia8B', validation_file=None, estimated_finish=None, integrations=[], user_provided_suffix=None)

## Model's generated predictions

In [16]:
# Function to generate descriptions with error handling
def generate_summary(prompt, model_id, retries=3):
    for _ in range(retries):
        try:
            response = openai.Completion.create(
                model=model_id,
                prompt=prompt,
                max_tokens=512,
                temperature=0.5,  # Reduced temperature for more stable output
                top_p=1,
                n=1,
                stop=None
            )
            return response.choices[0].text.strip()
        except openai.error.APIError as e:
            print(f"APIError: {e}")
            time.sleep(1)  # Wait a moment before retrying
        except Exception as e:
            print(f"Error: {e}")
            return None
    return None

# Generate descriptions using the fine-tuned model
model_id = 'ft:davinci-002:personal::9iWphVEK'  # Replace with your actual fine-tuned model ID
df['generated_summary'] = df['input_text'].apply(lambda x: generate_summary(x, model_id))

# Save the generated summaries to a new Excel file for reference
df.to_excel('generated_summaries.xlsx', index=False)

print(df[['input_text', 'generated_summary']])


Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.
Error: You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.
     

## Evaluation of the model's output by using the same metrics

In [17]:
import evaluate

# Initialize evaluation metrics
bleu = evaluate.load('bleu')
meteor = evaluate.load('meteor')
rouge = evaluate.load('rouge')

# Prepare the references and predictions
references = df['Περιγραφή'].tolist()
predictions = df['generated_summary'].tolist()

# Remove None values in predictions and corresponding references
filtered_predictions_references = [(pred, ref) for pred, ref in zip(predictions, references) if pred is not None]
filtered_predictions, filtered_references = zip(*filtered_predictions_references)

# Compute BLEU
bleu_result = bleu.compute(predictions=list(filtered_predictions), references=[[ref] for ref in filtered_references])
print(f"BLEU: {bleu_result['bleu']}")

# Compute METEOR
meteor_result = meteor.compute(predictions=list(filtered_predictions), references=list(filtered_references))
print(f"METEOR: {meteor_result['meteor']}")

# Compute ROUGE
rouge_result = rouge.compute(predictions=list(filtered_predictions), references=list(filtered_references))
print(f"ROUGE-L: {rouge_result['rougeL']}")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


BLEU: 0.7436937778253331
METEOR: 0.9115310357474636
ROUGE-L: 0.8893971866491569


## MISTRAL 7-B


## Import of all necessary libraries

In [25]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import torch
from huggingface_hub import login

## Transforming dataset into input-output pairs

In [22]:
# Load your dataset
file_path = '/content/drive/My Drive/Greek_D2T_dataset.xlsx'
df = pd.read_excel(file_path)

# Prepare the dataset
df['input_text'] = df.apply(
    lambda row: f"Όνομα: {row['Όνομα']}, Πεδίο: {row['Πεδίο']}, Τόπος Γέννησης: {row['Τόπος Γέννησης']}, Ημερομηνία Γέννησης: {row['Ημερομηνία Γέννησης']}, Ημερομηνία Θανάτου: {row['Ημερομηνία Θανάτου']}",
    axis=1
)

## Model Loading

In [28]:
# Log in to Hugging Face
huggingface_token = 'huggingface_token'  # Replace with your actual Hugging Face token
login(huggingface_token)

# Load the Mistral-7B model and tokenizer
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=True).to('cuda')

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful




tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

## Generating model's predictions

In [31]:
# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

# Function to generate summaries
def generate_summary(prompt, max_length=512, temperature=0.5, top_p=1, num_return_sequences=1, do_sample=True):
    inputs = tokenizer(prompt, return_tensors='pt', padding=True).to('cuda')
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'],  # Provide attention mask
        max_length=max_length,
        temperature=temperature,
        top_p=top_p,
        num_return_sequences=num_return_sequences,
        no_repeat_ngram_size=3,
        pad_token_id=tokenizer.eos_token_id,  # Set pad_token_id to eos_token_id
        do_sample=do_sample  # Set do_sample to True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

# Generate summaries using the Mistral-7B model
df['generated_summary'] = df['input_text'].apply(lambda x: generate_summary(x))

# Save the generated summaries to a new Excel file for reference
output_file_path = '/content/drive/My Drive/generated_summaries_Mistral.xlsx'
df.to_excel(output_file_path, index=False)

print(df[['input_text', 'generated_summary']])


                                            input_text  \
0    Όνομα: Πέτρος Α΄ της Βραζιλίας, Πεδίο: μουσική...   
1    Όνομα: Πάπας Ιωάννης Παύλος Β΄, Πεδίο: ποίηση,...   
2    Όνομα: Μαχάτμα Γκάντι, Πεδίο: φιλοσοφία, Πασιφ...   
3    Όνομα: Δάντης Αλιγκιέρι, Πεδίο: ποίηση, γλωσσο...   
4    Όνομα: Αλεσσάντρο Μαντσόνι, Πεδίο: ποίηση, δρά...   
..                                                 ...   
757  Όνομα: Παναγιώτης Μπαχράμης, Πεδίο: Πλάγιος μέ...   
758  Όνομα: Παναγιώτης Κατσούρης, Πεδίο: Πλάγιος μέ...   
759  Όνομα: Γιάννης Κοσκινιάτης, Πεδίο: Πλάγιος μέσ...   
760  Όνομα: Θανάσης Τριμπόνιας, Πεδίο: αμυντικός, Τ...   
761  Όνομα: Κώστας Ανδριόπουλος, Πεδίο: Τερματοφύλα...   

                                     generated_summary  
0    Όνομα: Πέτρος Α΄ της Βραζιλίας, Πεδίο: μουσική...  
1    Όνομα: Πάπας Ιωάννης Παύλος Β΄, Πεδίο: ποίηση,...  
2    Όνομα: Μαχάτμα Γκάντι, Πεδίο: φιλοσοφία, Πασιφ...  
3    Όνομα: Δάντης Αλιγκιέρι, Πεδίο: ποίηση, γλωσσο...  
4    Όνομα: Αλεσσά

## Evaluation of the model using the metrics described

In [33]:
import evaluate

# Assuming your DataFrame is named df and contains 'generated_summary' and 'Περιγραφή' columns

# Load evaluation metrics
bleu = evaluate.load("bleu")
meteor = evaluate.load("meteor")
rouge = evaluate.load("rouge")

# Compute BLEU
bleu_result = bleu.compute(predictions=df['generated_summary'].tolist(), references=[[ref] for ref in df['Περιγραφή'].tolist()])
print(f"BLEU: {bleu_result['bleu']}")

# Compute METEOR
meteor_result = meteor.compute(predictions=df['generated_summary'].tolist(), references=df['Περιγραφή'].tolist())
print(f"METEOR: {meteor_result['meteor']}")

# Compute ROUGE
rouge_result = rouge.compute(predictions=df['generated_summary'].tolist(), references=df['Περιγραφή'].tolist())
print(f"ROUGE-L: {rouge_result['rougeL']}")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


BLEU: 0.02889457987877729
METEOR: 0.17127043478781478
ROUGE-L: 0.06476659944378454
