%pip install datasets
%pip install transformers
%pip install sentencepiece
%pip install diffusers --upgrade
%pip install invisible_watermark accelerate safetensors
%pip install accelerate
%pip install jiwer
%pip install evaluate --upgrade
%pip install wandb --upgrade

In [None]:
import pandas as pd
import datasets
from PIL import Image, ImageFile
Image.LOAD_TRUNCATED_IMAGES = True
ImageFile.LOAD_TRUNCATED_IMAGES = True
import torch
import torch.nn as nn
from transformers import AutoProcessor
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
import json

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Data Loading

1. **Loading Data:**
    - The code first checks if a variable named data is not already defined in the local namespace. If it's not defined, it loads a CSV file named 'described_dataset_label.csv' located one directory above the current directory.
    - The data is read using Pandas read_csv() function, specifying the file path, delimiter (sep parameter) as '\t' (tab-separated), and encoding as 'latin-1'.
    - It then selects the first 20,000 rows of the dataset using .iloc[:20000].
    - The column names are then renamed for consistency using .rename() method.
    - Finally, it selects only specific columns ('image', 'author', 'title', 'style') using indexing.

2. **Data Preprocessing:**
    - It prefixes the 'image' column values with a dot ('.') using list comprehension and assigns the result back to the 'image' column.

3. **Label Encoding:**
    - Load the 'labels_auth' and 'labels_sty' variables from the json files with the encoding.
    - It then creates dictionaries `label2id_auth` and `id2label_auth` to map authors to unique IDs and vice versa.
    - Similarly, it creates dictionaries `label2id_sty` and `id2label_sty` to map styles to unique IDs and vice versa.

## Reasoning:
- **Loading Data:** 
    - The code loads a dataset from a CSV file, presumably containing information about images such as their authors, titles, and styles.
    - It selects only a subset of the dataset (first 20,000 rows) for faster processing or due to memory constraints.
    - Renaming and selecting specific columns ensure consistency and relevance to the task.

- **Data Preprocessing:**
    - The prefixing of the 'image' column values with a dot seems to be a formatting step, possibly to ensure compatibility with file paths.

- **Label Encoding:**
    - Encoding labels into numerical IDs is a common preprocessing step in machine learning tasks. It converts categorical data into a format that machine learning algorithms can understand.
    - These mappings (`label2id_auth`, `id2label_auth`, `label2id_sty`, `id2label_sty`) facilitate the conversion between labels and their corresponding numerical IDs during training and prediction stages.


In [None]:
data = pd.read_csv('../described_dataset_label.csv',sep='\t',encoding='utf-8')
data = data.sample(frac=1).reset_index(drop=True)
data = data[:20000]
data = data.rename(columns={'FILE':'image','AUTHOR':'author', 'TITLE': 'title', 'TECHNIQUE':'style'})
data = data[['image','author','title','style']]
data['image'] = [f'.{x}' for x in data['image']]
data['author'] = [x.lower() for x in data['author']]
data['style'] = [x.split(',')[0].lower() for x in data['style']]

with open('../label_author.json', 'r') as f:
    labels_author = json.load(f)
label2id_auth, id2label_auth = dict(), dict()
for i, label in labels_author.items():
    i= int(i)
    id2label_auth[i]=label
    label2id_auth[label]=i

with open('../label_style.json', 'r') as f:
    labels_sty = json.load(f)
label2id_sty, id2label_sty = dict(), dict()
for i, label in labels_sty.items():
    i=int(i)
    label2id_sty[label]=i
    id2label_sty[i]=label

In [None]:
print(label2id_auth)
print(labels_author)
print(len(labels_author))

1. **Label Mapping:**
    - It utilizes the `map()` function in Pandas to replace each unique label in the 'author' and 'style' columns with their corresponding numerical IDs stored in the `label2id_auth` and `label2id_sty` dictionaries, respectively.
    - This transformation effectively converts categorical labels into numerical representations, facilitating machine learning model training.

Mapping categorical labels to numerical IDs is essential for many machine learning algorithms, as they typically require numerical input. By replacing categorical labels with numerical IDs, the data becomes suitable for training predictive models.

In [None]:
data['author'] = data['author'].map(label2id_auth)
data['style'] = data['style'].map(label2id_sty)
print(data.columns)

Finally we create a Dataset from the processed data while also casting the image data to a PIL Image.

In [None]:
dataset = datasets.Dataset.from_pandas(data).cast_column('image',datasets.Image())
print(dataset)

In [None]:
sample = dataset[500]

image = sample['image']
height,width = image.size
display(image.resize((int(0.3*height),int(0.3*width))))
author = id2label_auth[sample['author']]
technique = id2label_sty[sample['style']]
print(f'Author: {author}')
print(f'Technique: {technique}')

1. **Checkpoint Initialization (`checkpoint_clas`):**
    - The variable `checkpoint_clas` is assigned the string 'google/vit-base-patch16-224-in21k'.
    - This string likely serves as a reference to the specific pre-trained model checkpoint. 
    - The mentioned model, 'vit-base-patch16-224', is a Vision Transformer model with a patch size of 16x16 and input image size of 224x224. The 'in21k' part indicates that the model has been pre-trained on an ImageNet-21k dataset, which includes 21,000 classes.

2. **Processor Initialization (`processor_clas`):**
    - The `AutoProcessor.from_pretrained()` method is used to initialize a processor for the specified pre-trained model.
    - The `AutoProcessor` class automatically selects the appropriate processor based on the provided checkpoint string.
    - The initialized processor is capable of handling various preprocessing tasks required for input data, such as tokenization, resizing, and normalization, to make it compatible with the model's architecture.


# Model Selection

For our classification task, we utilize a Vision Transformer (ViT) model. The ViT model, designed by Google, has demonstrated excellent performance in image classification tasks. It divides each image into patches and applies a transformer encoder to learn patterns and features from these patches.
Loading the Model

We load a pre-trained ViT model and its associated processor for our classification task. Using a pre-trained model allows us to leverage learned features from large datasets, improving the performance of our model on our specific task with less training data and time.

In [None]:
checkpoint_clas = 'google/vit-base-patch16-224-in21k'
processor_clas = AutoProcessor.from_pretrained(checkpoint_clas)

1. **Normalization:**
    - The code initializes a normalization transform (`normalize`) using the `Normalize` class from `torchvision.transforms`. 
    - The mean and standard deviation for normalization are obtained from the `processor_clas` object, which likely contains parameters specific to the pre-trained model's preprocessing requirements.
    - `processor_clas.image_mean` and `processor_clas.image_std` are used to set the mean and standard deviation, respectively, for normalizing the input images.

2. **Image Size Determination:**
    - The variable `size` is calculated based on the dimensions specified in the `processor_clas` object.
    - If the `processor_clas.size` dictionary contains a key named "shortest_edge", the `size` is set to the value corresponding to this key. Otherwise, the `size` is set to a tuple containing the height and width specified in the `processor_clas.size` dictionary.

3. **Compose Transformations:**
    - A sequence of transformations is defined using the `Compose` class from `torchvision.transforms`. 
    - The defined transformations include:
        - `RandomResizedCrop`: This transformation performs a random crop of the input image and resizes it to the specified size. The `size` parameter is set to the determined `size`.
        - `ToTensor`: This transformation converts the image data into a PyTorch tensor.
        - `normalize`: This transformation normalizes the tensor values using the specified mean and standard deviation.

## Reasoning:
- **Normalization:**
    - Normalizing the input images ensures that the pixel values are scaled to a range suitable for the model's training, typically between 0 and 1 or -1 and 1.
    - Normalization based on the mean and standard deviation of the dataset helps in stabilizing and speeding up the training process.

- **Compose Transformations:**
    - Composing the transformations into a pipeline allows for efficient preprocessing of input images before feeding them into the neural network.
    - Random cropping followed by resizing and conversion to tensor are common preprocessing steps used in image classification tasks.

In [None]:
import torch
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

normalize = Normalize(mean=processor_clas.image_mean, std=processor_clas.image_std)

size = (
    processor_clas.size["shortest_edge"]
    if "shortest_edge" in processor_clas.size
    else (processor_clas.size["height"], processor_clas.size["width"])
)

_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

In [None]:
def transforms(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
    del examples["image"]
    return examples


# Final Data Preparation
For both Author and Style we create two different datasets, by removing the unwanted columns and then we split this dataset into train and test sets. We chose a test size of 0.3 as it worked better for us to have a sliglty bigger test set.

In [None]:
#Author
auth_data = dataset.remove_columns(['style','title']).rename_column('author','label')
auth_dataset = auth_data.train_test_split(test_size=0.3)
auth_dataset = auth_dataset.with_transform(transforms)

In [None]:
#Style
sty_data = dataset.remove_columns(['author','title']).rename_column('style','label')
sty_dataset = sty_data.train_test_split(test_size=0.3)
sty_dataset = sty_dataset.with_transform(transforms)

To be able to train the models through the Trainer class of the transformer library we need metrics which are used to evaluate the model as it is training. We use the accuracy, precision, recall and f1 score as metrics for both models.

1. **Function Definition (`compute_metrics`):**
    - This function takes a `pred` parameter, presumably containing predictions made by a classification model.
    - It extracts true labels (`labels`) and predicted labels (`preds`) from the `pred` object.
    
2. **Metrics Calculation:**
    - **Accuracy:** 
        - The accuracy is calculated using the `accuracy_score` function from scikit-learn, comparing true labels (`labels`) and predicted labels (`preds`).
    - **Precision, Recall, and F1-score:** 
        - Precision, recall, and F1-score are calculated using `precision_score`, `recall_score`, and `f1_score` functions, respectively, from scikit-learn.
        - These metrics are computed with the 'weighted' averaging strategy to handle class imbalance. The `zero_division` parameter is set to 0 to handle cases where there are no true positives for a particular class, ensuring no division by zero errors.
        
3. **Return Statement:**
    - The function returns a dictionary containing the computed evaluation metrics (`accuracy`, `precision`, `recall`, `f1`).

## Reasoning:
- **Metric Selection:**
    - Accuracy, precision, recall, and F1-score are commonly used metrics for evaluating classification models.
    - Accuracy provides an overall assessment of the model's correctness.
    - Precision measures the model's ability to correctly identify positive instances out of all instances predicted as positive.
    - Recall measures the model's ability to correctly identify positive instances out of all actual positive instances.
    - F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics.

- **Weighted Averaging:**
    - Using weighted averaging for precision, recall, and F1-score is suitable for handling class imbalance scenarios where some classes may have significantly more samples than others.


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# def compute_metrics(pred):
#     labels = pred.label_ids
#     preds = pred.predictions.argmax(-1)
    
#     # Calculate accuracy
#     accuracy = accuracy_score(labels, preds)
#     return accuracy

#    # Calculate precision, recall, and F1-score
#     precision = precision_score(labels, preds, average='weighted',zero_division=0)
#     recall = recall_score(labels, preds, average='weighted',zero_division=0)
#     f1 = f1_score(labels, preds, average='weighted',zero_division=0)
    
#     return {
#         'accuracy': accuracy,
#         'precision': precision,
#         'recall': recall,
#         'f1': f1
#     }

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

The following code initializes and trains a classification model using the Hugging Face `Trainer` interface. It also defines training arguments (`auth_training_args`) and sets up the training process.
This setup is used for both Author and Style labels.

## Code Explanation:
1. **Training Arguments Initialization (`auth_training_args`):**
    - The code initializes a `TrainingArguments` object named `auth_training_args` with various parameters required for training the classification model.
    - Key parameters include:
        - `output_dir`: Specifies the directory where model checkpoints and other outputs will be saved.
        - `evaluation_strategy`: Sets the evaluation strategy to perform evaluation at the end of each epoch.
        - `learning_rate`: Sets the initial learning rate for the optimizer.
        - `per_device_train_batch_size`: Specifies the batch size per GPU for training data.
        - `gradient_accumulation_steps`: Accumulates gradients over multiple steps to effectively increase the batch size.
        - `num_train_epochs`: Specifies the total number of training epochs.
        - `metric_for_best_model`: Specifies the metric to monitor for determining the best model during training.
        - `push_to_hub`: Specifies whether to push the trained model to the Hugging Face model hub after training.

2. **Trainer Initialization (`auth_trainer`):**
    - The code initializes a `Trainer` object named `auth_trainer` for training the classification model.
    - Key arguments passed to the `Trainer` object include:
        - `model`: Specifies the classification model (`vit_model_auth`) to be trained.
        - `args`: Specifies the training arguments (`auth_training_args`) defined earlier.
        - `data_collator`: Specifies the data collator object for batch processing.
        - `train_dataset`: Specifies the training dataset (`auth_dataset['train']`).
        - `eval_dataset`: Specifies the evaluation dataset (`auth_dataset['test']`).
        - `tokenizer`: Specifies the tokenizer object (`processor_clas`) for tokenizing input data.
        - `compute_metrics`: Specifies the function for computing evaluation metrics during training.

3. **Training Process (`auth_trainer.train()`):**
    - Initiates the training process using the `train()` method of the `Trainer` object (`auth_trainer`).
    - The model is trained for the specified number of epochs (`num_train_epochs`).
    - Training progress, evaluation metrics, and checkpoints are saved based on the parameters specified in the training arguments.

4. **Memory Management (`torch.cuda.empty_cache()`):**
    - Clears the unused memory caches on the GPU after training to free up memory resources.

## Reasoning:
- **Training Configuration:**
    - The training arguments (`auth_training_args`) define various parameters crucial for training, such as batch size, learning rate, and evaluation strategy.
    - These parameters are set based on empirical observations, best practices, and the specific requirements of the training task.

- **Trainer Setup:**
    - The `Trainer` object (`auth_trainer`) encapsulates the training process, including data loading, model training, evaluation, and checkpointing.
    - By utilizing the Hugging Face `Trainer` interface, the code simplifies the training pipeline and provides convenient access to various training functionalities.


In [None]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

In [None]:
def model_init_auth(trial):
    return AutoModelForImageClassification.from_pretrained(
        checkpoint_clas,
        num_labels = len(labels_author),
        id2label = id2label_auth,
        label2id = label2id_auth
    ).to(device)

def model_init_sty(trial):
    return AutoModelForImageClassification.from_pretrained(
        checkpoint_clas,
        num_labels = len(labels_sty),
        id2label = id2label_sty,
        label2id = label2id_sty
    ).to(device)


# Training Configuration

We define training arguments to configure various parameters for the training process, including batch size, learning rate, and evaluation strategy. These parameters are crucial for optimizing the training process and ensuring effective model learning.
Training Arguments

## The training arguments specify key parameters:

    output_dir: Directory to save the model checkpoints.
    evaluation_strategy: Frequency of evaluation during training.
    per_device_train_batch_size: Batch size for training.
    per_device_eval_batch_size: Batch size for evaluation.
    num_train_epochs: Number of epochs to train the model.
    save_steps: Number of steps between each checkpoint save.
    eval_steps: Number of steps between each evaluation.
    logging_dir: Directory to save training logs.
    learning_rate: Learning rate for the optimizer.

In [None]:
auth_training_args = TrainingArguments(
    output_dir="model_checkpoints/auth",
    remove_unused_columns=False,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    save_total_limit=2,
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    warmup_ratio=0.1,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)



auth_trainer = Trainer(
    # model = model_clas_auth,
    model_init= model_init_auth,
    args = auth_training_args,
    data_collator=data_collator,
    train_dataset = auth_dataset['train'],
    eval_dataset = auth_dataset['test'],
    tokenizer = processor_clas,
    compute_metrics = compute_metrics,
)

In [None]:
#HyperParameter Search

def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-4, log=True,step=1e-5),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [4, 8, 16]),
        "gradient_accumulation_steps": trial.suggest_int("gradient_accumulation_steps", 1, 4, step=1),
        "per_device_eval_batch_size": trial.suggest_categorical("per_device_eval_batch_size", [4, 8, 16]),
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.3, step=0.1),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 1, 4, step=1),
    }

In [None]:
best_trial_auth = auth_trainer.hyperparameter_search(n_trials=100, 
                                                 backend="optuna",
                                                 hp_space=optuna_hp_space, 
                                                 direction="maximize",)

In [None]:
torch.cuda.empty_cache()
sty_training_args = TrainingArguments(
    output_dir="model_checkpoints/auth",
    remove_unused_columns=False,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    save_total_limit=2,
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    warmup_ratio=0.1,
    metric_for_best_model="accuracy",
    push_to_hub=False,
)




sty_trainer = Trainer(
    # model = vit_model_sty,
    model_init= model_init_sty,
    args = sty_training_args,
    data_collator=data_collator,
    train_dataset = sty_dataset['train'],
    eval_dataset = sty_dataset['test'],
    tokenizer = processor_clas,
    compute_metrics = compute_metrics,
)


Finally we evaluate the model on the test set and print the results.

In [None]:
best_trials_sty = sty_trainer.hyperparameter_search(n_trials=100, 
                                                 backend="optuna",
                                                 hp_space=optuna_hp_space, 
                                                 direction="maximize",)

# Model Training Using Best Hyperparameters

After finding the optimal hyperparameters through experimentation and evaluation, we proceed with training the model using these best parameters to achieve the highest possible performance.

In [None]:
best_hyperparameters_auth = best_trial_auth.hyperparameters
auth_training_args = TrainingArguments(
    output_dir="model_checkpoints/auth",
    remove_unused_columns=False,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    save_total_limit=2,
    learning_rate=best_hyperparameters_auth["learning_rate"],
    per_device_train_batch_size=best_hyperparameters_auth["per_device_train_batch_size"],
    gradient_accumulation_steps=best_hyperparameters_auth["gradient_accumulation_steps"],
    per_device_eval_batch_size=best_hyperparameters_auth["per_device_eval_batch_size"],
    num_train_epochs=best_hyperparameters_auth["num_train_epochs"],
    warmup_ratio=best_hyperparameters_auth["warmup_ratio"],
    metric_for_best_model="accuracy",
    push_to_hub=False,
    load_best_model_at_end=True,
)

model_clas_auth = AutoModelForImageClassification.from_pretrained(
    checkpoint_clas,
    num_labels = len(labels_author),
    id2label = id2label_auth,
    label2id = label2id_auth
).to(device)

auth_trainer = Trainer(
    model = model_clas_auth,
    # model_init= model_init_auth,
    args = auth_training_args,
    data_collator=data_collator,
    train_dataset = auth_dataset['train'],
    eval_dataset = auth_dataset['test'],
    tokenizer = processor_clas,
    compute_metrics = compute_metrics,
)
torch.cuda.empty_cache()
auth_trainer.train()

In [None]:
torch.cuda.empty_cache()
best_hyperparameters_sty = best_trials_sty.hyperparameters
sty_training_args = TrainingArguments(
    output_dir="model_checkpoints/sty",
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    learning_rate=best_hyperparameters_sty["learning_rate"],
    per_device_train_batch_size=best_hyperparameters_sty["per_device_train_batch_size"],
    gradient_accumulation_steps=best_hyperparameters_sty["gradient_accumulation_steps"],
    per_device_eval_batch_size=best_hyperparameters_sty["per_device_eval_batch_size"],
    num_train_epochs=best_hyperparameters_sty["num_train_epochs"],
    warmup_ratio=best_hyperparameters_sty["warmup_ratio"],
    metric_for_best_model="accuracy",
    push_to_hub=False,
    load_best_model_at_end=True,
)

model_clas_sty = AutoModelForImageClassification.from_pretrained(
    checkpoint_clas,
    num_labels = len(labels_sty),
    id2label = id2label_sty,
    label2id = label2id_sty
).to(device)

sty_trainer = Trainer(
    model = model_clas_sty,
    # model_init= model_init_sty,
    args = sty_training_args,
    data_collator=data_collator,
    train_dataset = sty_dataset['train'],
    eval_dataset = sty_dataset['test'],
    tokenizer = processor_clas,
    compute_metrics = compute_metrics,
)
sty_trainer.train()

In [None]:
sample = dataset[100]
image = sample['image']
height,width = image.size
display(image.resize((int(0.3*height),int(0.3*width))))
author = sample['author']
style = sample['style']
print(f'Author: {id2label_auth[author]}')
print(f'Style:  {id2label_sty[style]}')

In [None]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(checkpoint_clas)

inputs = image_processor(image, return_tensors="pt").to(device)

with torch.no_grad():
    logits_auth = model_clas_auth(**inputs).logits

In [None]:
from transformers import AutoImageProcessor

image_processor = AutoImageProcessor.from_pretrained(checkpoint_clas)

inputs = image_processor(image, return_tensors="pt").to(device)

with torch.no_grad():
    logits_sty = model_clas_sty(**inputs).logits

In [None]:
predicted_label_auth = logits_auth.argmax(-1).item()
predicted_label_sty = logits_sty.argmax(-1).item()
print(model_clas_auth.config.id2label[predicted_label_auth])
print(model_clas_sty.config.id2label[predicted_label_sty])

In [None]:
model_clas_auth.push_to_hub("Art_huggingface_auth")
model_clas_sty.push_to_hub("Art_huggingface_sty")