<a href="https://colab.research.google.com/github/solovastru/01CompensationExercise/blob/master/MachineLearning_Clickbait.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
!pip install transformers
!pip install datasets
!pip install evaluate rouge_score
!pip install transformers[torch]

Collecting evaluate
  Using cached evaluate-0.4.2-py3-none-any.whl (84 kB)
Collecting rouge_score
  Using cached rouge_score-0.1.2-py3-none-any.whl
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.2 rouge_score-0.1.2
Collecting accelerate>=0.21.0 (from transformers[torch])
  Using cached accelerate-0.30.1-py3-none-any.whl (302 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-ma

In [2]:
import os
import numpy as np
import transformers
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

# libs for evaluation
import evaluate
import rouge_score
rouge = evaluate.load("rouge")



Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

# Loading the dataset

In [18]:
from datasets import load_dataset, DatasetDict

In [19]:
cnn_dailymail = load_dataset("abisee/cnn_dailymail", "3.0.0")
cnn_dailymail

Downloading readme:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/257M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [20]:
small_cnn_dailymail = DatasetDict(
    train = cnn_dailymail["train"].shuffle(seed=24).select(range(800)),
    validation= cnn_dailymail["validation"].shuffle(seed=24).select(range(300)),
    test = cnn_dailymail["test"].shuffle(seed=24).select(range(150))
)

small_cnn_dailymail

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 800
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 300
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 150
    })
})

# Pre-processing the dataset

In [3]:
from transformers import T5Tokenizer
checkpoint_small = "t5-small"
tokenizer = T5Tokenizer.from_pretrained("t5-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [5]:
def preprocess_function(item):

  labels = tokenizer(text=item["highlights"], max_length=56, truncation=True)
  inputs = tokenizer(text=item["article"], max_length=400, truncation=True)

  model_inputs = {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]}
  model_inputs["labels"] = labels["input_ids"]
  return model_inputs



In [21]:
tokenized_small_cnn_dailymail = small_cnn_dailymail.map(preprocess_function, batched=True)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

DataCollator

In [13]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint_small)

## Evaluation

In [14]:
def compute_metrics(eval_pred):
   predictions, labels = eval_pred
   decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
   result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
   prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
   result["gen_len"] = np.mean(prediction_lens)
   return {k: round(v, 4) for k, v in result.items()}

# Loading the generated dataset

In [8]:
import pandas as pd

In [9]:
file_path = "/content/Title_generation_dataset - Лист1 (1).csv"
df_generated_dataset = pd.read_csv(file_path)

df_generated_dataset

Unnamed: 0,article,highlights,title,clickbate_title
0,The fashion industry is shifting towards susta...,The fashion industry is embracing sustainabili...,The Future of Sustainable Fashion,Revolutionizing Fashion: How These Game-Changi...
1,Electric vehicles (EVs) are revolutionizing th...,Electric vehicles are transforming the automot...,The Rise of Electric Vehicles: Driving Towards...,Shocking Secrets Revealed: How Electric Vehicl...
2,Embracing a plant-based diet has gained popula...,A plant-based diet is becoming popular for its...,The Power of Plant-Based Eating: How Going Veg...,The Plant-Based Revolution: How Going Vegan Ca...
3,"In today's fast-paced world, mindfulness is em...",Mindfulness is essential for managing stress i...,Unveiling the Healing Power of Mindfulness,Mindfulness: The Secret Weapon for Conquering ...
4,Urban gardening is transforming cityscapes wor...,"Urban gardening is revitalizing cities, provid...",The Rise of Urban Gardening: A Green Revolutio...,Transform Your City Life: The Surprising Benef...
5,"In today's fast-paced world, finding time to c...","Nature's benefits include reduced stress, bett...",The Unexpected Health Benefits of Spending Tim...,Discover the Shocking Health Benefits of Natur...
6,Plant-based diets are gaining popularity world...,A plant-based diet is becoming popular for its...,The Rise of Plant-Based Diets,Shocking! Plant-Based Diets: The Surprising Ke...
7,Exercise is essential for physical and mental ...,Outdoor exercise boosts mental and physical he...,The Benefits of Outdoor Exercise: A Breath of ...,Outdoor Exercise Unveiled: The Secret Weapon f...
8,Staying hydrated is crucial for overall health...,"Drinking water is vital for health, maintainin...",The Importance of Drinking Water,Hydration Hacks: Unlocking the Secrets to Opti...
9,Reading books isn't just a hobby; it's a gatew...,"Reading enhances knowledge, reduces stress, an...",The Benefits of Reading Books,Reading Revolution: The Astonishing Benefits o...


In [10]:
new_df_generated_dataset = df_generated_dataset[["article", "highlights"]]
new_df_generated_dataset

Unnamed: 0,article,highlights
0,The fashion industry is shifting towards susta...,The fashion industry is embracing sustainabili...
1,Electric vehicles (EVs) are revolutionizing th...,Electric vehicles are transforming the automot...
2,Embracing a plant-based diet has gained popula...,A plant-based diet is becoming popular for its...
3,"In today's fast-paced world, mindfulness is em...",Mindfulness is essential for managing stress i...
4,Urban gardening is transforming cityscapes wor...,"Urban gardening is revitalizing cities, provid..."
5,"In today's fast-paced world, finding time to c...","Nature's benefits include reduced stress, bett..."
6,Plant-based diets are gaining popularity world...,A plant-based diet is becoming popular for its...
7,Exercise is essential for physical and mental ...,Outdoor exercise boosts mental and physical he...
8,Staying hydrated is crucial for overall health...,"Drinking water is vital for health, maintainin..."
9,Reading books isn't just a hobby; it's a gatew...,"Reading enhances knowledge, reduces stress, an..."


In [11]:
tokenized_generated_dataset = new_df_generated_dataset.apply(preprocess_function, axis=1)

# Training the model



In [27]:
output_dir = "/content/drive/MyDrive/t5_small_spoilers"

In [29]:
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_small)


training_args = Seq2SeqTrainingArguments(
   output_dir= output_dir,
   evaluation_strategy="epoch",
   learning_rate=1.6e-5,
   per_device_train_batch_size=6,
   per_device_eval_batch_size=6,
   weight_decay=0.01,
   save_total_limit=3,
   num_train_epochs=3,
   predict_with_generate=True,
   fp16=False,  # Disable mixed precision training
   generation_max_length=50,
)

In [30]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_small_cnn_dailymail["train"],
   eval_dataset=tokenized_small_cnn_dailymail["validation"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)



In [31]:
trainer.train()

trainer.save_model(output_dir)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,No log,2.021363,0.3831,0.1774,0.2856,0.2851,47.78
2,No log,1.980838,0.3862,0.1769,0.2864,0.2866,47.5633
3,No log,1.972191,0.3822,0.1746,0.2829,0.2831,47.5933


# Evaluation

In [33]:
results = trainer.predict(tokenized_small_cnn_dailymail["test"])
print(results)

PredictionOutput(predictions=array([[    0, 19039,  3853, ...,     5,  6697,   228],
       [    0, 10352,  5007, ...,  3062,   141, 18639],
       [    0, 15961,  4625, ...,    11,  9357,  1699],
       ...,
       [    0,  5659, 21954, ...,     0,     0,     0],
       [    0, 11340,    15, ...,  4261,   203,     3],
       [    0, 26133,  2409, ...,     0,     0,     0]]), label_ids=array([[19039,  3853,  4999, ...,    56,    43,     1],
       [10352,  5007,     9, ...,     5,  2599,     1],
       [15961,  4625,    29, ...,  1687,     6,     1],
       ...,
       [12961, 12530,  3853, ...,  2331,    91,     1],
       [11340,    15,    15, ...,   141,   306,     1],
       [24998,    16,  6394, ...,    95,   274,     1]]), metrics={'test_loss': 2.003023624420166, 'test_rouge1': 0.3812, 'test_rouge2': 0.1793, 'test_rougeL': 0.2815, 'test_rougeLsum': 0.2822, 'test_gen_len': 47.2733, 'test_runtime': 264.4289, 'test_samples_per_second': 0.567, 'test_steps_per_second': 0.095})


In [None]:
from transformers import pipeline

summarizer = pipeline("summarization", model="/content/drive/MyDrive/t5_small_spoilers/checkpoint-500")

In [None]:
test_text = small_cnn_dailymail["test"]["article"][:1]
summarizer(test_text)

In [None]:
text = " Green tea is more than just a soothing beverage; it's packed with health benefits that make it a smart choice for daily consumption. Firstly, green tea is rich in antioxidants, particularly catechins, which help to fight inflammation and protect cells from damage. This can lower the risk of chronic diseases like heart disease and certain cancers. Moreover, green tea contains compounds that may boost metabolism and promote fat loss, making it a valuable tool for weight management when combined with a healthy diet and exercise. Additionally, green tea has been linked to improved brain function and a reduced risk of cognitive decline with aging. The combination of caffeine and L-theanine in green tea can enhance alertness and focus without the jittery side effects associated with coffee. Lastly, green tea may support overall longevity and promote healthy aging thanks to its protective effects on various aspects of health. Incorporating green tea into your daily routine can be a simple yet effective way to support your health and well-being."

In [None]:
summarizer(text)