<a href="https://colab.research.google.com/github/wangyeye66/projects/blob/main/NLP_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Tweet Sumarization

In [1]:
from transformers import pipeline

In [2]:
# Initialize a pipeline for summarization using the T5-small model
# Set max_length to a smaller value more appropriate for tweets, e.g., 50
# You can also adjust min_length if needed
summarization_pipeline = pipeline("summarization", model="t5-small", max_length=10, min_length=5)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [3]:
# tweet to summarize
long_tweet = "This is a long tweet that includes detailed information which we want to summarize to get the gist without reading the whole text."

In [4]:
# Perform summarization
summary_result = summarization_pipeline(long_tweet)
print(summary_result)


[{'summary_text': 'this is a long tweet that includes detailed'}]


### Train a summarization model

In [2]:
%%capture
! pip install datasets
!pip install accelerate -U
!pip install transformers[torch] -U

In [5]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW, TrainingArguments, Trainer
from datasets import Dataset
import pandas as pd

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [6]:
# Sample tweet data with their summaries
# change with larger dataset
data = {
    'tweet': [
        "Just tried the new cafe in town. Amazing coffee, great ambiance, and friendly staff. Highly recommend!",
        "Attended the tech conference yesterday. Great talks, learned a lot, and networked with peers. Worth the time!",
        "Had dinner at Julie's. The steak was perfectly cooked, and the wine selection was top-notch.",
        "Went for a run in the park. Beautiful weather, clear skies, and a lot of people enjoying the day.",
        "Finished reading 'The Great Gatsby'. Such a profound story, beautifully written, and thought-provoking."
    ],
    'summary': [
        "Great experience at the new cafe with excellent coffee and ambiance.",
        "The tech conference was informative and valuable for networking.",
        "Enjoyed a perfect steak dinner and wine at Julie's.",
        "Had a pleasant run in the park with nice weather.",
        "Loved 'The Great Gatsby' for its deep story and beautiful writing."
    ]
}


In [7]:
# Convert to DataFrame
tweets_df = pd.DataFrame(data)

# Split the data into train and validation sets (though it's not very meaningful with such a small dataset)
train_df, val_df = tweets_df.iloc[:4], tweets_df.iloc[4:]

# Convert the DataFrames to Hugging Face Dataset objects
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Initialize the tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [8]:
# Tokenization function to preprocess the text and summary
def tokenize_function(examples):
    model_inputs = tokenizer(examples['tweet'], padding="max_length", truncation=True, max_length=512)
    # Include the targets (summary)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['summary'], padding="max_length", truncation=True, max_length=128)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [9]:
# Tokenize the train and validation datasets
train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/4 [00:00<?, ? examples/s]



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [10]:
# Define model
model = T5ForConditionalGeneration.from_pretrained('t5-base').to(device)

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,16.518744
2,No log,16.513927
3,No log,16.504278


TrainOutput(global_step=3, training_loss=18.26825714111328, metrics={'train_runtime': 5.1648, 'train_samples_per_second': 2.323, 'train_steps_per_second': 0.581, 'total_flos': 7307494686720.0, 'train_loss': 18.26825714111328, 'epoch': 3.0})

In [3]:
# Save the model
model.save_pretrained("./summarization_model")
tokenizer.save_pretrained("./summarization_model")

NameError: name 'model' is not defined

In [4]:
# make inference on new tweets
from transformers import pipeline
# Load the trained model and tokenizer using the pipeline
summarization_model_path = "./summarization_model"
summarization_pipeline = pipeline("summarization", model=summarization_model_path, tokenizer=summarization_model_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


In [6]:
import pandas as pd

# sample of new tweets
new_tweets = [
    "Exploring the world one step at a time 🌍 #travel #adventure",
    "Nothing beats home-cooked meals #foodie #homecooking",
    "Just finished a 10k run and feeling great! #fitness #running",
    "Diving into the world of coding. It's challenging but rewarding. #coding #technology",
    "Coffee and books, the perfect morning. #relax #reading",
    "Sustainability should be our priority. Let's make a difference. #environment #sustainability",
    "Art is the expression of the soul. Visited an amazing gallery today. #art #culture",
    "Family time is the best time. Cherishing these moments. #family #love",
    "Backyard gardening is my new hobby. Nature is incredible. #gardening #nature",
    "Exploring local markets is a great way to understand a culture. #travel #local"
]


# Perform summarization on new tweets
summaries = []
for tweet in new_tweets:
    summary = summarization_pipeline(tweet, max_length=15, min_length=5)
    summaries.append(summary[0]['summary_text'])

df_tweets = pd.DataFrame({
    "Original Tweet": new_tweets,
    "Summarized Tweet": summaries
})

In [7]:
df_tweets

Unnamed: 0,Original Tweet,Summarized Tweet
0,Exploring the world one step at a time 🌍 #trav...,#travel #adventure . #
1,Nothing beats home-cooked meals #foodie #homec...,Nothing beats home-cooked meals #foodie #homeco
2,Just finished a 10k run and feeling great! #fi...,just finished a 10k run and feeling great! #fi...
3,Diving into the world of coding. It's challeng...,Diving into the world of coding. it's challeng...
4,"Coffee and books, the perfect morning. #relax ...","coffee and books, the perfect morning. #relax ..."
5,Sustainability should be our priority. Let's m...,sustainability should be our priority. Let's m...
6,Art is the expression of the soul. Visited an ...,visit an amazing gallery today. #art #culture .
7,Family time is the best time. Cherishing these...,family time is the best time. Cherishing these...
8,Backyard gardening is my new hobby. Nature is ...,backyard gardening is my new hobby. Nature is ...
9,Exploring local markets is a great way to unde...,#travel #local is a great way to understand a ...
