#  Training T5-small Model with How2/WikiHow/CNN data. 

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on a variety of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., summarization, etc.



In [1]:
from sklearn.model_selection import train_test_split
from simplet5 import SimpleT5
import pandas as pd
import logging
import os

Global seed set to 42


In [2]:
########
# DATA #
########


# Load datasets from CSV files created by the EDA notebooks

cnn_dailymail_df = pd.read_csv(os.getcwd() + "/data/cnn_dm_df.csv",encoding = "utf-8")
wikihow_df = pd.read_csv(os.getcwd() + "/data/wikihow_df.csv",encoding = "utf-8")
how2_df = pd.read_csv(os.getcwd() + "/data/how2_df.csv",encoding = "utf-8")

# Filter articles to make sure there is enough text to summarize

how2_df = how2_df[(how2_df['num_words_article_pp1'] > 200)] # & (how2_df['num_words'] < 400 )]
wikihow_df = wikihow_df[(wikihow_df['num_words_article_pp1'] > 200)] # & (how2_df['num_words'] < 400 )]
cnn_dailymail_df = cnn_dailymail_df[(cnn_dailymail_df['num_words_article_pp1'] > 200)]# & (how2_df['num_words'] < 400 )]


In [3]:
# we experimented with restricting number of rows from each dataset to see if
# it makes a difference.  Tweaking the order and number of rows had minimal impact.
# Having all three datasets the data from various datasets to see if it makes a difference. 

# Note:  We have reduced the number of rows used for training here so that notebook runs in minutes and not hours.
# When we trained our model, it took about 12 hours to do so.  
# In case the Instructor/TA run the code, we configuired it to run faster.

how2_df = how2_df.sample(n=50).reset_index(drop=True)
wikihow_df = wikihow_df.sample(n=50).reset_index(drop=True)
cnn_dailymail_df = cnn_dailymail_df.sample(n=50).reset_index(drop=True)

how2_df = how2_df[['article','summary']]
cnn_dailymail_df = cnn_dailymail_df[['article','summary']]
wikihow_df = wikihow_df[['article','summary']]

merged_df = pd.concat([how2_df, wikihow_df,cnn_dailymail_df], axis=0)
merged_df.columns = ['source_text', 'target_text']
#erged_df = merged_df[(merged_df[''])]
lens = merged_df['target_text'].astype(str).map(len)
lens.sort_values()


28      81
26      92
10     101
18     101
48     108
      ... 
19     799
10     813
4      827
17    1156
6     1560
Name: target_text, Length: 150, dtype: int64

In [4]:
# T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
merged_df['source_text'] = "summarize: " + merged_df['source_text']
print(merged_df)

                                          source_text  \
0   summarize: hi , my name is andrea orta mashbur...   
1   summarize: as you can see i 'm , i 'm here wit...   
2   summarize: hi everybody , i 'm john graden fro...   
3   summarize: hi , i 'm scott reil and on behalf ...   
4   summarize: all right folks , in this clip i 'm...   
..                                                ...   
45  summarize: President Barack Obama, in Hawaii f...   
46  summarize: What does the eldest of four daught...   
47  summarize: In an episode of animated series Fu...   
48  summarize: (CNN) -- U.S. Secretary of State Jo...   
49  summarize: Before she fell in love with a 10-y...   

                                          target_text  
0   playing games with children while babysitting ...  
1   the wheels of a basketball wheelchair slant ou...  
2   when determining the qualifications of a marti...  
3   learn why the right spot for each plant is ess...  
4   when drawing an angel , try to 

In [5]:
train_df, test_df = train_test_split(merged_df, test_size=0.2)
train_df.shape, test_df.shape

((120, 2), (30, 2))

In [6]:
train_df

Unnamed: 0,source_text,target_text
22,"summarize: i 'm beau timken , master sake somm...",some popular daiginjo sake brands include onig...
3,"summarize: hi , i 'm scott reil and on behalf ...",learn why the right spot for each plant is ess...
39,summarize: Think about your message:\n\n\nCon...,Prepare your communication before you deliver ...
1,summarize: By . Daily Mail Reporter . PUBLISHE...,"Lorraine Sullivan crashed her car in 2010, kil..."
43,summarize: By . Cherie Blair . PUBLISHED: . 20...,Cherie launches programme to help third-world ...
...,...,...
46,summarize: What does the eldest of four daught...,In a hilarious photo shared on the Huffington ...
18,"summarize: on behalf of expertvillage.com , my...",learn how to structure your board of directors...
2,summarize: Good punches are efficient punches...,Focus on your punching technique to improve po...
49,summarize: hi my name is madeline franco and i...,more than 300 different species of parrots com...


In [7]:
# Train the T5 model with CNN/How2/Wikihow data

model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-base")


model.train(train_df=train_df, # pandas dataframe with 2 columns: source_text & target_text
            eval_df=test_df, # pandas dataframe with 2 columns: source_text & target_text
            source_max_token_len = 512, 
            target_max_token_len = 128,
            batch_size = 5,
            max_epochs = 5,
            use_gpu = False)




GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  rank_zero_warn(
Global seed set to 42
  rank_zero_warn(
  rank_zero_warn(


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [8]:
# Quick test to validate model is working. 

model.load_model("t5","outputs/simplet5-epoch-7-train-loss-0.9977", use_gpu=False)

text_to_summarize="""Contemporary climate change includes both the global warming caused by humans, and its impacts on Earth's weather patterns. There have been previous periods of climate change, but the current changes are more rapid than any known events in Earth's history. The main cause is the emission of greenhouse gases, mostly carbon dioxide (CO2) and methane. Burning fossil fuels for energy use creates most of these emissions. Agriculture, steel making, cement production, and forest loss are additional sources.[3] Temperature rise is also affected by climate feedbacks such as the loss of sunlight-reflecting snow cover, and the release of carbon dioxide from drought-stricken forests. Collectively, these amplify global warming.
CO2 concentrations over the last 800,000 years as measured from ice cores (blue/green) and directly (black)
On land, temperatures have risen about twice as fast as the global average. Deserts are expanding, while heat waves and wildfires are becoming more common. Increased warming in the Arctic has contributed to melting permafrost, glacial retreat and sea ice loss. Higher temperatures are also causing more intense storms and other weather extremes. In places such as coral reefs, mountains, and the Arctic, many species are forced to relocate or become extinct, as their environment changes. Climate change threatens people with food and water scarcity, increased flooding, extreme heat, more disease, and economic loss. It can also drive human migration.  The World Health Organization calls climate change the greatest threat to global health in the 21st century. Even if efforts to minimise future warming are successful, some effects will continue for centuries. These include sea level rise, and warmer, more acidic oceans.
"""
model.predict(text_to_summarize)

["Temperatures have risen more rapidly than the global average. Increased rainfall and heat waves are also causing more severe storms and wildfires.[2]Human activity contributes to climate change.[4]Weather patterns have changed dramatically over the last 800,000 years.Climate change is driven by human activities.Earth's population is expected to double in the next few decades.More people live in remote areas due to climate change.Alternative effects of climate change on ecosystems"]