<a href="https://colab.research.google.com/github/thezsanett/textsummarization/blob/main/textsum.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data loading

## CNN/DailyMail dataset

https://github.com/maszhongming/MatchSum

In [2]:
!gdown 1FG4oiQ6rknIeL2WLtXD0GWyh6pBH9-hX
!unzip ACL2020_data.zip

Downloading...
From: https://drive.google.com/uc?id=1FG4oiQ6rknIeL2WLtXD0GWyh6pBH9-hX
To: /content/ACL2020_data.zip
100% 1.95G/1.95G [00:25<00:00, 76.4MB/s]
Archive:  ACL2020_data.zip
  inflating: test_CNNDM_bert.jsonl   
  inflating: test_CNNDM_roberta.jsonl  
  inflating: train_CNNDM_bert.jsonl  
  inflating: train_CNNDM_roberta.jsonl  
  inflating: val_CNNDM_bert.jsonl    
  inflating: val_CNNDM_roberta.jsonl  


## Other datasets
- WikiHow dataset
- PubMed dataset
- XSum dataset
- MultiNews dataset
- Reddit dataset


In [3]:
!gdown 1y23-pRrn1D31zfs1FjhpLA5bK0YBwQco
!unzip ACL2020_other_datasets.zip

Downloading...
From: https://drive.google.com/uc?id=1y23-pRrn1D31zfs1FjhpLA5bK0YBwQco
To: /content/ACL2020_other_datasets.zip
100% 1.44G/1.44G [00:26<00:00, 54.5MB/s]
Archive:  ACL2020_other_datasets.zip
  inflating: test_multinews.jsonl    
  inflating: test_pubmed.jsonl       
  inflating: test_reddit.jsonl       
  inflating: test_wikihow.jsonl      
  inflating: test_xsum.jsonl         
  inflating: train_multinews.jsonl   
  inflating: train_pubmed.jsonl      
  inflating: train_reddit.jsonl      
  inflating: train_wikihow.jsonl     
  inflating: train_xsum.jsonl        
  inflating: val_multinews.jsonl     
  inflating: val_pubmed.jsonl        
  inflating: val_reddit.jsonl        
  inflating: val_wikihow.jsonl       
  inflating: val_xsum.jsonl          


# Dataset generation

In [1]:
import pandas as pd

Loading only the smaller files, so I do not run out of RAM. Will add the other larger files later.

In [2]:
used_files = [
  # 'test_CNNDM_bert.jsonl',
  # 'test_multinews.jsonl',
  # 'test_pubmed.jsonl',
  # 'test_reddit.jsonl',
  # 'test_wikihow.jsonl',
  # 'test_xsum.jsonl',

  # 'train_CNNDM_bert.jsonl',
  # 'train_multinews.jsonl',
  # 'train_pubmed.jsonl',
  # 'train_reddit.jsonl',
  # 'train_wikihow.jsonl',
  # 'train_xsum.jsonl',
  
  'val_CNNDM_bert.jsonl',
  'val_multinews.jsonl',
  'val_pubmed.jsonl',
  'val_reddit.jsonl',
  'val_wikihow.jsonl',
  'val_xsum.jsonl',
]

In [3]:
df = pd.DataFrame(columns = ['text', 'summary'])

for filename in used_files:
  print(filename + ' is being loaded...')
  tmp = pd.read_json(filename, lines=True)

  tmp['text'] = tmp['text'].apply(lambda text: ' '.join(text))
  tmp['summary'] = tmp['summary'].apply(lambda summary: ' '.join(summary))

  df = pd.concat([df, tmp])
  del tmp
  print(filename + ' was loaded into the dataframe.')

val_CNNDM_bert.jsonl is being loaded...
val_CNNDM_bert.jsonl was loaded into the dataframe.
val_multinews.jsonl is being loaded...
val_multinews.jsonl was loaded into the dataframe.
val_pubmed.jsonl is being loaded...
val_pubmed.jsonl was loaded into the dataframe.
val_reddit.jsonl is being loaded...
val_reddit.jsonl was loaded into the dataframe.
val_wikihow.jsonl is being loaded...
val_wikihow.jsonl was loaded into the dataframe.
val_xsum.jsonl is being loaded...
val_xsum.jsonl was loaded into the dataframe.


In [4]:
df = df[['text', 'summary']].copy()
df

Unnamed: 0,text,summary
0,jorge pereira won the open savoury amateur pri...,jorge pereira won the amateur prize at the wor...
1,manuel neuer accepted responsibility as bayern...,bayern munich were beat 2-0 at home by borussi...
2,the financial crash has left the poor even poo...,financial wealth of the richest 20 per cent ha...
3,"years into the future , if historians look bac...",love affair with wipes has grown recently - fr...
4,forget the bragging rights - victory in sunday...,sunday 's clash will go a long way to deciding...
...,...,...
11268,the wreck of the 3rd century trading ship aste...,timbers from a gallo-roman wreck found off gue...
11269,"david rees , who chairs the assembly 's health...",plans to ban the use of e-cigarettes in enclos...
11270,both locals and immigrants joined the protest ...,"about 30,000 people have taken part in a march..."
11271,the championship club had been in talks with t...,nottingham forest 's proposed takeover by a un...


# Data preprocessing

In [5]:
import string
import spacy
import nltk

nlp = spacy.load('en_core_web_sm')
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Sampling only 500 data points for faster running of the file, will add all the others when having enough GPU.

In [6]:
#######################################
df = df.sample(n=500, replace=False, random_state=1)
df
#######################################

Unnamed: 0,text,summary
5389,police scotland said the recovery was made aft...,a man has appeared in court after cannabis wit...
5718,the beauty of baking or cooking is that the ac...,bake or cook . glam yourself up . watch a movi...
13311,"( cnn ) my name is mark goodacre , and i am a ...",religion professor mark goodacre appears in ea...
3416,the showdown took place at the eastern new yor...,"– next up , a debate with the parole board ? t..."
4828,earls was dismissed for a dangerous tackle on ...,ireland wing keith earls will miss the test ag...
...,...,...
5867,"before and after each drag , take a sip . this...",drink while you smoke . take smaller drags . e...
3162,current adult dental health surveys indicate a...,objective : this study aimed to use a utility ...
9965,the company said it was looking into the matte...,nintendo is facing more complaints about its n...
9720,"kante , 24 , and benalouane , 28 , have both p...",leicester city have signed yohan benalouane fr...


## Sentence splitting

In [7]:
df['sentences'] = df['text'].apply(lambda text: nltk.sent_tokenize(text))
df['len_sentences'] = df['sentences'].apply(lambda array: len(array))

## Removing punctuations

In [8]:
translator = str.maketrans('', '', string.punctuation)

df['sentences'] = df['sentences'].apply(lambda array: [sent.translate(translator) for sent in array ])
del translator

df

Unnamed: 0,text,summary,sentences,len_sentences
5389,police scotland said the recovery was made aft...,a man has appeared in court after cannabis wit...,[police scotland said the recovery was made af...,3
5718,the beauty of baking or cooking is that the ac...,bake or cook . glam yourself up . watch a movi...,[the beauty of baking or cooking is that the a...,32
13311,"( cnn ) my name is mark goodacre , and i am a ...",religion professor mark goodacre appears in ea...,[ cnn my name is mark goodacre and i am a pr...,43
3416,the showdown took place at the eastern new yor...,"– next up , a debate with the parole board ? t...",[the showdown took place at the eastern new yo...,19
4828,earls was dismissed for a dangerous tackle on ...,ireland wing keith earls will miss the test ag...,[earls was dismissed for a dangerous tackle on...,9
...,...,...,...,...
5867,"before and after each drag , take a sip . this...",drink while you smoke . take smaller drags . e...,"[before and after each drag take a sip , this...",17
3162,current adult dental health surveys indicate a...,objective : this study aimed to use a utility ...,[current adult dental health surveys indicate ...,36
9965,the company said it was looking into the matte...,nintendo is facing more complaints about its n...,[the company said it was looking into the matt...,16
9720,"kante , 24 , and benalouane , 28 , have both p...",leicester city have signed yohan benalouane fr...,[kante 24 and benalouane 28 have both penn...,7


## Tokenization



In [9]:
def tokenize(df_text_column):
    tokens = []
    documents = list(nlp.pipe(df_text_column, batch_size=64))

    print(len(documents))

    for doc in documents:
        toks = []
        for token in doc:
            toks.append(token)
        tokens.append(toks)

    return tokens

In [10]:
df['tokenized'] = tokenize(df['sentences'].apply(lambda array: ' '.join(array)))
df

500


Unnamed: 0,text,summary,sentences,len_sentences,tokenized
5389,police scotland said the recovery was made aft...,a man has appeared in court after cannabis wit...,[police scotland said the recovery was made af...,3,"[police, scotland, said, the, recovery, was, m..."
5718,the beauty of baking or cooking is that the ac...,bake or cook . glam yourself up . watch a movi...,[the beauty of baking or cooking is that the a...,32,"[the, beauty, of, baking, or, cooking, is, tha..."
13311,"( cnn ) my name is mark goodacre , and i am a ...",religion professor mark goodacre appears in ea...,[ cnn my name is mark goodacre and i am a pr...,43,"[ , cnn, , my, name, is, mark, goodacre, , a..."
3416,the showdown took place at the eastern new yor...,"– next up , a debate with the parole board ? t...",[the showdown took place at the eastern new yo...,19,"[the, showdown, took, place, at, the, eastern,..."
4828,earls was dismissed for a dangerous tackle on ...,ireland wing keith earls will miss the test ag...,[earls was dismissed for a dangerous tackle on...,9,"[earls, was, dismissed, for, a, dangerous, tac..."
...,...,...,...,...,...
5867,"before and after each drag , take a sip . this...",drink while you smoke . take smaller drags . e...,"[before and after each drag take a sip , this...",17,"[before, and, after, each, drag, , take, a, s..."
3162,current adult dental health surveys indicate a...,objective : this study aimed to use a utility ...,[current adult dental health surveys indicate ...,36,"[current, adult, dental, health, surveys, indi..."
9965,the company said it was looking into the matte...,nintendo is facing more complaints about its n...,[the company said it was looking into the matt...,16,"[the, company, said, it, was, looking, into, t..."
9720,"kante , 24 , and benalouane , 28 , have both p...",leicester city have signed yohan benalouane fr...,[kante 24 and benalouane 28 have both penn...,7,"[kante, , 24, , and, benalouane, , 28, , h..."


## Removing stopwords


In [14]:
def stopword_filter(doc):
    filtered = []

    for token in doc:
        if not token.is_stop and not token.is_space:
            filtered.append(token)

    return filtered

## Lemmatization

In [15]:
def lemmatize(doc):
    lemmas = []

    for token in doc:
        if token.pos_ != 'PUNCT' and token.lower_[0] != '@':
            if token.lemma_ == '-PRON-':
                lemmas.append(token.lower_)
            else:
                lemmas.append(token.lemma_)

    return lemmas

In [16]:
df['stopword_filtered_lemmas'] = df['tokenized'].apply(stopword_filter).apply(lemmatize)
df['len_lemmas'] = df['stopword_filtered_lemmas'].apply(lambda array: len(array))
df

Unnamed: 0,text,summary,sentences,len_sentences,tokenized,stopword_filtered_lemmas,len_lemmas
5389,police scotland said the recovery was made aft...,a man has appeared in court after cannabis wit...,[police scotland said the recovery was made af...,3,"[police, scotland, said, the, recovery, was, m...","[police, scotland, say, recovery, intelligence...",25
5718,the beauty of baking or cooking is that the ac...,bake or cook . glam yourself up . watch a movi...,[the beauty of baking or cooking is that the a...,32,"[the, beauty, of, baking, or, cooking, is, tha...","[beauty, bake, cooking, act, help, pass, time,...",213
13311,"( cnn ) my name is mark goodacre , and i am a ...",religion professor mark goodacre appears in ea...,[ cnn my name is mark goodacre and i am a pr...,43,"[ , cnn, , my, name, is, mark, goodacre, , a...","[cnn, mark, goodacre, professor, new, testamen...",428
3416,the showdown took place at the eastern new yor...,"– next up , a debate with the parole board ? t...",[the showdown took place at the eastern new yo...,19,"[the, showdown, took, place, at, the, eastern,...","[showdown, take, place, eastern, new, york, co...",233
4828,earls was dismissed for a dangerous tackle on ...,ireland wing keith earls will miss the test ag...,[earls was dismissed for a dangerous tackle on...,9,"[earls, was, dismissed, for, a, dangerous, tac...","[earl, dismiss, dangerous, tackle, fraser, bro...",124
...,...,...,...,...,...,...,...
5867,"before and after each drag , take a sip . this...",drink while you smoke . take smaller drags . e...,"[before and after each drag take a sip , this...",17,"[before, and, after, each, drag, , take, a, s...","[drag, sip, throat, cool, tar, stick, ideally,...",111
3162,current adult dental health surveys indicate a...,objective : this study aimed to use a utility ...,[current adult dental health surveys indicate ...,36,"[current, adult, dental, health, surveys, indi...","[current, adult, dental, health, survey, indic...",485
9965,the company said it was looking into the matte...,nintendo is facing more complaints about its n...,[the company said it was looking into the matt...,16,"[the, company, said, it, was, looking, into, t...","[company, say, look, matter, cause, wireless, ...",199
9720,"kante , 24 , and benalouane , 28 , have both p...",leicester city have signed yohan benalouane fr...,[kante 24 and benalouane 28 have both penn...,7,"[kante, , 24, , and, benalouane, , 28, , h...","[kante, 24, benalouane, 28, pen, fouryear, dea...",71


## POS-tagging

In [None]:
def pos_tag(doc):
    pos_list = []

    for token in doc:
        pos_list.append(token.pos_)

    return pos_list

In [None]:
df['pos_tagged'] = df['tokenized'].apply(pos_tag)
df

Unnamed: 0,text,summary,sentences,length,tokenized,stopword_filtered_lemmas,pos_tagged
4071,you 'll need to enter your email address and p...,open the gmail website . click the search bar ...,[you ll need to enter your email address and p...,9,"[you, ll, need, to, enter, your, email, addres...","[ll, need, enter, email, address, password, no...","[PRON, AUX, VERB, PART, VERB, PRON, NOUN, NOUN..."
4594,"magicians , like other performers , tend to st...",choose a persona . perfect your `` patter . ``...,[magicians like other performers tend to sti...,27,"[magicians, , like, other, performers, , ten...","[magician, like, performer, tend, stick, consi...","[NOUN, SPACE, ADP, ADJ, NOUN, SPACE, VERB, PAR..."
5245,"; , one way to do this is to get about 20 cind...",set aside some space in which to work and an a...,[ one way to do this is to get about 20 cinde...,25,"[ , one, way, to, do, this, is, to, get, abou...","[way, 20, cinder, block, arrange, square, laye...","[SPACE, NUM, NOUN, PART, VERB, PRON, AUX, PART..."
3665,tweet with a location you can add location inf...,– a manhattan bar that previously made waves f...,[tweet with a location you can add location in...,15,"[tweet, with, a, location, you, can, add, loca...","[tweet, location, add, location, information, ...","[VERB, ADP, DET, NOUN, PRON, AUX, VERB, NOUN, ..."
431,loose or inaccurately placed brackets that req...,objectivethe aim of this study was to evaluate...,[loose or inaccurately placed brackets that re...,16,"[loose, or, inaccurately, placed, brackets, th...","[loose, inaccurately, place, bracket, require,...","[ADJ, CCONJ, ADV, VERB, NOUN, PRON, VERB, VERB..."
...,...,...,...,...,...,...,...
6045,scotland international geoff cross has complet...,geoff cross had his beard shaved off at the lo...,[scotland international geoff cross has comple...,7,"[scotland, international, geoff, cross, has, c...","[scotland, international, geoff, cross, comple...","[PROPN, PROPN, PROPN, PROPN, AUX, VERB, PRON, ..."
35,"likely you will be working in a zoo , a safari...",be in charge of a tigress . wait for the tigre...,[likely you will be working in a zoo a safari...,16,"[likely, you, will, be, working, in, a, zoo, ...","[likely, work, zoo, safari, park, work, volunt...","[ADV, PRON, AUX, AUX, VERB, ADP, DET, NOUN, SP..."
42,nearly 300 amtrak passengers were left strande...,"acela train # 2164 lost power near mystic , co...",[nearly 300 amtrak passengers were left strand...,33,"[nearly, 300, amtrak, passengers, were, left, ...","[nearly, 300, amtrak, passenger, leave, strand...","[ADV, NUM, NOUN, NOUN, AUX, VERB, VERB, ADP, D..."
3046,"; , just right click on the archive file to op...",locate the file by opening the folder which co...,[ just right click on the archive file to ope...,8,"[ , just, right, click, on, the, archive, fil...","[right, click, archive, file, open, context, m...","[SPACE, ADV, ADV, NOUN, ADP, DET, NOUN, NOUN, ..."


# Embeddings



## Sentence-BERT

In [None]:
# TODO

## ELMO embedding

In [None]:
# TODO

# Train and validation split

In [1]:
from sklearn.model_selection import train_test_split

In [None]:
X = df[['']] # TODO
y = df['']  # TODO

In [None]:
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=42)
del X_tmp, y_tmp

print('Training samples: ' + len(y_train))
print('Validation samples: ' + len(y_val))
print('Testing samples: ' + len(y_test))

# Modeling

In [None]:
# TODO

# Evaluation

In [None]:
# TODO