In [4]:
# pip install rouge_score
from operator import itemgetter
import pandas as pd
from rouge_score import rouge_scorer

In [8]:
# Read the validation data
validation_df = pd.read_csv('data/validation.csv')

rouge_scores = []
# Initialize Rouge Scorer                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
scorer = rouge_scorer.RougeScorer(['rougeL'])

# Function that generates summaries using LEAD-N
def lead_summary(text: pd.core.series.Series, titles: pd.core.series.Series, scorer: rouge_scorer.RougeScorer):
    summaries = []
    for idx, row in text.items():
        sentences = row.split(".")
        summaries.append([idx, sentences[0] + "."])
    return summaries

# Function that generates summaries using EXT-ORACLE
def ext_oracle_summary(text: pd.core.series.Series, titles: pd.core.series.Series, scorer: rouge_scorer.RougeScorer):
    summaries = []
    for idx, row in text.items():
        sentences = row.split(".")
        reference = titles.iloc[idx]
        rs = [scorer.score(sentence, reference)['rougeL'][2] for sentence in sentences]
        index, element = max(enumerate(rs), key=itemgetter(1))
        summaries.append([idx, sentences[index]])  
    return summaries
    

In [9]:
lead_summaries = lead_summary(validation_df['text'], validation_df['titles'], scorer)
ext_oracle_summaries = ext_oracle_summary(validation_df['text'], validation_df['titles'], scorer)

lead_rouge = []
ext_oracle_rouge = []
# Calculate the rouge-l score for each of the generated summaries compared to the original titles
for idx, title in validation_df['titles'].items():
    lead_rouge.append(scorer.score(lead_summaries[idx][1], title)['rougeL'][2])
    ext_oracle_rouge.append(scorer.score(ext_oracle_summaries[idx][1], title)['rougeL'][2])

avg_rouge_score_lead = sum(lead_rouge) / len(lead_rouge)
avg_rouge_score_ext_oracle = sum(ext_oracle_rouge) / len(ext_oracle_rouge)

print("Average Rouge-L F-Score with LEAD-1: ", avg_rouge_score_lead)
print("Average Rouge-L F-Score with EXT-ORACLE:", avg_rouge_score_ext_oracle)

# Store the generated summaries in the Kaggle-accepted format
lead_submission_df = pd.DataFrame(lead_summaries, columns=['ID', 'titles'])
ext_oracle_submission_df = pd.DataFrame(ext_oracle_summaries, columns=['ID', 'titles'])
lead_submission_df.to_csv('lead_submission.csv', index=False)
ext_oracle_submission_df.to_csv('ext_oracle_submission.csv', index=False)

Average Rouge-L F-Score with LEAD-1:  0.1535873817959459
Average Rouge-L F-Score with EXT-ORACLE: 0.31354067145919595


For summarization in NLP, especially focusing on title generation, various models and techniques can be employed. Here are some of the prominent ones, along with brief explanations of their use cases:

### 1. Sequence-to-Sequence (Seq2Seq) Models
Seq2Seq models are fundamental in NLP for tasks involving text generation. They work by encoding a source text into a fixed-dimensional context vector and then decoding this vector to produce the output text. For title generation, the input would be the content of an article or document, and the output would be the generated title.

### 2. Attention Mechanisms
Attention mechanisms improve Seq2Seq models by allowing the decoder to focus on different parts of the input sequence during the generation process, improving the relevance of the generated titles to the content.

### 3. Transformer Models
Introduced in the paper "Attention is All You Need" by Vaswani et al., transformers have become the backbone of modern NLP, surpassing Seq2Seq models in performance. They are based entirely on attention mechanisms and are highly effective in generating summaries and titles due to their ability to capture long-range dependencies in text.

### 4. Pre-trained Language Models
Pre-trained models such as GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and their variants (e.g., RoBERTa, T5, BART) can be fine-tuned for the specific task of title generation. These models have been trained on vast amounts of text and have an excellent understanding of language, which makes them highly effective for generating coherent and contextually relevant titles.

#### Fine-tuning Approach:
- **GPT-3 for Direct Generation:** Given a prompt that includes the content, GPT-3 or similar models can directly generate a title based on the given context.
- **T5/BART for Text-to-Text Tasks:** T5 and BART are designed explicitly for text-to-text tasks, such as translation, summarization, and, by extension, title generation. They can be fine-tuned by framing the title generation as a summarization problem where the "summary" is the title of the input text.

### 5. Extractive Summarization Techniques
While not directly designed for title generation, extractive summarization models can identify key phrases or sentences within a text. These key elements can inspire or be directly used in generating titles, especially for academic papers or articles where titles are often descriptive and concise.

When choosing a model for title generation, consider the specific requirements of your task, such as the desired level of creativity, the importance of context preservation, and the available computational resources. Pre-trained models like GPT-3 or T5, due to their versatility and state-of-the-art performance, are often the go-to choice for tasks requiring high-quality text generation, including title generation.

In [3]:
import pandas as pd

data_path = 'data/train.csv'
data_val = 'data/validation.csv'
data_text = 'data/test_text.csv'

# Load the training data
data = pd.read_csv(data_path)

# Load the validation data
data_val = pd.read_csv(data_val)

# Load the test data
data_text = pd.read_csv(data_text)

# Display the first 5 rows of the dataframe
print("training data:......................................................")
print(data.head())

print("validation data:...................................................................")
print(data_val.head())

print("test data:................................................................................................")
print(data_text.head())

training data:......................................................
                                                text  \
0  Thierry Mariani sur la liste du Rassemblement ...   
1  C'est désormais officiel : Alain Juppé n'est p...   
2  La mesure est décriée par les avocats et les m...   
3  Dans une interview accordée au Figaro mercredi...   
4  Le préjudice est estimé à 2 millions d'euros. ...   

                                              titles  
0  L'information n'a pas été confirmée par l'inté...  
1  Le maire de Bordeaux ne fait plus partie des R...  
2  En 2020, les tribunaux d'instance fusionnent a...  
3  Les médecins jugés "gros prescripteurs d'arrêt...  
4  Il aura fallu mobiliser 90 gendarmes pour cett...  
validation data:...................................................................
                                                text  \
0  Sur les réseaux sociaux, les images sont impre...   
1  La vidéo est devenue virale. Elle montre un po...   
2  Depuis la

# Ecoder and decoder with LSTM layers 

In [22]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
import numpy as np
from tqdm import tqdm

In [46]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
device =torch.device('cpu')

cuda


In [23]:
class TextDataset(Dataset):
    def __init__(self, texts, titles, tokenizer, max_length):
        self.texts = [torch.tensor(tokenizer.encode(text)[:max_length]) for text in texts]
        self.titles = [torch.tensor(tokenizer.encode(title)[:max_length]) for title in titles]

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.titles[idx]

# Custom function to pad sequences and combine them into a batch
def collate_fn(batch):
    texts, titles = zip(*batch)
    texts_padded = pad_sequence(texts, batch_first=True, padding_value=0)
    titles_padded = pad_sequence(titles, batch_first=True, padding_value=0)
    return texts_padded, titles_padded

# Define the Seq2Seq Model
class Seq2Seq(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Seq2Seq, self).__init__()
        # Define encoder
        self.encoder = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        # Define decoder
        self.decoder = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        # Shared embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Fully connected layer to get output tokens
        self.fc = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, src, trg):
        embedded_src = self.embedding(src)
        encoder_outputs, (hidden, cell) = self.encoder(embedded_src)
        
        embedded_trg = self.embedding(trg)
        decoder_outputs, _ = self.decoder(embedded_trg, (hidden, cell))
        
        # Prediction
        output = self.fc(decoder_outputs)
        return output

In [24]:
#!pip install torchtext

In [25]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from pandas import concat

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data_iter):
    for text_tuple in data_iter:
        text = text_tuple[0]  # Assuming the text is the first element
        yield tokenizer(text)

combined_texts = concat([data['text'], data_val['text']])

# Now, use combined_texts in the iterator
vocab = build_vocab_from_iterator(yield_tokens(combined_texts), specials=["<unk>", "<pad>", "<bos>", "<eos>"])
vocab.set_default_index(vocab["<unk>"])


In [26]:
vocab_size = len(vocab)


In [27]:
from torch.utils.data import DataLoader, Dataset

class CustomDataset(Dataset):
    def __init__(self, texts, titles, vocab, tokenizer, max_length):
        self.texts = [torch.tensor([vocab[token] for token in tokenizer(text)], dtype=torch.long) for text in texts]
        self.titles = [torch.tensor([vocab[token] for token in tokenizer(title)], dtype=torch.long) for title in titles]
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx][:self.max_length]
        title = self.titles[idx][:self.max_length]
        return text, title

# Assuming a maximum sequence length for padding/truncation
max_length = 100
train_dataset = CustomDataset(data['text'], data['titles'], vocab, tokenizer, max_length)
val_dataset = CustomDataset(data_val['text'], data_val['titles'], vocab, tokenizer, max_length)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)


In [28]:
num_epochs = 10  # Define the number of epochs for training
vocab_size = len(vocab)  # Ensure vocab_size is correctly defined as the length of the vocabulary

# Initialize the Seq2Seq model with the correct vocabulary size
model = Seq2Seq(vocab_size=vocab_size, embedding_dim=100, hidden_dim=256)

# Training setup with optimizer and loss function
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=vocab['<pad>'])  # Assuming '<pad>' is your padding token and is in the vocabulary

for epoch in range(num_epochs):
    model.train()
    for texts, titles in tqdm(train_loader):
        optimizer.zero_grad()

        # Ensure trg input to decoder excludes the last token, and trg for calculating loss starts from the second token
        output = model(texts, titles[:, :-1])  # Decoder input excludes the last token
        
        # Reshape output for loss calculation
        output_dim = output.shape[-1]
        output = output.contiguous().view(-1, output_dim)  # Reshape for cross-entropy loss
        titles = titles[:, 1:].contiguous().view(-1)  # Target starts from the second token
        
        loss = criterion(output, titles)
        
        loss.backward()
        optimizer.step()
        
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')


  0%|          | 0/669 [00:00<?, ?it/s]

100%|██████████| 669/669 [00:56<00:00, 11.90it/s]


Epoch [1/10], Loss: 0.4463


100%|██████████| 669/669 [00:55<00:00, 12.07it/s]


Epoch [2/10], Loss: 0.4849


100%|██████████| 669/669 [00:57<00:00, 11.71it/s]


Epoch [3/10], Loss: 0.4827


100%|██████████| 669/669 [00:55<00:00, 11.98it/s]


Epoch [4/10], Loss: 0.4105


100%|██████████| 669/669 [00:56<00:00, 11.78it/s]


Epoch [5/10], Loss: 0.3696


100%|██████████| 669/669 [00:57<00:00, 11.63it/s]


Epoch [6/10], Loss: 0.4930


100%|██████████| 669/669 [00:55<00:00, 12.02it/s]


Epoch [7/10], Loss: 0.4703


100%|██████████| 669/669 [00:55<00:00, 11.98it/s]


Epoch [8/10], Loss: 0.3694


100%|██████████| 669/669 [00:55<00:00, 12.02it/s]


Epoch [9/10], Loss: 0.3308


100%|██████████| 669/669 [00:55<00:00, 12.06it/s]

Epoch [10/10], Loss: 0.2655





In [31]:
# saving the model
torch.save(model.state_dict(), 'models/seq2seq_model.pth')

# loading the model

In [47]:
def token_to_index(token, vocab):
    # Placeholder for converting a token to its corresponding index
    # Adjust this to match how your vocabulary object works
    return vocab.get_index(token, default=vocab.get_index("<unk>"))

def index_to_token(index, vocab):
    # Placeholder for converting an index back to its corresponding token
    # Adjust this to match how your vocabulary object works
    return vocab.get_token(index, default="<unk>")

In [50]:
# load the model
model = Seq2Seq(vocab_size=vocab_size, embedding_dim=100, hidden_dim=256)   
model.load_state_dict(torch.load('models/seq2seq_model.pth'))
model.eval()

# Function to generate summaries using the trained Seq2Seq model
# tensor transformation to vocab 
print(type(vocab))
print(vocab)

stoi = vocab.get_stoi()  # Get string-to-index mapping
itos = vocab.get_itos()

def generate_summary(texts, model, vocab, tokenizer, max_length, device):
    summaries = []
    model.to(device)  # Ensure the model is on the correct device
    for text in tqdm(texts):
        tokens = tokenizer(text)
        indices = [stoi[token] if token in stoi else stoi["<unk>"] for token in tokens]
        sequence = torch.tensor(indices, dtype=torch.long).unsqueeze(0).to(device)
        summary_indices = [stoi['<bos>']]
        for _ in range(max_length):
            input_tensor = torch.tensor(summary_indices, dtype=torch.long).unsqueeze(0).to(device)
            with torch.no_grad():
                output = model(sequence, input_tensor)
            prediction = output.argmax(2)[:,-1].item()
            summary_indices.append(prediction)
            if prediction == stoi['<eos>']:
                break
        summary_tokens = [itos[idx] for idx in summary_indices]  # Convert indices back to tokens
        summary = ' '.join(summary_tokens).replace('<bos>', '').replace('<eos>', '')
        summaries.append(summary)
    return summaries


# Generate summaries for the validation data
val_summaries = generate_summary(data_val['text'], model, vocab, tokenizer, max_length, device)

# Calculate the rouge-l score for each of the generated summaries compared to the original titles
rouge_scores = []
for summary, title in zip(val_summaries, data_val['titles']):
    scores = scorer.score(summary, title)
    rouge_scores.append(scores['rougeL'][2])
    
avg_rouge_score = sum(rouge_scores) / len(rouge_scores)
print("Average Rouge-L F-Score:", avg_rouge_score)



<class 'torchtext.vocab.vocab.Vocab'>
Vocab()


  0%|          | 0/1500 [00:00<?, ?it/s]

  2%|▏         | 28/1500 [00:23<20:59,  1.17it/s]


KeyboardInterrupt: 

# Abstraction-based summarization

In [51]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ines\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ines\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [64]:
#importing libraries
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import bs4 as BeautifulSoup
import urllib.request  

#fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

article_read = fetched_data.read()

#parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

#returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

#looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text


def _create_dictionary_table(text_string) -> dict:
   
    #removing stop words
    stop_words = set(stopwords.words("english"))
    
    words = word_tokenize(text_string)
    
    #reducing words to their root form
    stem = PorterStemmer()
    
    #creating dictionary for the word frequency table
    frequency_table = dict()
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1

    return frequency_table


def _calculate_sentence_scores(sentences, frequency_table) -> dict:   

    #algorithm for scoring a sentence by its words
    sentence_weight = dict()

    for sentence in sentences:
        sentence_wordcount = (len(word_tokenize(sentence)))
        sentence_wordcount_without_stop_words = 0
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]

        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words

       

    return sentence_weight

def _calculate_average_score(sentence_weight) -> int:
   
    #calculating the average score for the sentences
    sum_values = 0
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]

    #getting sentence average value from source text
    average_score = (sum_values / len(sentence_weight))

    return average_score

def _get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

def _run_article_summary(article):
    
    #creating a dictionary for the word frequency table
    frequency_table = _create_dictionary_table(article)
    #print (frequency_table) as pandas dataframe
    print(pd.DataFrame(frequency_table.items(), columns=['Word', 'Frequency']))

    #tokenizing the sentences
    print("sentences")
    sentences = sent_tokenize(article)
    print(sentences)

    #algorithm for scoring a sentence by its words
    sentence_scores = _calculate_sentence_scores(sentences, frequency_table)
    print("printing the sentence scores"    )
    print(pd.DataFrame(sentence_scores.items(), columns=['Sentence', 'Score']))
    
    #getting the threshold
    threshold = _calculate_average_score(sentence_scores)
    print("printing the threshold")
    print(threshold)
    
    #producing the summary
    article_summary = _get_article_summary(sentences, sentence_scores, 1.5 * threshold)

    return article_summary

if __name__ == '__main__':
    summary_results = _run_article_summary(article_content)
    print(summary_results)

         Word  Frequency
0        20th         14
1     centuri         31
2       began          3
3           1          2
4     januari          1
..        ...        ...
789     trait          1
790   practic          1
791  electron          1
792    travel          1
793   medicin          1

[794 rows x 2 columns]
sentences
['The 20th century began on  1 January 1901 (MCMI), and ended on 31 December 2000 (MM).', '[1][2]  It was the 10th and last century of the 2nd millennium and was marked by new models of scientific understanding, unprecedented scopes of warfare, new modes of communication that would operate at nearly instant speeds, and new forms of art and entertainment.', 'Population growth was also unprecedented,[3] as the century started with around 1.6 billion people, and ended with around 6.2 billion.', '[4]\nThe 20th century was dominated by significant geopolitical events that reshaped the political and social structure of the globe: World War I, the Spanish flu pande

In [65]:
data = pd.read_csv('data/validation.csv')

# Assuming each row in the CSV contains an article or paragraph in a column named 'text'
# Let's summarize the first article/paragraph for demonstration
article_content = data.iloc[0]['text']
print(article_content)

# Now we can apply the summarization code to `article_content`
summary_results = _run_article_summary(article_content)
print(summary_results)
print(len(summary_results))

Sur les réseaux sociaux, les images sont impressionnantes. Dimanche matin à Venise, l'équipage du MSC Opéra a perdu le contrôle du paquebot, à son arrivée dans le port de la cité des Doges. Le navire, qui peut contenir plus de 2.600 passagers, est venu heurter le quai auquel il voulait s'arrimer. Le paquebot a raclé le quai sur plusieurs mètres, suscitant la panique des personnes à terre, avant de percuter un autre bateau touristique, le Michelangelo, stoppant ainsi sa course. Des témoins ont filmé la scène. Les vidéos montrent des touristes courant pour tenter de fuir le paquebot, qui ne semble pas vouloir s'arrêter. Quatre personnes ont été blessées dans cet accident : deux légèrement, tandis que les deux autres ont été transportées à l'hôpital pour des examens. L'incident s'est produit à San Basilio-Zaterre, dans le canal de la Giudecca, où de nombreux navires de croisière s'arrêtent pour permettre à leurs passagers de visiter Venise.Selon le quotidien italien Corriere della Serra, 