# Project: Abstractive Text Summarization using Transformers

Dataset: CNN/DailyMail Dataset for training summarization task

Reason for selecting project: The high rise of data volume and information in the internet, there's a rising need for effective concise summarization tools, to help understand information quicker.
- **Industry Value:** Transformers like BART, T5, and GPT-3.5 are used extensively by tech giants and startups because they can efficiently handle large-scale datasets. Learning how to use these models can prepare you for industry-level challeneges where processing vast amount of data efficiently is crucial.
- **Real-world Applications:**
  - **News & Media Outlets:** Summarizing long articles into concise summaries for readers who want quick insights.
  - **Research & Academia:** Summarizing research papers, and making it easier to understand findings is very useful to researchers.
  - **Business Reports:** Summarizing lengthy quarterly reports, meeting notes or internal documents. Time is money, and people at the top are very busy, summarizing reports and only explaining concepts, ideas and information at high level is crucial.
- **Adaptability:** Once we've built the summarizer for news articles, we can fine-tune another model for other fields, like medical reports, or legal documents, and any other type of industry that uses data (which is a huge portion of companies, if not all modern companies).
- **Revelance to current Industry Trends:** The Transformer architecture is at the forefront of many NLP tasks, impressive ML and AI tools like ChatGPT uses transformers. There's also a push towards automation and AI-driven content creation tools in the industry.
- **Continous Learning & Model Updating:** Industries value models that can be updated with newer data or fine-tuned for specific tasks, this project will help you gain experience in transfer learning and model fine-tuning, which are very valuable skills. And working by working with Transformers, we'll be working with cutting-edge technology that's directly applicable to various industries.

Final Goals: Develop an abstractive text summarization model using transformers that can create concise and coherent summaries for longer articles from the CNN/DailyMail dataset. Or any other news articles. We're also thinking of book chapters summarization.

Dataset selection: Below we show the actual data which is a bunch of articles that we retrieved using an API. The structure of the data is in JSON style, but we'll only be using the 'article' key to train our model because that contains the actual content of each article. We'll also be using the value inside of the 'highlight' key because that's the summary reference for that article, and that's what we'll use to compare how well our model does. In our API we specify to only retrieve the first 100 articles. The code and it's output is below so you can see this for yourself. The only preprocessing we did was the specification of getting the row that contains the 'article' key and the 'highlights' key.

Literature Review:
- **Attention is All You Neeed**, by Vaswani et al. Introduction of the Transformer architecture
- **BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding**, by Devlin et al Understanding of Transformer-based embeddings.

Exisiting Methods:
- Seq2Seq models and pointer-generator networks

Novetly:
- While traditional models like Seq2Seq and pointer-generator networks have been used for summarization, there will be a focus on fine-tuning advanced transformers like GPT-3.5, BART, and T5 for summarization tasks.

Experiment/Simulation plans:
- Baseline model: Establish a baseline with advanced Seq2Seq models (BART and T5).
- Transformer-based model: Fine-tune a transformer model (GPT-3.5) and compare its performance against the baseline.
- Evaulation Methods:
  - ROGUE scores (recall, precision, F1 score for generated summaries against the actual summary).

Reference:

[1] Derek Miller et al., "Leveraging BERT for Extractive
Text Summarization on Lectures," arXiv preprint
arXiv:1906.04165, 2019. [Online]. Available:
https://arxiv.org/abs/1906.04165

[2] Yang Liu et al., "Fine-tune BERT for Extractive
Summarization," arXiv preprint arXiv:1903.10318, 2019.
[Online]. Available: https://arxiv.org/pdf/1903.10318.pdf

[3] Nandan Thakur, Nils Reimers, Andreas Rücklé∗,
Abhishek Srivastava, Iryna Gurevych et al., "BEIR: A
Heterogeneous Benchmark for Zero-shot Evaluation of
Information Retrieval Models," arXiv preprint
arXiv:2104.08663, 2021. [Online]. Available:
https://arxiv.org/pdf/2104.08663.pdf



# Fetching all CNN/DailyMail Dataset

In [None]:
import requests

url = "https://datasets-server.huggingface.co/rows?dataset=cnn_dailymail&config=1.0.0&split=train&offset=0&length=100"
response = requests.get(url)

if response.status_code == 200:
  data = response.json()

  for entry in data:
    print(entry)
else:
  print(f"failed to fetch data, status code: {response.status_code}, error {response.text}")

features
rows
num_rows_total
num_rows_per_page
partial


Now let's check what our data looks like:

In [None]:
data

{'features': [{'feature_idx': 0,
   'name': 'article',
   'type': {'dtype': 'string', '_type': 'Value'}},
  {'feature_idx': 1,
   'name': 'highlights',
   'type': {'dtype': 'string', '_type': 'Value'}},
  {'feature_idx': 2,
   'name': 'id',
   'type': {'dtype': 'string', '_type': 'Value'}}],
 'rows': [{'row_idx': 0,
   'row': {'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I 

## Requirements:

Installations required:

In [1]:
%%shell
pip install transformers
pip install SentencePiece
pip install torch
pip install accelerate -U
pip install selenium
pip install rouge_score
pip install openai

Collecting SentencePiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: SentencePiece
Successfully installed SentencePiece-0.1.99
Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0
Collecting selenium
  Downloading selenium-4.16.0-py3-none-any.whl (10.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.23.2-py3-none-any.whl (461 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m461.6/461.6 kB[0m [3



Fetching Clean Data:

In [2]:
import requests

# Fetch the data
url = # hugging face api
response = requests.get(url)
articles = []
summaries = []

if response.status_code == 200:
    data = response.json()
    if isinstance(data, dict) and 'rows' in data:
        entries = data['rows']
        for entry in entries:
            articles.append(entry['row']['article'])
            summaries.append(entry['row']['highlights'])
    elif isinstance(data, list):
        for entry in data:
            articles.append(entry['row']['article'])
            summaries.append(entry['row']['highlights'])
    else:
        print("Unexpected data structure received from the API.")
else:
    print(f"Failed to fetch data. Status code: {response.status_code}. Error: {response.text}")
    exit()

In [3]:
summaries

["Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday . Young actor says he has no plans to fritter his cash away . Radcliffe's earnings from first five Potter films have been held in trust fund .",
 'Mentally ill inmates in Miami are housed on the "forgotten floor" Judge Steven Leifman says most are there as a result of "avoidable felonies" While CNN tours facility, patient shouts: "I am the son of the president" Leifman says the system is unjust and he\'s fighting for change .',
 'NEW: "I thought I was going to die," driver says . Man says pickup truck was folded in half; he just has cut on face . Driver: "I probably had a 30-, 35-foot free fall" Minnesota bridge collapsed during rush hour Wednesday .',
 'Five small polyps found during procedure; "none worrisome," spokesman says . President reclaims powers transferred to vice president . Bush undergoes routine colonoscopy at Camp David .',
 "NEW: NFL chief, Atlanta Falcons owner critical of Michael Vick's condu

Mounting Drive:

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


# Ealuation of T5 Transformer, BART, and GPT-3.5

T5 Transformer Evaulation:

In [9]:
import requests
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
import torch
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Initialize the T5 model and tokenizer
model_name = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Tokenize the data
max_length = 1024  # Max length for the articles
summary_length = 300  # Max length for the summaries

train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# Prepare the dataset format for Trainer
class SummarizationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SummarizationDataset(train_encodings, train_labels)

# Function to generate summaries for evaluation
def generate_summaries_for_evaluation(dataset, tokenizer, model, max_length, num_summaries=10, chunk_size=900):
    summaries = []
    for article in dataset[:num_summaries]:  # Process only the first num_summaries articles
        article_chunks = [article[i:i+chunk_size] for i in range(0, len(article), chunk_size)]
        article_summary = []

        for chunk in article_chunks:
            input_ids = tokenizer.encode("summarize: " + chunk, return_tensors="pt", max_length=max_length, truncation=True)
            summary_ids = model.generate(input_ids, num_beams=4, min_length=30, max_length=300, early_stopping=True)
            article_summary.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

        summaries.append(' '.join(article_summary))
    return summaries

# Generate summaries for evaluation (only for 10 articles)
T5_generated_summaries = generate_summaries_for_evaluation(articles, tokenizer, model, max_length)

# Function to calculate ROUGE scores (same as before)
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []

    for reference, candidate in zip(references, candidates):
        scores.append(scorer.score(reference, candidate))

    return scores

# Calculate ROUGE scores (only for the summaries of the 10 articles)
rouge_scores = calculate_rouge_scores(summaries[:10], T5_generated_summaries)

# Print out the ROUGE scores for the summaries
for score in rouge_scores:
    print(f"rouge scores for T5: {score}")

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')

def calculate_bleu_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = [word_tokenize(reference.lower())]
        candidate_tokenized = word_tokenize(candidate.lower())
        score = sentence_bleu(reference_tokenized, candidate_tokenized)
        scores.append(score)
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = word_tokenize(reference.lower())
        candidate_tokenized = word_tokenize(candidate.lower())
        score = meteor_score([reference_tokenized], candidate_tokenized)
        scores.append(score)
    return scores

bleu_scores = calculate_bleu_score(summaries, T5_generated_summaries)
meteor_scores = calculate_meteor_score(summaries, T5_generated_summaries)

T5_cosine_scores = []
for original, generated in zip(summaries, T5_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    T5_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", T5_cosine_scores)

# Print scores
for score in bleu_scores:
    print("BLEU Score:", score)
for score in meteor_scores:
    print("METEOR Score:", score)


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


rouge scores for T5: {'rouge1': Score(precision=0.09090909090909091, recall=0.35135135135135137, fmeasure=0.14444444444444446), 'rouge2': Score(precision=0.028169014084507043, recall=0.1111111111111111, fmeasure=0.04494382022471911), 'rougeL': Score(precision=0.055944055944055944, recall=0.21621621621621623, fmeasure=0.08888888888888889)}
rouge scores for T5: {'rouge1': Score(precision=0.10454545454545454, recall=0.6216216216216216, fmeasure=0.17898832684824903), 'rouge2': Score(precision=0.0273972602739726, recall=0.16666666666666666, fmeasure=0.047058823529411764), 'rougeL': Score(precision=0.06363636363636363, recall=0.3783783783783784, fmeasure=0.10894941634241244)}
rouge scores for T5: {'rouge1': Score(precision=0.12755102040816327, recall=0.6097560975609756, fmeasure=0.2109704641350211), 'rouge2': Score(precision=0.07692307692307693, recall=0.375, fmeasure=0.12765957446808512), 'rougeL': Score(precision=0.061224489795918366, recall=0.2926829268292683, fmeasure=0.10126582278481013

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Cosine Similarity Scores: [0.16188470996588256, 0.32486723735253, 0.2103749165886538, 0.20397867755419669, 0.1670133549848236, 0.16655185454591648, 0.3151626970404623, 0.20672677432738706, 0.28537539463396167, 0.17422596218009942]
BLEU Score: 7.576263790947843e-79
BLEU Score: 6.619719816443538e-79
BLEU Score: 0.07049918802222783
BLEU Score: 9.536975164800799e-79
BLEU Score: 0.011705582380646744
BLEU Score: 1.947545109576649e-155
BLEU Score: 0.01729461145453025
BLEU Score: 0.056747040290454735
BLEU Score: 8.394101205137021e-79
BLEU Score: 7.663972691889274e-79
METEOR Score: 0.20567416617611167
METEOR Score: 0.2616479511426321
METEOR Score: 0.3590982286634461
METEOR Score: 0.3453804347826087
METEOR Score: 0.2164215386099885
METEOR Score: 0.1763967047987292
METEOR Score: 0.2298589477327796
METEOR Score: 0.3348929138184845
METEOR Score: 0.3296207438507633
METEOR Score: 0.3010021690231362


In [10]:
T5_generated_summaries[0]

'he will be able to gamble in a casino buy a drink in a pub or see the horror film Hostel Part II currently six places below. he will be able to see the horror film Hostel Part II currently six places below. despite his growing fame and riches the actor says he is keeping his feet firmly on the ground. his latest outing as the boy wizard in Harry Potter and the Order of the Phoenix is breaking records on both sides of the Atlantic. he has filmed a tv movie called My Boy Jack about author Rudyard Kipling and his son due for release later this year He will also appear in December Boys an Australian film about four boys who escape an orphanage. he made his stage debut playing a tortured teenager in Peter Shaffer s Equus earlier this year.'

BART Transformer Evaluation:

In [11]:
import requests
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Initialize the BART model and tokenizer
model_name = "facebook/bart-large-cnn"  # A popular BART model for summarization
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Tokenize the data
max_length = 1024  # Max length for the articles
summary_length = 300  # Max length for the summaries

train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# Prepare the dataset format for Trainer
class SummarizationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SummarizationDataset(train_encodings, train_labels)

# Function to generate summaries for evaluation
def generate_summaries_for_evaluation(dataset, tokenizer, model, max_length, num_summaries=10, chunk_size=900):
    summaries = []
    for article in dataset[:num_summaries]:
        article_chunks = [article[i:i+chunk_size] for i in range(0, len(article), chunk_size)]
        article_summary = []

        for chunk in article_chunks:
            input_ids = tokenizer.encode(chunk, return_tensors="pt", max_length=max_length, truncation=True)
            summary_ids = model.generate(input_ids, num_beams=4, min_length=30, max_length=300, early_stopping=True)
            article_summary.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

        summaries.append(' '.join(article_summary))
    return summaries

# Generate summaries for evaluation (only for 10 articles)
BART_generated_summaries = generate_summaries_for_evaluation(articles, tokenizer, model, max_length)

# Function to calculate ROUGE scores (same as before)
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []

    for reference, candidate in zip(references, candidates):
        scores.append(scorer.score(reference, candidate))

    return scores

# Calculate ROUGE scores (only for the summaries of the 10 articles)
rouge_scores = calculate_rouge_scores(summaries[:10], BART_generated_summaries)

# Print out the ROUGE scores for the summaries
for score in rouge_scores:
    print(f"rouge scores for BART: {score}")

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')

def calculate_bleu_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = [word_tokenize(reference.lower())]
        candidate_tokenized = word_tokenize(candidate.lower())
        score = sentence_bleu(reference_tokenized, candidate_tokenized)
        scores.append(score)
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = word_tokenize(reference.lower())
        candidate_tokenized = word_tokenize(candidate.lower())
        score = meteor_score([reference_tokenized], candidate_tokenized)
        scores.append(score)
    return scores

bleu_scores = calculate_bleu_score(summaries, BART_generated_summaries)
meteor_scores = calculate_meteor_score(summaries, BART_generated_summaries)

BART_cosine_scores = []
for original, generated in zip(summaries, BART_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    BART_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", BART_cosine_scores)

# Print scores
for score in bleu_scores:
    print("BLEU Score:", score)
for score in meteor_scores:
    print("METEOR Score:", score)


rouge scores for BART: {'rouge1': Score(precision=0.2578125, recall=0.8918918918918919, fmeasure=0.39999999999999997), 'rouge2': Score(precision=0.2125984251968504, recall=0.75, fmeasure=0.3312883435582822), 'rougeL': Score(precision=0.2578125, recall=0.8918918918918919, fmeasure=0.39999999999999997)}
rouge scores for BART: {'rouge1': Score(precision=0.10377358490566038, recall=0.5945945945945946, fmeasure=0.17670682730923695), 'rouge2': Score(precision=0.02843601895734597, recall=0.16666666666666666, fmeasure=0.048582995951417005), 'rougeL': Score(precision=0.05660377358490566, recall=0.32432432432432434, fmeasure=0.0963855421686747)}
rouge scores for BART: {'rouge1': Score(precision=0.11165048543689321, recall=0.5609756097560976, fmeasure=0.18623481781376522), 'rouge2': Score(precision=0.04390243902439024, recall=0.225, fmeasure=0.07346938775510203), 'rougeL': Score(precision=0.05339805825242718, recall=0.2682926829268293, fmeasure=0.08906882591093117)}
rouge scores for BART: {'rouge

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
BART_generated_summaries[0]

'Harry Potter star Daniel Radcliffe gains access to a reported 20 million 41 1 million fortune as he turns 18 on Monday. Radcliffe says he has no plans to fritter his cash away on fast cars drink and celebrity parties. Radcliffe s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Details of how he ll mark his landmark birthday are under wraps. The Londoner has filmed a TV movie called My Boy Jack about author Rudyard Kipling and his son. He will also appear in December Boys an Australian film about four boys who escape an orphanage. Earlier this year he made his stage debut playing a tortured teenager in Peter Shaffer s Equus'

GPT-3.5 Model Evaluation:

In [13]:
import requests
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Set your Open AI API key


# Function to generate summaries using GPT-3
def generate_summaries_with_gpt3(articles, num_summaries=10):
    GPT_generated_summaries = []
    for article in articles[:num_summaries]:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": 'You are a helpful assistant that provides concise article summaries.'},
                {"role": "user", "content": article}
            ]
        )
        generated_summary = response.choices[0].message.content.strip()  # Corrected line
        GPT_generated_summaries.append(generated_summary)
    return GPT_generated_summaries

# Generate summaries using GPT-3
GPT_generated_summaries = generate_summaries_with_gpt3(articles)

# Function to calculate ROUGE scores (same as before)
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []
    for reference, candidate in zip(references, candidates):
        scores.append(scorer.score(reference, candidate))
    return scores

# Calculate and print ROUGE scores
rouge_scores = calculate_rouge_scores(summaries[:10], GPT_generated_summaries)
for score in rouge_scores:
    print(f"rouge scores for GPT-3.5: {score}")

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')

def calculate_bleu_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = [word_tokenize(reference.lower())]
        candidate_tokenized = word_tokenize(candidate.lower())
        score = sentence_bleu(reference_tokenized, candidate_tokenized)
        scores.append(score)
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = word_tokenize(reference.lower())
        candidate_tokenized = word_tokenize(candidate.lower())
        score = meteor_score([reference_tokenized], candidate_tokenized)
        scores.append(score)
    return scores

bleu_scores = calculate_bleu_score(summaries, GPT_generated_summaries)
meteor_scores = calculate_meteor_score(summaries, GPT_generated_summaries)
GPT_cosine_scores = []
for original, generated in zip(summaries, GPT_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    GPT_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", GPT_cosine_scores)

# Print scores
for score in bleu_scores:
    print("BLEU Score:", score)
for score in meteor_scores:
    print("METEOR Score:", score)


rouge scores for GPT-3.5: {'rouge1': Score(precision=0.18181818181818182, recall=0.8108108108108109, fmeasure=0.29702970297029707), 'rouge2': Score(precision=0.12804878048780488, recall=0.5833333333333334, fmeasure=0.21000000000000002), 'rougeL': Score(precision=0.16363636363636364, recall=0.7297297297297297, fmeasure=0.2673267326732673)}
rouge scores for GPT-3.5: {'rouge1': Score(precision=0.15286624203821655, recall=0.6486486486486487, fmeasure=0.24742268041237114), 'rouge2': Score(precision=0.057692307692307696, recall=0.25, fmeasure=0.09375), 'rougeL': Score(precision=0.10191082802547771, recall=0.43243243243243246, fmeasure=0.16494845360824745)}
rouge scores for GPT-3.5: {'rouge1': Score(precision=0.11940298507462686, recall=0.5853658536585366, fmeasure=0.1983471074380165), 'rouge2': Score(precision=0.05, recall=0.25, fmeasure=0.08333333333333334), 'rougeL': Score(precision=0.05970149253731343, recall=0.2926829268292683, fmeasure=0.09917355371900825)}
rouge scores for GPT-3.5: {'r

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Cosine Similarity Scores: [0.37079380082076974, 0.43897553601352374, 0.14956267598651377, 0.18141969046446405, 0.1724608987685712, 0.15837852859910242, 0.3033784801622043, 0.18461765088278037, 0.32299432379444737, 0.16747640123528768]
BLEU Score: 0.08786251237424847
BLEU Score: 0.03039015465884554
BLEU Score: 0.020933955638580953
BLEU Score: 9.831755279445954e-79
BLEU Score: 2.830368122157524e-155
BLEU Score: 8.119773304635117e-79
BLEU Score: 8.162512610543604e-79
BLEU Score: 1.2054456770285295e-78
BLEU Score: 1.1171676721903704e-78
BLEU Score: 3.4328698008148447e-155
METEOR Score: 0.45783638837749135
METEOR Score: 0.29279651675485013
METEOR Score: 0.29110855556456194
METEOR Score: 0.23648949430199437
METEOR Score: 0.2529564860052053
METEOR Score: 0.23489238410596025
METEOR Score: 0.23809523809523808
METEOR Score: 0.3534754879855699
METEOR Score: 0.4196745801033593
METEOR Score: 0.24334516415261756


In [14]:
GPT_generated_summaries[0]

"Harry Potter star Daniel Radcliffe, who turned 18 on Monday, gains access to a reported £20 million ($41.1 million) fortune but insists that he won't be extravagant with his newfound wealth. Radcliffe says he has no plans to spend his money on fast cars, drink, or celebrity parties. He prefers buying things that cost around £10 like books, CDs, and DVDs. Radcliffe's earnings from the first five Potter films have been held in a trust fund, which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground and trying to avoid going off the rails like other child stars. Radcliffe's latest film, Harry Potter and the Order of the Phoenix, is breaking records in both the UK and the US. He also has other projects lined up, including a TV movie and an Australian film. With his newfound adult status, Radcliffe expects even closer media scrutiny."

# ROUGE Scores Comparison between T5, BART and GPT-3.5 prior to fine-tuning:

## GPT-3.5 Model
- Average ROUGE-1 Precision: 0.180
- Average ROUGE-1 Recall: 0.576
- Average ROUGE-1 Fmeasure: 0.279
- Average ROUGE-2 Precision: 0.061
- Average ROUGE-2 Recall: 0.189
- Average ROUGE-2 Fmeasure: 0.095
- Average ROUGE-L Precision: 0.110
- Average ROUGE-L Recall: 0.369
- Average ROUGE-L Fmeasure: 0.185
- Average BLEU Score: 0.031
- Average METEOR Score: 0.342

## BART Model
- Average ROUGE-1 Precision: 0.183
- Average ROUGE-1 Recall: 0.649
- Average ROUGE-1 Fmeasure: 0.303
- Average ROUGE-2 Precision: 0.082
- Average ROUGE-2 Recall: 0.286
- Average ROUGE-2 Fmeasure: 0.136
- Average ROUGE-L Precision: 0.114
- Average ROUGE-L Recall: 0.402
- Average ROUGE-L Fmeasure: 0.180
- Average BLEU Score: 0.059
- Average METEOR Score: 0.398

## T5 Model
- Average ROUGE-1 Precision: 0.162
- Average ROUGE-1 Recall: 0.578
- Average ROUGE-1 Fmeasure: 0.256
- Average ROUGE-2 Precision: 0.069
- Average ROUGE-2 Recall: 0.265
- Average ROUGE-2 Fmeasure: 0.118
- Average ROUGE-L Precision: 0.113
- Average ROUGE-L Recall: 0.403
- Average ROUGE-L Fmeasure: 0.195
- Average BLEU Score: 0.059
- Average METEOR Score: 0.351

**BART Model:** Demonstrates the strongest performance, particularly in ROUGE-1 and ROUGE-2 scores, indicating a high degree of overlap with reference summaries. Its BLEU and METEOR scores also suggest effective summary generation capabilities.

**T5 Model:** Shows competitive performance, with ROUGE scores close to BART's, though slightly lower. Its BLEU and METEOR scores are on par with BART, indicating a strong summarization ability.

**GPT-3.5 Model:** While trailing behind BART and T5 in most metrics, GPT-3.5 still shows notable summarization capabilities, particularly in ROUGE-1 Recall and Fmeasure. Its lower BLEU score indicates potential areas for fine-tuning.

## Conclusion
- **Best Overall Performance**: BART, showing strength across all ROUGE metrics.
- **Competitive Alternative**: T5, with slightly lower but comparable performance to BART.
- **Potential for Improvement**: GPT-3.5, lagging behind BART and T5 in this setup but has room for fine-tuning and optimization.

# Fine-tuning GPT-3.5:

In [15]:
import requests
from rouge_score import rouge_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Set your Open AI API key

# Function to generate summaries using GPT-3
def generate_summaries_with_gpt3(articles, num_summaries=10):
    GPT_FineTuned_generated_summaries = []
    for article in articles[:num_summaries]:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": 'Summarize the following article in less than 100 words, focusing on key points. Aim for clarity and brevity.'},
                {"role": "user", "content": article}
            ]
        )
        generated_summary = response.choices[0].message.content.strip()  # Corrected line
        GPT_FineTuned_generated_summaries.append(generated_summary)
    return GPT_FineTuned_generated_summaries

# Generate summaries using GPT-3
GPT_FineTuned_generated_summaries = generate_summaries_with_gpt3(articles)

# Function to calculate ROUGE scores (same as before)
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []
    for reference, candidate in zip(references, candidates):
        scores.append(scorer.score(reference, candidate))
    return scores

# Calculate and print ROUGE scores
rouge_scores = calculate_rouge_scores(summaries[:10], GPT_FineTuned_generated_summaries)
for score in rouge_scores:
    print(f"rouge scores for fine-tuned GPT-3.5: {score}")

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
nltk.download('wordnet')

def calculate_bleu_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = [word_tokenize(reference.lower())]
        candidate_tokenized = word_tokenize(candidate.lower())
        score = sentence_bleu(reference_tokenized, candidate_tokenized)
        scores.append(score)
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for reference, candidate in zip(references, candidates):
        reference_tokenized = word_tokenize(reference.lower())
        candidate_tokenized = word_tokenize(candidate.lower())
        score = meteor_score([reference_tokenized], candidate_tokenized)
        scores.append(score)
    return scores

bleu_scores = calculate_bleu_score(summaries, GPT_FineTuned_generated_summaries)
meteor_scores = calculate_meteor_score(summaries, GPT_FineTuned_generated_summaries)
GPT_FineTuned_cosine_scores = []
for original, generated in zip(summaries, GPT_FineTuned_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    GPT_FineTuned_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", GPT_FineTuned_cosine_scores)

# Print scores
for score in bleu_scores:
    print("BLEU Score:", score)
for score in meteor_scores:
    print("METEOR Score:", score)



rouge scores for fine-tuned GPT-3.5: {'rouge1': Score(precision=0.2727272727272727, recall=0.6486486486486487, fmeasure=0.38399999999999995), 'rouge2': Score(precision=0.14942528735632185, recall=0.3611111111111111, fmeasure=0.21138211382113825), 'rougeL': Score(precision=0.20454545454545456, recall=0.4864864864864865, fmeasure=0.28800000000000003)}
rouge scores for fine-tuned GPT-3.5: {'rouge1': Score(precision=0.1958762886597938, recall=0.5135135135135135, fmeasure=0.2835820895522388), 'rouge2': Score(precision=0.07291666666666667, recall=0.19444444444444445, fmeasure=0.10606060606060606), 'rougeL': Score(precision=0.1134020618556701, recall=0.2972972972972973, fmeasure=0.16417910447761194)}
rouge scores for fine-tuned GPT-3.5: {'rouge1': Score(precision=0.1568627450980392, recall=0.3902439024390244, fmeasure=0.22377622377622378), 'rouge2': Score(precision=0.04950495049504951, recall=0.125, fmeasure=0.07092198581560284), 'rougeL': Score(precision=0.09803921568627451, recall=0.2439024

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [16]:
GPT_FineTuned_generated_summaries[0]

"Daniel Radcliffe, the star of the Harry Potter films, has turned 18 and gained access to his £20 million fortune. Despite his new wealth, Radcliffe insists that he won't become extravagant and plans to continue his modest spending habits on books and DVDs. He also stated that he will have a party to celebrate his birthday, but details of the event are under wraps. Radcliffe’s earnings from the Harry Potter films have been held in a trust fund that he has not been able to touch."

Providing examples for the GPT model to learn better:

In [None]:
import requests
from rouge_score import rouge_scorer
import random
import nltk
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')

# Set your Open AI API key


def generate_summaries_with_gpt3(articles, summaries, num_summaries=10):
    GPT_finer_generated_summaries = []
    for i in range(num_summaries):
        # Randomly select an example article and its summary for the prompt
        example_idx = random.randint(0, len(articles) - 1)
        prompt = (
            f"Article: {articles[example_idx]}\nSummary: {summaries[example_idx]}\n\n"
            f"Article: {articles[i]}\nSummary:"
        )

        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "system", "content": prompt}]
        )

        generated_summary = response.choices[0].message.content.strip()
        GPT_finer_generated_summaries.append(generated_summary)

    return GPT_finer_generated_summaries

# Generate summaries using the updated approach
GPT_finer_generated_summaries = generate_summaries_with_gpt3(articles, summaries)

# Calculate ROUGE, BLEU, and METEOR scores
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return [scorer.score(ref, cand) for ref, cand in zip(references, candidates)]

def calculate_bleu_score(references, candidates):
    return [
        sentence_bleu([word_tokenize(ref.lower())], word_tokenize(cand.lower()))
        for ref, cand in zip(references, candidates)
    ]

def calculate_meteor_score(references, candidates):
    return [
        meteor_score([word_tokenize(ref.lower())], word_tokenize(cand.lower()))
        for ref, cand in zip(references, candidates)
    ]

# Print the evaluation scores
rouge_scores = calculate_rouge_scores(summaries[:10], GPT_finer_generated_summaries)
bleu_scores = calculate_bleu_score(summaries[:10], GPT_finer_generated_summaries)
meteor_scores = calculate_meteor_score(summaries[:10], GPT_finer_generated_summaries)
GPT_finer_cosine_scores = []
for original, generated in zip(summaries, GPT_finer_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    GPT_finer_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", GPT_finer_cosine_scores)

for i in range(10):
    print(f"Article {i+1} - ROUGE: {rouge_scores[i]}, BLEU: {bleu_scores[i]}, METEOR: {meteor_scores[i]}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Cosine Similarity Scores: [0.25862404170994413, 0.381221055525755, 0.11673076769587434, 0.21012068493286917, 0.25441777118044046, 0.07291077325131615, 0.3147868250354779, 0.33800201768337435, 0.25497337681718185, 0.19654759661851784]
Article 1 - ROUGE: {'rouge1': Score(precision=0.2571428571428571, recall=0.46153846153846156, fmeasure=0.33027522935779813), 'rouge2': Score(precision=0.11594202898550725, recall=0.21052631578947367, fmeasure=0.14953271028037382), 'rougeL': Score(precision=0.2, recall=0.358974358974359, fmeasure=0.25688073394495414)}, BLEU: 0.09928901010854298, METEOR: 0.4052170574660843
Article 2 - ROUGE: {'rouge1': Score(precision=0.1459227467811159, recall=0.6938775510204082, fmeasure=0.24113475177304966), 'rouge2': Score(precision=0.05603448275862069, recall=0.2708333333333333, fmeasure=0.09285714285714285), 'rougeL': Score(precision=0.0944206008583691, recall=0.4489795918367347, fmeasure=0.15602836879432627)}, BLEU: 0.04939611332738405, METEOR: 0.35626606683804635
Art

The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


# Fine-tuning T5 Model:

In [4]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import LambdaLR
import torch
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Initialize the T5 model and tokenizer
model_name = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Function to clean text
def clean_text(text):
    # Basic cleaning
    text = re.sub(r"\s+", " ", text)  # Remove extra spaces
    text = re.sub(r"\W+", " ", text)  # Remove non-word characters
    return text

# Function to truncate text
def truncate_text(text, max_length=200):
    return text[:max_length]

# Preprocess articles and summaries
articles = [clean_text(article) for article in articles]

# Tokenize the data
max_length = 1024  # Max length for the articles
summary_length = 100  # Max length for the summaries

train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# Prepare the dataset
class SummarizationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SummarizationDataset(train_encodings, train_labels)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Setup optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)

def lr_lambda(current_step: int):
    warmup_steps = 500
    if current_step < warmup_steps:
        return float(current_step) / float(max(1, warmup_steps))
    return 0.95 ** (current_step - warmup_steps)

scheduler = LambdaLR(optimizer, lr_lambda)

# Training loop
num_epochs = 1  # You can adjust this
model.train()
for epoch in range(num_epochs):
    for batch in train_loader:
        # Forward pass
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        labels = batch['labels'].to(model.device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Update parameters and learning rate
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        print(f"Epoch {epoch}, Loss: {loss.item()}")

def generate_summaries(dataset, model, tokenizer, max_length=1024, num_summaries=10):
    model.eval()
    T5_FineTuned_generated_summaries = []
    for i, article in enumerate(dataset):
        if i >= num_summaries:
            break
        inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=max_length, truncation=True).to(model.device)
        outputs = model.generate(inputs, max_length=200)
        summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
        T5_FineTuned_generated_summaries.append(summary)
    return T5_FineTuned_generated_summaries

T5_FineTuned_generated_summaries = generate_summaries(articles, model, tokenizer)
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import nltk
nltk.download('wordnet')
nltk.download('punkt')

def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []
    for ref, cand in zip(references, candidates):
        scores.append(scorer.score(ref, cand))
    return scores

def calculate_bleu_score(references, candidates):
    scores = []
    for ref, cand in zip(references, candidates):
        ref_tokenized = [word_tokenize(ref.lower())]
        cand_tokenized = word_tokenize(cand.lower())
        scores.append(sentence_bleu(ref_tokenized, cand_tokenized))
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for ref, cand in zip(references, candidates):
        ref_tokenized = word_tokenize(ref.lower())
        cand_tokenized = word_tokenize(cand.lower())
        scores.append(meteor_score([ref_tokenized], cand_tokenized))
    return scores

# Calculate the scores
rouge_scores = calculate_rouge_scores(summaries[:10], T5_FineTuned_generated_summaries)
bleu_scores = calculate_bleu_score(summaries[:10], T5_FineTuned_generated_summaries)
meteor_scores = calculate_meteor_score(summaries[:10], T5_FineTuned_generated_summaries)
T5_FineTuned_cosine_scores = []
for original, generated in zip(summaries, T5_FineTuned_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    T5_FineTuned_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", T5_FineTuned_cosine_scores)

# Print scores
print("ROUGE Scores:", rouge_scores)
print("BLEU Scores:", bleu_scores)
print("METEOR Scores:", meteor_scores)



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Epoch 0, Loss: 7.666695594787598


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Cosine Similarity Scores: [0.1289985907367188, 0.36992666978265637, 0.31576903178862953, 0.09482766850567939, 0.1713745597291317, 0.20134697774366345, 0.3422064513706536, 0.34211469363773894, 0.04701610374659439, 0.19342170935126668]
ROUGE Scores: [{'rouge1': Score(precision=0.16071428571428573, recall=0.23076923076923078, fmeasure=0.18947368421052632), 'rouge2': Score(precision=0.05454545454545454, recall=0.07894736842105263, fmeasure=0.06451612903225806), 'rougeL': Score(precision=0.10714285714285714, recall=0.15384615384615385, fmeasure=0.12631578947368421)}, {'rouge1': Score(precision=0.4358974358974359, recall=0.3469387755102041, fmeasure=0.3863636363636364), 'rouge2': Score(precision=0.10526315789473684, recall=0.08333333333333333, fmeasure=0.0930232558139535), 'rougeL': Score(precision=0.23076923076923078, recall=0.1836734693877551, fmeasure=0.20454545454545456)}, {'rouge1': Score(precision=0.2857142857142857, recall=0.43902439024390244, fmeasure=0.34615384615384615), 'rouge2': 

In [1]:
# summaries

In [8]:
T5_FineTuned_generated_summaries[0]

'he will be able to gamble in a casino buy a drink in a pub or see the horror film Hostel Part II currently six places below his number one movie on the UK box office chart. despite his growing fame and riches the young actor says he is keeping his feet firmly on the ground.'

# Fine-Tuning BART

In [8]:
from transformers import BartTokenizer, BartForConditionalGeneration, AdamW
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import LambdaLR
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import torch
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

nltk.download('punkt')
nltk.download('wordnet')

# Initialize the BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Functions for text processing
def clean_text(text):
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\W+", " ", text)
    return text

def truncate_text(text, max_length=200):
    return text[:max_length]

# Preprocess articles and summaries
articles = [clean_text(article) for article in articles]

# Tokenization
max_length = 1024
summary_length = 300

train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# Dataset
class SummarizationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SummarizationDataset(train_encodings, train_labels)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Optimizer and Scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)

def lr_lambda(current_step: int):
    warmup_steps = 500
    if current_step < warmup_steps:
        return float(current_step) / float(max(1, warmup_steps))
    return 0.95 ** (current_step - warmup_steps)

scheduler = LambdaLR(optimizer, lr_lambda)

# Training Loop
num_epochs = 1  # Adjust as needed
model.train()
for epoch in range(num_epochs):
    for batch in train_loader:
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        labels = batch['labels'].to(model.device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        print(f"Epoch {epoch}, Loss: {loss.item()}")

# Summarization Function
def generate_summaries(dataset, model, tokenizer, max_length=1024, num_summaries=10):
    model.eval()
    generated_summaries = []
    for i, article in enumerate(dataset):
        if i >= num_summaries:
            break
        inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=max_length, truncation=True).to(model.device)
        outputs = model.generate(inputs, max_length=300)
        summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_summaries.append(summary)
    return generated_summaries

generated_summaries = generate_summaries(articles, model, tokenizer)

# Evaluation Functions
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []
    for ref, cand in zip(references, candidates):
        scores.append(scorer.score(ref, cand))
    return scores

def calculate_bleu_score(references, candidates):
    scores = []
    smoothie = SmoothingFunction().method4
    for ref, cand in zip(references, candidates):
        ref_tokenized = [word_tokenize(ref.lower())]
        cand_tokenized = word_tokenize(cand.lower())
        scores.append(sentence_bleu(ref_tokenized, cand_tokenized, smoothing_function=smoothie))
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for ref, cand in zip(references, candidates):
        ref_tokenized = word_tokenize(ref.lower())
        cand_tokenized = word_tokenize(cand.lower())
        scores.append(meteor_score([ref_tokenized], cand_tokenized))
    return scores

# Calculate Scores
rouge_scores = calculate_rouge_scores(summaries[:10], generated_summaries)
bleu_scores = calculate_bleu_score(summaries[:10], generated_summaries)
meteor_scores = calculate_meteor_score(summaries[:10], generated_summaries)
cosine_scores = []
for original, generated in zip(summaries, generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", cosine_scores)

# Print Scores
print("ROUGE Scores:", rouge_scores)
print("BLEU Scores:", bleu_scores)
print("METEOR Scores:", meteor_scores)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Epoch 0, Loss: 9.901117324829102
Cosine Similarity Scores: [0.6444823937847926, 0.4090179912362602, 0.1608096947628205, 0.194775963358456, 0.3260743091463692, 0.1460970800186441, 0.3379048414209187, 0.28691396739851954, 0.4246289206642457, 0.139475820918422]
ROUGE Scores: [{'rouge1': Score(precision=0.5538461538461539, recall=0.9230769230769231, fmeasure=0.6923076923076924), 'rouge2': Score(precision=0.453125, recall=0.7631578947368421, fmeasure=0.5686274509803922), 'rougeL': Score(precision=0.5538461538461539, recall=0.9230769230769231, fmeasure=0.6923076923076924)}, {'rouge1': Score(precision=0.3958333333333333, recall=0.3877551020408163, fmeasure=0.3917525773195876), 'rouge2': Score(precision=0.0851063829787234, recall=0.08333333333333333, fmeasure=0.08421052631578947), 'rougeL': Score(precision=0.16666666666666666, recall=0.16326530612244897, fmeasure=0.16494845360824742)}, {'rouge1': Score(precision=0.30357142857142855, recall=0.4146341463414634, fmeasure=0.35051546391752575), 'ro

In [4]:
generated_summaries[0]

'Harry Potter star Daniel Radcliffe gains access to a reported 20 million 41 1 million fortune as he turns 18 on Monday. The actor says he has no plans to fritter his cash away on fast cars drink and celebrity parties. Radcliffe s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch.'

In [7]:
summaries

["Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday . Young actor says he has no plans to fritter his cash away . Radcliffe's earnings from first five Potter films have been held in trust fund .",
 'Mentally ill inmates in Miami are housed on the "forgotten floor" Judge Steven Leifman says most are there as a result of "avoidable felonies" While CNN tours facility, patient shouts: "I am the son of the president" Leifman says the system is unjust and he\'s fighting for change .',
 'NEW: "I thought I was going to die," driver says . Man says pickup truck was folded in half; he just has cut on face . Driver: "I probably had a 30-, 35-foot free fall" Minnesota bridge collapsed during rush hour Wednesday .',
 'Five small polyps found during procedure; "none worrisome," spokesman says . President reclaims powers transferred to vice president . Bush undergoes routine colonoscopy at Camp David .',
 "NEW: NFL chief, Atlanta Falcons owner critical of Michael Vick's condu

In [7]:
from transformers import BartTokenizer, BartForConditionalGeneration, AdamW
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import LambdaLR
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from nltk.tokenize import word_tokenize
import torch
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

nltk.download('punkt')
nltk.download('wordnet')

# Initialize the BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Functions for text processing
def clean_text(text):
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\W+", " ", text)
    return text

def truncate_text(text, max_length=200):
    return text[:max_length]

# Preprocess articles and summaries
articles = [clean_text(article) for article in articles]
summaries = [clean_text(summary) for summary in summaries]
summaries = [truncate_text(summary) for summary in summaries]

# Tokenization
max_length = 1024
summary_length = 40

train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# Dataset
class SummarizationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SummarizationDataset(train_encodings, train_labels)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

# Optimizer and Scheduler
optimizer = AdamW(model.parameters(), lr=5e-5)

def lr_lambda(current_step: int):
    warmup_steps = 500
    if current_step < warmup_steps:
        return float(current_step) / float(max(1, warmup_steps))
    return 0.95 ** (current_step - warmup_steps)

scheduler = LambdaLR(optimizer, lr_lambda)

# Training Loop
num_epochs = 1  # Adjust as needed
model.train()
for epoch in range(num_epochs):
    for batch in train_loader:
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        labels = batch['labels'].to(model.device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        print(f"Epoch {epoch}, Loss: {loss.item()}")

# Summarization Function
def generate_summaries(dataset, model, tokenizer, max_length=1024, num_summaries=10):
    model.eval()
    BART_FineTuned_generated_summaries = []
    for i, article in enumerate(dataset):
        if i >= num_summaries:
            break
        inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=max_length, truncation=True).to(model.device)
        outputs = model.generate(inputs, max_length=40)
        summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
        BART_FineTuned_generated_summaries.append(summary)
    return BART_FineTuned_generated_summaries

BART_FineTuned_generated_summaries = generate_summaries(articles, model, tokenizer)

# Evaluation Functions
def calculate_rouge_scores(references, candidates):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = []
    for ref, cand in zip(references, candidates):
        scores.append(scorer.score(ref, cand))
    return scores

def calculate_bleu_score(references, candidates):
    scores = []
    smoothie = SmoothingFunction().method4
    for ref, cand in zip(references, candidates):
        ref_tokenized = [word_tokenize(ref.lower())]
        cand_tokenized = word_tokenize(cand.lower())
        scores.append(sentence_bleu(ref_tokenized, cand_tokenized, smoothing_function=smoothie))
    return scores

def calculate_meteor_score(references, candidates):
    scores = []
    for ref, cand in zip(references, candidates):
        ref_tokenized = word_tokenize(ref.lower())
        cand_tokenized = word_tokenize(cand.lower())
        scores.append(meteor_score([ref_tokenized], cand_tokenized))
    return scores

# Calculate Scores
rouge_scores = calculate_rouge_scores(summaries[:10], BART_FineTuned_generated_summaries)
bleu_scores = calculate_bleu_score(summaries[:10], BART_FineTuned_generated_summaries)
meteor_scores = calculate_meteor_score(summaries[:10], BART_FineTuned_generated_summaries)
BART_FineTuned_cosine_scores = []
for original, generated in zip(summaries, BART_FineTuned_generated_summaries):
    score = calculate_cosine_similarity(original, generated)
    BART_FineTuned_cosine_scores.append(score)

# Print the cosine similarity scores
print("Cosine Similarity Scores:", BART_FineTuned_cosine_scores)

# Print Scores
print("ROUGE Scores:", rouge_scores)
print("BLEU Scores:", bleu_scores)
print("METEOR Scores:", meteor_scores)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Epoch 0, Loss: 2.302992105484009




Cosine Similarity Scores: [0.5089465909384226, 0.42600288790348806, 0.18578341441251509, 0.20985878895376944, 0.286795932954523, 0.030242567519783332, 0.1532279983592761, 0.1571927663103977, 0.3579533505210165, 0.08429796603862379]
ROUGE Scores: [{'rouge1': Score(precision=0.6363636363636364, recall=0.5675675675675675, fmeasure=0.6000000000000001), 'rouge2': Score(precision=0.53125, recall=0.4722222222222222, fmeasure=0.4999999999999999), 'rougeL': Score(precision=0.6363636363636364, recall=0.5675675675675675, fmeasure=0.6000000000000001)}, {'rouge1': Score(precision=0.45161290322580644, recall=0.3783783783783784, fmeasure=0.411764705882353), 'rouge2': Score(precision=0.13333333333333333, recall=0.1111111111111111, fmeasure=0.1212121212121212), 'rougeL': Score(precision=0.22580645161290322, recall=0.1891891891891892, fmeasure=0.2058823529411765)}, {'rouge1': Score(precision=0.41935483870967744, recall=0.3170731707317073, fmeasure=0.3611111111111111), 'rouge2': Score(precision=0.2333333

# Mounting Drive (If you have not done so already):

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# import requests
# from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
# import torch
# import os

# # Fetch the data
# url = "https://datasets-server.huggingface.co/rows?dataset=cnn_dailymail&config=1.0.0&split=train&offset=0&length=100"
# response = requests.get(url)
# articles = []
# summaries = []

# if response.status_code == 200:
#     data = response.json()
#     if isinstance(data, dict) and 'rows' in data:
#         entries = data['rows']
#         for entry in entries:
#             articles.append(entry['row']['article'])
#             summaries.append(entry['row']['highlights'])
#     elif isinstance(data, list):
#         for entry in data:
#             articles.append(entry['row']['article'])
#             summaries.append(entry['row']['highlights'])
#     else:
#         print("Unexpected data structure received from the API.")
# else:
#     print(f"Failed to fetch data. Status code: {response.status_code}. Error: {response.text}")
#     exit()

# # Initialize the T5 model and tokenizer
# model_name = "t5-base"
# model = T5ForConditionalGeneration.from_pretrained(model_name)
# tokenizer = T5Tokenizer.from_pretrained(model_name)

# # Tokenize the data
# max_length = 1024  # Max length for the articles
# summary_length = 300  # Max length for the summaries

# train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
# train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# # Prepare the dataset format for Trainer
# class SummarizationDataset(torch.utils.data.Dataset):
#     def __init__(self, encodings, labels):
#         self.encodings = encodings
#         self.labels = labels

#     def __getitem__(self, idx):
#         item = {key: val[idx] for key, val in self.encodings.items()}
#         item['labels'] = self.labels['input_ids'][idx]
#         return item

#     def __len__(self):
#         return len(self.labels)

# train_dataset = SummarizationDataset(train_encodings, train_labels)

# # Training the model
# training_args = TrainingArguments(
#     output_dir='/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/training_results',
#     per_device_train_batch_size=4,
#     num_train_epochs=1,
# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     tokenizer=tokenizer
# )

# trainer.train()

# testing_samples_dir = '/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/testing_samples'

# # List all .txt files in the directory
# txt_files = [f for f in os.listdir(testing_samples_dir) if f.endswith('.txt')]

# # Read the content of the first .txt file into 'article'
# with open(os.path.join(testing_samples_dir, txt_files[0]), 'r', encoding='utf-8') as file:
#     article = file.read()

# # If you want to use the first file:
# def generate_summary(text, tokenizer, model):
#     # Split the article into chunks of approx. 900 tokens
#     chunk_size = 900
#     article_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
#     summaries = []

#     for chunk in article_chunks:
#         input_ids = tokenizer.encode("summarize: " + chunk, return_tensors="pt", max_length=max_length, truncation=True)
#         summary_ids = model.generate(input_ids, num_beams=4, min_length=30, max_length=300, early_stopping=True)
#         summaries.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

#     return ' '.join(summaries)

# article_summary = generate_summary(article, tokenizer, model)
# print(article_summary)


In [None]:
# import requests
# from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
# import torch
# from rouge_score import rouge_scorer

# # Initialize the BART model and tokenizer
# model_name = "facebook/bart-large-cnn"  # A popular BART model for summarization
# model = BartForConditionalGeneration.from_pretrained(model_name)
# tokenizer = BartTokenizer.from_pretrained(model_name)

# # Tokenize the data
# max_length = 1024  # Max length for the articles
# summary_length = 300  # Max length for the summaries

# train_encodings = tokenizer(articles, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
# train_labels = tokenizer(summaries, truncation=True, padding='max_length', max_length=summary_length, return_tensors="pt")

# # Prepare the dataset format for Trainer
# class SummarizationDataset(torch.utils.data.Dataset):
#     def __init__(self, encodings, labels):
#         self.encodings = encodings
#         self.labels = labels

#     def __getitem__(self, idx):
#         item = {key: val[idx] for key, val in self.encodings.items()}
#         item['labels'] = self.labels['input_ids'][idx]
#         return item

#     def __len__(self):
#         return len(self.labels)

# train_dataset = SummarizationDataset(train_encodings, train_labels)

# # Training the model
# training_args = TrainingArguments(
#     output_dir='/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/training_results',
#     per_device_train_batch_size=8,
#     num_train_epochs=1,
# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     tokenizer=tokenizer
# )

# trainer.train()

# testing_samples_dir = '/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/testing_samples'

# # List all .txt files in the directory
# txt_files = [f for f in os.listdir(testing_samples_dir) if f.endswith('.txt')]

# # Read the content of the first .txt file into 'article'
# with open(os.path.join(testing_samples_dir, txt_files[0]), 'r', encoding='utf-8') as file:
#     article = file.read()

# # If you want to use the first file:
# def generate_summary(text, tokenizer, model):
#     # Split the article into chunks of approx. 900 tokens
#     chunk_size = 900
#     article_chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
#     summaries = []

#     for chunk in article_chunks:
#         input_ids = tokenizer.encode("summarize: " + chunk, return_tensors="pt", max_length=max_length, truncation=True)
#         summary_ids = model.generate(input_ids, num_beams=4, min_length=30, max_length=300, early_stopping=True)
#         summaries.append(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

#     return ' '.join(summaries)

# article_summary = generate_summary(article, tokenizer, model)
# print(article_summary)

In [None]:
# from selenium import webdriver
# from selenium.webdriver.chrome.options import Options
# from bs4 import BeautifulSoup
# import os

# def scrape_msnbc_article():
#     # Set up Selenium to work with headless Chrome (no GUI)
#     chrome_options = Options()
#     chrome_options.add_argument("--headless")
#     chrome_options.add_argument("--no-sandbox")
#     chrome_options.add_argument("--disable-dev-shm-usage")

#     # Initialize the webdriver
#     with webdriver.Chrome(options=chrome_options) as driver:
#         # URL of the MSNBC article page
#         url = 'https://www.msnbc.com/rachel-maddow-show/maddowblog/dhs-texts-become-latest-jan-6-materials-go-missing-rcna40692'

#         # Use Selenium to fetch the webpage content
#         driver.get(url)

#         # Get the page source using Selenium and parse it with BeautifulSoup
#         soup = BeautifulSoup(driver.page_source, 'html.parser')
#         article_content = soup.find_all('p')

#         # Begin the article text collection
#         article_text = ''

#         # Loop through each paragraph
#         for p in article_content:
#             # If the start of the cookie message is found, break out of the loop
#             if "Like many companies, we use cookies" in p.get_text() or "2023 NBC UNIVERSAL  This Cookie Notice" in p.get_text():
#                 break
#             # Otherwise, continue adding paragraphs to the article text
#             article_text += p.get_text() + ' '

#     # Define the output directory
#     output_dir = '/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/testing_samples'

#     # Create the directory if it doesn't exist
#     os.makedirs(output_dir, exist_ok=True)

#     # Save the article to a .txt file inside testing_samples
#     output_path = os.path.join(output_dir, 'msnbc_article.txt')
#     with open(output_path, 'w', encoding='utf-8') as file:
#         file.write(article_text.strip())  # strip() removes trailing spaces

#     print(f"Article saved to {output_path}")

# # Call the function to scrape and save the article
# scrape_msnbc_article()


# Using BART to Summarize A Chapter of Their Eyes Were Watching God

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Path to the text file in Google Drive
file_path = '/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/testing_samples/chapter1_TEWWG.txt'

# Reading the chapter text
with open(file_path, 'r') as file:
    chapter_text = file.read()

# Preprocess and tokenize the chapter text
chapter_encodings = tokenizer(chapter_text, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")

# Generate a summary
model.eval()
inputs = chapter_encodings['input_ids'].to(model.device)
attention_mask = chapter_encodings['attention_mask'].to(model.device)
summary_ids = model.generate(inputs, max_length=200)
chapter_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Generated Summary:", chapter_summary)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Generated Summary: Women forget all those things they don't want to remember, and remember everything they don’t want to forget. The dream is the truth. Then they act and do things accordingly. So they chewed up the back parts of their minds and swallowed with relish. They made burning statements with questions, and killing tools out of laughs.


# Using GPT-3.5 to Sumarize Chapter

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# set up open ai key here:

# Function to read the chapter text from a file
def read_chapter(file_path):
    with open(file_path, 'r') as file:
        return file.read()

# Function to generate a summary using GPT-3
def generate_summaries_with_gpt3(articles, num_summaries=10):
    for article in articles[:num_summaries]:
        response = open ai chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Summarize the following chapter"},
                {"role": "user", "content": "Summarize this chapter in less than 100 words: " + chapter_text}
            ]
        )
        generated_sum = response.choices[0].message.content.strip()  # Corrected line
    return generated_sum

# File path to your chapter text file in Google Drive
file_path = '/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/testing_samples/chapter1_TEWWG.txt'

# Read the chapter text
chapter_text = read_chapter(file_path)

# Generate the summary
gen_sum = generate_summaries_with_gpt3(chapter_text)

# Print the generated summary
print("Generated Summary:", gen_sum)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Generated Summary: In this chapter, the narrator discusses the lives of men and women. Men are represented by ships at a distance, with their wishes on board. Some ships come in with the tide, while others sail forever on the horizon. Women, on the other hand, forget things they don't want to remember and remember everything they don't want to forget, living their lives according to their dreams.

The chapter then focuses on a woman named Janie who has returned from burying the dead. The people in the community gossip about her and speculate about her life. They comment on her appearance and question her choices. Janie ignores their comments and continues on her way.

Janie's friend, Pheoby, defends her against the gossipers and goes to visit her. They share a meal and talk about Janie's life. Janie reveals that her husband, Tea Cake, has left her and she has

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set your Open AI API key

# Function to read the chapter text from a file
def read_chapter(file_path):
    with open(file_path, 'r') as file:
        return file.read()

# Function to generate a summary using GPT-3
def generate_summaries_with_gpt3(articles, num_summaries=10):
    for article in articles[:num_summaries]:
        response = open ai chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Summarize the following chapter"},
                {"role": "user", "content": "Summarize this chapter in 300 words: " + chapter_text}
            ]
        )
        generated_sum = response.choices[0].message.content.strip()  # Corrected line
    return generated_sum

# File path to your chapter text file in Google Drive
file_path = '/content/drive/My Drive/ML_Algorithms_FA_23/ML_Project/testing_samples/chapter1_TEWWG.txt'

# Read the chapter text
chapter_text = read_chapter(file_path)

# Generate the summary
gen_sum = generate_summaries_with_gpt3(chapter_text)

# Print the generated summary
print("Generated Summary:", gen_sum)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Generated Summary: In this chapter, the narrator introduces the theme of ships at a distance carrying each person's wishes. Some people's desires come with the tide, while others remain out of reach on the horizon. The narrator also contrasts the way men and women remember things, with women choosing to forget certain memories and remember what they wish. 

The chapter focuses on a woman who has returned from burying the dead. The townspeople see her as she arrives at sundown and start gossiping about her. They comment on her appearance and question her decisions, expressing envy and cruelty. The woman ignores them and continues walking to her gate without engaging in conversation.

After she enters her home, her friend Pheoby Watson arrives with food. They discuss the town's gossip and the woman's recent experiences. The woman explains that she had been livi

# Comparing CliffNotes Summary:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Function to calculate ROUGE scores
def calculate_rouge_scores(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return scorer.score(reference, candidate)

# Function to calculate BLEU score
def calculate_bleu_score(reference, candidate):
    reference_tokenized = [word_tokenize(reference.lower())]
    candidate_tokenized = word_tokenize(candidate.lower())
    smoothie = SmoothingFunction().method4
    return sentence_bleu(reference_tokenized, candidate_tokenized, smoothing_function=smoothie)


CliffNotes = "The porch sitters are spread out on the front porch of Pheoby and Sam Watson's home, happy to be free of the responsibilities of their long day's labor. They are astonished to see a bedraggled and weary-looking Janie Starks trudging into town, then turning her face in their direction. The women see her as a disaster, but the men see her as still possessing physical attraction. Janie speaks, acknowledges them, and goes on, and their indignation is great. How could she have the nerve not to stop and explain why she went off a year and a half ago in a blue satin dress and now she returns in dirty overalls? Surely her husband — they assume she married the man, the guitar-playing, roving Tea Cake — took her money and probably went off with a younger woman. After all, Tea Cake was nearly ten years younger than Janie. They believe that Janie should have stopped and talked to them. The inherent jealousy of the women is quite apparent. Janie's friend Pheoby defends her to the porch sitters. Pheoby believes that Janie does not have to share any of her personal business with them. Assuming that Janie is hungry, Pheoby volunteers to take Janie a pot of mulatto rice, and soon she finds her way through the darkness to Janie's back steps. Pheoby's motive is not completely unselfish. She is quietly certain that Janie will talk to her and explain what happened during the past year and a half. Janie welcomes her friend and the gift of food. She informs Pheoby that Tea Cake did not run off with the money that Joe left her. She reveals that the money is safe in the bank, but Tea Cake is dead. After Janie has rested for a while, cleaned and soothed her tired feet, and enjoyed the rice, she tells Pheoby about her months with Tea Cake."

Generated = "In this chapter, the author introduces the concept of ships at a distance, symbolizing people's dreams and desires. The chapter then shifts to focus on a woman who has returned home after burying the dead, specifically victims of a tragic event. As she walks through her town, the people on their porches begin to gossip and speculate about her past, criticizing her appearance and actions. The woman, named Janie, approaches her friend Pheoby's porch, where Pheoby defends Janie against the gossip and judgments of others. Inside the porch, Janie enjoys a meal Pheoby brought her and they talk about the rumors and whispers surrounding Janie's life. Pheoby expresses her curiosity, but Janie dismisses the idea of explaining herself, stating that people don't understand her experiences. Janie reveals that her husband, Tea Cake, has left her, and that is why she has returned home. She explains that she no longer has anything to keep her in her previous location. Janie emphasizes that there is more to her story than what people assume, but she doesn't feel the need to share it unless someone truly understands her perspective. Pheoby agrees to listen and understand, and they continue their conversation. The chapter ends with Janie stating that she has depended on Pheoby as a friend for a long time, and she trusts her to have an open mind. Time passes as they talk, and the darkness outside becomes old and weathered, while Janie's words carry wisdom and depth."

similarity_score = calculate_cosine_similarity(CliffNotes, Generated)
print(f"Cosine Similarity Score: {similarity_score}")
rouge_scores = calculate_rouge_scores(CliffNotes, Generated)
print("ROUGE Scores:", rouge_scores)

# Calculate BLEU score
bleu_score = calculate_bleu_score(CliffNotes, Generated)
print("BLEU Score:", bleu_score)

Cosine Similarity Score: 0.7432449075279601
ROUGE Scores: {'rouge1': Score(precision=0.5617529880478087, recall=0.4392523364485981, fmeasure=0.493006993006993), 'rouge2': Score(precision=0.084, recall=0.065625, fmeasure=0.0736842105263158), 'rougeL': Score(precision=0.2151394422310757, recall=0.16822429906542055, fmeasure=0.1888111888111888)}
BLEU Score: 0.021093114057218643


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


BART:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

def calculate_cosine_similarity(text1, text2):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return similarity[0][0]

# Function to calculate ROUGE scores
def calculate_rouge_scores(reference, candidate):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return scorer.score(reference, candidate)

# Function to calculate BLEU score
def calculate_bleu_score(reference, candidate):
    reference_tokenized = [word_tokenize(reference.lower())]
    candidate_tokenized = word_tokenize(candidate.lower())
    smoothie = SmoothingFunction().method4
    return sentence_bleu(reference_tokenized, candidate_tokenized, smoothing_function=smoothie)


CliffNotes = "The porch sitters are spread out on the front porch of Pheoby and Sam Watson's home, happy to be free of the responsibilities of their long day's labor. They are astonished to see a bedraggled and weary-looking Janie Starks trudging into town, then turning her face in their direction. The women see her as a disaster, but the men see her as still possessing physical attraction. Janie speaks, acknowledges them, and goes on, and their indignation is great. How could she have the nerve not to stop and explain why she went off a year and a half ago in a blue satin dress and now she returns in dirty overalls? Surely her husband — they assume she married the man, the guitar-playing, roving Tea Cake — took her money and probably went off with a younger woman. After all, Tea Cake was nearly ten years younger than Janie. They believe that Janie should have stopped and talked to them. The inherent jealousy of the women is quite apparent. Janie's friend Pheoby defends her to the porch sitters. Pheoby believes that Janie does not have to share any of her personal business with them. Assuming that Janie is hungry, Pheoby volunteers to take Janie a pot of mulatto rice, and soon she finds her way through the darkness to Janie's back steps. Pheoby's motive is not completely unselfish. She is quietly certain that Janie will talk to her and explain what happened during the past year and a half. Janie welcomes her friend and the gift of food. She informs Pheoby that Tea Cake did not run off with the money that Joe left her. She reveals that the money is safe in the bank, but Tea Cake is dead. After Janie has rested for a while, cleaned and soothed her tired feet, and enjoyed the rice, she tells Pheoby about her months with Tea Cake."

Generated = "Women forget all those things they don't want to remember, and remember everything they don’t want to forget. The dream is the truth. Then they act and do things accordingly. So they chewed up the back parts of their minds and swallowed with relish. They made burning statements with questions, and killing tools out of laughs."

similarity_score = calculate_cosine_similarity(CliffNotes, Generated)
print(f"Cosine Similarity Score: {similarity_score}")
rouge_scores = calculate_rouge_scores(CliffNotes, Generated)
print("ROUGE Scores:", rouge_scores)

# Calculate BLEU score
bleu_score = calculate_bleu_score(CliffNotes, Generated)
print("BLEU Score:", bleu_score)

Cosine Similarity Score: 0.3109377645792762
ROUGE Scores: {'rouge1': Score(precision=0.39655172413793105, recall=0.07165109034267912, fmeasure=0.12137203166226912), 'rouge2': Score(precision=0.017543859649122806, recall=0.003125, fmeasure=0.005305039787798409), 'rougeL': Score(precision=0.25862068965517243, recall=0.04672897196261682, fmeasure=0.07915567282321899)}
BLEU Score: 0.000376059691895402


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
