### Importing Libraries:

We start by importing the necessary libraries from the transformers package. We need BartForConditionalGeneration and BartTokenizer for working with the BART model.

### Loading the Pre-trained Model and Tokenizer:

We load the pre-trained BART model (BartForConditionalGeneration) and the corresponding tokenizer (BartTokenizer) from the Hugging Face model hub. These are used for generating summaries from the input text.

In [1]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Step 1: Load the pre-trained BART model and tokenizer
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Loading the News Article from a Text File:

We specify the path to a text file (news_article.txt) containing the news article we want to summarize. Then, we use Python's open() function to read the contents of the file and store them in a variable called news_article

In [2]:
file_path = "/content/drive/MyDrive/news_summarization/news_article_sample.txt"  # Update this with the path to your text file
with open(file_path, "r", encoding="utf-8") as file:
    news_article = file.read()

news_article

"In a historic victory for Indian cricket, the national team has emerged triumphant in a thrilling Test match against Australia. The match, which took place at the iconic Melbourne Cricket Ground (MCG), saw India secure a remarkable win by a margin of 8 wickets on the final day of play.\n\nAfter winning the toss and electing to bowl first, the Indian bowlers delivered a stellar performance, restricting the Australian team to a modest total of 220 runs in their first innings. Jasprit Bumrah led the bowling attack with a superb display, claiming 4 wickets to dismantle the Australian batting lineup.\n\nIn response, the Indian batsmen showcased resilience and determination as they piled on the runs to seize control of the match. Opener Shubman Gill played a magnificent innings, scoring a brilliant century to anchor India's innings. Contributions from Cheteshwar Pujara, Ajinkya Rahane, and Ravindra Jadeja further bolstered India's position as they posted a commanding total of 445 runs, taki

### Tokenizing the Input Text:

We tokenize the input news article using the tokenizer (BartTokenizer). Tokenization is the process of breaking down the text into smaller units called tokens, which the model can understand. Here, we specify the maximum length of the input text (max_length) and use the tokenizer to convert the text into a format suitable for the model (return_tensors="pt").

In [4]:
# Step 3: Tokenize the input text
inputs = tokenizer([news_article], max_length=1024, return_tensors="pt", truncation=True)



When generating text, such as in text summarization or text completion tasks, the max_length parameter controls the maximum length of the generated output. This helps in controlling the length of the generated text and ensuring that it is concise and relevant.

### Generating the Summary:

We use the pre-trained BART model (BartForConditionalGeneration) to generate a summary of the input news article. We pass the tokenized input text to the model's generate() method, along with parameters such as the maximum and minimum length of the summary, the length penalty, and the number of beams for beam search. This generates a summary of the input text.

**inputs["input_ids"]:**

This parameter specifies the input token IDs obtained from tokenizing the input text. It represents the input sequence encoded as token IDs, which are numerical representations of the tokens according to the model's vocabulary.

**length_penalty=2.0:**

This parameter applies a penalty to the length of the generated output sequence during decoding. It helps in controlling the trade-off between length and quality of the generated summary. A higher penalty value encourages the model to produce shorter summaries, while a lower penalty value encourages longer summaries. In this case, a length penalty of 2.0 is applied.

**num_beams=4:**

This parameter controls the number of beams used in beam search decoding. Beam search is a search algorithm used to generate multiple possible sequences of tokens during decoding, and beams are the number of sequences considered at each step of decoding. A higher number of beams can lead to more diverse but slower decoding, while a lower number of beams can lead to faster but less diverse decoding. In this case, 4 beams are used in beam search decoding.

**early_stopping=True:**

This parameter enables early stopping during decoding. Early stopping terminates the decoding process when all beams have reached the end-of-sequence token or when a specified condition is met. It helps in improving decoding efficiency and avoiding unnecessary computation. In this case, early stopping is enabled, which means decoding will stop when all beams have reached the end-of-sequence token.

In [5]:
# Step 4: Generate the summary
summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)




####Decoding and Printing the Generated Summary:

Finally, we decode the generated summary from its tokenized form into human-readable text using the tokenizer (BartTokenizer). We remove any special tokens (such as padding tokens) from the summary, and then print the generated summary to the console.

In [6]:
# Step 5: Decode and print the generated summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Generated Summary:")
print(summary)

Generated Summary:
India beat Australia by 8 wickets in the fourth Test at the Melbourne Cricket Ground. The victory is only the second time in history that India has won a Test match at the MCG. India has leveled the four-match Test series against Australia.
