<a href="https://colab.research.google.com/github/uumair327/natural_language_processing/blob/main/NLP_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install necessary libraries
!pip install transformers torch



In [2]:
# Import libraries for summarization
from transformers import pipeline

In [3]:
# Create a function to set up the summarization model
def setup_summarizer():
    """
    Initializes the Hugging Face summarization pipeline
    """
    try:
        # Load pre-trained summarization model
        summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
        print("Model loaded successfully.")
        return summarizer
    except Exception as e:
        print(f"Error loading model: {str(e)}")
        return None

In [4]:
# Function to summarize text, with length control
def summarize_large_text(summarizer, text, max_chunk_size=1024):
    """
    Summarizes large texts by splitting them into smaller chunks.

    Args:
    - summarizer: The Hugging Face summarizer model pipeline
    - text: The text to be summarized
    - max_chunk_size: The maximum size of each chunk

    Returns:
    - The summarized text
    """
    # Split the text into manageable chunks
    sentences = text.split('. ')
    current_chunk = []
    summarized_chunks = []

    for sentence in sentences:
        # Check if adding the next sentence would exceed the chunk size
        if len(' '.join(current_chunk + [sentence])) <= max_chunk_size:
            current_chunk.append(sentence)
        else:
            # Summarize the current chunk and reset for the next chunk
            chunk_text = '. '.join(current_chunk) + ('.' if current_chunk else '')
            summarized = summarizer(chunk_text, max_length=150, min_length=40, do_sample=False)
            summarized_chunks.append(summarized[0]['summary_text'])
            current_chunk = [sentence]  # Start a new chunk

    # Summarize any remaining sentences in the current chunk
    if current_chunk:
        chunk_text = '. '.join(current_chunk) + ('.' if current_chunk else '')
        summarized = summarizer(chunk_text, max_length=150, min_length=40, do_sample=False)
        summarized_chunks.append(summarized[0]['summary_text'])

    # Join all summaries together
    final_summary = ' '.join(summarized_chunks)
    return final_summary


In [5]:
# Main function to orchestrate the flow
def main():
    """
    Main function to handle the flow of text summarization.
    It prompts the user for input, summarizes the input text, and displays the result.
    """
    # Initialize the summarizer model
    summarizer = setup_summarizer()

    # Check if the summarizer loaded properly
    if not summarizer:
        print("Unable to load the model. Exiting.")
        return

    # Get user input using the input() function
    long_text = input("Please enter the text to summarize: ")

    # Check if input is valid
    if len(long_text.strip()) == 0:
        print("No input provided. Exiting.")
        return

    # Summarize the user-provided text
    summary = summarize_large_text(summarizer, long_text)

    # Display the result
    print("\nOriginal Text:\n", long_text)
    print("\nSummary:\n", summary)

# Run the main function
main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Model loaded successfully.
Please enter the text to summarize: UP NEET UG Counselling 2024 Round 3: The Directorate of Medical Education and Training, Lucknow has opened the registration process for the third round of the Uttar Pradesh National Eligibility Entrance Test Undergraduate (UP NEET UG) counselling 2024. The registration link is now active on the official UP NEET website at upneet.gov.in. Candidates have until October 9 to complete their registration for round 3 of the counselling.


Your max_length is set to 150, but your input_length is only 94. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=47)



Original Text:
 UP NEET UG Counselling 2024 Round 3: The Directorate of Medical Education and Training, Lucknow has opened the registration process for the third round of the Uttar Pradesh National Eligibility Entrance Test Undergraduate (UP NEET UG) counselling 2024. The registration link is now active on the official UP NEET website at upneet.gov.in. Candidates have until October 9 to complete their registration for round 3 of the counselling.

Summary:
 The registration process for the third round of the Uttar Pradesh National Eligibility Entrance Test Undergraduate (UP NEET UG) counselling 2024 has opened. The registration link is now active on the official UP NEET website at upneet.gov.in.


News Article Summarizer in coding

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [7]:
!apt-get install python3-lxml libxml2-dev libxslt-dev
!pip install newspaper3k

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'libxslt1-dev' instead of 'libxslt-dev'
libxml2-dev is already the newest version (2.9.13+dfsg-1ubuntu0.4).
The following additional packages will be installed:
  python3-bs4 python3-chardet python3-html5lib python3-soupsieve python3-webencodings
Suggested packages:
  python3-genshi python-lxml-doc
The following NEW packages will be installed:
  libxslt1-dev python3-bs4 python3-chardet python3-html5lib python3-lxml python3-soupsieve
  python3-webencodings
0 upgraded, 7 newly installed, 0 to remove and 49 not upgraded.
Need to get 1,678 kB of archives.
After this operation, 8,468 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libxslt1-dev amd64 1.1.34-4ubuntu0.22.04.1 [219 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 python3-soupsieve all 2.3.1-1 [33.0 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/ma

In [8]:
!pip install newspaper3k==0.2.8



In [9]:
!pip install beautifulsoup4 requests




In [10]:
# Install necessary packages
!pip install newspaper3k googletrans==4.0.0-rc1

from newspaper import Article
from googletrans import Translator
import ipywidgets as widgets
from IPython.display import display, clear_output

# Variables
article = ''
art_title = ''
article_sum = ''
art_content = ''

# Translating and Summary Function
def translate_news(b):
    global article, art_title, article_sum, art_content

    url = news_url.value
    lang = lang_selection.value

    # Get Article
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()

    # Translate Article
    translator = Translator()
    art_title = translator.translate(article.title, dest=lang).text
    article_sum = translator.translate(article.summary, dest=lang).text
    art_content = translator.translate(article.text, dest=lang).text

    # Display results
    with output:
        clear_output()
        display(f"**Article Title:** {art_title}")
        display(f"**Article Summary:** {article_sum}")
        display(f"**Publish Date:** {article.publish_date}")
        display(f"**Top Image Link:** {article.top_image}")

# Downloading article Function
def download_article(b):
    with open("News_summary_file.txt", "w", encoding="utf-8") as file1:
        file1.write(f"Title:\n{art_title}\n\n")
        file1.write(f"Article Summary:\n{article_sum}\n")

# URL Entry Field
news_url = widgets.Text(
    value='',
    placeholder='Enter the URL of the news article',
    description='URL:',
    disabled=False
)

# Language Selection Dropdown
lang_selection = widgets.Dropdown(
    options=[('English', 'en'), ('Hindi', 'hi')],
    value='en',
    description='Language:'
)

# Buttons for actions
translate_button = widgets.Button(
    description="Translate and Summarize",
    button_style='primary'
)
translate_button.on_click(translate_news)

download_button = widgets.Button(
    description="Download Article",
    button_style='success'
)
download_button.on_click(download_article)

# Output area for displaying results
output = widgets.Output()

# Display widgets
display(news_url, lang_selection, translate_button, download_button, output)


Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2024.10.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->goog

Text(value='', description='URL:', placeholder='Enter the URL of the news article')

Dropdown(description='Language:', options=(('English', 'en'), ('Hindi', 'hi')), value='en')

Button(button_style='primary', description='Translate and Summarize', style=ButtonStyle())

Button(button_style='success', description='Download Article', style=ButtonStyle())

Output()

Website Using NGROK Authtoken

In [18]:
!ngrok authtoken #add your ngrok token here

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
# Step 1: Install necessary packages
!pip install flask newspaper3k googletrans==4.0.0-rc1 pyngrok

# Step 2: Import required libraries
from flask import Flask, render_template, request
from newspaper import Article
from googletrans import Translator
from pyngrok import ngrok

# Step 3: Create Flask app
app = Flask(__name__)

# Step 4: Define the home route
@app.route('/')
def home():
    return '''
    <!doctype html>
    <html lang="en">
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <title>News Article Summarizer</title>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
    </head>
    <body>
        <div class="container mt-5">
            <h1>News Article Summarizer</h1>
            <form action="/summarize" method="post">
                <div class="form-group">
                    <label for="url">Article URL:</label>
                    <input type="text" class="form-control" id="url" name="url" required>
                </div>
                <div class="form-group">
                    <label for="language">Select Language:</label>
                    <select class="form-control" id="language" name="language">
                        <option value="en">English</option>
                        <option value="hi">Hindi</option>
                    </select>
                </div>
                <button type="submit" class="btn btn-primary">Summarize and Translate</button>
            </form>
        </div>
    </body>
    </html>
    '''

# Step 5: Define the summarize and translate route
@app.route('/summarize', methods=['POST'])
def summarize():
    url = request.form['url']
    lang = request.form['language']

    # Get Article
    article = Article(url)
    article.download()
    article.parse()
    article.nlp()

    # Translate Article
    translator = Translator()
    art_title = translator.translate(article.title, dest=lang).text
    article_sum = translator.translate(article.summary, dest=lang).text
    art_content = translator.translate(article.text, dest=lang).text

    return f'''
    <!doctype html>
    <html lang="en">
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <title>Summary Result</title>
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css">
    </head>
    <body>
        <div class="container mt-5">
            <h1>{art_title}</h1>
            <p><strong>Summary:</strong></p>
            <p>{article_sum}</p>
            <p><strong>Full Content:</strong></p>
            <p>{art_content}</p>
            <p><strong>Publish Date:</strong> {article.publish_date}</p>
            <img src="{article.top_image}" alt="Article Image" class="img-fluid">
            <br>
            <a href="/" class="btn btn-secondary mt-3">Back</a>
        </div>
    </body>
    </html>
    '''

# Step 6: Start the Flask app
if __name__ == '__main__':
    # Create a tunnel to the localhost port 5000
    public_url = ngrok.connect(5000)
    print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:5000\"")
    app.run(port=5000)


ERROR:root:Unexpected exception finding object shape
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/google/colab/_debugpy_repr.py", line 54, in get_shape
    shape = getattr(obj, 'shape', None)
  File "/usr/local/lib/python3.10/dist-packages/werkzeug/local.py", line 318, in __get__
    obj = instance._get_current_object()
  File "/usr/local/lib/python3.10/dist-packages/werkzeug/local.py", line 519, in _get_current_object
    raise RuntimeError(unbound_message) from None
RuntimeError: Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.


 * ngrok tunnel "NgrokTunnel: "https://4086-34-125-246-164.ngrok-free.app" -> "http://localhost:5000"" -> "http://127.0.0.1:5000"
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
ERROR:root:Unexpected exception finding object shape
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/google/colab/_debugpy_repr.py", line 54, in get_shape
    shape = getattr(obj, 'shape', None)
  File "/usr/local/lib/python3.10/dist-packages/werkzeug/local.py", line 318, in __get__
    obj = instance._get_current_object()
  File "/usr/local/lib/python3.10/dist-packages/werkzeug/local.py", line 519, in _get_current_object
    raise RuntimeError(unbound_message) from None
RuntimeError: Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
ERROR:root:Unexpected exception finding object shape
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/google/colab/_debugpy_repr.py", line 54