<a href="https://colab.research.google.com/github/srinijamadireddy19/Web-Scraping/blob/main/Web_Scraping_using_trafilatura.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1 Import the Required Libraries**

*Why these libraries?*

* trafilatura → Extracts main readable text from a web page (filters out ads, menus, scripts, etc.).

* reportlab → A powerful library for creating PDF documents from text and styled paragraphs.

* os → Used to extract the main part of the URL to generate dynamic filenames.

* re (Regular Expressions) → Used for safely splitting the text into paragraphs.

In [17]:
#Run this cell only once
#!pip install trafilatura

In [18]:
#Run this cell only once
#!pip install reportlab

In [5]:
import os
import trafilatura
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch

**Step 2: Define the Webpage URL**

This is the webpage we want to extract text from.

This will be replaced by any other URL dynamically.

In [6]:
url = input("Enter the URL: ")

Enter the URL: https://www.youtube.com/watch?v=HNCypVfeTdw


**Step 3: Extract the “Main Part” of the URL Dynamically**

*What this does:*

os.path.basename(url) gives the last part of the URL after the last /.

In [7]:
base_name = os.path.basename(url)
base_name

'watch?v=HNCypVfeTdw'

**Step 5: Download the Webpage Content**

*What’s happening:*

trafilatura.fetch_url() downloads the raw HTML of the page.

We check if it succeeded. If not, the script exits gracefully.

In [8]:
print("Downloading webpage content...")
downloaded = trafilatura.fetch_url(url)
if not downloaded:
    print("Could not download the webpage. Exiting.")
    exit()

Downloading webpage content...


**Step 6: Extract Clean Text from the Downloaded HTML**

*Explanation:*

trafilatura.extract() automatically identifies and extracts main textual content.

We specify:

* output_format="txt" → We want plain text (not HTML or XML).

* include_comments=False → Exclude comment sections.

* include_tables=True → Include tables where possible.

This gives us a clean text version of the article — no ads, no menus, just main content.

In [9]:
print("Extracting text and structure...")

text = trafilatura.extract(
    downloaded,
    output_format="txt",
    include_comments=False,
    include_tables=True
)
if not text:
    print("No text was extracted. Exiting.")
    exit()


Extracting text and structure...


**Step 7: Split the Text into Paragraphs**

*Why this matters:*

* Web text often contains multiple line breaks or random spacing.

* This regex safely splits the text into paragraphs using any sequence of blank lines.

* We also strip leading/trailing spaces and ignore empty lines.

Result → A clean list of paragraph strings.

In [10]:
paragraph_list = []
# Ensure no empty strings from multiple newlines
for p in text.split('\n\n'):
    stripped_p = p.strip()
    if stripped_p:
        paragraph_list.append(stripped_p)

**Step 8: Set Up the PDF Document and Styles**

*Explanation:*

* SimpleDocTemplate sets up page size and margins.

* getSampleStyleSheet() loads default ReportLab text styles.

* We define a custom paragraph style with extra spacing before/after paragraphs for readability.

In [11]:
pdf_filename = f"{base_name}.pdf"
doc = SimpleDocTemplate(pdf_filename, pagesize=letter, rightMargin=inch, leftMargin=inch, topMargin=inch, bottomMargin=inch)

styles = getSampleStyleSheet()

In [12]:
paragraph_style = ParagraphStyle(
    'SpacedParagraph',
    parent=styles['Normal'],
    spaceBefore=12,  # Space before the paragraph in points
    spaceAfter=6,   # Space after the paragraph in points
)

**Step 9: Build the PDF Content (“Story”)**

* We start with a title at the top (h1 style).

* Add a small space after the title.

* Then we loop through all paragraphs:

In [13]:
story = []

In [14]:
title_text = base_name.replace('_', ' ')
story.append(Paragraph(title_text, styles['h1']))
story.append(Spacer(1, 12))

In [15]:
for para_text in paragraph_list:
    # `ReportLab`'s `Paragraph` needs `<br />` for internal line breaks,
    # so we replace single newlines with it.
    formatted_para_text = para_text.replace('\n', '<br /> <br />')
    story.append(Paragraph(formatted_para_text, paragraph_style))


**Step 10: Generate and Save the PDF**

* This compiles the list of paragraphs (story) into a finished, styled PDF.

* The final file name is automatically based on the URL.

In [16]:
print(f"Building PDF: {pdf_filename}...")
doc.build(story)
print(f"PDF created successfully as '{pdf_filename}'")

Building PDF: watch?v=HNCypVfeTdw.pdf...
PDF created successfully as 'watch?v=HNCypVfeTdw.pdf'
