# Web Scraper Tool for Data Collection and Brochure Generation


## Overview
This repository contains a Python-based web scraper tool that extracts data from websites and generates a brochure. The tool scrapes essential content like titles and paragraphs from a specified website and creates a well-formatted PDF brochure. Additionally, the data is stored in an SQLite database for further use.

## Features

- **Web Scraping**: Scrapes website content, including titles and paragraphs.
- **Brochure Generation**: Creates a PDF brochure using the scraped content.
- **Database Integration**: Stores the scraped website data in an SQLite database for future reference.
- **Customizable**: Can be adapted to scrape any website by specifying the URL.

## Technologies Used

- **Python**: The main programming language used for the project.
- **BeautifulSoup**: For parsing and scraping HTML data from websites.
- **Requests**: For making HTTP requests to fetch web pages.
- **FPDF**: For creating the PDF brochure from the scraped data.
- **SQLite**: For storing scraped data in a database.

## Installation

### 1. Clone the repository

```bash
git clone https://github.com/yourusername/web-scraper-brochure.git
cd web-scraper-brochure
```

### 2. Install dependencies

Make sure you have Python 3 installed. Then, install the required libraries:

```bash
pip install requests beautifulsoup4 fpdf
```

## Usage

### Scrape Data and Generate Brochure

To use the scraper and generate a brochure, simply run the following command in your Python environment:

```python
from scraper import generate_brochure_from_website

# Provide the URL of the website you want to scrape
generate_brochure_from_website('https://example.com', filename='brochure.pdf')
```

This will:
1. Scrape the website content.
2. Generate a PDF brochure named `brochure.pdf`.
3. Store the scraped data in an SQLite database for future use.

## Database

The scraped data is stored in an SQLite database named `scraped_data.db`. The database contains the following table:

- **webpages**:
  - `id`: Primary key
  - `title`: Title of the webpage
  - `content`: Content scraped from the webpage (i.e., paragraphs)

## Contributing

1. Fork the repository.
2. Create your feature branch (`git checkout -b feature-name`).
3. Commit your changes (`git commit -am 'Add feature'`).
4. Push to the branch (`git push origin feature-name`).
5. Create a new Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)
- [FPDF](http://www.fpdf.org/)
- [SQLite](https://www.sqlite.org/)


In [None]:

import requests
from bs4 import BeautifulSoup
from fpdf import FPDF
import sqlite3

def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    title = soup.title.string if soup.title else 'No Title Found'
    paragraphs = [p.get_text() for p in soup.find_all('p')]
    return title, paragraphs

def process_data(title, paragraphs):
    processed_paragraphs = [para.strip() for para in paragraphs if len(para.strip()) > 50]
    return {'title': title, 'paragraphs': processed_paragraphs}

def create_brochure(data, filename='brochure.pdf'):
    pdf = FPDF()
    pdf.set_auto_page_break(auto=True, margin=15)
    pdf.add_page()
    pdf.set_font('Arial', 'B', 16)
    pdf.cell(200, 10, txt=data['title'], ln=True, align='C')
    pdf.ln(10)
    pdf.set_font('Arial', size=12)
    for para in data['paragraphs']:
        pdf.multi_cell(0, 10, txt=para)
    pdf.output(filename)

def initialize_db(db_name='scraped_data.db'):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    cursor.execute('''CREATE TABLE IF NOT EXISTS webpages (
                        id INTEGER PRIMARY KEY,
                        title TEXT,
                        content TEXT
                    )''')
    conn.commit()
    conn.close()

def store_data(title, paragraphs, db_name='scraped_data.db'):
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    content = ' '.join(paragraphs)
    cursor.execute('INSERT INTO webpages (title, content) VALUES (?, ?)', (title, content))
    conn.commit()
    conn.close()

def generate_brochure_from_website(url, filename='brochure.pdf', db_name='scraped_data.db'):
    title, paragraphs = scrape_website(url)
    processed_data = process_data(title, paragraphs)
    create_brochure(processed_data, filename)
    store_data(title, paragraphs, db_name)
    print(f'Brochure created and data stored for: {url}')
