Skip to content

shantanu-verma-salpro/HPScrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

47 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HPScraper

HPScraper is a high-performance, event-driven utility designed to asynchronously fetch and process web content. Equipped to manage multiple requests concurrently, it stands out with its efficiency and robustness.

Build GitHub commit activity (branch) GitHub

🌟 Features

  • Flexible DOM Handling: Offers a wrapper over lexbor, providing a JS-like API for HTML DOM.
  • Highly Asynchronous: Designed for non-blocking operations.
  • Protocol Support: Works with HTTP/1, HTTP/1.1, and HTTP/2.
  • Advanced Network Features: Supports proxies, authentication, and post requests.
  • Optimized: Benefits from a preallocated connection pool.
  • User-Friendly: Clean and straightforward API.
  • Platform Compatibility: Cross-platform support.
  • Highly Efficient: Uses epoll, kqueue, Windows IOCP, Solaris event ports, and Linux io_uring via libuv.
  • Extensible: Provides wrappers for curl, libuv, and lexbor for independent use.
  • Modern C++: Harnesses modern C++ capabilities for optimized memory management and a more intuitive API.

πŸš€ Remaining Tasks

Here's our prioritized roadmap:

  1. βš™οΈ Error Management: Better error insights and debugging.
  2. 🌍 Headless Chrome: Access JS-rendered pages efficiently.
  3. πŸ“š Expand Documentation: Cover all features and use-cases.
  4. πŸ§ͺ CI/CD Improvements: Streamline updates.
  5. 🏁 Performance Benchmarks: Compare against competitors.

πŸ“š Table of Contents

πŸ›  Prerequisites

Before diving in, ensure you have installed the following libraries:

  • libcurl
  • libuv
  • liblexbor

πŸš€ Getting Started

1. Compilation:

$ g++ your_source_file.cpp -o scraper -lcurl -luv -llexbor -Ofast

2. Initialization:

Initialize the Async instance:

constexpr int concurrent_connections = 200, max_host_connections = 10;
std::unique_ptr<Async> scraper = std::make_unique<Async>(concurrent_connections, max_host_connections);

3. Configuration:

Customize as needed:

scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);

For additional settings:

scraper->setVerbose(true);
scraper->setProxy("188.87.102.128", 3128);`

4. Seed URL:

Start your scraping journey:

scraper->seed("https://www.google.com/");

5. Event Management:

Incorporate custom event handlers:

scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance, Document& page) {
    // Process the response...
});

6. Execution:

Get the scraper running:

scraper->run();

βš™οΈ Advanced Options

HPScraper offers a myriad of options to fine-tune your scraping experience:

  • setMultiplexing(bool): Enable or disable HTTP/2 multiplexing.
  • setHttpVersion(HTTP): Opt for your preferred HTTP version.
  • More options available in our detailed documentation.

πŸ“Œ Example Usage

Using Scraper

int main(){

    constexpr int concurrent_connections = 200 , max_host_connections = 10 ;
    std::unique_ptr<Async> scraper = std::make_unique<Async>(concurrent_connections , max_host_connections);
    scraper->setUserAgent("Scraper/ 1.1");
    scraper->setMultiplexing(true);
    scraper->setHttpVersion(HTTP::HTTP2);
    //scraper->setVerbose(true);
    //scraper->setProxy("188.87.102.128",3128);
    scraper->seed("https://www.google.com/");

    scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance , Document& page){
        std::cout << "URL: " << response.url << '\n';
        std::cout << "Received: " << response.bytesRecieved << " bytes\n";
        std::cout << "Content Type: " << response.contentType << '\n';
        std::cout << "Total Time: " << response.totalTime << '\n';
        std::cout << "HTTP Version: " << response.httpVersion << '\n';
        std::cout << "HTTP Method: " << response.httpMethod << '\n';
        std::cout << "Download speed: " << response.bytesPerSecondR << " bytes/sec\n";
        std::cout << "Header Size: " << response.headerSize << " bytes\n";

        auto body = page.rootElement();
        auto div = body->getElementsByTagName("div")->item(0);
        auto links = div->getLinksMatching("");

        for(auto i: *links.get()) std::cout<<i<<'\n';
    });

    scraper->onIdle([](long pending , Async& instance ){

    });

    scraper->onException([](const std::exception& e , Async& instance) {
        std::cerr << "Exception encountered: " << e.what() << std::endl;    
    });

    scraper->onFailure([](const CurlEasyHandle::Response& response , Async& instance){

    });

    scraper->run();
}

Using Parser

int main() {
    std::string htmlContent = R"(
        <!DOCTYPE html>
        <html>
        <head>
            <title>Test Page</title>
        </head>
        <body>
            <div class="col-md">
                <div>Text 1 inside col-md</div>
                <a href="http://example.com">Example Link</a>
                <div data-custom="value">Text 2 inside col-md</div>
            </div>
            <div class="col-md">
                <div>Text 3 inside col-md</div>
            </div>
        </body>
        </html>
    )";

    Parser parser;
    Document doc = parser.createDOM(htmlContent);

    auto root = doc.rootElement();
    auto colMdElements = root->getElementsByClassName("col-md");

    for (std::size_t i = 0; i < colMdElements->length(); ++i) {
        auto colMdNode = colMdElements->item(i);
        auto divElements = colMdNode->getElementsByTagName("div");

        for (std::size_t j = 0; j < divElements->length(); ++j) {
            auto divNode = divElements->item(j);
            std::cout << divNode->text() << '\n';

            if(divNode->hasAttributes()) {
                auto attributes = divNode->getAttributes();
                for(const auto& [attr, value] : *attributes) {
                    std::cout << "Attribute: " << attr << ", Value: " << value << std::endl;
                }
            }

            if(divNode->hasAttribute("data-custom")) {
                std::cout << "Data-custom attribute: " << divNode->getAttribute("data-custom") << '\n';
            }
        }

        if (colMdNode->hasChildElements()) {
            auto firstChild = colMdNode->firstChild();
            auto lastChild = colMdNode->lastChild();

            std::cout << "First child's text content: " << firstChild->text() << '\n';
            std::cout << "Last child's text content: " << lastChild->text() << '\n';
        }

        auto links = colMdNode->getLinksMatching("http://example.com");
        for(const auto& link : *links) {
            std::cout << "Matching Link: " << link << '\n';
        }
    }

    return 0;
}

Using Eventloop

int main() {
   
    EventLoop eventLoop;

    TimerWrapper timer(eventLoop);
    timer.on<TimerEvent, TimerWrapper>([](const TimerEvent&, TimerWrapper&) {
        std::cout << "Timer triggered after 2 seconds!" << std::endl;
    });
    timer.start(2000, 0);  // Timeout of 2 seconds, no repeat

   
    IdleWrapper idle(eventLoop);
    idle.on<IdleEvent, IdleWrapper>([](const IdleEvent&, IdleWrapper&) {
        std::cout << "Idle handler running..." << std::endl;
    });
    idle.start();

   
    TimerWrapper stopTimer(eventLoop);
    stopTimer.on<TimerEvent, TimerWrapper>([&eventLoop](const TimerEvent&, TimerWrapper&) {
        std::cout << "Stopping event loop after 5 seconds..." << std::endl;
        eventLoop.stop();
    });
    stopTimer.start(5000, 0);  


    eventLoop.run();

    return 0;
}

Using Fetcher

int main() {

    int buffer_size = 1024;
    long timeout = 5000;
    CurlEasyHandle curlHandle(buffer_size, timeout); 

    curlHandle.setUrl("www.google.com");
    curlHandle.fetch([](CurlEasyHandle::Response* response){
        std::cout << response->message;
    });
    return 0;
}

Check examples directory

🀝 Contributing

We appreciate contributions! If you're considering significant modifications, kindly initiate a discussion by opening an issue first.

πŸ“„ License

HPScraper is licensed under the MIT

Acknowledgements This software uses the following libraries:

libcurl : Licensed under the MIT License.

libuv : Licensed under the MIT License.

liblexbor: Licensed under the Apache License, Version 2.0.

When using HPScraper, please ensure you comply with the requirements and conditions of all included licenses.

About

Web Scraper in C++ built on libuv,libcurl and lexbor

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published