HPScraper is a high-performance, event-driven utility designed to asynchronously fetch and process web content. Equipped to manage multiple requests concurrently, it stands out with its efficiency and robustness.
- Flexible DOM Handling: Offers a wrapper over
lexbor
, providing a JS-like API for HTML DOM. - Highly Asynchronous: Designed for non-blocking operations.
- Protocol Support: Works with HTTP/1, HTTP/1.1, and HTTP/2.
- Advanced Network Features: Supports proxies, authentication, and post requests.
- Optimized: Benefits from a preallocated connection pool.
- User-Friendly: Clean and straightforward API.
- Platform Compatibility: Cross-platform support.
- Highly Efficient: Uses epoll, kqueue, Windows IOCP, Solaris event ports, and Linux io_uring via
libuv
. - Extensible: Provides wrappers for
curl
,libuv
, andlexbor
for independent use. - Modern C++: Harnesses modern C++ capabilities for optimized memory management and a more intuitive API.
Here's our prioritized roadmap:
- ⚙️ Error Management: Better error insights and debugging.
- 🌍 Headless Chrome: Access JS-rendered pages efficiently.
- 📚 Expand Documentation: Cover all features and use-cases.
- 🧪 CI/CD Improvements: Streamline updates.
- 🏁 Performance Benchmarks: Compare against competitors.
Before diving in, ensure you have installed the following libraries:
libcurl
libuv
liblexbor
$ g++ your_source_file.cpp -o scraper -lcurl -luv -llexbor -Ofast
Initialize the Async
instance:
constexpr int concurrent_connections = 200, max_host_connections = 10;
std::unique_ptr<Async> scraper = std::make_unique<Async>(concurrent_connections, max_host_connections);
Customize as needed:
scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);
For additional settings:
scraper->setVerbose(true);
scraper->setProxy("188.87.102.128", 3128);`
Start your scraping journey:
scraper->seed("https://www.google.com/");
Incorporate custom event handlers:
scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance, Document& page) {
// Process the response...
});
Get the scraper running:
scraper->run();
HPScraper offers a myriad of options to fine-tune your scraping experience:
setMultiplexing(bool)
: Enable or disable HTTP/2 multiplexing.setHttpVersion(HTTP)
: Opt for your preferred HTTP version.- More options available in our detailed documentation.
int main(){
constexpr int concurrent_connections = 200 , max_host_connections = 10 ;
std::unique_ptr<Async> scraper = std::make_unique<Async>(concurrent_connections , max_host_connections);
scraper->setUserAgent("Scraper/ 1.1");
scraper->setMultiplexing(true);
scraper->setHttpVersion(HTTP::HTTP2);
//scraper->setVerbose(true);
//scraper->setProxy("188.87.102.128",3128);
scraper->seed("https://www.google.com/");
scraper->onSuccess([](const CurlEasyHandle::Response& response, Async& instance , Document& page){
std::cout << "URL: " << response.url << '\n';
std::cout << "Received: " << response.bytesRecieved << " bytes\n";
std::cout << "Content Type: " << response.contentType << '\n';
std::cout << "Total Time: " << response.totalTime << '\n';
std::cout << "HTTP Version: " << response.httpVersion << '\n';
std::cout << "HTTP Method: " << response.httpMethod << '\n';
std::cout << "Download speed: " << response.bytesPerSecondR << " bytes/sec\n";
std::cout << "Header Size: " << response.headerSize << " bytes\n";
auto body = page.rootElement();
auto div = body->getElementsByTagName("div")->item(0);
auto links = div->getLinksMatching("");
for(auto i: *links.get()) std::cout<<i<<'\n';
});
scraper->onIdle([](long pending , Async& instance ){
});
scraper->onException([](const std::exception& e , Async& instance) {
std::cerr << "Exception encountered: " << e.what() << std::endl;
});
scraper->onFailure([](const CurlEasyHandle::Response& response , Async& instance){
});
scraper->run();
}
int main() {
std::string htmlContent = R"(
<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
</head>
<body>
<div class="col-md">
<div>Text 1 inside col-md</div>
<a href="http://example.com">Example Link</a>
<div data-custom="value">Text 2 inside col-md</div>
</div>
<div class="col-md">
<div>Text 3 inside col-md</div>
</div>
</body>
</html>
)";
Parser parser;
Document doc = parser.createDOM(htmlContent);
auto root = doc.rootElement();
auto colMdElements = root->getElementsByClassName("col-md");
for (std::size_t i = 0; i < colMdElements->length(); ++i) {
auto colMdNode = colMdElements->item(i);
auto divElements = colMdNode->getElementsByTagName("div");
for (std::size_t j = 0; j < divElements->length(); ++j) {
auto divNode = divElements->item(j);
std::cout << divNode->text() << '\n';
if(divNode->hasAttributes()) {
auto attributes = divNode->getAttributes();
for(const auto& [attr, value] : *attributes) {
std::cout << "Attribute: " << attr << ", Value: " << value << std::endl;
}
}
if(divNode->hasAttribute("data-custom")) {
std::cout << "Data-custom attribute: " << divNode->getAttribute("data-custom") << '\n';
}
}
if (colMdNode->hasChildElements()) {
auto firstChild = colMdNode->firstChild();
auto lastChild = colMdNode->lastChild();
std::cout << "First child's text content: " << firstChild->text() << '\n';
std::cout << "Last child's text content: " << lastChild->text() << '\n';
}
auto links = colMdNode->getLinksMatching("http://example.com");
for(const auto& link : *links) {
std::cout << "Matching Link: " << link << '\n';
}
}
return 0;
}
int main() {
EventLoop eventLoop;
TimerWrapper timer(eventLoop);
timer.on<TimerEvent, TimerWrapper>([](const TimerEvent&, TimerWrapper&) {
std::cout << "Timer triggered after 2 seconds!" << std::endl;
});
timer.start(2000, 0); // Timeout of 2 seconds, no repeat
IdleWrapper idle(eventLoop);
idle.on<IdleEvent, IdleWrapper>([](const IdleEvent&, IdleWrapper&) {
std::cout << "Idle handler running..." << std::endl;
});
idle.start();
TimerWrapper stopTimer(eventLoop);
stopTimer.on<TimerEvent, TimerWrapper>([&eventLoop](const TimerEvent&, TimerWrapper&) {
std::cout << "Stopping event loop after 5 seconds..." << std::endl;
eventLoop.stop();
});
stopTimer.start(5000, 0);
eventLoop.run();
return 0;
}
int main() {
int buffer_size = 1024;
long timeout = 5000;
CurlEasyHandle curlHandle(buffer_size, timeout);
curlHandle.setUrl("www.google.com");
curlHandle.fetch([](CurlEasyHandle::Response* response){
std::cout << response->message;
});
return 0;
}
Check examples directory
We appreciate contributions! If you're considering significant modifications, kindly initiate a discussion by opening an issue first.
HPScraper is licensed under the MIT
Acknowledgements This software uses the following libraries:
libcurl
: Licensed under the MIT License.
libuv
: Licensed under the MIT License.
liblexbor
: Licensed under the Apache License, Version 2.0.
When using HPScraper
, please ensure you comply with the requirements and conditions of all included licenses.