Spiderman is a lightweight C++ framework for web scraping. It allows you to fetch and parse HTML content from a given URL using libcurl. The parsed content is tokenized into TAG and TEXT tokens, providing a foundation for further processing.
- Fetch HTML content from a URL.
- Tokenize HTML into
TAGandTEXTcomponents. - Simple and easy-to-use API.
- C++17 or later
libcurl
-
Clone the repository:
git clone https://github.com/zoroxide/spiderman.git cd spiderman -
Install
libcurl(if not already installed):# For Debian/Ubuntu sudo apt update && sudo apt install libcurl4-openssl-dev
-
Compile the code using a C++ compiler:
g++ -std=c++17 -lcurl -o spiderman main.cpp spiderman.cpp
#include <bits/stdc++.h>
#include "spiderman.hpp"
int main() {
std::string url = "https://zoroxide.pages.dev";
spiderman scraper(url);
std::string content = scraper.fetch();
std::vector<Token> tokens = scraper.parse();
if (!content.empty()) {
for (const auto& token : tokens) {
std::cout << "Type: " << token.type << ", Value: " << token.value << std::endl;
}
} else {
std::cerr << "Failed to fetch content from " << url << std::endl;
}
return 0;
}The Token structure represents the components of the parsed HTML:
struct Token {
std::string type; // "TAG" or "TEXT"
std::string value; // The HTML tag or text content
};├── examples # Examples
├── spiderman.hpp # Header file
├── spiderman.cpp # Implementation file
To compile the example:
g++ examples/{example}.cpp spiderman.cpp -o example -lcurl
sudo ./exampleThis project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please feel free to open issues or submit pull requests.