This project is an algorithmic web scraper that extracts text from web pages and classifies paragraphs into specific themes. It uses a modern graphical interface to allow users to enter a URL and view the organized results in an interactive table.
- Python: Main programming language.
- Requests: For making HTTP requests.
- BeautifulSoup: For parsing HTML.
- Pandas: For data manipulation and organization.
- ttkbootstrap: For creating a modern graphical interface.
- Tkinter: For handling messages and managing images.
- PIL (Pillow): For handling and displaying images in the graphical interface.
- Makes a GET request to the URL.
- Parses the HTML content to extract paragraphs.
- Classifies the paragraphs into themes based on predefined keywords.
- Returns a DataFrame with the paragraphs organized by theme.
The application provides a simple and modern graphical interface:
- URL Entry: Allows the user to input the URL of the web page to scrape.
- Scraping Button: Starts the scraping process.
- Results Table: Displays the extracted and classified paragraphs in an interactive table.
- URL Input: The user enters the URL in the graphical interface.
- Scraping: The application makes a request to the URL, parses the HTML, and extracts paragraphs.
- Classification: The paragraphs are classified into themes using predefined keywords.
- Visualization: The results are displayed in a table within the graphical interface.
- AI Training: Facilitates the creation of datasets for training natural language processing models.
- Content Analysis: Useful for extracting and analyzing content from news sites, blogs, and other web pages.
- Market Research: Allows for the collection of data on products, trends, and consumer opinions.
Contributions are welcome. Please open an issue to discuss any changes you would like to make.