This Streamlit application is designed to automate the process of taking screenshots of web pages, extracting contact information, and downloading the results. The application utilizes Selenium WebDriver and offers proxy support to bypass geo-restrictions.
- Screenshot Capture: Automatically takes a screenshot of the specified web page.
- Contact Information Extraction: Extracts emails and phone numbers from the page's content.
- Text Content Extraction: Extracts all visible text from the web page.
- Proxy Support: Optional proxy configuration to bypass geo-blocking, supporting SOCKS4 and SOCKS5 proxies.
- Download Options: Allows users to download the screenshot and extracted text content.
- Version Information: Displays version information for Python, Streamlit, Selenium, Chromedriver, and Chromium.
- Logging: Captures and displays Selenium logs for debugging.
- Python 3.6+
- Streamlit
- Selenium
- BeautifulSoup (
beautifulsoup4
) - Chromedriver (Make sure
chromedriver
is installed and accessible)
-
Clone the Repository
git clone https://github.com/your-repo/streamlit-cloud-scraper.git cd streamlit-cloud-scraper
-
Create a Virtual Environment
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
-
Run the Application
streamlit run streamlit_app.py
Enter the URL of the webpage you want to scrape in the provided text input field.
- Enable Proxy: Toggle to enable proxy support.
- Select Proxy Type: Choose between SOCKS4 and SOCKS5.
- Refresh Proxy List: If proxies are enabled, click to refresh the list of available proxies.
- Select Country: Choose the country for your proxy, if applicable.
- Select Proxy: Choose a specific proxy from the available list.
Click the "Start Selenium run and take screenshot" button to start the scraping process. The application will:
- Navigate to the specified URL using Selenium.
- Take a screenshot of the webpage.
- Extract contact information (emails and phone numbers).
- Extract all visible text content from the webpage.
- Screenshot: View the screenshot of the webpage and download it as a PNG file.
- Contact Information: View the extracted emails and phone numbers.
- Text Content: View the extracted text content and download it as a TXT file.
- Logs: View the Selenium logs to debug any issues.
streamlit-cloud-scraper/
βββ logs/ # Log files generated by Selenium
βββ screenshots/ # Screenshots taken by Selenium
βββ streamlit_app.py # Main Streamlit application script
βββ requirements.txt # Python dependencies
βββ README.md # Documentation file
- Chromedriver Issues: Ensure that
chromedriver
is installed and properly set up in your PATH. You can download it from here. - Proxy Errors: Make sure the proxy settings are correct and that the proxy is functional.
- Permissions: Ensure the application has the necessary permissions to create directories and write files in the working directory.