Skip to content

Streamlit project to test Selenium running on Streamlit Cloud

License

Notifications You must be signed in to change notification settings

Ralphdapythondev/Streamlit-Selenium

 
 

Repository files navigation

Streamlit Cloud Scraper πŸ•ΈοΈ

This Streamlit application is designed to automate the process of taking screenshots of web pages, extracting contact information, and downloading the results. The application utilizes Selenium WebDriver and offers proxy support to bypass geo-restrictions.

Features

  • Screenshot Capture: Automatically takes a screenshot of the specified web page.
  • Contact Information Extraction: Extracts emails and phone numbers from the page's content.
  • Text Content Extraction: Extracts all visible text from the web page.
  • Proxy Support: Optional proxy configuration to bypass geo-blocking, supporting SOCKS4 and SOCKS5 proxies.
  • Download Options: Allows users to download the screenshot and extracted text content.
  • Version Information: Displays version information for Python, Streamlit, Selenium, Chromedriver, and Chromium.
  • Logging: Captures and displays Selenium logs for debugging.

Requirements

  • Python 3.6+
  • Streamlit
  • Selenium
  • BeautifulSoup (beautifulsoup4)
  • Chromedriver (Make sure chromedriver is installed and accessible)

Installation

  1. Clone the Repository

    git clone https://github.com/your-repo/streamlit-cloud-scraper.git
    cd streamlit-cloud-scraper
  2. Create a Virtual Environment

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Install Dependencies

    pip install -r requirements.txt
  4. Run the Application

    streamlit run streamlit_app.py

How to Use

1. Input URL

Enter the URL of the webpage you want to scrape in the provided text input field.

2. Proxy Configuration (Optional)

  • Enable Proxy: Toggle to enable proxy support.
  • Select Proxy Type: Choose between SOCKS4 and SOCKS5.
  • Refresh Proxy List: If proxies are enabled, click to refresh the list of available proxies.
  • Select Country: Choose the country for your proxy, if applicable.
  • Select Proxy: Choose a specific proxy from the available list.

3. Start the Scraping Process

Click the "Start Selenium run and take screenshot" button to start the scraping process. The application will:

  • Navigate to the specified URL using Selenium.
  • Take a screenshot of the webpage.
  • Extract contact information (emails and phone numbers).
  • Extract all visible text content from the webpage.

4. View and Download Results

  • Screenshot: View the screenshot of the webpage and download it as a PNG file.
  • Contact Information: View the extracted emails and phone numbers.
  • Text Content: View the extracted text content and download it as a TXT file.
  • Logs: View the Selenium logs to debug any issues.

Project Structure

streamlit-cloud-scraper/
β”œβ”€β”€ logs/                 # Log files generated by Selenium
β”œβ”€β”€ screenshots/          # Screenshots taken by Selenium
β”œβ”€β”€ streamlit_app.py      # Main Streamlit application script
β”œβ”€β”€ requirements.txt      # Python dependencies
└── README.md             # Documentation file

Troubleshooting

  • Chromedriver Issues: Ensure that chromedriver is installed and properly set up in your PATH. You can download it from here.
  • Proxy Errors: Make sure the proxy settings are correct and that the proxy is functional.
  • Permissions: Ensure the application has the necessary permissions to create directories and write files in the working directory.

About

Streamlit project to test Selenium running on Streamlit Cloud

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 58.9%
  • Jupyter Notebook 26.0%
  • Makefile 9.1%
  • Dockerfile 6.0%