- 2024-05-13: Added Medium Article Link
- 2024-05-14: Added Tor Proxy to the Dockerfile
- 2024-05-15 Added ProtonVPN Proxy to the Dockerfile
- 2024-05-16: Added VPN Gateway bypass for local networks
Disclaimer: This repository can be cloned as a template repository and modified by your needs. It is only for educational purposes.
This repository contains a DockerFile with a python app to run a selenium scraper.
You can include your own scraper logic in the app/scraper.py
file.
The docker container will in default run every 15minutes the scraper logic.
After the scraper is finished your data can be sent to a webhook or you can save it by your own.
- Clone this repository
- Modify the
Dockerfile
Line 18 with your cronjob time - Modify the
app/scraper.py
file with your own scraper logic - (Optional: ) Modify the
app/notifier.py
file with your own notification logic - Build the docker container with
sh build.sh
- Run the docker container with
sh run.sh
You can watch your scraper while working in the browser.
URL: http://localhost:4444
Password: secret
...
# modify the cronjob time here
RUN (crontab -l ; echo "*/15 * * * * /opt/scheduler.sh >> /var/log/cron/cron-scheduled.log 2>&1") | crontab
..
When you want to use the webhook as result you can modify the app/notifier.py
file with your own logic.
logic = LogicController(notification=WebhookNotification("http://your-webhook-url.com/api/v1/scraper"
, only_finished=True))
You can find the log files in the /var/logs/cron/cron-scheduled.log
folder.
Run the run_tor.sh file to use the Tor Proxy.
sh run_tor.sh
After that you can use the tor proxy in your scraper logic. Example is included.
### If you want to use tor proxy, uncomment the following lines
ff_options.set_preference('network.proxy.type', 1)
ff_options.set_preference('network.proxy.socks', "localhost")
ff_options.set_preference('network.proxy.socks_port', 9050)
If you want to use a VPN for your scraper you can use the ProtonVPN Proxy. You can crete a free account on ProtonVPN and use the free servers.
Copy your ProtonVPN credentials from your account and run the run_protonvpn.sh file.
sh run_protonvpn.sh
# When you are asked for the username and password, enter your ProtonVPN credentials
# basic country code is: nl
If you want to bypass local network traffic, uncomment the lines in the change_vpn.sh file.
# add ignore local traffic of example for kubernetes
# sudo bash -c "echo 'route 10.233.0.0 255.255.0.0 net_gateway' >> /etc/openvpn/client.conf"s
# and add dns resolver if needed
# dhcp-option DNS 10.233.0.3
Have fun with your scraper!