GitHub - spiderpig36/nmbe-crawler: Crawls websites and adds them to archive

Istructions

Step 1. Install required python packages

pip3 install -r requirements.txt

Step 2. Run the website crawler for the intended url, for instance https://insekten-evb.ch/

python3 crawl_urls.py https://insekten-evb.ch/

Step 3. This process takes a few minutes and saves all urls that are contained in the website to a file called crawl_insekten-evb.ch.txt. Duplicate this file and name the copy progress_insekten-evb.ch.txt. Next run the archiver.

python3 archive.py progress_insekten-evb.ch.txt

Step 4. This process takes a long time since the archive takes a long time to make the snapshot of the website. Therefore the programm saves the progress in the file. If needed the program can be stoped and continued at a later stage.

Step 5. Go back to step 2 and repeat with another url.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
README.md		README.md
archive.py		archive.py
crawl_urls.py		crawl_urls.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Istructions

About

Releases

Packages

Contributors 2

Languages

spiderpig36/nmbe-crawler

Folders and files

Latest commit

History

Repository files navigation

Istructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages