This tool allows you to be able to retrieve all documentation for GCP providing you with a local copy you can archive, search, and diff for security research. All credit and glory for this project goes to Jonathan Walker who created the AWS docs version.
- Retrieves all sitemap.xml files
- Recursively retrieves all links within them
- Ignores all URLs included in the sitemaps that do not include
cloud.google.com
- Ignores all non https links
- Avoids most non-product documentation such as docs for
java
,ruby
andarchitecture
specific pages. - Supports both outputting as warc or html file formats
- Saves all files by
gcp_warcs/
orgcp_html/
andYYYY/MM/DD/cloud.google.com/docs/compute/index.warc
The following command allows you to be able to retrieve all the documentation in gcp_warcs/YYYY/MM/DD
.
gcpdocs --rate-limit --workers 15 -logfile=gcpdocs.log
ripgrep will help massively reduce the time to search through all the files recursively as quickly as possible. Grep took 36.78s
and ripgrep spent 0.67s
for the exact same search. So I strongly advise getting familiar with ripgrep to help speed up your search.
To search for a specific string and retrieve all GCP Documentation urls containing that string you can use a combination of ripgrep and xargs to do so.
$ cd 2024/09/26/cloud.google.com
$ rg "gs://google-cloud-" . -l | xargs -I {} rg "Warc-Target-Uri" {} | awk '{print $2}' | sort | uniq
$ rg "gs://google-cloud-" .
- Exlude non-english documentation pages
This project is a fork of: awsdocs from Jonathan Walker