Skip to content

Commit

Permalink
get-artifacts.sh: Don't reject index.html
Browse files Browse the repository at this point in the history
Recursive wget on directories create index.html files that are not
in the remote. It seems such files are unavoidable and are the result
of how recursive wget on endpoints that send directory listing works -
the index.html files contain the lists of files (on the subdir) that
are to be retrieved from each subfolder.

Rejecting 'index.html' files avoids having such spoofed files on the
locally downloaded content, however, if the remote target directory
contains index.html, it too would be rejected. This commit changes the
wget invocation `--reject` argument so that the index.html files will
be kept, rejecting only the files that match `index.html?*`.

The downside is, each subfolder will include an index.html listing
the contents of that specific subdir, even though the remote does not
have such files.

Signed-off-by: Henri Rosten <henri.rosten@unikie.com>
  • Loading branch information
henrirosten committed Sep 10, 2024
1 parent eb8130e commit 7e2e63f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion scripts/get-artifacts.sh
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ get_recursively () {
--level=inf \
--timestamping \
--execute robots=off \
--reject 'index.html*' \
--reject 'index.html?*' \
--user-agent=Mozilla/5.0 \
--accept '*' \
--random-wait \
Expand Down

0 comments on commit 7e2e63f

Please sign in to comment.