Feature/harvesting metadata from a provided repository url via cloning by Aidajafarbigloo · Pull Request #473 · softwarepub/hermes

Aidajafarbigloo · 2026-02-20T14:06:22Z

This feature branch introduces functionality for harvesting metadata from a provided repository URL via cloning.

Changes:

Accept a repository URL as a parameter in the hermes harvest command
Accept a token (for GitHub/GitLab) as a parameter in the hermes harvest command (for the plugin githublab)
Clone the repository locally
Harvest metadata from the cloned repository (CFF and CodeMeta)

Added clone utility functions for repository cloning with error handling and cleanup.

Added new command-line arguments for URL and token in harvest command.

Add temporary directory handling for cloning repositories and update token management.

Added an explicit logging shutdown step before clearing the HERMES caches. Without shutting down logging first, the `clean` command fails on Windows with: `An error occurred during execution of clean (Find details in './hermes.log')` "Original exception was: [WinError 32] The process cannot access the file because it is being used by another process: '.hermes\\audit.log'"

Aidajafarbigloo · 2026-02-20T14:08:09Z

@sferenz
Could you please take a look at this pull request and share your feedback?

sferenz · 2026-02-23T11:53:10Z

src/hermes/commands/clean/base.py


 import argparse
 import shutil
+import logging


@Aidajafarbigloo Is it necessary for your changes to include logging? If not, please remove it everywhere.

@sferenz Thank you for the comments.
When using hermes clean command on Windows, I face this error:
Run subcommand clean
Removing HERMES caches...
An error occurred during execution of clean (Find details in './hermes.log')

Error in the "hermes.log":
Original exception was: [WinError 32] The process cannot access the file because it is being used by another process: '.hermes\\audit.log'.

This happens because Windows does not allow deletion of a file that is still open by the current process. The audit.log file inside .hermes is held open by a logging file handler. When shutil.rmtree() attempts to remove the directory, it fails due to the open file handle. I'm using logging.shutdown() to ensure that all logging handlers are closed before the directory is deleted, it does not introduce new logging behavior.

This was fixed before but somehow the fix got lost... Weird 🤔

#226

Well, not only once. But it seems more important to not have a log file in the working directory than having a properly cleaned .hermes cache.

you should be able to configure the path to the logfile, though.

sferenz · 2026-02-23T11:53:49Z

src/hermes/commands/harvest/util/clone.py

+
+# ---------------- utilities ----------------
+
+def _normalize_clone_url(url: str) -> str:


Please provide a general comment for each function.

Docstrings were added.

sferenz · 2026-02-23T11:54:58Z

src/hermes/commands/harvest/util/clone.py

@@ -0,0 +1,249 @@
+# SPDX-FileCopyrightText: 2026 OFFIS e.V.


Please put UOL here.

sferenz · 2026-02-23T11:55:26Z

src/hermes/commands/harvest/util/token.py

+import toml
+
+
+def _load_config(config_path: str) -> dict:


Please ensure that every function has a comment

Docstrings were added.

This script bulk-tests HERMES metadata harvesting across multiple repositories, checking for expected metadata files and generating a CSV report.

Add test repositories for bulk testing.

Aidajafarbigloo · 2026-03-18T11:29:18Z

@zyzzyxdonta @led02 Could you please review the PR and let me know if something needs to be changed?

zyzzyxdonta

I left some comments on the cloning procedure and the tests. I haven't looked at how it is integrated into the application yet.

zyzzyxdonta · 2026-03-27T08:00:11Z

src/hermes/tests/bulk_repository_test.py

The tests are currently a standalone script but they should use the Python unittest library or Pytest just like all the other tests do. You rebuilt a lot of things that already exist in these libraries, e.g. output of messages, test fixtures, ...

zyzzyxdonta · 2026-03-27T08:03:35Z

src/hermes/tests/bulk_repository_test.py

+"""
+
+import subprocess
+import pandas as pd


Pandas is not part of the dependencies and I would prefer not to add it.

zyzzyxdonta · 2026-03-27T08:10:21Z

src/hermes/tests/bulk_repository_test.py

+# JSON file that stores previous yes/no answers per URL to detect regressions across runs.
+state_file = "hermes_bulk_test_state.json"
+
+# Load previous state (if any).


I don't think comparing against previous results makes sense here. Either the code does the right thing (tests succeed) or it doesn't (tests fail). There is no performance degradation over time.

zyzzyxdonta · 2026-03-27T10:13:59Z

src/hermes/tests/bulk_repository_test.py

+        # Step 2: Run harvest for the repository.
+        # Capture stdout/stderr to include HERMES output as an error message when needed.
+        proc = subprocess.run(
+            ["hermes", "harvest", "--url", url, "--token", token],


I don't think it's a good idea to run the whole program like this. Especially, we shouldn't actually clone any real repositories. Doing this can easily lead to our tests degrading, e.g. because repositories change over time or are deleted.

Here are some things that could be tested instead:

Git repo URLs are constructed correctly

git command lines (that are passed to subprocess.run) are constructed correctly

I'm not averse to the idea of having end-to-end tests for the whole metadata extraction workflow. But I think it should be based on a dedicated test repo that should ideally be part of the tests/ directory and not cloned. It should be safe to assume that a git clone works if the command parameters are correct.

zyzzyxdonta · 2026-03-27T10:18:50Z

src/hermes/commands/harvest/util/clone.py

+    Normalization rules:
+      - For SSH and HTTPS, append ".git" when missing (common, but not required by all hosts).
+    """
+    s = str(url).strip()


More expressive variable names would be nice. stripped_url? Or just overwrite the url parameter with the new value so the unstripped value isn't accidentally used again.

zyzzyxdonta · 2026-03-27T10:27:39Z

src/hermes/commands/harvest/util/clone.py

+        print(f"error: failed to remove temp dir {path!s} after {retries} attempts. "
+              f"Please remove it manually. (Often caused by antivirus or open handles.)")
+
+def _move_or_copy(src: Path, dst: Path):


Why make things this complicated? We don't need atomic moves here.

zyzzyxdonta · 2026-03-27T10:33:07Z

src/hermes/commands/harvest/util/clone.py

+            return
+
+        # Optimized clone succeeded
+        if dest_path.exists():


There is no reason to run into this problem at all. Just create a new temporary directory for every clone attempt.

zyzzyxdonta · 2026-03-27T10:35:58Z

src/hermes/commands/harvest/util/clone.py

+    *,
+    root_only: bool = False,
+    include_files: Sequence[str] | None = None,
+    verbose: bool = False,


Instead of this verbose flag and print() calls, the logging features should be used with appropriate log levels (info, warning, error, ...)

zyzzyxdonta · 2026-03-27T10:44:16Z

src/hermes/commands/harvest/util/clone.py

+def rmtree_with_retries(path: Path, retries: int = 6, initial_wait: float = 0.1):
+    """
+    Recursive directory deletion with retries and read-only handling, for environments where temporary directories may be locked
+    or marked read-only (e.g., Windows, CI systems, antivirus interference).


I wouldn't call this "retries". git clone always writes the .git/ directory as read-only so the issue that something can not be removed will always occur. I think shutil.rmtree with the _clear_readonly handler should suffice 🤔

zyzzyxdonta · 2026-03-27T10:49:15Z

src/hermes/commands/harvest/util/clone.py

+                return
+        except Exception as e:
+            print(f"warn: rmtree attempt {attempt} failed for {path!s}: {e!r}")
+        time.sleep(wait)


Something is seriously wrong if a sleep is required.

Aidajafarbigloo added 5 commits February 19, 2026 14:57

Add token management functions for TOML config

50a7afd

Implement repository cloning

d27c6d5

Added clone utility functions for repository cloning with error handling and cleanup.

Add URL and token arguments for harvesting

17c8b80

Added new command-line arguments for URL and token in harvest command.

Implement repository cloning and token updates

459845f

Add temporary directory handling for cloning repositories and update token management.

sferenz reviewed Feb 23, 2026

View reviewed changes

Aidajafarbigloo added 6 commits February 23, 2026 16:45

Update copyright holder

e0a527e

Update copyright holder

1f3b31f

Add detailed docstrings for better clarity on functions

e128e15

Add detailed docstrings for better clarity on functions

52d995f

Add bulk repository test for HERMES metadata harvesting

abca6b3

This script bulk-tests HERMES metadata harvesting across multiple repositories, checking for expected metadata files and generating a CSV report.

Create test_repositories.json with sample URLs

c0c71d1

Add test repositories for bulk testing.

zyzzyxdonta mentioned this pull request Mar 26, 2026

Feature/276 harvesting metadata from a provided repository url #278

Open

zyzzyxdonta requested changes Mar 27, 2026

View reviewed changes


		# ---------------- utilities ----------------

		def _normalize_clone_url(url: str) -> str:

Conversation

Aidajafarbigloo commented Feb 20, 2026

Uh oh!

Aidajafarbigloo commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Aidajafarbigloo commented Mar 18, 2026

Uh oh!

zyzzyxdonta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants