Skip to content

Feature/harvesting metadata from a provided repository url via cloning#473

Open
Aidajafarbigloo wants to merge 11 commits intosoftwarepub:developfrom
Aidajafarbigloo:feature/harvesting-metadata-from-a-provided-repository-URL-via-cloning
Open

Feature/harvesting metadata from a provided repository url via cloning#473
Aidajafarbigloo wants to merge 11 commits intosoftwarepub:developfrom
Aidajafarbigloo:feature/harvesting-metadata-from-a-provided-repository-URL-via-cloning

Conversation

@Aidajafarbigloo
Copy link
Copy Markdown

This feature branch introduces functionality for harvesting metadata from a provided repository URL via cloning.

Changes:

  1. Accept a repository URL as a parameter in the hermes harvest command
  2. Accept a token (for GitHub/GitLab) as a parameter in the hermes harvest command (for the plugin githublab)
  3. Clone the repository locally
  4. Harvest metadata from the cloned repository (CFF and CodeMeta)

Added clone utility functions for repository cloning with error handling and cleanup.
Added new command-line arguments for URL and token in harvest command.
Add temporary directory handling for cloning repositories and update token management.
Added an explicit logging shutdown step before clearing the HERMES caches.
Without shutting down logging first, the `clean` command fails on Windows with:
`An error occurred during execution of clean (Find details in './hermes.log')`
"Original exception was: [WinError 32] The process cannot access the file because it is being used by another process: '.hermes\\audit.log'"
@Aidajafarbigloo
Copy link
Copy Markdown
Author

@sferenz
Could you please take a look at this pull request and share your feedback?


import argparse
import shutil
import logging
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aidajafarbigloo Is it necessary for your changes to include logging? If not, please remove it everywhere.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sferenz Thank you for the comments.
When using hermes clean command on Windows, I face this error:
Run subcommand clean
Removing HERMES caches...
An error occurred during execution of clean (Find details in './hermes.log')

Error in the "hermes.log":
Original exception was: [WinError 32] The process cannot access the file because it is being used by another process: '.hermes\\audit.log'.

This happens because Windows does not allow deletion of a file that is still open by the current process. The audit.log file inside .hermes is held open by a logging file handler. When shutil.rmtree() attempts to remove the directory, it fails due to the open file handle. I'm using logging.shutdown() to ensure that all logging handlers are closed before the directory is deleted, it does not introduce new logging behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed before but somehow the fix got lost... Weird 🤔

#226

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, not only once. But it seems more important to not have a log file in the working directory than having a properly cleaned .hermes cache.

you should be able to configure the path to the logfile, though.


# ---------------- utilities ----------------

def _normalize_clone_url(url: str) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a general comment for each function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings were added.

@@ -0,0 +1,249 @@
# SPDX-FileCopyrightText: 2026 OFFIS e.V.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put UOL here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

import toml


def _load_config(config_path: str) -> dict:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure that every function has a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings were added.

@Aidajafarbigloo
Copy link
Copy Markdown
Author

@zyzzyxdonta @led02 Could you please review the PR and let me know if something needs to be changed?

Copy link
Copy Markdown
Contributor

@zyzzyxdonta zyzzyxdonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on the cloning procedure and the tests. I haven't looked at how it is integrated into the application yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are currently a standalone script but they should use the Python unittest library or Pytest just like all the other tests do. You rebuilt a lot of things that already exist in these libraries, e.g. output of messages, test fixtures, ...

"""

import subprocess
import pandas as pd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas is not part of the dependencies and I would prefer not to add it.

# JSON file that stores previous yes/no answers per URL to detect regressions across runs.
state_file = "hermes_bulk_test_state.json"

# Load previous state (if any).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think comparing against previous results makes sense here. Either the code does the right thing (tests succeed) or it doesn't (tests fail). There is no performance degradation over time.

# Step 2: Run harvest for the repository.
# Capture stdout/stderr to include HERMES output as an error message when needed.
proc = subprocess.run(
["hermes", "harvest", "--url", url, "--token", token],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to run the whole program like this. Especially, we shouldn't actually clone any real repositories. Doing this can easily lead to our tests degrading, e.g. because repositories change over time or are deleted.

Here are some things that could be tested instead:

  • Git repo URLs are constructed correctly
  • git command lines (that are passed to subprocess.run) are constructed correctly

I'm not averse to the idea of having end-to-end tests for the whole metadata extraction workflow. But I think it should be based on a dedicated test repo that should ideally be part of the tests/ directory and not cloned. It should be safe to assume that a git clone works if the command parameters are correct.

Normalization rules:
- For SSH and HTTPS, append ".git" when missing (common, but not required by all hosts).
"""
s = str(url).strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More expressive variable names would be nice. stripped_url? Or just overwrite the url parameter with the new value so the unstripped value isn't accidentally used again.

print(f"error: failed to remove temp dir {path!s} after {retries} attempts. "
f"Please remove it manually. (Often caused by antivirus or open handles.)")

def _move_or_copy(src: Path, dst: Path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make things this complicated? We don't need atomic moves here.

return

# Optimized clone succeeded
if dest_path.exists():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to run into this problem at all. Just create a new temporary directory for every clone attempt.

*,
root_only: bool = False,
include_files: Sequence[str] | None = None,
verbose: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this verbose flag and print() calls, the logging features should be used with appropriate log levels (info, warning, error, ...)

def rmtree_with_retries(path: Path, retries: int = 6, initial_wait: float = 0.1):
"""
Recursive directory deletion with retries and read-only handling, for environments where temporary directories may be locked
or marked read-only (e.g., Windows, CI systems, antivirus interference).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call this "retries". git clone always writes the .git/ directory as read-only so the issue that something can not be removed will always occur. I think shutil.rmtree with the _clear_readonly handler should suffice 🤔

return
except Exception as e:
print(f"warn: rmtree attempt {attempt} failed for {path!s}: {e!r}")
time.sleep(wait)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is seriously wrong if a sleep is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants