Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add parsers support #97

Merged
merged 46 commits into from
Nov 7, 2023
Merged

feat: add parsers support #97

merged 46 commits into from
Nov 7, 2023

Conversation

cmdoret
Copy link
Member

@cmdoret cmdoret commented Oct 24, 2023

This PR adds a new "Parser" interface that can parse file contents into triples. It also migrates the existing license detection code into a parser.

This required semantic changes throughout the code, to replace "source" by "git provider" to avoid confusion.

The PR does the following:

  • Move gimie.source -> gimie.extractors
  • Define Parser interface
  • Add LicenseParser
  • Simplify logic in Project and make it compatible with Parsers
  • Update all tests and docstrings
  • Update CLI to:
    • Allow the user to include / exclude specific parsers in gimie data
    • Add a command to list available parsers.

The important changes are in gimie/project.py, gimie/parsers and gimie/cli.py.

Implementing a new Parser is as simple as defining a subclass in gimie.parsers (and adding it to PARSERS):

from gimie.graph.namespaces import SDO
from gimie.parsers import Parser
from rdflib import Graph, Literal, URIRef

class ExampleParser(Parser):
    """A dummy parser for demonstration.
    It will parse files named 'example' and generate a
    graph with a single triple: 
    <URI> <schema:description <file contents>
    """

    def __init__(self, uri: URIRef):
        super().__init__(uri)

    def can_parse(resource: Resource) -> bool:
        # Match based on filename (or content)
        return Resource.name == "example"

    def _parse(resource: Resource) -> Graph:
        # Parsing logic
        contents = resource.open().read()
        return Graph().add(
            (self.uri, SDO.description, Literal(contents))
        )

Example usage of parsers on the command line :

# Use all parsers (default)
➜ gimie data <url>
# Only use license parser
➜ gimie data --include-parser license <url>
# Use all parsers except license
➜ gimie data --exclude-parser license <url>
# The include/exclude flags have short forms and can be specified multiple times
# Enable only cff and pyproject parsers (NOTE: these don't exist yet)
➜ gimie data -I cff -I pyproject <url> 

# list available parsers and their functionalities
➜ gimie parsers --verbose
license: Parse LICENSE file(s) into schema:license.

@cmdoret cmdoret changed the base branch from refactor/simpler-extractor to main October 24, 2023 11:09
@cmdoret cmdoret linked an issue Oct 24, 2023 that may be closed by this pull request
@cmdoret cmdoret marked this pull request as ready for review October 24, 2023 14:50
@cmdoret cmdoret self-assigned this Oct 24, 2023
@cmdoret cmdoret added enhancement New feature or request refactor improving code without user-facing changes labels Oct 24, 2023
@cmdoret cmdoret requested a review from vancauwe October 31, 2023 09:17
Copy link
Contributor

@vancauwe vancauwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very nice change of architecture towards having extractors and parsers. Clear naming and flow.

My suggestions would be:

  • to decouple the parser from a file finder
  • decouple the graph making from the parsing (as we decoupled the extraction from the graph making)

gimie/cli.py Outdated Show resolved Hide resolved
gimie/project.py Outdated Show resolved Hide resolved
gimie/project.py Outdated Show resolved Hide resolved
gimie/project.py Show resolved Hide resolved
@cmdoret cmdoret requested a review from vancauwe November 3, 2023 18:04
@cmdoret
Copy link
Member Author

cmdoret commented Nov 3, 2023

Here is a summary of new changes:

  • CLI: print parser name in bold green 🍀
  • More flexible PARSERS using NamedTuple
  • Helper functions to list and use parsers (thus decoupling the file matching from parsers)
  • Parsers Output Property instead of Graph, helper function in gimie.graph to make the conversion
  • Moved code related to parsers and extractors out of project.py into their respective modules
  • added tests for gimie.parsers

Copy link
Contributor

@vancauwe vancauwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing more to say: I really like how you chose to implement my suggestions and I find it makes the whole much more modular.
Really nice! ⭐

gimie/graph/operations.py Show resolved Hide resolved
gimie/parsers/__init__.py Show resolved Hide resolved
@cmdoret cmdoret merged commit dc6f149 into main Nov 7, 2023
5 checks passed
@cmdoret cmdoret deleted the feat/parsers branch November 10, 2023 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor improving code without user-facing changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Parser concept
2 participants