Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openbsd updater needs adapting for model changes in sqlports ? #1349

Closed
landryb opened this issue Sep 9, 2023 · 21 comments
Closed

openbsd updater needs adapting for model changes in sqlports ? #1349

landryb opened this issue Sep 9, 2023 · 21 comments

Comments

@landryb
Copy link

landryb commented Sep 9, 2023

some recent changes made it to sqlports (cf https://github.com/openbsd/ports/commits/master/databases/sqlports), which might have broken the openbsd updater in repology according to @ajacoutot. @marcespie might know better, i'm only the messenger and havent looked at the actual changes nor potential breakage.

right now sqlports doesnt have a "stable api" but https://marc.info/?l=openbsd-ports&m=169419062712772&w=2 hints that there might be one someday.

@marcespie
Copy link

marcespie commented Sep 9, 2023

I've had a quick look at the parser, the DISTFILES and MASTER_SITES* fields you are using are indeed gone.

We've changed the layout in bsd.port.mk, so that we can have arbitrary DISTFILES.sufx variables, and SITES.sufx that match (also taken the opportunity to shorten the name). Accordingly, there's no longer a fixed list of MASTER_SITES names, thus it's gone from the main table.

I don't know what you use DISTFILES and MASTER_SITES for, but at a guess, you might want to use the "simpler" ROACH_URL and ROACH_SITES variables instead.

On complicated ports with multiple origins, these now should contain the main url that we fetch for the package being built, and the corresponding originating sites (ROACH_URL has all the "fun" parts of OpenBSD-style MASTER_SITES already parsed).

That should actually make the data you report more accurate, hopefully.

If you really need to, you can instead find the full data in subsidiary tables:
_sites contain all the sites for each package path, N is actually the suffix used.
_distfiles contains all the various files that get fetched (but the path{url}sufx transformation was not applied), where type is 0/1/2 depending whether it's a DISTFILES, PATCHFILES, or SUPDISTFILES.

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 11, 2023

Thanks for the heads up. I'll try to update the parser this week, until then I have to disable openbsd repo.

AMDmi3 added a commit that referenced this issue Sep 11, 2023
@didickman
Copy link

I use repology for openbsd very regularly so losing it on repology isn’t ideal. Instead of taking down the entire thing, would it be possible to just show the main master site as a quick stop gap for now? I think that will show the same as before for about 93% of ports as most ports don’t need more than one master site. Example diff here: repology_diff

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 19, 2023

I'm working on this. For now I came up with this query to return all site+distfile pairs for all ports, which can then be converted to real URLs. Does it look right? I'm particularly puzzled with _distfiles' SUFX vs N but it seems to work with ports using either.

SELECT
    _distfiles.FullPkgPath AS FullPkgPath,
    _sites.Value AS Sites,
    _fetchfiles.Value AS Files
FROM _distfiles
    JOIN _fetchfiles
        ON KeyRef=_Distfiles.Value
    JOIN _sites
        ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)

@didickman
Copy link

What does repology need exactly? The OpenBSD sqlports package includes a man page which has some light documentation. I'm not sure if you've seen it, but if not I've uploaded it here in case it is helpful: sqlports man page

The database has a table called Meta which might be useful to check for freshness and schema versioning.

sqlite> select * from Meta;
SchemaVersion  Hash                                          CreationDate
-------------  --------------------------------------------  ------------
7.48           e9fy133wnJkyctyUUC3P8MtkwdSShLQHLX+Plp6DZqo=  2023-09-17

Once that's known, the raw tables are the ones starting with underscores. So for example I see repology is using the raw tables today. Is that what you want to do?

Then there are some convenience views which merge the data in the raw tables. Those convenience views do not start with underscores.

The "Ports" view may be very useful to use. But since it's a view it's slow, so the results from the dynamic view are also stored as static data in "PortsQ" which breaks the pattern and is actually a table.

I'm not an expert on sqlports, but if you can share how you need the data for repology I can try to work it out with you. @marcespie is the real expert here. Maybe the PortsQ table is what Marc was suggesting for you to use when he mentioned using the ROACH_SITES and ROACH_URL variables? Here is an example:

sqlite> select fullpkgpath, roach_sites, ROACH_URL from portsq limit 1;
FullPkgPath        ROACH_SITES                                                  ROACH_URL
-----------------  -----------------------------------------------------------  --------------------
archivers/ancient  https://github.com/temisu/ancient/archive/refs/tags/v2.1.1/  ancient-2.1.1.tar.gz

Nevertheless if you can describe what repology needs it might be helpful for those of us on the OpenBSD side to try to help.

@marcespie
Copy link

marcespie commented Sep 19, 2023 via email

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 19, 2023

What does repology need exactly?

https://repology.org/docs/requirements
TLDR: identifiers/names (portname, pkgname, origin), version, summary, maintainer, categories, licenses, cpe info, homepage and download urls. In the future, depends.

The OpenBSD sqlports package includes a man page which has some light documentation. I'm not sure if you've seen it, but if not I've uploaded it here in case it is helpful: sqlports man page

Thank you, but it contains only the basic info which is already clear from the database structure. It doesn't answer my question regarding SUFX vs N, neither does basic description you've given.

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 19, 2023

N has to do with the distfiles table containing DISTFILES, SUPDISTFILES
and PATCHFILES.

As I understand, that is _distfiles.Type, not N.

Back to N and SUFX, if my guess is right, this is an example of distfiles-sites linkage via SUFX:

sqlite> select * from _distfiles where fullpkgpath=4448;
FullPkgPath|Value|N|SUFX|Type
4448|18545||.github|0
4448|18546||.github|0
sqlite> select * from _sites where fullpkgpath=4448;
FullPkgPath|N|Value
4448|.github|https://github.com/
4448|.gitlab|https://gitlab.com/
4448|.srht|https://git.sr.ht/

And this is and example of linkage via N:

sqlite> select * from _distfiles where fullpkgpath=13146;
FullPkgPath|Value|N|SUFX|Type
13146|131315|||0
13146|131316|0||0
13146|131317|1||0
13146|131318|2||0
sqlite> select * from _sites where fullpkgpath=13146;
FullPkgPath|N|Value
13146||https://github.com/stumpwm/stumpwm/archive/
13146|0|https://github.com/edicl/cl-ppcre/archive/
13146|1|https://beta.quicklisp.org/archive/clx/2022-07-07/
13146|2|https://beta.quicklisp.org/archive/alexandria/2022-07-07/

coalesce as in the query above should handle both cases, am I right?

As I mentioned MUCH EARLIER in the thread, unless you really want
the full list of all files which are grabbed by the port, in the
new scheme of things, THE MAIN DISTFILE for a given port will be
located at ROACH_URL, grabbable from ROACH_SITE.

I would have already implemented ROACH_* support much earlier if that was suitable, wouldn't I?
Yes, I need the full list of all files.

@marcespie
Copy link

marcespie commented Sep 19, 2023 via email

@marcespie
Copy link

As forecasted, I just finished removing MASTER_SITES0..9

So you don't need to worry about N, it's gone in sqlports 7.49

@didickman
Copy link

Fantastic! Thank you so much for the quick adaptation!

Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.

diff --git a/repology/parsers/parsers/openbsd.py b/repology/parsers/parsers/openbsd.py
index 076ad7ac..5962b880 100644
--- a/repology/parsers/parsers/openbsd.py
+++ b/repology/parsers/parsers/openbsd.py
@@ -95,7 +95,6 @@ def _iter_sqlports(path: str) -> Iterator[Port]:
         yield Port(**row_dict, distfiles_cursor=distfiles_cursor)


-# TODO: drop _distfiles.N for sqlports >= 7.49
 _DISTFILES_QUERY = """
 SELECT
     _sites.Value AS sites,
@@ -104,7 +103,7 @@ FROM _distfiles
     JOIN _fetchfiles
         ON KeyRef=_Distfiles.Value
     LEFT JOIN _sites
-        ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (coalesce(_distfiles.N, _distfiles.SUFX) is not distinct from _sites.N)
+        ON _sites.FullPkgPath=_Distfiles.FullPkgPath AND (_distfiles.SUFX is not distinct from _sites.N)
 WHERE
     _distfiles.FullPkgPath = ?
 """

@marcespie
Copy link

Following-up (somewhat): in the more generic case of all package distributions, we all have different rules about splitting up things and how "unique" things get.

I think I might get some students (!!!!) looking at it this year.

For starters, the number of packages varies widely depending on distributions, but it doesn't mean the number of supported software is that different (depending on supported versions, available options, various splits in packages).

Speaking for OpenBSD, I know the whole pkgpath/fullpkgpath/flavors/subpackages enchilada can be mighty confusing when coming from outside.

I have zero idea what more info could be provided in views from sqlports to make this less confusing... nor do I know where other distros get their info.

AMDmi3 added a commit that referenced this issue Sep 26, 2023
@AMDmi3
Copy link
Member

AMDmi3 commented Sep 26, 2023

Fantastic! Thank you so much for the quick adaptation!

I won't count that as quick, but you're welcome)

Following the final bit of work Marc just did, the tweak mentioned in the comment is now required.

Yesterday sqlports was still at 7.48 so couldn't drop it right away.

@marcespie
Copy link

marcespie commented Sep 26, 2023 via email

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 26, 2023

For starters, the number of packages varies widely depending on distributions, but it doesn't mean the number of supported software is that different (depending on supported versions, available options, various splits in packages).

Repology primarily counts projects, which is specifically "number of supported software", so apart from a few exceptions subpackages and splits does not count. For instance, despite libxml2 is packaged

CentOS 6:

  pname  | version 
---------+---------
 libxml2 | 2.7.6

ConanCenter:

  pname  | version 
---------+---------
 libxml2 | 2.10.3
 libxml2 | 2.10.4
 libxml2 | 2.11.3
 libxml2 | 2.11.4
 libxml2 | 2.9.10
 libxml2 | 2.9.12
 libxml2 | 2.9.13
 libxml2 | 2.9.14
 libxml2 | 2.9.9

Cygwin:

          pname          | version 
------------------------+---------
  libxml2                | 2.10.4
  libxml2-debuginfo      | 2.10.4
  libxml2-devel          | 2.10.4
  libxml2-doc            | 2.10.4
  python27-libxml2       | 2.10.4
  python36-libxml2       | 2.10.4
  python37-libxml2       | 2.10.4
  python38-libxml2       | 2.10.4
  python39-libxml2       | 2.10.4
  mingw64-i686-libxml2   | 2.9.4
  mingw64-x86_64-libxml2 | 2.9.4

it's counted once for each repo. Of course there are discrepancies, and you can estimate the error rate by percentage of "unique" projects (not matched with any other repository), which is usually below 15%, or by percentage of unique projects which have "related" ones (with matches by url, which mean they can potentialy be merged with something) which is usually below 10%.

Speaking for OpenBSD, I know the whole pkgpath/fullpkgpath/flavors/subpackages enchilada can be mighty confusing when coming from outside.

I don't have a problem with it. There's minor problem with a lot of flavors which Repology cannot deduplicate yet (thus showing a bunch of similar entries on e.g. package versions page), but that does not affect statistics and statuses and will be fixed on Repology side someday.

I have zero idea what more info could be provided in views from sqlports to make this less confusing... nor do I know where other distros get their info.

As said, I don't feel much need to. From what I can see, most unmatched projects in OpenBSD are truly unique or have ambiguous names and have unmatched counterparts in other repos, so there's no OpenBSD specific problems here. Still, it can be improved somewhat by submitting rules to repology or renaming packages in OpenBSD where that makes sense. For instance, I see some potentially misnamed python modules, like devel/py-test-expect may be better called devel/py-pytest-expect to match PyPI.

Depending on our snaps, you'll soon get 7.49. after all, it still takes a day to run a full bulk...

Yes, it's already there and I've deployed updated parser.

@marcespie
Copy link

Renaming packages is a bitch. We got a mechanism for that, but it is fairly expensive on the updater-side.

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 26, 2023

Well, just suggesting, it's not fatal as it is.

@didickman
Copy link

didickman commented Sep 27, 2023

Looking at some false positives on OpenBSD, I noticed repology says security/ghidra is out of date.

However that port is marked BROKEN on all archs right now. So it is not built.

Therefore does the below change make sense? It filters out ports that are broken on ALL archs.

This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD. (in fact the security issues were noted as part of the reason it was disabled. see: marc.info

This diff purposely does not filter out ports that are only broken on SOME archs.

diff --git a/repology/parsers/parsers/openbsd.py b/repology/parsers/parsers/openbsd.py
index e88fcc33..6357d297 100644
--- a/repology/parsers/parsers/openbsd.py
+++ b/repology/parsers/parsers/openbsd.py
@@ -58,6 +58,8 @@ FROM _Ports
         ON Categories_ordered.FullPkgpath=_Ports.FullPkgpath
     JOIN _Email
         ON _Email.KeyRef=MAINTAINER
+WHERE
+    _Ports.FullPkgPath NOT IN (SELECT DISTINCT FullPkgPath FROM _Broken where Arch IS NULL)
 """

@didickman
Copy link

didickman commented Sep 27, 2023

Another possible false positive.

I can't figure out why repology says Python 3.10.13 in our tree has a vulnerability.

Blindly guessing I see that we have CVE-2023-40217 which states that some affected versions include "[3.10.0, 3.10.13)". To my knowledge a square bracket on the right would mean version <= 3.10.13, while a parenthesis means version < 3.10.13. i.e. strictly less than version 3.10.13.

Is it possible repology isn't marking the affected versions correctly? Or did I miss a vulnerability that's in Python 3.10.13 on my side?

@AMDmi3
Copy link
Member

AMDmi3 commented Sep 27, 2023

This would stop showing what I feel is a false positive for security/ghidra being vulnarable on OpenBSD

This makes sense, but I don't feel good about hiding information. Repology's target audience is not only maintainers but also upstream, and visibility of broken packages is especially important for these. Also there are maintainers from other repositories who may find even disabled package useful. Showing it as outdated is also a valid call for action. The point that its vulnerability does not in fact affect users because the package is not available is valid though, and it would be fair to at least exclude it from count of vulnerable packages. This requires a bit of development though (#1352).

I can't figure out why repology says Python 3.10.13 in our tree has a vulnerability

You can click an exclamation mark to get a page which highlights CVEs matched with a given version: https://repology.org/project/python/cves?version=3.10.13

@marcespie
Copy link

marcespie commented Sep 27, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants