Solidago for ICML24 #1879

lenhoanglnh · 2024-01-13T16:29:27Z

related issues #1781 (also contributes to solidago)

Description

The goal of the PR is to make solidago run the full pipeline, for ICML24 submission.

Additionally, I would like to include a synthetic data generation process, and evaluation of the pipeline on synthetic data.
I also consider generating attack data, to test the resilience of the pipeline.

Checklist

I added the related issue(s) id in the related issues section (if any)
- if not, delete the related issues section
I described my changes and my decisions in the PR description
I read the development guidelines of the CONTRIBUTING.md
The tests pass and have been updated if relevant
The code quality check pass

…L24 submission. Additionally, I would like to include a synthetic data generation process, and evaluation of the pipeline on synthetic data. I also consider generating attack data, to test the resilience of the pipeline.

lenhoanglnh · 2024-01-13T17:47:53Z

Les pretrusts et les vouches ne semblent pas être exportés dans les données publiques.
Est-ce qu'on pourrait les y ajouter ?

J'aimerais bien avoir les vouches en csv avec les columns "voucher_id", "vouchee_id" et "vouch".
Pour l'instant "vouch" serait juste égal à 1 pour tous les vouches.
(ce n'est pas indispensable, mais je voulais ouvrir la possibilité de vouches non-binaires)

Les pretrusts pourraient être juste une liste de user_id.

amatissart · 2024-01-13T19:01:25Z

J'aimerais bien avoir les vouches en csv avec les columns "voucher_id", "vouchee_id" et "vouch". Pour l'instant "vouch" serait juste égal à 1 pour tous les vouches. (ce n'est pas indispensable, mais je voulais ouvrir la possibilité de vouches non-binaires)

Les pretrusts pourraient être juste une liste de user_id.

That sounds reasonable to add a "vouchers.csv" file, as we already describe vouching as public information on the website.

I am not so sure about "pretrusts". We could add a column "is_pretrusted" in the file "users.csv". But is the fact that a user registered with a trusted address supposed to be public? (altough that can be guessed easily from the current "trust_score" values).

I also included the public export to /solidago/tests/data to easily test implementations. I re-implemented lipschitrust, which is now linear in the number of vouches, as opposed to quadratic with the number of users. However I do not get the same trust scores (I assumed that if a user has a trust score >= 0.8, then they were pretrusted).

…ameters

WIP Constructing a generative model pipeline

Also cleaned lipshitrust code

…due to always having a<b

…computation)

There is still more room for acceleration for Mehestan

… active user comparisons.

lfaucon · 2024-02-07T07:30:30Z

backend/ml/mehestan/parameters.py

+from solidago.pipeline import PipelineParameters
+
+from tournesol.utils.constants import COMPARISON_MAX, MEHESTAN_MAX_SCALED_SCORE
+
+
+class MehestanParameters(PipelineParameters):
+    r_max = COMPARISON_MAX
+    max_squashed_score = MEHESTAN_MAX_SCALED_SCORE


(1) As this is extending the dataclass PipelineParameters, I think it should include the type annotation.

r_max: float = COMPARISON_MAX max_squashed_score: float = MEHESTAN_MAX_SCALED_SCORE

(2) Shouldn't we define this type in Solidago and only use it in Tournesol backend

lfaucon · 2024-02-07T07:42:30Z

solidago/src/solidago/pipeline/outputs.py

+
+import pandas as pd
+
+class PipelineOutput(ABC):


This PipelineOutput interface, only makes sense if this object is passed to the pipeline and gets called along the way. I see that this is how it is used inside Tournesol which does not yet use the pipeline as defined by Solidago. Do I understand this correctly?

If yes, do we want to adapt the solidago's Pipeline to use solidago's PipelineOutput?

Pipeline.__call__() now accepts an optional output to customize the output and write scores during the pipeline execution:

tournesol/solidago/src/solidago/pipeline/pipeline.py

Lines 122 to 131 in 631257d

def __call__(

self,

users: pd.DataFrame,

vouches: pd.DataFrame,

entities: pd.DataFrame,

privacy: PrivacySettings,

judgments: Judgments,

init_user_models : Optional[dict[int, ScoringModel]] = None,

output: Optional[PipelineOutput] = None,

) -> tuple[pd.DataFrame, VotingRights, dict[int, ScoringModel], ScoringModel]:

lfaucon · 2024-02-07T07:57:46Z

solidago/experiments/engagement.py

@lenhoanglnh On this file I created a flame graph to investigate how we can improve the compute time of the pipeline. May be it is not the right place?

Install flameprof

$ source venv/bin/activate $ pip install flameprof

Create a profiling stats file then use flameprof to make it into a nice svg for visualization

$ python -m cProfile -o stats.prof experiments/engagement.py $ python -m flameprof stats.prof > output.svg

(open the svg in a new tab, gives some hoverlay that helps readability)

Analyzing the graph:

Nothing takes >80% of the time, so we don't have an obvious optimization issue (although this could change if we run the profiling with a larger dataset to better show cubic or quadratic complexity in some parts of the pipeline)

The multiple steps of scaling is taking about 50%. I don't really understand what are all the substeps, but once again, there is not one substep that takes the majority of the time. After confirming what the profile looks like on bigger dataset, that's where we should put time to investigate performance improvements.

The second most time consuming part is the computation of individual scores which spends 90% of the time running brentq. So we could look closely howbrentq is implemented and if it can be sped up.

Fixed doc typos in pipeline.

…on as in the ICML paper

solidago/src/solidago/scoring_model.py

…mentation to legacy2023 (#1915)

amatissart · 2024-03-14T09:59:01Z

I merged #1915 into this branch, to include separate more clearly the previous implementation from the new pipeline modules, deduplicate algorithms when that was possible, and apply some various optimizations.

Solidago version is now v0.1.0:

the new modular pipeline is available as solidago.pipeline.Pipeline
torch can be installed as an optional dependency with solidago[torch] in order to use LBFGSGeneralizedBradleyTerry based on torch.optim.LBFGS (not used in Tournesol).
Tournesol does not use the new pipeline yet. Some details described in TODO and FIXME need attention, more performance testing will be necessary, and some models will need to be updated in the database.
The code specific to the current Tournesol pipeline have been moved to solidago.pipeline.legacy2023.
Older implementations of some algorithms, that were reimplemented in the new pipeline, have been removed:
- solidago.resilient_primitives ➡️ solidago.primitives (with asymmetric uncertainties)
- vouch.trust_algo is now based on solidago.trust_propagation.LipschiTrust
- vouch.voting_rights ➡️ solidago.voting_rights.compute_voting_rights based on solidago.voting_rights.AffineOvertrust

Anything else to check before merging?

@lenhoanglnh A discussion is still open about how to apply the post_process function in "scoring_model".

lenhoanglnh · 2024-03-27T17:07:52Z

I think it is ready to merge, right?

lenhoanglnh and others added 2 commits January 13, 2024 17:22

move MlInput abstraction to Solidago, renamed into TournesolInput

9590ff0

amatissart and others added 6 commits January 13, 2024 21:58

move collaborative_scaling to solidago

491f99f

solidago: define PipelineParameters base class

323933f

define PipelineOutpuut base class and implement TournesolPollOutput

79c2248

move criterion pipeline to solidago

30ceaf6

add PipelineOutputInMemory

0cad21e

amatissart mentioned this pull request Jan 14, 2024

Move scalings and scores computation pipeline to Solidago #1880

Closed

5 tasks

amatissart and others added 19 commits January 14, 2024 14:50

fix tests about public dataset

a5cfe36

Merge branch 'solidago-pipeline' into icml24

abb957b

solidago: define over_trust_bias and over_trust_scale in pipeline par…

c5a607f

…ameters

Merged with #1880

ddebed8

WIP Constructing a generative model pipeline

Rollback pyproject.toml

9609761

Generation up to engagement now seems to work

eb2d33c

Generative model now seems to work :D

74eec44

Adding generative model solution

ed143e9

Added logger for data generation

6551818

Fixed trust propagation to make it better fit a pipeline

9a07a29

Also cleaned lipshitrust code

Added a naive trust propagation

6e1e18c

Merge branch 'solidago-pipeline' into icml24

2569eb7

Added "is_public" to the generative comparison model

9a15814

Random flip of entities when generating comparisons to detect errors …

d507112

…due to always having a<b

Fixed errors in tests

12ddd1e

Refactoring the pipeline to make it more clearly modular

a1a6b70

Added how to run solidago tests

4bcf8dd

Fixed README

d3608ed

Defined a dataset abstraction

5411666

lenhoanglnh added 7 commits February 6, 2024 09:53

WIP Accelerating Mehestan (scalers are now identified with a simpler …

119e36b

…computation)

Merge branch 'icml24' of github.com:tournesol-app/tournesol into icml24

8490f53

Pipeline accelerated. Now runs in 1 hour on "largely_recommended".

6ab36dc

There is still more room for acceleration for Mehestan

Merge branch 'main' into icml24

d5dfef9

Made Mehestan faster (speed untested so far) by allowing sampling for…

2bd7a04

… active user comparisons.

Added UnorderedPairs abstraction to clarify code

7d33dcd

Fixed error in double counting self in Mehestan

48d5c59

lfaucon requested review from lfaucon and amatissart February 7, 2024 07:08

lfaucon reviewed Feb 7, 2024

View reviewed changes

lenhoanglnh added 4 commits February 8, 2024 10:10

Added a default get_scaling_parameters to ScoringModel

7af9dc1

Fixed doc typos in pipeline.

Fixed typing errors

9306ed5

Fixed typing error

9621174

Moved standardization into scaling, rather than doing it in aggregati…

4c13f93

…on as in the ICML paper

amatissart reviewed Feb 17, 2024

View reviewed changes

solidago/src/solidago/scoring_model.py Outdated Show resolved Hide resolved

amatissart and others added 2 commits March 14, 2024 09:51

Prepare integration of new pipeline with backend, move previous imple…

7ed8728

…mentation to legacy2023 (#1915)

bump solidago to v0.1.0

631257d

lenhoanglnh added 2 commits March 22, 2024 13:43

Merge branch 'icml24' of github.com:tournesol-app/tournesol into icml24

2e0730b

Fixed @amatissart's issue about postprocess uncertainty

c80c8c8

amatissart added 3 commits March 28, 2024 14:36

fix warning about division by zero in scaling_step

b2df42e

small optimization in overtrust computation

6d6ad9c

add log about voting rights computation

970971f

amatissart approved these changes Mar 28, 2024

View reviewed changes

amatissart merged commit 234c326 into main Mar 28, 2024
9 checks passed

amatissart deleted the icml24 branch March 28, 2024 17:43

GresilleSiffle added Backend Back-end code of Tournesol Solidago Tournesol algorithms library labels Mar 28, 2024

amatissart mentioned this pull request Apr 18, 2024

[back] feat: add vouchers.csv in public dataset #1957

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solidago for ICML24 #1879

Solidago for ICML24 #1879

lenhoanglnh commented Jan 13, 2024 •

edited by amatissart

Loading

lenhoanglnh commented Jan 13, 2024

amatissart commented Jan 13, 2024

lfaucon Feb 7, 2024

lfaucon Feb 7, 2024

amatissart Mar 14, 2024

lfaucon Feb 7, 2024

amatissart commented Mar 14, 2024 •

edited

Loading

lenhoanglnh commented Mar 27, 2024

	def __call__(
	self,
	users: pd.DataFrame,
	vouches: pd.DataFrame,
	entities: pd.DataFrame,
	privacy: PrivacySettings,
	judgments: Judgments,
	init_user_models : Optional[dict[int, ScoringModel]] = None,
	output: Optional[PipelineOutput] = None,
	) -> tuple[pd.DataFrame, VotingRights, dict[int, ScoringModel], ScoringModel]:

Solidago for ICML24 #1879

Solidago for ICML24 #1879

Conversation

lenhoanglnh commented Jan 13, 2024 • edited by amatissart Loading

Description

Checklist

lenhoanglnh commented Jan 13, 2024

amatissart commented Jan 13, 2024

lfaucon Feb 7, 2024

Choose a reason for hiding this comment

lfaucon Feb 7, 2024

Choose a reason for hiding this comment

amatissart Mar 14, 2024

Choose a reason for hiding this comment

lfaucon Feb 7, 2024

Choose a reason for hiding this comment

amatissart commented Mar 14, 2024 • edited Loading

lenhoanglnh commented Mar 27, 2024

lenhoanglnh commented Jan 13, 2024 •

edited by amatissart

Loading

amatissart commented Mar 14, 2024 •

edited

Loading