Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solidago for ICML24 #1879

Merged
merged 136 commits into from
Mar 28, 2024
Merged

Solidago for ICML24 #1879

merged 136 commits into from
Mar 28, 2024

Conversation

lenhoanglnh
Copy link
Contributor

@lenhoanglnh lenhoanglnh commented Jan 13, 2024

related issues #1781 (also contributes to solidago)


Description

The goal of the PR is to make solidago run the full pipeline, for ICML24 submission.

Additionally, I would like to include a synthetic data generation process, and evaluation of the pipeline on synthetic data.
I also consider generating attack data, to test the resilience of the pipeline.

Checklist

  • I added the related issue(s) id in the related issues section (if any)
    • if not, delete the related issues section
  • I described my changes and my decisions in the PR description
  • I read the development guidelines of the CONTRIBUTING.md
  • The tests pass and have been updated if relevant
  • The code quality check pass

lenhoanglnh and others added 2 commits January 13, 2024 17:22
…L24 submission.

Additionally, I would like to include a synthetic data generation process, and evaluation of the pipeline on synthetic data.
I also consider generating attack data, to test the resilience of the pipeline.
@lenhoanglnh
Copy link
Contributor Author

Les pretrusts et les vouches ne semblent pas être exportés dans les données publiques.
Est-ce qu'on pourrait les y ajouter ?

J'aimerais bien avoir les vouches en csv avec les columns "voucher_id", "vouchee_id" et "vouch".
Pour l'instant "vouch" serait juste égal à 1 pour tous les vouches.
(ce n'est pas indispensable, mais je voulais ouvrir la possibilité de vouches non-binaires)

Les pretrusts pourraient être juste une liste de user_id.

@amatissart
Copy link
Member

J'aimerais bien avoir les vouches en csv avec les columns "voucher_id", "vouchee_id" et "vouch". Pour l'instant "vouch" serait juste égal à 1 pour tous les vouches. (ce n'est pas indispensable, mais je voulais ouvrir la possibilité de vouches non-binaires)

Les pretrusts pourraient être juste une liste de user_id.

That sounds reasonable to add a "vouchers.csv" file, as we already describe vouching as public information on the website.

I am not so sure about "pretrusts". We could add a column "is_pretrusted" in the file "users.csv". But is the fact that a user registered with a trusted address supposed to be public? (altough that can be guessed easily from the current "trust_score" values).

amatissart and others added 6 commits January 13, 2024 21:58
I also included the public export to /solidago/tests/data to easily test implementations.

I re-implemented lipschitrust, which is now linear in the number of vouches,
as opposed to quadratic with the number of users.
However I do not get the same trust scores
(I assumed that if a user has a trust score >= 0.8, then they were pretrusted).
Comment on lines 1 to 8
from solidago.pipeline import PipelineParameters

from tournesol.utils.constants import COMPARISON_MAX, MEHESTAN_MAX_SCALED_SCORE


class MehestanParameters(PipelineParameters):
r_max = COMPARISON_MAX
max_squashed_score = MEHESTAN_MAX_SCALED_SCORE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) As this is extending the dataclass PipelineParameters, I think it should include the type annotation.

    r_max: float = COMPARISON_MAX
    max_squashed_score: float = MEHESTAN_MAX_SCALED_SCORE

(2) Shouldn't we define this type in Solidago and only use it in Tournesol backend


import pandas as pd

class PipelineOutput(ABC):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PipelineOutput interface, only makes sense if this object is passed to the pipeline and gets called along the way. I see that this is how it is used inside Tournesol which does not yet use the pipeline as defined by Solidago. Do I understand this correctly?

If yes, do we want to adapt the solidago's Pipeline to use solidago's PipelineOutput?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipeline.__call__() now accepts an optional output to customize the output and write scores during the pipeline execution:

def __call__(
self,
users: pd.DataFrame,
vouches: pd.DataFrame,
entities: pd.DataFrame,
privacy: PrivacySettings,
judgments: Judgments,
init_user_models : Optional[dict[int, ScoringModel]] = None,
output: Optional[PipelineOutput] = None,
) -> tuple[pd.DataFrame, VotingRights, dict[int, ScoringModel], ScoringModel]:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lenhoanglnh On this file I created a flame graph to investigate how we can improve the compute time of the pipeline. May be it is not the right place?

Install flameprof

$ source venv/bin/activate
$ pip install flameprof

Create a profiling stats file then use flameprof to make it into a nice svg for visualization

$ python -m cProfile -o stats.prof experiments/engagement.py
$ python -m flameprof stats.prof > output.svg

(open the svg in a new tab, gives some hoverlay that helps readability)
output

Analyzing the graph:

  • Nothing takes >80% of the time, so we don't have an obvious optimization issue (although this could change if we run the profiling with a larger dataset to better show cubic or quadratic complexity in some parts of the pipeline)
  • The multiple steps of scaling is taking about 50%. I don't really understand what are all the substeps, but once again, there is not one substep that takes the majority of the time. After confirming what the profile looks like on bigger dataset, that's where we should put time to investigate performance improvements.
  • The second most time consuming part is the computation of individual scores which spends 90% of the time running brentq. So we could look closely howbrentq is implemented and if it can be sped up.

@amatissart
Copy link
Member

amatissart commented Mar 14, 2024

I merged #1915 into this branch, to include separate more clearly the previous implementation from the new pipeline modules, deduplicate algorithms when that was possible, and apply some various optimizations.

Solidago version is now v0.1.0:

  • the new modular pipeline is available as solidago.pipeline.Pipeline
  • torch can be installed as an optional dependency with solidago[torch] in order to use LBFGSGeneralizedBradleyTerry based on torch.optim.LBFGS (not used in Tournesol).
  • Tournesol does not use the new pipeline yet. Some details described in TODO and FIXME need attention, more performance testing will be necessary, and some models will need to be updated in the database.
  • The code specific to the current Tournesol pipeline have been moved to solidago.pipeline.legacy2023.
  • Older implementations of some algorithms, that were reimplemented in the new pipeline, have been removed:
    • solidago.resilient_primitives ➡️ solidago.primitives (with asymmetric uncertainties)
    • vouch.trust_algo is now based on solidago.trust_propagation.LipschiTrust
    • vouch.voting_rights ➡️ solidago.voting_rights.compute_voting_rights based on solidago.voting_rights.AffineOvertrust

Anything else to check before merging?

@lenhoanglnh A discussion is still open about how to apply the post_process function in "scoring_model".

@lenhoanglnh
Copy link
Contributor Author

I think it is ready to merge, right?

@amatissart amatissart merged commit 234c326 into main Mar 28, 2024
9 checks passed
@amatissart amatissart deleted the icml24 branch March 28, 2024 17:43
@GresilleSiffle GresilleSiffle added Backend Back-end code of Tournesol Solidago Tournesol algorithms library labels Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend Back-end code of Tournesol Solidago Tournesol algorithms library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants