Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Localization of scikit-learn website content. #28547

Open
steppi opened this issue Feb 27, 2024 · 13 comments
Open

Localization of scikit-learn website content. #28547

steppi opened this issue Feb 27, 2024 · 13 comments
Labels
Needs Decision Requires decision

Comments

@steppi
Copy link

steppi commented Feb 27, 2024

Hi,

I work for Quansight Labs and am helping set things up for translation of main project website content for core projects in the Scientific Python ecosystem. This work is supported by the Scientific Python Community & Communications Infrastructure grant from CZI. I created a Github discussion for this here: #28105, but @glemaitre pointed out that an issue would be more appropriate.

A deliverable for this grant is to have the brochure websites of at least 8 of the 10 Scientific Python core projects translated into at least 3 commonly used languages. You may have seen the language drop-down selector at https://numpy.org/. The goal is for a cross functional team of Quansight staff and volunteers to handle the bulk of the work, taking the burden off of project maintainers.

The localization management platform Crowdin is offering a free supported enterprise organization for Scientific Python translations. This is the platform we used for numpy.org. At this moment you would not need to decide what if anything you would do with translated content. However, to get things started, I'd like to ask permission to fill out the Crowdin Open Source Project Setup Request form on your behalf, to allow me to add a project to the Scientific Python Crowdin organization for Scikit-Learn.

For those interested. Here are the specifics of what I'd do with this.

  • Create a GitHub repository which mirrors the content from the Scikit-Learn brochure website.
  • Setup a cron github action which polls for updates to the Scikit-Learn website content and helps keep the mirror up to date.
  • Sync this repository to the Scikit-Learn Crowdin project. Translate can then translate the content, and Crowdin will automatically push commits to a PR against the repository mirroring the website content.

The end product will be a repository with parallel versions of the website content in a variety of languages which is kept up to date. If you want to host the translations on your webpage like NumPy does, myself and/or a colleague from Quansight would be supported by the grant to help make that happen.

Please let me know if you have any questions.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Feb 27, 2024
@Charlie-XIAO
Copy link
Contributor

This numpy wiki page and NEP 28 might provide some more context.

@adrinjalali
Copy link
Member

@Charlie-XIAO what would be the workload / process like on our side to make this happen?

Note that this is how things have worked in the past:
https://scikit-learn.org/stable/related_projects.html#translations-of-scikit-learn-documentation

We haven't maintained them in any way, and they're completely independent.

@betatim
Copy link
Member

betatim commented Feb 28, 2024

This sounds like an interesting idea and even nicer that it is supported financially.

One thing I don't understand from the original comment is which part of the website you plan to translate (API docs, examples, user guide, ..?) and how the translations would appear in the scikit-learn website (how would a user discover that translations exist)? What would the process be to keep things up to date across releases but also within a release?

@Charlie-XIAO
Copy link
Contributor

@adrinjalali I also have just started to learn about these, @steppi would definitely know better. He mentioned in #28105 about this sphinx internationalization workflow, essentially this graph:

image

My understanding is that sphinx-intl will help us generate the .pot files to be translated, then Crowdin would bring us to the translated .po files, and finally sphinx takes the .po files, combined with our original documentation, to generate the translated documentation.

One thing I don't understand from the original comment is which part of the website you plan to translate (API docs, examples, user guide, ..?)

In #28105 @steppi mentioned that "technical documentation like the User Guide, Examples, and the API reference are out-of-scope though, at least for now" while many other pages (that are not updated so frequently) are worth translating.

how the translations would appear in the scikit-learn website (how would a user discover that translations exist)?

Also in #28105 @steppi mentioned adding a language selector drop-down to Sphinx documentation. This should more or less look like the version switcher in pydata-sphinx-theme (and we can simply vendor its implementation to make a component).

What would the process be to keep things up to date across releases but also within a release?

Documentations of past releases are hardly touched (I think?) and there will be no problem as long as translations are also per-version.

But I'm also concerned about the second part because I'm not sure how the Crowdin part of the workflow works. If it cannot be fully automated (e.g. need people to review) then the update likely cannot be done at the granularity of commits. We would also need native speakers of the language (or at least people who are familiar with the language) to maintain that part.

Another of my major concern is that, I think (guess) the most viewed part of the scikit-learn documentation is exactly its technical part (user guide, examples, API) and those are the largest parts. How much value is left if we leave all those behind, or how much effort we need to pay to take those parts into account?

@steppi
Copy link
Author

steppi commented Feb 28, 2024

Thanks for providing context @Charlie-XIAO. I knew I would miss something. Like you pointed out, the rationale for limiting the scope of translations to only core website content is laid out in NEP 28.

We start with an assertion: maintaining translations of all documentation, or even the whole user guide, as part of the NumPy project is not feasible. One simply has to look at the volume of our documentation and the frequency with which we change it to realize that that’s the case. Perhaps it will be feasible though to translate just the top-level pages of the website. Those do not change very often, and it will be a limited amount of content (order of magnitude 5-10 pages of text).

I think up-to-date translations of API documentation, tutorials, examples etc. would be incredibly valuable, but also very difficult, and its outside the scope of the deliverables for the grant I'm working under.

To answer more questions:

One thing I don't understand from the original comment is which part of the website you plan to translate (API docs, examples, user guide, ..?) and how the translations would appear in the scikit-learn website (how would a user discover that translations exist)? What would the process be to keep things up to date across releases but also within a release?

For the question of how the translations would appear: At https:://numpy.org, there is a dropdown in the top right corner for selecting the language. For other projects like scikit-learn, I think the decision on how translations appear can be deferred for later. What I'm trying to do now is get all of the infrastructure set up for translation management for the projects. The plan is that in around one month this project will be taken over by someone at Quansight with more relevant experience, and my job is to have all of the groundwork in place for whoever that ends up being when they get started.

@Charlie-XIAO has already helped answer the question about the scope of content to translate. Having a limited scope will simplify the process of keeping things up to date, since core website content shouldn't change as much. For NumPy, so far changes have been small when they are made, and we just let stuff remain out of date until the new translations come in.

But I'm also concerned about the second part because I'm not sure how the Crowdin part of the workflow works. If it cannot be fully automated (e.g. need people to review) then the update likely cannot be done at the granularity of commits. We would also need native speakers of the language (or at least people who are familiar with the language) to maintain that part.

The plan is to have a common community of volunteer and perhaps sometimes paid translators across different projects in the Scientific Python ecosystem. Maintainers from individual projects wouldn't need to find or work directly with translators. For the Crowdin workflow: Crowdin creates a branch in the synched repo that it pushes individual commits to for each update made to the translations in Crowdin. The plan is that these would be made to a repo that mirrors (and stays up to date with) the content from the Scikit-Learn website. The goal would be to maintain parallel versions of the content in this mirror repo in different languages. I think how these are used in Scikit-Learn's website can be settled down later; but we hope minimize the burden on maintainers of the core projects.

Another of my major concern is that, I think (guess) the most viewed part of the scikit-learn documentation is exactly its technical part (user guide, examples, API) and those are the largest parts. How much value is left if we leave all those behind, or how much effort we need to pay to take those parts into account?

For now, this grant supported project only covers core project websites, but this could potentially be just the beginning steps. Perhaps if a strong enough community of translators is organized, it may become more feasible to start maintaining translations of technical documentation. I agree that translated technical documentation would be more valuable, but if we're going to get there, small steps need to be taken first.

For now, what I'm asingk is for permission to set up a repo for mirroring content from the Scikit-Learn website and to fill out the Crowdin Open Source Request form I linked to above on your behalf, so I can create a Crowdin project for Scikit-Learn within the Scientific Python Crowdin organization and sync it to this repo. Where things go from there can be settled down later I hope. Even in the case where you ultimately decide not to host the translations on your web page, I think it would still be valuable for them to be created and maintained.

@adrinjalali
Copy link
Member

I think setting up the account on crowdin makes sense then. The danger here that we're gonna have translated docs, which people will read and then complain about, and we have no idea what those translations say. I wonder how much the translations from crowdin is going to be better than a auto google translate button on our website.

WDYT @scikit-learn/core-devs

@betatim
Copy link
Member

betatim commented Feb 29, 2024

I wonder how much the translations from crowdin is going to be better than a auto google translate button on our website.

I think this is a good question to ask. For lolz I asked my browser (Firefox) to translate scikit-learn.org to German for me.

Screenshot 2024-02-29 at 10 42 05

You can tell it is machine translated and as a result somewhat amusing to read if you are a native speaker of both languages.

I think it makes sense to ask some questions and discuss some of the details of how to get the translations back into the website before setting out on the translation effort. This is not just "community projects being slow and slowing things down with endless discussions". For me it is a counter balance to my first impression of this effort which is "someone somewhere applied for some grant to have someone else do something for someone on the internet", which makes me worry that we are driving full throttle in some random direction.

Having a first/rough idea of how the changes are going to (continuously) get integrated back into https://github.com/scikit-learn/scikit-learn.github.io/tree/main and what structure we will have there and how much knock on work that will cause seems reasonable before using up people's time translating things.

Creating high quality translations of highly refined wording (as found on landing pages and such) is seriously hard work. In my experience it requires someone who is a native speaker of the target language and nearly a native speaker in the source language. Otherwise you end up with poor/comical translations of idioms and phrases.

Especially given that the zero effort alternative of "let people use AI to translate it to their language" is not all that terrible (it creates comical translations for phrases/idioms but that is kind of expected)


I'm not against creating an account, but I think it is worth having a bit more of a coherent plan instead of salami slicing the asks so that each one is easy to agree to.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 29, 2024 via email

@steppi
Copy link
Author

steppi commented Feb 29, 2024

I think setting up the account on crowdin makes sense then. The danger here that we're gonna have translated docs, which people will read and then complain about, and we have no idea what those translations say. I wonder how much the translations from crowdin is going to be better than a auto google translate button on our website.

These are good concerns @adrinjalali . These topics came up in discussion with the Pandas core developers too, pandas-dev/pandas#56301 (comment). To summarize, based on translator's experience working on https://numpy.org, machine translation is best used as a starting point, and appears not to be advanced enough that it can be trusted to accurately translate the technical and jargon-filled language of scientific computing. It's valuable to have humans who understand the subject matter in the loop who can review and fix up the machine translations. I think @betatim observed this in his example too.

For the danger of having translated docs people read and complain about; I think it would need to be very clear that if someone wants to complain, core project maintainers are not the people they should complain too. When someone tries to create an issue, there should be an option for issues with translations which links to the new issue page on the associated repo for the translated content. There should be clearly visible notes in the translated content pointing out where issues should be submitted too.

I think it makes sense to ask some questions and discuss some of the details of how to get the translations back into the website before setting out on the translation effort. This is not just "community projects being slow and slowing things down with endless discussions". For me it is a counter balance to my first impression of this effort which is "someone somewhere applied for some grant to have someone else do something for someone on the internet", which makes me worry that we are driving full throttle in some random direction.

I think that's reasonable. At the moment I just want to get as many of the projects set up with repos mirroring the content synced to Crowdin to lay the ground work and basically just tick off a box. I was explicitly told not to try to reach out to translators and push for translating content. I can let the existing translators who've worked on numpy.org not to work on Scikit-Learn translations yet. Basically, if I budget time to setting projects up, due to overhead it doesn't make a big difference in total effort whether I set up 1, 2, 3 or 10 projects. But if I need to context switch between holding discussions with maintainers, setting up projects, working on interactive documentation for SciPy, and my primary focus as a mathematical/statistical software developer it's going to make it difficult for me to keep up with everything. It may well be that the exact deliverables specified in the grant are misguided, and if so, this will come out in discussions with project maintainers, and I think the funders will only care that a good faith effort was made.

Worried... What I would want to know is: what's the mechanism for updating (hum, there is a business opportunity here to provide a system for automated translation, with fixes, that learns from the fixes and tries to apply them to new versions). I would really want to have a written plan for updating the translation, and one that have a realistic approach.

When it comes time to move forward with more than the groundwork, the idea is to work on a SPEC which contains such a written plan. This is the route things first went with when discussing things with Pandas maintainers. You might want to look at the PDEP I wrote and the discussion around it pandas-dev/pandas#57204. The earliest draft in the commit history was much more detailed than what was there when the issue was closed. A core goal of this project is to try to create a community of translators who can help keep the content up to date. In around one month someone skilled in these kinds of community building efforts will take over this project at Quansight, and my role will be to help maintain the infrastructure.

@steppi
Copy link
Author

steppi commented Feb 29, 2024

I'm not against creating an account, but I think it is worth having a bit more of a coherent plan instead of salami slicing the asks so that each one is easy to agree to.

I missed this. Sorry if I gave that impression. I don't plan to salami slice this. My hope is that there will be one small salami slice, so I can get everything set up, which will be easier to do if I can set every project up at once. After that it will just be discussion of the rest of the salami. I think milestones in the grant offer some flexibility for what is actually accomplished and the plan can be adapted to each projects needs.

I can try to give an overview of the plan as I understand it.

First the scope for content is limited to only information on the core project website which does not change too frequently. The scope for languages is limited to widely spoken languages with a relatively low percentage of speakers proficient in English. The idea is that the amount of content to translate for one project into one language should only be a few days worth of work total.

To summarize what's happened so far. As a proof of concept, we set up translation infrastructure for https://numpy.org (managed by PIs on the grant), ha content translated, and publish translations. This is done. Portuguese and Japanese translations are available now. Korean translations will be published within the next few weeks.

After this I started looking into how things could be set up for other projects. When I started reaching out to other projects, Ididn't have a clear plan yet. Or even a good idea of what I was doing. I had a good discussion with the Pandas developers, from which I realized how important it really is to keep the work needed from core project maintainers to a minimum.

Now I'm trying to get the translation infrastructure set up for the different projects. I have a technical solution for this, that I described above. Basically, having repositories that mirror the content which are synced to Crowdin, which will allow for maintaining parallel versions of the content translated into different languages. The idea here is that there would be a GitHub action in the mirror repo which polls for changes in the content in the source repo and pulls them in when needed. Crowdin will then be aware of these updates, and will push notifications to the translators that there is fresh content that needs to be translated. When translations are made, Crowdin will push commits to a branch on the repo its synced with adding the updates to the translations. I have a workflow for submitting PRs to update to translations and we have admins for each language who can give a final sign off.

For numpy.org, Crowdin is synced directly with the associated Github repo, https://github.com/numpy/numpy.org, so when such a PR is merged, the translations get updated online automatically. If the translated content appears in a separate repo, extra steps will be needed to actual publish the content on the Scikit-Learn website. The Pandas developers maintain their own static site generator, and my undestanding is their plan would involve pulling down content from the mirror repo dynamically when the website is build, and putting parallel versions at places like pandas.pydata.org/pt. It seems every core project website is a bit different, so how the translations actually appear on the webpage would differ between projects.

For keeping things up to date in the long term, like I mentioned, the goal is to build a community of translators who can help this. The same group of translators could work on projects across the Scientific Python ecosystem, and having all of the projects in the same Crowdin organization means there would be no friction between someone trying to translate for one project and then moving to another. After the ground work is set down, in about one month, someone is going to take over with experience with this kind of work. I'm not really qualified to judge the feasibility of accomplishing this, but it seems reasonable to me based on my experience watching over how things have gone with the NumPy translations.

@jeremiedbb jeremiedbb added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Mar 3, 2024
@steppi
Copy link
Author

steppi commented Mar 7, 2024

I'm working on an FAQ that will hopefully contain most of the information project maintainers will want to see before deciding if they want to participate, and plan to put that together before continuing discussions with more projects. After the FAQ, I plan work on a draft SPEC that gives the kind of detailed plan @GaelVaroquaux would like to see.

@steppi
Copy link
Author

steppi commented May 23, 2024

Hi everyone. The FAQ I'd promised is now available here: https://scientific-python-translations.github.io/faq/. Feel free to take a look and let me know if you'd be interested in participating or if you have any other questions. We're going to start moving forward only with the more enthusiastic projects, so if you're still unsure, you can wait to see how things work out for other project websites before making a final decision.

@adrinjalali
Copy link
Member

I personally would like to see how this looks like. Since we now have moved to the pydata-theme, I suspect the theme would also support this language switch which should make it easy for us to adopt it.

When that happens, and the translations infrastructure is set nicely, then I don't see much of a risk to proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

No branches or pull requests

6 participants