Skip to content

GH-46833:[Python][Azure] Adding ConfigureClientSecretCredential to AzureFileSystem #46837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KirillTsyganov
Copy link

@KirillTsyganov KirillTsyganov commented Jun 17, 2025

Rationale for this change

ClientSecretCredential is another method to authenticate to Azure resources using Service Principle (SP). We work on AzureML, where there are two main types of compute ComputeInstance (personal VM) and ComputeCluster. In context of AzureML you can work interactively on ComputeInstance, where pyarrow defaults toDefaultAzureCredential when reading/writing data over abfss:// protocol. However when running AzureML "jobs" which is run non-interactively execution, the context (network) is different and DefaultAzureCredential doesn't work. An alternative method for non-interactive job is to use Service Principle i.e ClientSecretCredential. This way we can create AzureFileSystem object and give it pyarrow dataset

What changes are included in this PR?

Adding ConfigureClientSecretCredential to AzureFilesystem, which is already implemented at the C++, but hasn't been propagated to python library. This pull request just propagates C++ method to pyarrow

Are these changes tested?

Yes, test have been included in the relevant file

Are there any user-facing changes?

Docs of AzureFileSystem have been updated in the code. It would be amazing for those changes to surface on the website, which I'm happy to try to help with if needed

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pickle failures are related to missing to add them to the __reduce__ method, something like:

diff --git a/python/pyarrow/_azurefs.pyx b/python/pyarrow/_azurefs.pyx
index 554687ccb6..573eb0ecf1 100644
--- a/python/pyarrow/_azurefs.pyx
+++ b/python/pyarrow/_azurefs.pyx
@@ -163,5 +163,8 @@ cdef class AzureFileSystem(FileSystem):
                 dfs_storage_authority=frombytes(opts.dfs_storage_authority),
                 blob_storage_scheme=frombytes(opts.blob_storage_scheme),
                 dfs_storage_scheme=frombytes(opts.dfs_storage_scheme),
-                sas_token=frombytes(self.sas_token)
+                sas_token=frombytes(self.sas_token),
+                tenant_id=frombytes(self.tenant_id),
+                client_id=frombytes(self.client_id),
+                client_secret=frombytes(self.client_secret),
             ),))

should fix those failures

@KirillTsyganov KirillTsyganov force-pushed the feature/azure_sp_credential branch from 5fbb903 to 5abfa69 Compare June 18, 2025 07:12
@KirillTsyganov
Copy link
Author

@raulcd, that was it ! thanks for helping

What should I do next?

@raulcd
Copy link
Member

raulcd commented Jun 18, 2025

Thanks! great work!

What should I do next?

Can you update the title of the PR as suggested on this comment:

#46837 (comment)

(basically add the original GH-46833: issue)

And if you could update the description to match the template:

Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?

I'll proceed to ping some other committers to review once this is done and CI has finished successfully.

@KirillTsyganov
Copy link
Author

will do, quick question. I've noticed that linting is failing the test. which linter should I use, nothing seems to be setup in the pyproject.toml. Just cython-lint ?

thanks

@raulcd
Copy link
Member

raulcd commented Jun 18, 2025

We use pre-commit.
As the error suggests you can try:

If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
 diff --git a/python/pyarrow/_azurefs.pyx b/python/pyarrow/_azurefs.pyx
index d5d0660f7f..2131824b85 100644
--- a/python/pyarrow/_azurefs.pyx
+++ b/python/pyarrow/_azurefs.pyx
@@ -122,7 +122,8 @@ cdef class AzureFileSystem(FileSystem):
 
         if (tenant_id or client_id or client_secret):
             if not (tenant_id and client_id and client_secret):
-                raise ValueError("All of tenant_id, client_id, and client_secret must be provided for ClientSecretCredential.")
+                raise ValueError(
+                    "All of tenant_id, client_id, and client_secret must be provided for ClientSecretCredential.")
             options.ConfigureClientSecretCredential(
                 tobytes(tenant_id), tobytes(client_id), tobytes(client_secret)
             )

There seems to be some line length issues too:

 Python Format............................................................Failed
- hook id: autopep8
- files were modified by this hook
Python Lint..............................................................Failed
- hook id: flake8
- exit code: 1

python/pyarrow/tests/test_fs.py:1488:89: E501 line too long (108 > 88 characters)
python/pyarrow/tests/test_fs.py:1495:89: E501 line too long (108 > 88 characters)
python/pyarrow/tests/test_fs.py:1502:89: E501 line too long (108 > 88 characters)

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Jun 18, 2025
@KirillTsyganov KirillTsyganov force-pushed the feature/azure_sp_credential branch from fa6b2c3 to 38c4e23 Compare June 18, 2025 08:18
@KirillTsyganov KirillTsyganov changed the title [Python] Adding ConfigureClientSecretCredential to AzureFileSystem [GH-46833:][Python] Adding ConfigureClientSecretCredential to AzureFileSystem Jun 18, 2025
@KirillTsyganov KirillTsyganov changed the title [GH-46833:][Python] Adding ConfigureClientSecretCredential to AzureFileSystem GH-46833:[Python][Azure] Adding ConfigureClientSecretCredential to AzureFileSystem Jun 18, 2025
Copy link

⚠️ GitHub issue #46833 has been automatically assigned in GitHub to PR creator.

hasn't been surfaced in python library.
This patch propogates this to a python library

Also adding unit tests for AzureFileSystem options

Update python/pyarrow/tests/test_fs.py with

Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
@KirillTsyganov KirillTsyganov force-pushed the feature/azure_sp_credential branch from 38c4e23 to 5a0e725 Compare June 18, 2025 08:37
@KirillTsyganov
Copy link
Author

@raulcd I've updated PR description, let me know if more information is required. thanks

@KirillTsyganov KirillTsyganov marked this pull request as ready for review June 18, 2025 11:34
Copy link
Member

@raulcd raulcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks in general looks good to me. I am not an expert on the Azure SDK but I think we should also expose ConfigureManagedIdentityCredential and match the C++ implementation

Comment on lines +72 to +74
client_id : str, default None
Client ID for Azure Active Directory authentication. Must be provided together with
`tenant_id` and `client_secret` to use ClientSecretCredential.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems client_id could be used in isolation without tenant_id and client_secret:

if (client_id.empty()) {
return Status::Invalid("client_id must be specified");
}
if (tenant_id.empty() && client_secret.empty()) {
RETURN_NOT_OK(ConfigureManagedIdentityCredential(client_id));
} else if (!tenant_id.empty() && !client_secret.empty()) {
RETURN_NOT_OK(
ConfigureClientSecretCredential(tenant_id, client_id, client_secret));
} else {
return Status::Invalid("Both of tenant_id and client_secret must be specified");
}

To use ConfigureManagedIdentityCredential instead of ConfigureClientSecretCredential, we probably should cover that case on the PR too, @pitrou @kou any thoughts?

Copy link
Author

@KirillTsyganov KirillTsyganov Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I can definitely do that, this would future proof pyarrow, but for my immediate need, that's not required because ManagedIdentityCredential requires client_id of user assigned identity, which we can't make, but this is just our internal policy.

General notes, just fyi

Azure has a few different ways to authentic to it's resources. I'm primarily concerned about AzureML (AML) service, which includes things like ACR, VMs, keyvault and storage account as core services. Note that VM's are managed by AzureML service and are not visible outside of AzureML. Also note that AzureML "buckets" things into workspace, meaning two different workspace will have different VM's and storage accounts at least.

Just in case, but I find this diagram for credential flow useful.

Our internal setup is a little convoluted, but as far as I understand when working interactively i.e on ComputeInstance (AzureML compute type)

from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()

will resolve to managed identity (system assigned for us) of the workspace, which has access to storage account. But when I'm running the same script non-iteractively i.e as a AML job DefaultAzureCredential will resolve to managed identity of the VM (ComputeCluster), which for us has no access to storage account.

We have a separate Service Principle (SP) whose identity we allowed access to storage account, hence why adding ClientSecretCredential which is how you create credential for SP.

as an aside, more than happy to try and stay involved with pyarrow and Azure related things

Copy link
Author

@KirillTsyganov KirillTsyganov Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want me to push ConfigureManagedIdentityCredential through, thanks

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants