-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46833:[Python][Azure] Adding ConfigureClientSecretCredential to AzureFileSystem #46837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-46833:[Python][Azure] Adding ConfigureClientSecretCredential to AzureFileSystem #46837
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pickle failures are related to missing to add them to the __reduce__
method, something like:
diff --git a/python/pyarrow/_azurefs.pyx b/python/pyarrow/_azurefs.pyx
index 554687ccb6..573eb0ecf1 100644
--- a/python/pyarrow/_azurefs.pyx
+++ b/python/pyarrow/_azurefs.pyx
@@ -163,5 +163,8 @@ cdef class AzureFileSystem(FileSystem):
dfs_storage_authority=frombytes(opts.dfs_storage_authority),
blob_storage_scheme=frombytes(opts.blob_storage_scheme),
dfs_storage_scheme=frombytes(opts.dfs_storage_scheme),
- sas_token=frombytes(self.sas_token)
+ sas_token=frombytes(self.sas_token),
+ tenant_id=frombytes(self.tenant_id),
+ client_id=frombytes(self.client_id),
+ client_secret=frombytes(self.client_secret),
),))
should fix those failures
5fbb903
to
5abfa69
Compare
@raulcd, that was it ! thanks for helping What should I do next? |
Thanks! great work!
Can you update the title of the PR as suggested on this comment: (basically add the original And if you could update the description to match the template:
I'll proceed to ping some other committers to review once this is done and CI has finished successfully. |
will do, quick question. I've noticed that linting is failing the test. which linter should I use, nothing seems to be setup in the thanks |
We use pre-commit.
There seems to be some line length issues too:
|
fa6b2c3
to
38c4e23
Compare
|
hasn't been surfaced in python library. This patch propogates this to a python library Also adding unit tests for AzureFileSystem options Update python/pyarrow/tests/test_fs.py with Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
38c4e23
to
5a0e725
Compare
@raulcd I've updated PR description, let me know if more information is required. thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks in general looks good to me. I am not an expert on the Azure SDK but I think we should also expose ConfigureManagedIdentityCredential
and match the C++ implementation
client_id : str, default None | ||
Client ID for Azure Active Directory authentication. Must be provided together with | ||
`tenant_id` and `client_secret` to use ClientSecretCredential. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems client_id
could be used in isolation without tenant_id
and client_secret
:
arrow/cpp/src/arrow/filesystem/azurefs.cc
Lines 219 to 229 in 28cf7a4
if (client_id.empty()) { | |
return Status::Invalid("client_id must be specified"); | |
} | |
if (tenant_id.empty() && client_secret.empty()) { | |
RETURN_NOT_OK(ConfigureManagedIdentityCredential(client_id)); | |
} else if (!tenant_id.empty() && !client_secret.empty()) { | |
RETURN_NOT_OK( | |
ConfigureClientSecretCredential(tenant_id, client_id, client_secret)); | |
} else { | |
return Status::Invalid("Both of tenant_id and client_secret must be specified"); | |
} |
To use
ConfigureManagedIdentityCredential
instead of ConfigureClientSecretCredential
, we probably should cover that case on the PR too, @pitrou @kou any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I can definitely do that, this would future proof pyarrow
, but for my immediate need, that's not required because ManagedIdentityCredential
requires client_id
of user assigned identity, which we can't make, but this is just our internal policy.
General notes, just fyi
Azure has a few different ways to authentic to it's resources. I'm primarily concerned about AzureML (AML) service, which includes things like ACR, VMs, keyvault and storage account as core services. Note that VM's are managed by AzureML service and are not visible outside of AzureML. Also note that AzureML "buckets" things into workspace, meaning two different workspace will have different VM's and storage accounts at least.
Just in case, but I find this diagram for credential flow useful.
Our internal setup is a little convoluted, but as far as I understand when working interactively i.e on ComputeInstance (AzureML compute type)
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
will resolve to managed identity (system assigned for us) of the workspace, which has access to storage account. But when I'm running the same script non-iteractively i.e as a AML job DefaultAzureCredential
will resolve to managed identity of the VM (ComputeCluster), which for us has no access to storage account.
We have a separate Service Principle (SP) whose identity we allowed access to storage account, hence why adding ClientSecretCredential
which is how you create credential for SP.
as an aside, more than happy to try and stay involved with pyarrow and Azure related things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you want me to push ConfigureManagedIdentityCredential
through, thanks
Rationale for this change
ClientSecretCredential
is another method to authenticate to Azure resources using Service Principle (SP). We work on AzureML, where there are two main types of compute ComputeInstance (personal VM) and ComputeCluster. In context of AzureML you can work interactively on ComputeInstance, where pyarrow defaults toDefaultAzureCredential
when reading/writing data overabfss://
protocol. However when running AzureML "jobs" which is run non-interactively execution, the context (network) is different andDefaultAzureCredential
doesn't work. An alternative method for non-interactive job is to use Service Principle i.eClientSecretCredential
. This way we can createAzureFileSystem
object and give it pyarrowdataset
What changes are included in this PR?
Adding ConfigureClientSecretCredential to AzureFilesystem, which is already implemented at the C++, but hasn't been propagated to python library. This pull request just propagates C++ method to pyarrow
Are these changes tested?
Yes, test have been included in the relevant file
Are there any user-facing changes?
Docs of
AzureFileSystem
have been updated in the code. It would be amazing for those changes to surface on the website, which I'm happy to try to help with if needed