{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":31994448,"defaultBranch":"master","name":"spark","ownerLogin":"zzcclp","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2015-03-11T02:41:36.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/9430290?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1696580351.0","currentOid":""},"activityList":{"items":[{"before":"f26eeb0fa9ba1fdc6b715044e0717bd88fc9604c","after":"a90cb0a9fb15b93c64b0570daccb8c524f5ac382","ref":"refs/heads/master","pushedAt":"2024-07-19T00:21:08.000Z","pushType":"push","commitsCount":12,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48495][DOCS][FOLLOW-UP] Fix Table Markdown in Shredding.md\n\nMinor change that shouldn't require a Jira to fix the unbalanced row in the example of Shredding.md\n\nCloses #47407 from RussellSpitzer/patch-1.\n\nAuthored-by: Russell Spitzer <russell.spitzer@GMAIL.COM>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48495][DOCS][FOLLOW-UP] Fix Table Markdown in Shredding.md"}},{"before":"44f8766f77597c68a5ba5a6300bbc507f57290b4","after":"66dce6d09c8832a0dabc3b79eaee8f6c3aef0d73","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-19T00:21:02.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48934][SS] Python datetime types converted incorrectly for setting timeout in applyInPandasWithState\n\nFix the way applyInPandasWithState's setTimeoutTimestamp() handles argument of datetime\n\nIn applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in with datetime.datetime type, it doesn't function as expected. Fix it.\nAlso, fix another bug of reporting VALUE_NOT_POSITIVE. This issue will trigger when the converted value is 0.\n\nNo.\n\nAdd unit test coverage for thie scenario\n\nNo.\n\nCloses #47398 from siying/state_set_timeout.\n\nLead-authored-by: Siying Dong <siying.dong@databricks.com>\nCo-authored-by: Siying Dong <dong.sy@gmail.com>\nSigned-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>\n(cherry picked from commit a54daa196c1c554e537435cf77c23001d16d9428)\nSigned-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>","shortMessageHtmlLink":"[SPARK-48934][SS] Python datetime types converted incorrectly for set…"}},{"before":"c0f6db83db7a1007475e30818c86fcc33205647d","after":"f26eeb0fa9ba1fdc6b715044e0717bd88fc9604c","ref":"refs/heads/master","pushedAt":"2024-07-18T01:30:17.000Z","pushType":"push","commitsCount":12,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48926][SQL][TESTS] Use `checkError` method to optimize exception check logic related to `UNRESOLVED_COLUMN` error classes\n\n### What changes were proposed in this pull request?\n\nThis PR aims to use `checkError` method to optimize exception check logic related to `UNRESOLVED_COLUMN` error classes\n\n### Why are the changes needed?\n\nUnify error classes check way to `checkError` method.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass related test cases.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47389 from wayneguow/op_un_col.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48926][SQL][TESTS] Use <code>checkError</code> method to optimize excepti…"}},{"before":"f1f5bb6ee40b655e334e90c0f81042c632601d41","after":"44f8766f77597c68a5ba5a6300bbc507f57290b4","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-18T01:30:10.000Z","pushType":"push","commitsCount":4,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48930][CORE] Redact `awsAccessKeyId` by including `accesskey` pattern\n\n### What changes were proposed in this pull request?\n\nThis PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.\n\n- **Apache Spark 4.0.0-preview1**\nThere is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.\n\n```\n$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell\n```\n\n![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)\n\n### Why are the changes needed?\n\nSince Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.\n- #450\n\nhttps://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486\n\n### Does this PR introduce _any_ user-facing change?\n\nUsers may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.\n\n### How was this patch tested?\n\nPass the CIs with the newly added test cases.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47392 from dongjoon-hyun/SPARK-48930.\n\nAuthored-by: Dongjoon Hyun <dhyun@apple.com>\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>\n(cherry picked from commit 1e17c392b4def939d04e556084e7b48cca86412b)\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>","shortMessageHtmlLink":"[SPARK-48930][CORE] Redact <code>awsAccessKeyId</code> by including <code>accesskey</code> …"}},{"before":"22115214670d76e11f37b36ac8a2a8aabcbd603d","after":"674d4dbb4a5648ffe888f5115ba1de62236fd182","ref":"refs/heads/branch-3.4","pushedAt":"2024-07-18T01:30:03.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48930][CORE] Redact `awsAccessKeyId` by including `accesskey` pattern\n\n### What changes were proposed in this pull request?\n\nThis PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.\n\n- **Apache Spark 4.0.0-preview1**\nThere is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.\n\n```\n$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell\n```\n\n![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)\n\n### Why are the changes needed?\n\nSince Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.\n- #450\n\nhttps://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486\n\n### Does this PR introduce _any_ user-facing change?\n\nUsers may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.\n\n### How was this patch tested?\n\nPass the CIs with the newly added test cases.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47392 from dongjoon-hyun/SPARK-48930.\n\nAuthored-by: Dongjoon Hyun <dhyun@apple.com>\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>\n(cherry picked from commit 1e17c392b4def939d04e556084e7b48cca86412b)\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>","shortMessageHtmlLink":"[SPARK-48930][CORE] Redact <code>awsAccessKeyId</code> by including <code>accesskey</code> …"}},{"before":"23080acba7c72d26e7e10c2200aa0f30f7fe7429","after":"c0f6db83db7a1007475e30818c86fcc33205647d","ref":"refs/heads/master","pushedAt":"2024-07-17T01:11:54.000Z","pushType":"push","commitsCount":11,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dataframe read / write API\n\n### What changes were proposed in this pull request?\n\nThis PR is a retry of https://github.com/apache/spark/pull/47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition:\n\nhttps://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57\n\n### Why are the changes needed?\n\nIn order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., https://github.com/apache/spark/pull/29063, https://github.com/apache/spark/pull/15813, https://github.com/apache/spark/pull/17255 and SPARK-19918.\n\nAlso, we remove `repartition(1)`. To avoid unnecessary shuffle.\n\nWith `repartition(1)`:\n\n```\n== Physical Plan ==\nAdaptiveSparkPlan isFinalPlan=false\n+- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6]\n   +- LocalTableScan [_1#0]\n```\n\nWithout `repartition(1)`:\n\n```\n== Physical Plan ==\nLocalTableScan [_1#2]\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nCI in this PR should verify the change\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47341 from HyukjinKwon/SPARK-48883-followup.\n\nAuthored-by: Hyukjin Kwon <gurwls223@apache.org>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48883][ML][R] Replace RDD read / write API invocation with Dat…"}},{"before":"56dec397e26d44e2b578ecea92be4c5e343c2a50","after":"f1f5bb6ee40b655e334e90c0f81042c632601d41","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-17T01:11:48.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings\n\nBackports #47303 to 3.5\n\n### What changes were proposed in this pull request?\n\n[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings\n\n### Why are the changes needed?\n\nIn #35110, it was incorrectly asserted that:\n\n> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt\n\nThis is not true as the previous code called:\n\n```java\npublic static byte[] encodeBase64(byte[] binaryData)\n```\n\nWhich states:\n\n> Encodes binary data using the base64 algorithm but does not chunk the output.\n\nHowever, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nExisting test suite.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\n No\n\nCloses #47325 from wForget/SPARK-47307_3.5.\n\nLead-authored-by: wforget <643348094@qq.com>\nCo-authored-by: Ted Jenks <tedcj@palantir.com>\nSigned-off-by: Kent Yao <yao@apache.org>","shortMessageHtmlLink":"[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings"}},{"before":"05ad146171780362a6542f9c3c7a43213def1da2","after":"22115214670d76e11f37b36ac8a2a8aabcbd603d","ref":"refs/heads/branch-3.4","pushedAt":"2024-07-17T01:11:40.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-47172][DOCS][FOLLOWUP] Fix spark.network.crypto.ciphersince version field on security page\n\n### What changes were proposed in this pull request?\n\nGiven that SPARK-47172 was an improvement but got merged into 3.4/3.5, we need to fix the since version to eliminate misunderstandings.\n\n### Why are the changes needed?\n\ndoc fix\n\n### Does this PR introduce _any_ user-facing change?\n\nno\n\n### How was this patch tested?\n\ndoc build\n\n### Was this patch authored or co-authored using generative AI tooling?\nno\n\nCloses #47353 from yaooqinn/SPARK-47172.\n\nAuthored-by: Kent Yao <yao@apache.org>\nSigned-off-by: Kent Yao <yao@apache.org>\n(cherry picked from commit d8820a07eb82acd35a1c9d4ff6ee4f65fc6aac29)\nSigned-off-by: Kent Yao <yao@apache.org>","shortMessageHtmlLink":"[SPARK-47172][DOCS][FOLLOWUP] Fix spark.network.crypto.ciphersince ve…"}},{"before":"206cc1a554ef130cfa0aceb1d4ee330e5c83de5f","after":"23080acba7c72d26e7e10c2200aa0f30f7fe7429","ref":"refs/heads/master","pushedAt":"2024-07-16T00:33:55.000Z","pushType":"push","commitsCount":11,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-45155][CONNECT] Add API Docs for Spark Connect JVM/Scala Client\n\nThis PR is based on https://github.com/apache/spark/pull/42911.\n\n### What changes were proposed in this pull request?\n\n- Enables Scala and Java Unidoc generation for the `connectClient` project.\n- Generates docs and moves them to the `docs/api/connect` folder.\n\nSome methods' documentation in the connect directory had to be modified to remove references to avoid javadoc generation failures. **References API docs in the main index page and the global floating header will be added in a later PR.**\n\n### Why are the changes needed?\n\nIncreasing scope of documentation for the Spark Connect JVM/Scala Client project.\n\n### Does this PR introduce _any_ user-facing change?\n\nNope.\n\n### How was this patch tested?\n\nManual test.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47332 from xupefei/connnect-doc-web.\n\nAuthored-by: Paddy Xu <xupaddy@gmail.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-45155][CONNECT] Add API Docs for Spark Connect JVM/Scala Client"}},{"before":"06bebb884961f8b164e218af0ab2e8cb7517a66a","after":"206cc1a554ef130cfa0aceb1d4ee330e5c83de5f","ref":"refs/heads/master","pushedAt":"2024-07-15T01:19:47.000Z","pushType":"push","commitsCount":29,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48613][SQL] SPJ: Support auto-shuffle one side + less join keys than partition keys\n\n### What changes were proposed in this pull request?\n\nThis is the final planned SPJ scenario:  auto-shuffle one side + less join keys than partition keys.  Background:\n\n- Auto-shuffle works by creating ShuffleExchange for the non-partitioned side, with a clone of the partitioned side's KeyGroupedPartitioning.\n- \"Less join key than partition key\" works by 'projecting' all partition values by join keys (ie, keeping only partition columns that are join columns).  It makes a target KeyGroupedShuffleSpec with 'projected' partition values, and then pushes this down to BatchScanExec.  The BatchScanExec then 'groups' its projected partition value (except in the skew case but that's a different story..).\n\nThis combination is hard because the SPJ planning calls is spread in several places in this scenario.  Given two sides, a non-partitioned side and a partitioned side, and the join keys are only a subset:\n\n1.  EnsureRequirements creates the target KeyGroupedShuffleSpec from the join's required distribution (ie, using only the join keys, not all partition keys).\n2.  EnsureRequirements copies this to the non-partitoned side's KeyGroupedPartition (for the auto-shuffle case)\n3.  BatchScanExec groups the partitions (for the partitioned side), including by join keys (if they differ from partition keys).\n\nTake the example partition columns (id, name) , and partition values: (1, \"bob\"), (2, \"alice\"), (2, \"sam\").\nProjection leaves us (1, 2, 2), and the final grouped partition values are (1, 2).\n\nThe problem is, that the two sides of the join do not match at all times.  After the steps 1 and 2, the partitioned side has the 'projected' partition values (1, 2, 2), and the non-partitioned side creates a matching KeyGroupedPartitioning (1, 2, 2) for ShuffleExechange.  But on step 3, the BatchScanExec for partitioned side groups the partitions to become (1, 2), but the non-partitioned side does not group and still retains (1, 2, 2) partitions.  This leads to following assert error from the join:\n\n```\nrequirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions.\njava.lang.IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions.\n\tat scala.Predef$.require(Predef.scala:337)\n\tat org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.<init>(partitioning.scala:550)\n\tat org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning(ShuffledJoin.scala:49)\n\tat org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning$(ShuffledJoin.scala:47)\n\tat org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:39)\n\tat org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:66)\n\tat scala.collection.immutable.Vector1.map(Vector.scala:2140)\n\tat scala.collection.immutable.Vector1.map(Vector.scala:385)\n\tat org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:65)\n\tat org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:657)\n\tat org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:632)\n```\n\nThe fix is to do the de-duplication in first pass.\n\n1. Pushing down join keys to the BatchScanExec to return a de-duped outputPartitioning (partitioned side)\n2. Creating the non-partitioned side's KeyGroupedPartitioning with de-duped partition keys (non-partitioned side).\n\n  ### Why are the changes needed?\n\nThis is the last planned scenario for SPJ not yet supported.\n\n  ### How was this patch tested?\nUpdate existing unit test in KeyGroupedPartitionSuite\n\n  ### Was this patch authored or co-authored using generative AI tooling?\n No.\n\nCloses #47064 from szehon-ho/spj_less_join_key_auto_shuffle.\n\nAuthored-by: Szehon Ho <szehon.apache@gmail.com>\nSigned-off-by: Chao Sun <chao@openai.com>","shortMessageHtmlLink":"[SPARK-48613][SQL] SPJ: Support auto-shuffle one side + less join key…"}},{"before":"d517a6392b160f23c0fb633074f567c08000921a","after":"56dec397e26d44e2b578ecea92be4c5e343c2a50","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-15T01:19:40.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48666][SQL] Do not push down filter if it contains PythonUDFs\n\nThis PR proposes to prevent pushing down Python UDFs. This PR uses the same approach as https://github.com/apache/spark/pull/47033, therefore added the author as a co-author, but simplifies the change.\n\nExtracting filters to push down happens first\n\nhttps://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L46\n\nhttps://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L211\n\nhttps://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L51\n\nBefore extracting Python UDFs\n\nhttps://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L80\n\nHere is full stacktrace:\n\n```\n[INTERNAL_ERROR] Cannot evaluate expression: pyUDF(cast(input[0, bigint, true] as string)) SQLSTATE: XX000\norg.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: pyUDF(cast(input[0, bigint, true] as string)) SQLSTATE: XX000\n\tat org.apache.spark.SparkException$.internalError(SparkException.scala:92)\n\tat org.apache.spark.SparkException$.internalError(SparkException.scala:96)\n\tat org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:65)\n\tat org.apache.spark.sql.catalyst.expressions.FoldableUnevaluable.eval(Expression.scala:387)\n\tat org.apache.spark.sql.catalyst.expressions.FoldableUnevaluable.eval$(Expression.scala:386)\n\tat org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:72)\n\tat org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:563)\n\tat org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:403)\n\tat org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:53)\n\tat org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:189)\n\tat org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:188)\n\tat scala.collection.immutable.List.filter(List.scala:516)\n\tat scala.collection.immutable.List.filter(List.scala:79)\n\tat org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:188)\n\tat org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:604)\n\tat org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)\n\tat org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1358)\n\tat org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.listPartitionsByFilter(ExternalCatalogUtils.scala:168)\n\tat org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:74)\n\tat org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:72)\n\tat org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:50)\n\tat org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)\n\tat org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)\n\tat org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:470)\n\tat org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)\n\tat org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:330)\n\tat org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:326)\n\tat org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)\n\tat org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)\n\tat org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:475)\n\tat org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1251)\n\tat org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1250)\n\tat org.apache.spark.sql.catalyst.plans.logical.Join.mapChildren(basicLogicalOperators.scala:552)\n\tat org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:475)\n\tat org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)\n\tat org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:330)\n\tat org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:326)\n\tat org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)\n\tat org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)\n\tat org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:446)\n\tat org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:50)\n\tat org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:35)\n\tat org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:226)\n\tat scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183)\n\tat scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179)\n\tat scala.collection.immutable.List.foldLeft(List.scala:79)\n\tat org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:223)\n\tat org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:215)\n\tat scala.collection.immutable.List.foreach(List.scala:334)\n\tat org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:215)\n\tat org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:186)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)\n\tat org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:186)\n\tat org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:167)\n\tat org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)\n\tat org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:234)\n\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:608)\n\tat org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:234)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)\n\tat org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:233)\n\tat org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:163)\n\tat org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:159)\n\tat org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$19(PythonUDFSuite.scala:136)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:307)\n\tat org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:305)\n\tat org.apache.spark.sql.execution.python.PythonUDFSuite.withTable(PythonUDFSuite.scala:25)\n\tat org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$18(PythonUDFSuite.scala:130)\n\tat scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)\n\tat org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)\n\tat org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)\n\tat org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)\n\tat org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)\n\tat org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)\n\tat org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)\n\tat org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)\n\tat org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)\n\tat org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)\n\tat org.scalatest.Transformer.apply(Transformer.scala:22)\n\tat org.scalatest.Transformer.apply(Transformer.scala:20)\n\tat org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)\n\tat org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)\n\tat org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)\n\tat org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)\n\tat org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)\n\tat org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)\n\tat org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)\n\tat org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)\n\tat scala.collection.immutable.List.foreach(List.scala:334)\n\tat org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)\n\tat org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)\n\tat org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)\n\tat org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)\n\tat org.scalatest.Suite.run(Suite.scala:1114)\n\tat org.scalatest.Suite.run$(Suite.scala:1096)\n\tat org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)\n\tat org.scalatest.SuperEngine.runImpl(Engine.scala:535)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)\n\tat org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)\n\tat org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:69)\n\tat org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)\n\tat org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)\n\tat org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)\n\tat org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)\n\tat org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)\n\tat org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)\n\tat org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)\n\tat scala.collection.immutable.List.foreach(List.scala:334)\n\tat org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)\n\tat org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)\n\tat org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)\n\tat org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)\n\tat org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)\n\tat org.scalatest.tools.Runner$.run(Runner.scala:798)\n\tat org.scalatest.tools.Runner.run(Runner.scala)\n\tat org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)\n\tat org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)\n```\n\nIn order for end users to use Python UDFs against partitioned columns.\n\nYes, this fixes a bug - this PR allows to use Python UDF in partitioned columns.\n\nUnittest added.\n\nNo.\n\nCloses #47033\n\nCloses #47313 from HyukjinKwon/SPARK-48666.\n\nLead-authored-by: Hyukjin Kwon <gurwls223@apache.org>\nCo-authored-by: Wei Zheng <weiz@apache.org>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>\n(cherry picked from commit d74785359c50bf966cfe892d3a9eae1a06341db2)\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48666][SQL] Do not push down filter if it contains PythonUDFs"}},{"before":"4d13c2231cc4852b922079895223104972dc960b","after":"06bebb884961f8b164e218af0ab2e8cb7517a66a","ref":"refs/heads/master","pushedAt":"2024-07-12T00:12:14.000Z","pushType":"push","commitsCount":20,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48852][CONNECT] Fix string trim function in connect\n\n### What changes were proposed in this pull request?\n\nChanged the order of arguments passed in the connect client's trim function call to match [`sql/core/src/main/scala/org/apache/spark/sql/functions.scala`](https://github.com/apache/spark/blob/f2dd0b3338a6937bbfbea6cd5fffb2bf9992a1f3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4322)\n\n### Why are the changes needed?\n\nThis change fixes a correctness bug in spark connect where a query to trim characters `s` from a column will be replaced by a substring of `s`.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nUpdated golden files for [`/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala`](https://github.com/apache/spark/blob/f2dd0b3338a6937bbfbea6cd5fffb2bf9992a1f3/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala#L1815) and added an additional test to verify correctness.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #47277 from biruktesf-db/fix-trim-connect.\n\nAuthored-by: Biruk Tesfaye <biruk.tesfaye@databricks.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48852][CONNECT] Fix string trim function in connect"}},{"before":"1e15e3f3eb2bd92f3a3c5365a99afea10c392084","after":"d517a6392b160f23c0fb633074f567c08000921a","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-12T00:12:08.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[MINOR][SQL][TESTS] Remove a duplicate test case in `CSVExprUtilsSuite`\n\n### What changes were proposed in this pull request?\n\nThis PR aims to remove a duplicate test case in `CSVExprUtilsSuite`.\n\n### Why are the changes needed?\n\nClean duplicate code.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass GA.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47298 from wayneguow/csv_suite.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>\n(cherry picked from commit 297a9d2ac77373157473950f607728b6f4c1c542)\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>","shortMessageHtmlLink":"[MINOR][SQL][TESTS] Remove a duplicate test case in <code>CSVExprUtilsSuite</code>"}},{"before":"fa7a6ab43f2b8b353fdbf9dff04b2d3446d7ac95","after":"05ad146171780362a6542f9c3c7a43213def1da2","ref":"refs/heads/branch-3.4","pushedAt":"2024-07-12T00:12:00.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[MINOR][SQL][TESTS] Remove a duplicate test case in `CSVExprUtilsSuite`\n\n### What changes were proposed in this pull request?\n\nThis PR aims to remove a duplicate test case in `CSVExprUtilsSuite`.\n\n### Why are the changes needed?\n\nClean duplicate code.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass GA.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47298 from wayneguow/csv_suite.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>\n(cherry picked from commit 297a9d2ac77373157473950f607728b6f4c1c542)\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>","shortMessageHtmlLink":"[MINOR][SQL][TESTS] Remove a duplicate test case in <code>CSVExprUtilsSuite</code>"}},{"before":"f73884328cb8d2bdf81447049117f5a903b93a2e","after":"4d13c2231cc4852b922079895223104972dc960b","ref":"refs/heads/master","pushedAt":"2024-07-11T01:19:48.000Z","pushType":"push","commitsCount":16,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48763][CONNECT][BUILD] Move connect server and common to builtin module\n\n### What changes were proposed in this pull request?\n\nThis PR proposes to move the connect server to builtin module.\n\nFrom:\n\n```\nconnector/connect/server\nconnector/connect/common\n```\n\nTo:\n\n```\nconnect/server\nconnect/common\n```\n\n### Why are the changes needed?\n\nSo the end users do not have to specify `--packages` when they start the Spark Connect server. Spark Connect client remains as a separate module. This was also pointed out in https://github.com/apache/spark/pull/39928#issuecomment-1428264541.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, users don't have to specify `--packages` anymore.\n\n### How was this patch tested?\n\nCI in this PR should verify them.\nAlso manually tested several basic commands such as:\n\n- Maven build\n- SBT build\n- Running basic Scala client commands\n   ```bash\n   cd connector/connect\n   bin/spark-connect\n   bin/spark-connect-scala-client\n   ```\n- Running basic PySpark client commands\n\n   ```bash\n   bin/pyspark --remote local\n   ```\n- Connecting to the server launched by `./sbin/start-connect-server.sh`\n\n   ```bash\n   ./sbin/start-connect-server.sh\n    bin/pyspark --remote \"sc://localhost\"\n   ```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47157 from HyukjinKwon/move-connect-server-builtin.\n\nAuthored-by: Hyukjin Kwon <gurwls223@apache.org>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48763][CONNECT][BUILD] Move connect server and common to built…"}},{"before":"67047cde8bb70e32f2b6f8162388240dee215aa4","after":"1e15e3f3eb2bd92f3a3c5365a99afea10c392084","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-11T01:19:42.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48843] Prevent infinite loop with BindParameters\n\n### What changes were proposed in this pull request?\n\nIn order to resolve the named parameters on the subtree, BindParameters recurses into the subtrees and tries to match the pattern with the named parameters. If there's no named parameter in the current level, the rule tries to return the unchanged plan. However, instead of returning the current plan object, the rule always returns the captured root plan node, leading into the infinite recursion.\n\n### Why are the changes needed?\n\nInfinite recursion with the named parameters and the global limit.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nAdded unit tests.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47271 from nemanja-boric-databricks/fix-bind.\n\nLead-authored-by: Nemanja Boric <nemanja.boric@databricks.com>\nCo-authored-by: Wenchen Fan <cloud0fan@gmail.com>\nSigned-off-by: Wenchen Fan <wenchen@databricks.com>\n(cherry picked from commit a39f70d0c8e85c9911d9a15445fd2a136a66ae4b)\nSigned-off-by: Wenchen Fan <wenchen@databricks.com>","shortMessageHtmlLink":"[SPARK-48843] Prevent infinite loop with BindParameters"}},{"before":"dab708e562acc3f206470b1d4e0598ba3d7c7861","after":"f73884328cb8d2bdf81447049117f5a903b93a2e","ref":"refs/heads/master","pushedAt":"2024-07-10T01:03:10.000Z","pushType":"push","commitsCount":17,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-44728][PYTHON][DOCS] Fix the incorrect naming and missing params in func docs in `builtin.py`\n\n### What changes were proposed in this pull request?\n\nFix the incorrect naming and missing params in func docs in `builtin.py`.\n\n### Why are the changes needed?\n\nSome params' name in `pySpark` docs are wrong, for example:\n![image](https://github.com/apache/spark/assets/16032294/af0ca3c9-b085-4364-8cfc-814371f21b4b)\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPassed GA.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47269 from wayneguow/py_docs.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-44728][PYTHON][DOCS] Fix the incorrect naming and missing para…"}},{"before":"f1eca903f5c25aa08be80e9af2df3477e2a5a6ef","after":"dab708e562acc3f206470b1d4e0598ba3d7c7861","ref":"refs/heads/master","pushedAt":"2024-07-09T00:57:41.000Z","pushType":"push","commitsCount":12,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer\n\n### What changes were proposed in this pull request?\n\nWe can eliminate the use of mutable.ArrayBuffer by using `flatmap`.\n\n### Why are the changes needed?\n\nCode simplification and optimization.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nExisting UT\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #47185 from amaliujia/followup_cte.\n\nLead-authored-by: Rui Wang <rui.wang@databricks.com>\nCo-authored-by: Kent Yao <yao@apache.org>\nSigned-off-by: Wenchen Fan <wenchen@databricks.com>","shortMessageHtmlLink":"[SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer"}},{"before":"1cc0043fe549dcdf918d9dad21462a1e4714bb5d","after":"67047cde8bb70e32f2b6f8162388240dee215aa4","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-09T00:57:29.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48719][SQL][3.5] Fix the calculation bug of RegrSlope & RegrIntercept when the first parameter is null\n\n### What changes were proposed in this pull request?\n\nThis PR aims to fix the calculation bug of RegrSlope&RegrIntercept when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out.\n\n### Why are the changes needed?\n\nFix bug.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, the calculation changes when the first value of a tuple is null, but the value is truly correct.\n\n### How was this patch tested?\n\nPass GA and test with build/sbt \"~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql\"\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47230 from wayneguow/SPARK-48719_3_5.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Wenchen Fan <wenchen@databricks.com>","shortMessageHtmlLink":"[SPARK-48719][SQL][3.5] Fix the calculation bug of RegrSlope &amp; RegrIn…"}},{"before":"310f8ea2456dad7cec0f22bfed05a679764c3d7e","after":"f1eca903f5c25aa08be80e9af2df3477e2a5a6ef","ref":"refs/heads/master","pushedAt":"2024-07-06T02:13:19.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48719][SQL] Fix the calculation bug of `RegrSlope` & `RegrIntercept` when the first parameter is null\n\n### What changes were proposed in this pull request?\n\nThis PR aims to fix the calculation bug of `RegrSlope`&`RegrIntercept` when the first parameter is null. Regardless of whether the first parameter(y) or the second parameter(x) is null, this tuple should be filtered out.\n\n### Why are the changes needed?\n\nFix bug.\n\n### Does this PR introduce _any_ user-facing change?\n\nYes, the calculation changes when the first value of a tuple is null, but the value is truly correct.\n\n### How was this patch tested?\n\nPass GA and test with `build/sbt \"~sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z linear-regression.sql\"`\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47105 from wayneguow/SPARK-48719.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Wenchen Fan <wenchen@databricks.com>","shortMessageHtmlLink":"[SPARK-48719][SQL] Fix the calculation bug of <code>RegrSlope</code> &amp; `RegrInte…"}},{"before":"8ace648a8357c0ec8ff8b1d9c2a49a17a07d2202","after":"310f8ea2456dad7cec0f22bfed05a679764c3d7e","ref":"refs/heads/master","pushedAt":"2024-07-05T01:12:21.000Z","pushType":"push","commitsCount":6,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48806][SQL] Pass actual exception when url_decode fails\n\n### What changes were proposed in this pull request?\n\nPass actual exception for url_decode.\n\nFollow-up to https://issues.apache.org/jira/browse/SPARK-40156\n\n### Why are the changes needed?\n\nCurrently url_decode function ignores actual exception, which contains information that is useful for quickly locating the problem.\n\nLike executing this sql:\n```\nselect url_decode('https%3A%2F%2spark.apache.org');\n```\nWe only get the error message:\n```\norg.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] The provided URL cannot be decoded: https%3A%2F%2spark.apache.org. Please ensure that the URL is properly formatted and try again.\n    at org.apache.spark.sql.errors.QueryExecutionErrors$.illegalUrlError(QueryExecutionErrors.scala:376)\n    at org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:118)\n    at org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)\n```\nHowever, the actual useful exception information is ignored:\n```\njava.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 1 in: \"2s\"\n```\n\nAfter this pr we will get:\n\n```\norg.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] The provided URL cannot be decoded: https%3A%2F%2spark.apache.org. Please ensure that the URL is properly formatted and try again. SQLSTATE: 22546\n\tat org.apache.spark.sql.errors.QueryExecutionErrors$.illegalUrlError(QueryExecutionErrors.scala:372)\n\tat org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:119)\n\tat org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)\n\t...\nCaused by: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 1 in: \"2s\"\n\tat java.base/java.net.URLDecoder.decode(URLDecoder.java:237)\n\tat java.base/java.net.URLDecoder.decode(URLDecoder.java:147)\n\tat org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:116)\n\t... 135 more\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nunit test\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #47211 from wForget/SPARK-48806.\n\nLead-authored-by: wforget <643348094@qq.com>\nCo-authored-by: Kent Yao <yao@apache.org>\nSigned-off-by: Kent Yao <yao@apache.org>","shortMessageHtmlLink":"[SPARK-48806][SQL] Pass actual exception when url_decode fails"}},{"before":"44eba46cc8b90be990177450141c48746fa5b67d","after":"1cc0043fe549dcdf918d9dad21462a1e4714bb5d","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-05T01:12:09.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48806][SQL] Pass actual exception when url_decode fails\n\n### What changes were proposed in this pull request?\n\nPass actual exception for url_decode.\n\nFollow-up to https://issues.apache.org/jira/browse/SPARK-40156\n\n### Why are the changes needed?\n\nCurrently url_decode function ignores actual exception, which contains information that is useful for quickly locating the problem.\n\nLike executing this sql:\n```\nselect url_decode('https%3A%2F%2spark.apache.org');\n```\nWe only get the error message:\n```\norg.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] The provided URL cannot be decoded: https%3A%2F%2spark.apache.org. Please ensure that the URL is properly formatted and try again.\n    at org.apache.spark.sql.errors.QueryExecutionErrors$.illegalUrlError(QueryExecutionErrors.scala:376)\n    at org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:118)\n    at org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)\n```\nHowever, the actual useful exception information is ignored:\n```\njava.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 1 in: \"2s\"\n```\n\nAfter this pr we will get:\n\n```\norg.apache.spark.SparkIllegalArgumentException: [CANNOT_DECODE_URL] The provided URL cannot be decoded: https%3A%2F%2spark.apache.org. Please ensure that the URL is properly formatted and try again. SQLSTATE: 22546\n\tat org.apache.spark.sql.errors.QueryExecutionErrors$.illegalUrlError(QueryExecutionErrors.scala:372)\n\tat org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:119)\n\tat org.apache.spark.sql.catalyst.expressions.UrlCodec.decode(urlExpressions.scala)\n\t...\nCaused by: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 1 in: \"2s\"\n\tat java.base/java.net.URLDecoder.decode(URLDecoder.java:237)\n\tat java.base/java.net.URLDecoder.decode(URLDecoder.java:147)\n\tat org.apache.spark.sql.catalyst.expressions.UrlCodec$.decode(urlExpressions.scala:116)\n\t... 135 more\n```\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nunit test\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #47211 from wForget/SPARK-48806.\n\nLead-authored-by: wforget <643348094@qq.com>\nCo-authored-by: Kent Yao <yao@apache.org>\nSigned-off-by: Kent Yao <yao@apache.org>\n(cherry picked from commit 310f8ea2456dad7cec0f22bfed05a679764c3d7e)\nSigned-off-by: Kent Yao <yao@apache.org>","shortMessageHtmlLink":"[SPARK-48806][SQL] Pass actual exception when url_decode fails"}},{"before":"15f216774ee5dad1043f7f0092cb17a9a1077921","after":"8ace648a8357c0ec8ff8b1d9c2a49a17a07d2202","ref":"refs/heads/master","pushedAt":"2024-07-04T00:41:56.000Z","pushType":"push","commitsCount":8,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-47046][BUILD][TESTS] Upgrade `mysql-connector-j` to 9.0.0\n\n### What changes were proposed in this pull request?\n\nThis PR aims to upgrade `mysql-connector-j` from 8.4.0 to 9.0.0.\n\n### Why are the changes needed?\n\nVersion 9.0.0 is a new GA release version and is recommended for use. The full release notes of `mysql-connector-j` 9.0.0:\nhttps://dev.mysql.com/doc/relnotes/connector-j/en/news-9-0-0.html\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nPass GA.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #47200 from wayneguow/upgrade_mysql_connector.\n\nAuthored-by: Wei Guo <guow93@gmail.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-47046][BUILD][TESTS] Upgrade <code>mysql-connector-j</code> to 9.0.0"}},{"before":"5ac7c9bdb6ca572f80ecda0d4c97856402a7754b","after":"15f216774ee5dad1043f7f0092cb17a9a1077921","ref":"refs/heads/master","pushedAt":"2024-07-03T01:11:15.000Z","pushType":"push","commitsCount":9,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48785][DOCS] Add a simple Python data source example in the user guide\n\n### What changes were proposed in this pull request?\n\nThis PR adds a self-contained, simple example implementation of a Python data source in the user guide to help users get started more quickly.\n\n### Why are the changes needed?\n\nTo improve the documentation\n\n### Does this PR introduce _any_ user-facing change?\n\nNo\n\n### How was this patch tested?\n\nExisting tests\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #47187 from allisonwang-db/spark-48785-pyds-user-guide.\n\nAuthored-by: allisonwang-db <allison.wang@databricks.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48785][DOCS] Add a simple Python data source example in the us…"}},{"before":"df70cc1797c60e913885106aa155f4047a070b2a","after":"44eba46cc8b90be990177450141c48746fa5b67d","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-03T01:11:09.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2)\n\n### What changes were proposed in this pull request?\n * Add a constraint for `numpy<2` to the PySpark package\n\n### Why are the changes needed?\n\nPySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.\n\nhttps://github.com/apache/spark/pull/47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.\n\n### Does this PR introduce _any_ user-facing change?\nNumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.\n\n### How was this patch tested?\nVia existing CI jobs.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #47175 from codesorcery/SPARK-48710-numpy-upper-bound.\n\nAuthored-by: Patrick Marx <6949483+codesorcery@users.noreply.github.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (&gt;=…"}},{"before":"1bfc9c3cbdc5d017fef68b0e84dd7fd22d2fef0f","after":"fa7a6ab43f2b8b353fdbf9dff04b2d3446d7ac95","ref":"refs/heads/branch-3.4","pushedAt":"2024-07-03T01:11:01.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (>=1.15,<2)\n\n### What changes were proposed in this pull request?\n * Add a constraint for `numpy<2` to the PySpark package\n\n### Why are the changes needed?\n\nPySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail.\n\nhttps://github.com/apache/spark/pull/47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied.\n\n### Does this PR introduce _any_ user-facing change?\nNumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`.\n\n### How was this patch tested?\nVia existing CI jobs.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #47175 from codesorcery/SPARK-48710-numpy-upper-bound.\n\nAuthored-by: Patrick Marx <6949483+codesorcery@users.noreply.github.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>\n(cherry picked from commit 44eba46cc8b90be990177450141c48746fa5b67d)\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48710][PYTHON][3.5] Limit NumPy version to supported range (&gt;=…"}},{"before":"f49418b79c8817b59ef6ec41b517c8098b7aaa7b","after":"5ac7c9bdb6ca572f80ecda0d4c97856402a7754b","ref":"refs/heads/master","pushedAt":"2024-07-02T01:48:33.000Z","pushType":"push","commitsCount":10,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48766][PYTHON] Document the behavior difference of `extraction` between `element_at` and `try_element_at`\n\n### What changes were proposed in this pull request?\nDocument the behavior difference of `extraction` between `element_at` and `try_element_at`\n\n### Why are the changes needed?\nwhen the function `try_element_at` was introduced in 3.5, its `extraction` handling was unintentionally  not consistent with the `element_at`, which causes confusion.\n\nThis PR document this behavior difference (I don't think we can fix it since it will be a breaking change).\n```\nIn [1]: from pyspark.sql import functions as sf\n\nIn [2]: df = spark.createDataFrame([({\"a\": 1.0, \"b\": 2.0}, \"a\")], ['data', 'b'])\n\nIn [3]: df.select(sf.try_element_at(df.data, 'b')).show()\n+-----------------------+\n|try_element_at(data, b)|\n+-----------------------+\n|                    1.0|\n+-----------------------+\n\nIn [4]: df.select(sf.element_at(df.data, 'b')).show()\n+-------------------+\n|element_at(data, b)|\n+-------------------+\n|                2.0|\n+-------------------+\n```\n\n### Does this PR introduce _any_ user-facing change?\ndoc changes\n\n### How was this patch tested?\nci, added doctests\n\n### Was this patch authored or co-authored using generative AI tooling?\nno\n\nCloses #47161 from zhengruifeng/doc_element_at_extraction.\n\nAuthored-by: Ruifeng Zheng <ruifengz@apache.org>\nSigned-off-by: Ruifeng Zheng <ruifengz@apache.org>","shortMessageHtmlLink":"[SPARK-48766][PYTHON] Document the behavior difference of `extraction…"}},{"before":"686f59c802105ea7bc1f9af50f9a1bdbd84e336d","after":"df70cc1797c60e913885106aa155f4047a070b2a","ref":"refs/heads/branch-3.5","pushedAt":"2024-07-02T01:48:27.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48292][CORE][3.5] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status\n\nThis is a backport of #46696\n\n### What changes were proposed in this pull request?\nRevert #36564 According to discuss https://github.com/apache/spark/pull/36564#discussion_r1607575927\n\nWhen spark commit task will commit to committedTaskPath\n`${outputpath}/_temporary//${appAttempId}/${taskId}`\nSo in #36564 's case, since before #38980, each task's job id's date is not the same,  when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated.\n\nAfter #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated.\n\nNote: The taskAttemptPath is not same since in the path contains the taskAttemptId.\n\n### Why are the changes needed?\nNo need anymore\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nExisted UT\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #47166 from dongjoon-hyun/SPARK-48292.\n\nAuthored-by: Angerszhuuuu <angers.zhu@gmail.com>\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>","shortMessageHtmlLink":"[SPARK-48292][CORE][3.5] Revert [SPARK-39195][SQL] Spark OutputCommit…"}},{"before":"5180694705be3508bd21dd9b863a59b8cb8ba193","after":"1bfc9c3cbdc5d017fef68b0e84dd7fd22d2fef0f","ref":"refs/heads/branch-3.4","pushedAt":"2024-07-02T01:48:21.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status\n\nThis is a backport of #46696\n\n### What changes were proposed in this pull request?\nRevert #36564 According to discuss https://github.com/apache/spark/pull/36564#discussion_r1607575927\n\nWhen spark commit task will commit to committedTaskPath\n`${outputpath}/_temporary//${appAttempId}/${taskId}`\nSo in #36564 's case, since before #38980, each task's job id's date is not the same,  when the task writes data success but fails to send back TaskSuccess RPC, the task rerun will commit to a different committedTaskPath then causing data duplicated.\n\nAfter #38980, for the same task's different attempts, the TaskId is the same now, when re-run task commit, will commit to the same committedTaskPath, and hadoop CommitProtocol will handle such case then data won't be duplicated.\n\nNote: The taskAttemptPath is not same since in the path contains the taskAttemptId.\n\n### Why are the changes needed?\nNo need anymore\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nExisted UT\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo\n\nCloses #47168 from dongjoon-hyun/SPARK-48292-3.4.\n\nAuthored-by: Angerszhuuuu <angers.zhu@gmail.com>\nSigned-off-by: Dongjoon Hyun <dhyun@apple.com>","shortMessageHtmlLink":"[SPARK-48292][CORE][3.4] Revert [SPARK-39195][SQL] Spark OutputCommit…"}},{"before":"6bfeb094248269920df8b107c86f0982404935cd","after":"f49418b79c8817b59ef6ec41b517c8098b7aaa7b","ref":"refs/heads/master","pushedAt":"2024-07-01T00:42:11.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"zzcclp","name":"Zhichao Zhang","path":"/zzcclp","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/9430290?s=80&v=4"},"commit":{"message":"[SPARK-48751][INFRA][PYTHON][TESTS] Re-balance `pyspark-pandas-connect` tests on GA\n\n### What changes were proposed in this pull request?\nThe pr aims to `re-balance` `pyspark-pandas-connect` tests on `GA`.\n\n### Why are the changes needed?\nMake the execution cost time of `pyspark-pandas-connect-part[0-3]` testing to a relatively average level, avoiding the occurrence of long tails and resulting in higher overall GA execution cost time.\n\nHere are some currently observed examples:\n- https://github.com/apache/spark/pull/47135/checks?check_run_id=26784966983\n  <img width=\"311\" alt=\"image\" src=\"https://github.com/apache/spark/assets/15246973/45d627bc-f0e7-4a76-bfd5-edc6e821e427\">\n\n  Most of them are around `1 hour`, but `part2` cost `1h 49m`, `part3` cost `2h 16m`\n\n- https://github.com/panbingkun/spark/actions/runs/9693237300\n  <img width=\"296\" alt=\"image\" src=\"https://github.com/apache/spark/assets/15246973/6837622a-3ff3-42d7-9725-e548c161277e\">\n  Most of them are around `1 hour`, but `part2` cost `1h 47m`, `part3` cost `2h 20m`\n\n### Does this PR introduce _any_ user-facing change?\nNo.\n\n### How was this patch tested?\nManually observing the cost time of `pyspark-pandas-connect-part[0-3]`.\n\n### Was this patch authored or co-authored using generative AI tooling?\nNo.\n\nCloses #47137 from panbingkun/split_pyspark_tests_to_5.\n\nAuthored-by: panbingkun <panbingkun@baidu.com>\nSigned-off-by: Hyukjin Kwon <gurwls223@apache.org>","shortMessageHtmlLink":"[SPARK-48751][INFRA][PYTHON][TESTS] Re-balance `pyspark-pandas-connec…"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEg28qagA","startCursor":null,"endCursor":null}},"title":"Activity · zzcclp/spark"}