{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":52826553,"defaultBranch":"master","name":"spark","ownerLogin":"wgtmac","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2016-02-29T21:36:47.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/4684607?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1710653400.0","currentOid":""},"activityList":{"items":[{"before":"fd69f32ad0d10d5a20c3e00ee0db4b731a469db2","after":"da8e4cf7bfccd7a390b15d293327c14e9e426e5e","ref":"refs/heads/release_orc_1.9.3_rc0","pushedAt":"2024-03-19T01:03:25.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.9.3 RC0","shortMessageHtmlLink":"Test Apache ORC 1.9.3 RC0"}},{"before":"d1c249a7a07134fcebc338446c74b0d00001c133","after":"98754657e7855b86845dfc1950220e2ee6777030","ref":"refs/heads/release_orc_1.9.3","pushedAt":"2024-03-19T01:03:03.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.9.3-SNAPSHOT","shortMessageHtmlLink":"Test Apache ORC 1.9.3-SNAPSHOT"}},{"before":null,"after":"fd69f32ad0d10d5a20c3e00ee0db4b731a469db2","ref":"refs/heads/release_orc_1.9.3_rc0","pushedAt":"2024-03-17T05:30:00.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.9.3 RC0","shortMessageHtmlLink":"Test Apache ORC 1.9.3 RC0"}},{"before":null,"after":"d1c249a7a07134fcebc338446c74b0d00001c133","ref":"refs/heads/release_orc_1.9.3","pushedAt":"2024-03-17T03:39:06.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.9.3-SNAPSHOT","shortMessageHtmlLink":"Test Apache ORC 1.9.3-SNAPSHOT"}},{"before":null,"after":"8c6eeb8ab0180368cc60de8b2dbae7457bee5794","ref":"refs/heads/branch-3.5","pushedAt":"2024-03-17T03:31:33.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"[SPARK-45587][INFRA] Skip UNIDOC and MIMA in `build` GitHub Action job\n\n### What changes were proposed in this pull request?\n\nThis PR aims to skip `Unidoc` and `MIMA` phases in many general test pipelines. `mima` test is moved to `lint` job.\n\n### Why are the changes needed?\n\nBy having an independent document generation and mima checking GitHub Action job, we can skip them in the following many jobs.\n\nhttps://github.com/apache/spark/blob/73f9f5296e36541db78ab10c4c01a56fbc17cca8/.github/workflows/build_and_test.yml#L142-L190\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nManually check the GitHub action logs.\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #43422 from dongjoon-hyun/SPARK-45587.\n\nAuthored-by: Dongjoon Hyun \nSigned-off-by: Dongjoon Hyun ","shortMessageHtmlLink":"[SPARK-45587][INFRA] Skip UNIDOC and MIMA in build GitHub Action job"}},{"before":null,"after":"8c192b678aadb7ea4fbf87dc7bfc12f9c086fb48","ref":"refs/heads/ORC-1.8.5-RC0","pushedAt":"2023-09-05T03:08:27.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.8.5 RC0","shortMessageHtmlLink":"Test Apache ORC 1.8.5 RC0"}},{"before":"20048372d6aa7546504266ffa0567c418de660f1","after":"daf481d950564efc01fb99628dded08ad1f51ff2","ref":"refs/heads/branch-3.4","pushedAt":"2023-09-05T03:05:46.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"[SPARK-44940][SQL][3.4] Improve performance of JSON parsing when \"spark.sql.json.enablePartialResults\" is enabled\n\n### What changes were proposed in this pull request?\n\nBackport of https://github.com/apache/spark/pull/42667 to branch-3.4.\n\nThe PR improves JSON parsing when `spark.sql.json.enablePartialResults` is enabled:\n- Fixes the issue when using nested arrays `ClassCastException: org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow`\n- Improves parsing of the nested struct fields, e.g. `{\"a1\": \"AAA\", \"a2\": [{\"f1\": \"\", \"f2\": \"\"}], \"a3\": \"id1\", \"a4\": \"XXX\"}` used to be parsed as `|AAA|NULL |NULL|NULL|` and now is parsed as `|AAA|[{NULL, }]|id1|XXX|`.\n- Improves performance of nested JSON parsing. The initial implementation would throw too many exceptions when multiple nested fields failed to parse. When the config is disabled, it is not a problem because the entire record is marked as NULL.\n\nThe internal benchmarks show the performance improvement from slowdown of over 160% to an improvement of 7-8% compared to the master branch when the flag is enabled. I will create a follow-up ticket to add a benchmark for this regression.\n\n### Why are the changes needed?\n\nFixes some corner cases in JSON parsing and improves performance when `spark.sql.json.enablePartialResults` is enabled.\n\n### Does this PR introduce _any_ user-facing change?\n\nNo.\n\n### How was this patch tested?\n\nI added tests to verify nested structs, maps, and arrays can be parsed without affecting the subsequent fields in the JSON. I also updated the existing tests when `spark.sql.json.enablePartialResults` is enabled because we parse more data now.\n\nI added a benchmark to check performance.\n\nBefore the change (master, https://github.com/apache/spark/commit/a45a3a3d60cb97b107a177ad16bfe36372bc3e9b):\n```\n[info] OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws\n[info] Intel(R) Xeon(R) Platinum 8375C CPU 2.90GHz\n[info] Partial JSON results: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative\n[info] ------------------------------------------------------------------------------------------------------------------------\n[info] parse invalid JSON 9537 9820 452 0.0 953651.6 1.0X\n```\n\nAfter the change (this PR):\n```\nOpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10 on Linux 5.4.0-1045-aws\nIntel(R) Xeon(R) Platinum 8375C CPU 2.90GHz\nPartial JSON results: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative\n------------------------------------------------------------------------------------------------------------------------\nparse invalid JSON 3100 3106 6 0.0 309967.6 1.0X\n```\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo.\n\nCloses #42792 from sadikovi/SPARK-44940-3.4.\n\nAuthored-by: Ivan Sadikov \nSigned-off-by: Dongjoon Hyun ","shortMessageHtmlLink":"[SPARK-44940][SQL][3.4] Improve performance of JSON parsing when \"spa…"}},{"before":null,"after":"20048372d6aa7546504266ffa0567c418de660f1","ref":"refs/heads/branch-3.4","pushedAt":"2023-09-01T01:32:40.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"[SPARK-44990][SQL] Reduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv`\n\n### What changes were proposed in this pull request?\nThis PR move get config `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv` to lazy val of `UnivocityGenerator`. To reduce the frequency of get it. As report, it will affect performance.\n\n### Why are the changes needed?\nReduce the frequency of get `spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv`\n\n### Does this PR introduce _any_ user-facing change?\nNo\n\n### How was this patch tested?\nexist test\n\n### Was this patch authored or co-authored using generative AI tooling?\n\nNo\n\nCloses #42738 from Hisoka-X/SPARK-44990_csv_null_value_config.\n\nAuthored-by: Jia Fan \nSigned-off-by: Dongjoon Hyun \n(cherry picked from commit dac750b855c35a88420b6ba1b943bf0b6f0dded1)\nSigned-off-by: Dongjoon Hyun ","shortMessageHtmlLink":"[SPARK-44990][SQL] Reduce the frequency of get `spark.sql.legacy.null…"}},{"before":null,"after":"65f15bb27c946fe98eb221c1a3eef0078d2160c7","ref":"refs/heads/ORC-1.8.5-SNAPSHOT","pushedAt":"2023-09-01T01:32:20.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.8.5-SNAPSHOT","shortMessageHtmlLink":"Test Apache ORC 1.8.5-SNAPSHOT"}},{"before":null,"after":"37c923687f1efe2d3902873db51afa682d76ab9d","ref":"refs/heads/ORC-1.7.9-RC1","pushedAt":"2023-05-04T02:33:50.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test ORC-1.7.9-RC1","shortMessageHtmlLink":"Test ORC-1.7.9-RC1"}},{"before":"058dcbf3fb0b17a4295f6e0b516f5c955cfa2d59","after":"e9aab411ca804fed1da9fae0ccfd590fc9c4c61e","ref":"refs/heads/branch-3.3","pushedAt":"2023-05-04T02:30:18.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"[SPARK-43293][SQL] `__qualified_access_only` should be ignored in normal columns\n\nThis is a followup of https://github.com/apache/spark/pull/39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns.\n\nIt's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag.\n\nNo\n\nnew analysis test\n\nCloses #40961 from cloud-fan/col.\n\nAuthored-by: Wenchen Fan \nSigned-off-by: Wenchen Fan \n(cherry picked from commit 021f02e02fb88bbbccd810ae000e14e0c854e2e6)\nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-43293][SQL] __qualified_access_only should be ignored in nor…"}},{"before":"22d24bb29c3541e6950c949e6b352bac17d6a290","after":"679b714e8a2d8a53a4412b97c872b4f8749fa635","ref":"refs/heads/ORC-1.7.9-SNAPSHOT","pushedAt":"2023-04-27T01:57:49.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"Test Apache ORC 1.7.9-SNAPSHOT","shortMessageHtmlLink":"Test Apache ORC 1.7.9-SNAPSHOT"}},{"before":null,"after":"058dcbf3fb0b17a4295f6e0b516f5c955cfa2d59","ref":"refs/heads/branch-3.3","pushedAt":"2023-04-27T01:54:23.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.describe() method\n\n### What changes were proposed in this pull request?\nThe df.describe() method will cached the RDD. And if the cached RDD is RDD[Unsaferow], which may be released after the row is used, then the result will be wong. Here we need to copy the RDD before caching as the [TakeOrderedAndProjectExec ](https://github.com/apache/spark/blob/d68d46c9e2cec04541e2457f4778117b570d8cdb/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala#L204)operator does.\n\n### Why are the changes needed?\nbug fix\n\n### Does this PR introduce _any_ user-facing change?\nno\n\n### How was this patch tested?\n\nCloses #40914 from JkSelf/describe.\n\nAuthored-by: Jia Ke \nSigned-off-by: Wenchen Fan ","shortMessageHtmlLink":"[SPARK-43240][SQL][3.3] Fix the wrong result issue when calling df.de…"}},{"before":null,"after":"22d24bb29c3541e6950c949e6b352bac17d6a290","ref":"refs/heads/ORC-1.7.9-SNAPSHOT","pushedAt":"2023-04-18T01:39:28.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"}},{"before":"cb87b3ced9453b5717fa8e8637b97a2f3f25fdd7","after":"cbe94a172ca2e361fba38318298cb349389eb8a2","ref":"refs/heads/master","pushedAt":"2023-04-18T01:35:11.000Z","pushType":"push","commitsCount":10000,"pusher":{"login":"wgtmac","name":"Gang Wu","path":"/wgtmac","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/4684607?s=80&v=4"},"commit":{"message":"[SPARK-43084][SS] Add applyInPandasWithState support for spark connect\n\n### What changes were proposed in this pull request?\n\nThis change adds applyInPandasWithState support for Spark connect.\nExample (try with local mode `./bin/pyspark --remote \"local[*]\"`):\n\n```\n>>> from pyspark.sql.streaming.state import GroupStateTimeout, GroupState\n>>> from pyspark.sql.types import (\n... LongType,\n... StringType,\n... StructType,\n... StructField,\n... Row,\n... )\n>>> import pandas as pd\n>>> output_type = StructType(\n... [StructField(\"key\", StringType()), StructField(\"countAsString\", StringType())]\n... )\n>>> state_type = StructType([StructField(\"c\", LongType())])\n>>> def func(key, pdf_iter, state):\n... total_len = 0\n... for pdf in pdf_iter:\n... total_len += len(pdf)\n... state.update((total_len,))\n... yield pd.DataFrame({\"key\": [key[0]], \"countAsString\": [str(total_len)]})\n...\n>>>\n>>> input_path = \"/Users/peng.zhong/tmp/applyInPandasWithState\"\n>>> df = spark.readStream.format(\"text\").load(input_path)\n>>> q = (\n... df.groupBy(df[\"value\"])\n... .applyInPandasWithState(\n... func, output_type, state_type, \"Update\", GroupStateTimeout.NoTimeout\n... )\n... .writeStream.queryName(\"this_query\")\n... .format(\"memory\")\n... .outputMode(\"update\")\n... .start()\n... )\n>>>\n>>> q.status\n{'message': 'Processing new data', 'isDataAvailable': True, 'isTriggerActive': True}\n>>>\n>>> spark.sql(\"select * from this_query\").show()\n+-----+-------------+\n| key|countAsString|\n+-----+-------------+\n|hello| 1|\n| this| 1|\n+-----+-------------+\n```\n\n### Why are the changes needed?\n\nThis change adds an API support for spark connect.\n\n### Does this PR introduce _any_ user-facing change?\n\nThis change adds an API support for spark connect.\n\n### How was this patch tested?\n\nManually tested.\n\nCloses #40736 from pengzhon-db/connect_applyInPandasWithState.\n\nAuthored-by: pengzhon-db \nSigned-off-by: Hyukjin Kwon ","shortMessageHtmlLink":"[SPARK-43084][SS] Add applyInPandasWithState support for spark connect"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEGRu-xAA","startCursor":null,"endCursor":null}},"title":"Activity · wgtmac/spark"}