<a href="https://colab.research.google.com/github/rahulrajpr/prepare-anytime/blob/main/spark/functions/9_spark_sql_collection_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spark Collection Functions**
-------------------------------------
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#collection-functions

#### Collections
-------------------------------------

In **Apache Spark**, **collections** are **complex column types** that allow you to store **multiple values or structured data** inside a **single DataFrame column**.  
Collections provide a **flexible way** to manage, query, and process **nested or semi-structured data** efficiently.
-------------------------------------


-------------------------------------

#### Common Collection Types

-------------------------------------

| Collection Type | Description | Example Use Case |
|-----------------|------------|-----------------|
| **ArrayType**   | Ordered list of elements of the **same type** | Tags for a product, list of skills, numerical arrays |
| **MapType**     | **Key-value pairs** where each key maps to a value | User properties, configuration settings, JSON-like data |

-------------------------------------
-------------------------------------


-------------------------------------

#### Relationship Between Collections

-------------------------------------

- **Maps** can store **arrays or structs** as values.  
  - Example: `map<string, struct<city:string, zip:int>>`
- **Arrays** can contain **structs** or **maps** as elements.  
  - Example: `array<struct<key:string, value:int>>`

These combinations enable Spark to efficiently handle **nested, hierarchical, or semi-structured datasets** like JSON, logs, and complex business data.

-------------------------------------


In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-functions').getOrCreate()

In [None]:
sql = '''
create or replace temp view input_view as
(
select
  named_struct('name','rahul','age',28,'profession','dataengineering') as struct1,
  sequence(1,15,1) as array1,
  array('apple','orange','kiwi') as array2,
  map('name','rahul','age',28,'profession','dataengineering') as map1,
  map('product','laptop','price',999,'category','electronics','stock',15,'rating','4.5') as map2
)
'''
spark.sql(sql)

##----

sql = '''select * from input_view'''

spark.sql(sql).show(truncate = False)

+----------------------------+---------------------------------------------------+---------------------+---------------------------------------------------------+--------------------------------------------------------------------------------------+
|struct1                     |array1                                             |array2               |map1                                                     |map2                                                                                  |
+----------------------------+---------------------------------------------------+---------------------+---------------------------------------------------------+--------------------------------------------------------------------------------------+
|{rahul, 28, dataengineering}|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[apple, orange, kiwi]|{name -> rahul, age -> 28, profession -> dataengineering}|{product -> laptop, price -> 999, category -> electronics, stock -> 15, rating -> 4.5}|


In [None]:
# size

sql = '''
select
  size(array1) as arraySize,
  size(map1) as mapSize
from input_view
'''
spark.sql(sql).show(truncate = False)

+---------+-------+
|arraySize|mapSize|
+---------+-------+
|15       |3      |
+---------+-------+



In [None]:
# cardinality : for array and map

sql = '''
SELECT
  cardinality(array1) as arrayCardinality,
  cardinality(map1) as mapCardinality
FROM input_view
'''
spark.sql(sql).show(truncate=False)

+----------------+--------------+
|arrayCardinality|mapCardinality|
+----------------+--------------+
|15              |3             |
+----------------+--------------+



## size vs cardinality functions

✅ **Functionally Identical** - Both return the same results for arrays and maps  
✅ **Interchangeable** - Can be used interchangeably without any difference in output  
✅ **Personal Preference** - Choose based on readability context:  
   - Use `size()` for general programming contexts  
   - Use `cardinality()` for SQL-standard compliance

In [None]:
# reverse

sql = '''
  select array1,
  reverse(array1) as reversedArray
from input_view
'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+---------------------------------------------------+
|array1                                             |reversedArray                                      |
+---------------------------------------------------+---------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]|
+---------------------------------------------------+---------------------------------------------------+



In [None]:
# sort_array

# - acseding order by default

sql = '''
select
  array1,
  sort_array(array1) as array_sortOut
from input_view
'''
spark.sql(sql).show(truncate = False)

# - descending order

sql = '''
select
  array1,
  sort_array(array1,False) as array_sortOut
from input_view
'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+---------------------------------------------------+
|array1                                             |array_sortOut                                      |
+---------------------------------------------------+---------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|
+---------------------------------------------------+---------------------------------------------------+

+---------------------------------------------------+---------------------------------------------------+
|array1                                             |array_sortOut                                      |
+---------------------------------------------------+---------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]|
+--------------------------------------------

In [None]:
# array_sort : left and right are the function keywords
# Note : Array-Sort & Sort-Array are different, comparsion table down below
# - acsedingt order by default

sql = '''
select array1,array_sort(array1) as array_sortOut
from input_view
'''
spark.sql(sql).show(truncate = False)

##-- descending order

sql = '''
select array1,array_sort(array1,(left,right)-> right  - left) as array_sortOut
from input_view
'''
spark.sql(sql).show(truncate = False)

##-- custom function 1 -- ordering on ascening on the legth of the element

sql = '''
select array2,array_sort(array2,(left,right)-> length(left)  - length(right)) as array_sortOut
from input_view
'''
spark.sql(sql).show(truncate = False)

##-- custom function 2 -- ordering on descending on the length of the element

sql = '''
select array2,array_sort(array2,(left,right)-> length(right)  - length(left)) as array_sortOut
from input_view
'''
spark.sql(sql).show(truncate = False)



+---------------------------------------------------+---------------------------------------------------+
|array1                                             |array_sortOut                                      |
+---------------------------------------------------+---------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|
+---------------------------------------------------+---------------------------------------------------+

+---------------------------------------------------+---------------------------------------------------+
|array1                                             |array_sortOut                                      |
+---------------------------------------------------+---------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]|
+--------------------------------------------

-----------------------------------
#### `reverse` vs `sort_array` vs `array_sort`
-----------------------------------

| Aspect | reverse | sort_array | array_sort |
|--------|-----------|--------------|--------------|
| **Purpose** | Reverses element order | Sorts with explicit order control | Sorts with custom comparator |
| **Parameters** | `(array)` | `(array, [asc/desc])` | `(array, [comparator])` |
| **Order Control** | None - always reverses | Boolean flag (true=asc, false=desc) | Lambda function for custom logic |
| **Custom Sorting** | No | No | Yes |
| **Return Type** | Array | Array | Array |
| **Default Behavior** | Reverses current order | Ascending sort | Ascending sort |
-----------------------------------

In [None]:
## aggregate : aggregate(expr, start, merge, finish)

## sum

sql = '''
select aggregate(array1,0, (res,x) -> res+x) as sumTotal
from input_view
'''
spark.sql(sql).show(truncate = False)

+--------+
|sumTotal|
+--------+
|120     |
+--------+



In [None]:
## aggregate : aggregate(expr, start, merge, finish)

## max

sql = '''
select aggregate(array1,0, (res,x) -> case when res < x then x else res end) as maxVal
from input_view
'''
spark.sql(sql).show(truncate = False)

##- min

sql = '''
select aggregate(array1,0, (res,x) -> case when res > x then x else res end) as minVal
from input_view
'''
spark.sql(sql).show(truncate = False)


+------+
|maxVal|
+------+
|15    |
+------+

+------+
|minVal|
+------+
|0     |
+------+



In [None]:
## aggregate : aggregate(expr, start, merge, finish)

## average

sql = '''
SELECT array1,
       aggregate(array1,
                struct(0 as sm, 0 as cnt),
                (res, x) -> struct(res.sm + x as sm, res.cnt + 1 as cnt),
                res -> res.sm / NULLIF(res.cnt, 0)
                ) as AvCal
FROM input_view
'''
spark.sql(sql).show(truncate=False)

+---------------------------------------------------+-----+
|array1                                             |AvCal|
+---------------------------------------------------+-----+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|8.0  |
+---------------------------------------------------+-----+



In [None]:
## aggregate : aggregate(expr, start, merge, finish)

## find the elements greater than a threshold 10

sql = '''
SELECT array1,
       aggregate(array1,
                0,
                (res, x) -> res + (case when x > 10 then 1 else 0 end),
                res -> res
                ) as AvCal
FROM input_view
'''
spark.sql(sql).show(truncate=False)

+---------------------------------------------------+-----+
|array1                                             |AvCal|
+---------------------------------------------------+-----+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|5    |
+---------------------------------------------------+-----+



In [None]:
## aggregate : aggregate(expr, start, merge, finish)

## string concatenation

sql = '''
SELECT array2,
       aggregate(array2,
                '',
                (res, x) -> case when res = '' then x else res ||'-'|| x end,
                res -> res
                ) as AvCal
FROM input_view
'''
spark.sql(sql).show(truncate=False)

+---------------------+-----------------+
|array2               |AvCal            |
+---------------------+-----------------+
|[apple, orange, kiwi]|apple-orange-kiwi|
+---------------------+-----------------+



In [None]:
# reduce : similar to aggregate function , works with arrays

# sum of elements

sql = '''
select
  array1,
  reduce(array1,
         0,
         (res,x) -> res + x,
         res->res) as sumOfElements
from input_view
'''
spark.sql(sql).show(truncate = False)


+---------------------------------------------------+-------------+
|array1                                             |sumOfElements|
+---------------------------------------------------+-------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|120          |
+---------------------------------------------------+-------------+



In [None]:
# reduce : similar to aggregate function , works with arrays

# min

sql = '''
select
  array1,
  reduce(array1,
        0,
        (res,x) -> case when x < res then x else res end) as minValue
from input_view
'''
spark.sql(sql).show(truncate = False)


+---------------------------------------------------+--------+
|array1                                             |minValue|
+---------------------------------------------------+--------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|0       |
+---------------------------------------------------+--------+



In [None]:
# reduce : similar to aggregate function , works with arrays

# average

sql = '''
select
  array1,
  reduce(array1,
         struct(0 as sm, 0 as cnt),
         (res,x) -> struct(res.sm + x as sm, res.cnt +1 as cnt),
         res -> res.sm/nullif(res.cnt,0)) as ArrayAvg
from input_view
'''
spark.sql(sql).show(truncate = False)


+---------------------------------------------------+--------+
|array1                                             |ArrayAvg|
+---------------------------------------------------+--------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|8.0     |
+---------------------------------------------------+--------+



In [None]:
# reduce : similar to aggregate function , works with arrays

# string conatination

sql = '''
select
  array2,
  reduce(array2,
         '',
         (res,x) -> case when res <> '' then res ||'-'|| x else x end,
         res -> res) as concatString
from input_view
'''
spark.sql(sql).show(truncate = False)

+---------------------+-----------------+
|array2               |concatString     |
+---------------------+-----------------+
|[apple, orange, kiwi]|apple-orange-kiwi|
+---------------------+-----------------+



---------------------------------------
#### aggregate vs reduce Comparison
---------------------------------------

| Aspect | `aggregate` | `reduce` |
|--------|-------------|----------|
| **Purpose** | Applies binary operator to array with initial state, converts final state using finish function | Applies binary operator to array with initial state, converts final state using finish function |
| **Parameters** | `(expr, start, merge, finish)` | `(expr, start, merge, finish)` |
| **Functionality** | **Identical** to reduce | **Identical** to aggregate |
| **Return Type** | Same as reduce | Same as aggregate |
| **Usage** | SQL standard name | More common in functional programming |

---------------------------------------
##### Conclusion:
---------------------------------------
✅ **Functionally Identical** - Both perform exactly the same operation  
✅ **Interchangeable** - Can be used interchangeably  
✅ **Syntax Identical** - Same parameters and behavior

---------------------------------------

In [None]:
# concat : can be used for string and arrays , NOT for Map and Struct

sql = '''select concat('rahul','+','lathika','->','skylr') as concatOut '''
spark.sql(sql).show(truncate = False)

#---

sql = '''
select concat(array1, array2) as concatOut
from input_view
'''
spark.sql(sql).show(truncate = False)

+--------------------+
|concatOut           |
+--------------------+
|rahul+lathika->skylr|
+--------------------+

+------------------------------------------------------------------------+
|concatOut                                                               |
+------------------------------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, apple, orange, kiwi]|
+------------------------------------------------------------------------+



In [None]:
# element_at : to access elements, position of map and key for map

##- for array

sql = '''
select array2, element_at(array2,2) as element_atOut
from input_view'''
spark.sql(sql).show(truncate = False)

##- for map

sql = '''
select map1, element_at(map1,'name') as element_atOut
from input_view'''
spark.sql(sql).show(truncate = False)


+---------------------+-------------+
|array2               |element_atOut|
+---------------------+-------------+
|[apple, orange, kiwi]|orange       |
+---------------------+-------------+

+---------------------------------------------------------+-------------+
|map1                                                     |element_atOut|
+---------------------------------------------------------+-------------+
|{name -> rahul, age -> 28, profession -> dataengineering}|rahul        |
+---------------------------------------------------------+-------------+



In [None]:
# try_element_at

##- for array

sql = '''
select array2, try_element_at(array2,2) as element_atOut
from input_view'''
spark.sql(sql).show(truncate = False)

##- for map

sql = '''
select map1, try_element_at(map1,'name') as element_atOut
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------+-------------+
|array2               |element_atOut|
+---------------------+-------------+
|[apple, orange, kiwi]|orange       |
+---------------------+-------------+

+---------------------------------------------------------+-------------+
|map1                                                     |element_atOut|
+---------------------------------------------------------+-------------+
|{name -> rahul, age -> 28, profession -> dataengineering}|rahul        |
+---------------------------------------------------------+-------------+



In [None]:
# exists :

# (only for array) object and test function are required as parameters

sql = '''
select array1, exists(array1,x->x=15) as existsOut
from input_view'''
spark.sql(sql).show(truncate = False)

##-- let check the array contains even values

sql = '''
select array1, exists(array1,x->x%2=0) as existsOut
from input_view'''
spark.sql(sql).show(truncate = False)

##-- let check the array contains no nulls

sql = '''
select array1, NOT exists(array1,x->x is null) as existsOut
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+---------+
|array1                                             |existsOut|
+---------------------------------------------------+---------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|true     |
+---------------------------------------------------+---------+

+---------------------------------------------------+---------+
|array1                                             |existsOut|
+---------------------------------------------------+---------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|true     |
+---------------------------------------------------+---------+

+---------------------------------------------------+---------+
|array1                                             |existsOut|
+---------------------------------------------------+---------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|true     |
+---------------------------------------------------+---------+



In [None]:
# forall : checks a predicate is applicable all the elemenets in an array

sql = '''
SELECT array1,
       forall(array1, x -> try_cast(x AS INT) IS NOT NULL) as all_numeric
FROM input_view
'''
spark.sql(sql).show(truncate=False)

+---------------------------------------------------+-----------+
|array1                                             |all_numeric|
+---------------------------------------------------+-----------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|true       |
+---------------------------------------------------+-----------+



#### exists vs forall
---------------------------------

| Aspect | `exists` | `forall` |
|--------|----------|----------|
| **Purpose** | Checks if **AT LEAST ONE** element satisfies condition | Checks if **ALL** elements satisfy condition |
| **Return Type** | Boolean | Boolean |
| **Returns `true` when** | Any element matches the condition | All elements match the condition |
---------------------------------

In [None]:
# filter

# only works with arrays

sql = '''
select array1, filter(array1,x->x%2=0) as filterOut
from input_view'''
spark.sql(sql).show(truncate = False)

#---

sql = '''
select array1, filter(array1,x-> lower(x) like '%a%') as filterOut
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+------------------------+
|array1                                             |filterOut               |
+---------------------------------------------------+------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[2, 4, 6, 8, 10, 12, 14]|
+---------------------------------------------------+------------------------+

+---------------------------------------------------+---------+
|array1                                             |filterOut|
+---------------------------------------------------+---------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[]       |
+---------------------------------------------------+---------+



In [None]:
# map_filter : maps are getting filters through the using a predicate for key or valus

# only works for maps

sql = '''
select
  map2,
  map_filter(map2,(k,v)-> try_cast(v as float) is not null) as resultMap
from input_view
'''
spark.sql(sql).show(truncate = False)

##--

sql = '''
select
  map2,
  map_filter(map2,(k,v)-> startswith(k,'p')) as resultMap
from input_view
'''
spark.sql(sql).show(truncate = False)


+--------------------------------------------------------------------------------------+------------------------------------------+
|map2                                                                                  |resultMap                                 |
+--------------------------------------------------------------------------------------+------------------------------------------+
|{product -> laptop, price -> 999, category -> electronics, stock -> 15, rating -> 4.5}|{price -> 999, stock -> 15, rating -> 4.5}|
+--------------------------------------------------------------------------------------+------------------------------------------+

+--------------------------------------------------------------------------------------+---------------------------------+
|map2                                                                                  |resultMap                        |
+--------------------------------------------------------------------------------------+-----

-------------------------------------
### `filter` vs `map_filter` Comparison
-------------------------------------

| Aspect | filter | map_filter |
|--------|----------|--------------|
| **Purpose** | Filters **ARRAY** elements based on condition | Filters **MAP** entries based on condition |
| **Input Type** | Array | Map |
| **Lambda Parameters** | Single parameter (array element) | Two parameters (key, value) |
| **Return Type** | Array (filtered elements) | Map (filtered key-value pairs) |
-------------------------------------

In [None]:
# transform

## multiply

sql = '''
select array1,
transform(array1,x->x*2) as twoTimesX
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+--------------------------------------------------------+
|array1                                             |twoTimesX                                               |
+---------------------------------------------------+--------------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]|
+---------------------------------------------------+--------------------------------------------------------+



In [None]:
# transform

## cleaning text

sql = '''
select array2,
transform(array2,x->trim(upper(x))) as cleanedText
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------+---------------------+
|array2               |cleanedText          |
+---------------------+---------------------+
|[apple, orange, kiwi]|[APPLE, ORANGE, KIWI]|
+---------------------+---------------------+



In [None]:
# transform

## with a condition

sql = '''
select array1,
transform(array1,x-> case when x > 10 then x*2 else x end) as conditionOut
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+---------------------------------------------------+
|array1                                             |conditionOut                                       |
+---------------------------------------------------+---------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 22, 24, 26, 28, 30]|
+---------------------------------------------------+---------------------------------------------------+



In [None]:
# transform

## with the index of an array

sql = '''
select
  array1,
  transform(array1,(x,i)->i) as indexVal,
  transform(array1,(x,i)-> x+i) as IndexedOut
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------+--------------------------------------------------+-------------------------------------------------------+
|array1                                             |indexVal                                          |IndexedOut                                             |
+---------------------------------------------------+--------------------------------------------------+-------------------------------------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]|[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]|[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]|
+---------------------------------------------------+--------------------------------------------------+-------------------------------------------------------+



In [None]:
# transform_keys

sql = '''
select
  map1,
  transform_keys(map1, (k,v)-> upper(k))
from input_view'''
spark.sql(sql).show(truncate = False)

#--

sql = '''
select
  map1,
  transform_keys(map1, (k,v)-> 'Key_'||upper(k))
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
|map1                                                     |transform_keys(map1, lambdafunction(upper(namedlambdavariable()), namedlambdavariable(), namedlambdavariable()))|
+---------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
|{name -> rahul, age -> 28, profession -> dataengineering}|{NAME -> rahul, AGE -> 28, PROFESSION -> dataengineering}                                                       |
+---------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+

+---------------------------------------------------------+---------------------------------------------------------------------------

In [None]:
# tranmsform_values

sql = '''
select
  map1,
  transform_values(map1, (k,v)-> upper(v))
from input_view'''
spark.sql(sql).show(truncate = False)

#--

sql = '''
select
  map1,
  transform_values(map1, (k,v)-> 'Value_'||upper(v))
from input_view'''
spark.sql(sql).show(truncate = False)

#--

sql = '''
select
  map1,
  transform_values(map1, (k,v)-> k||'-'||v)
from input_view'''
spark.sql(sql).show(truncate = False)

+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
|map1                                                     |transform_values(map1, lambdafunction(upper(namedlambdavariable()), namedlambdavariable(), namedlambdavariable()))|
+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
|{name -> rahul, age -> 28, profession -> dataengineering}|{name -> RAHUL, age -> 28, profession -> DATAENGINEERING}                                                         |
+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+

+---------------------------------------------------------+-----------------------------------------------------------------

--------------------------
#### `transform` vs `transform_keys` vs `transform_values`
--------------------------

| Aspect | `transform` | `transform_keys` | `transform_values` |
|--------|-------------|------------------|-------------------|
| **Purpose** | Transforms **ARRAY** elements | Transforms **MAP** keys | Transforms **MAP** values |
| **Input Type** | Array | Map | Map |
| **Lambda Parameters** | `(element)` or `(element, index)` | `(key, value)` | `(key, value)` |
| **What Changes** | Array elements | Map keys | Map values |
| **Return Type** | Transformed array | Map with transformed keys | Map with transformed values |
| **Original Structure** | Array size unchanged | Map size unchanged | Map size unchanged |
--------------------------

In [None]:
# arrays_zip

sql = '''
with cte as
(
select
  sequence(1,5,1) as array1,
  transform(sequence(1,5,1), x-> x*x) as arrayTrans
from input_view
)
select
  array1,
  arrayTrans,
  arrays_zip(array1,arrayTrans) as zippedArray,
  TypeOf(arrays_zip(array1,arrayTrans)) as TypeOut
from cte
'''
spark.sql(sql).show(truncate = False)

+---------------+-----------------+------------------------------------------+----------------------------------------+
|array1         |arrayTrans       |zippedArray                               |TypeOut                                 |
+---------------+-----------------+------------------------------------------+----------------------------------------+
|[1, 2, 3, 4, 5]|[1, 4, 9, 16, 25]|[{1, 1}, {2, 4}, {3, 9}, {4, 16}, {5, 25}]|array<struct<array1:int,arrayTrans:int>>|
+---------------+-----------------+------------------------------------------+----------------------------------------+



In [None]:
# zip_with : zipping with a transformation between the element of the array

sql = '''
with cte as
(
select
  sequence(1,5,1) as array1,
  transform(sequence(1,5,1), x-> x*x) as arrayTrans
from input_view
)
select
  array1,
  arrayTrans,
  zip_with(array1,arrayTrans,(a1,a2)-> a1 = sqrt(a2)) as zippedWithArray
from cte
'''
spark.sql(sql).show(truncate = False)

#--

sql = '''
with cte as
(
select
  sequence(1,5,1) as array1,
  transform(sequence(1,5,1), x-> x*x) as arrayTrans
from input_view
)
select
  array1,
  arrayTrans,
  zip_with(array1,arrayTrans,(a1,a2)-> case when a1%2 = 0 then a1 else a2 end) as zippedWithArray
from cte
'''
spark.sql(sql).show(truncate = False)

+---------------+-----------------+------------------------------+
|array1         |arrayTrans       |zippedWithArray               |
+---------------+-----------------+------------------------------+
|[1, 2, 3, 4, 5]|[1, 4, 9, 16, 25]|[true, true, true, true, true]|
+---------------+-----------------+------------------------------+

+---------------+-----------------+----------------+
|array1         |arrayTrans       |zippedWithArray |
+---------------+-----------------+----------------+
|[1, 2, 3, 4, 5]|[1, 4, 9, 16, 25]|[1, 2, 9, 4, 25]|
+---------------+-----------------+----------------+



--------------------------------------
#### `arrays_zip` vs `zip_with`
--------------------------------------

| Aspect | `arrays_zip` | `zip_with` |
|--------|--------------|------------|
| **Purpose** | Zips multiple arrays into array of structs | Zips two arrays with custom transformation |
| **Number of Arrays** | Multiple arrays (2+) | Exactly two arrays |
| **Return Type** | Array of structs | Array (custom type based on function) |
| **Output Structure** | `[{"0": val1, "1": val2, ...}]` | Custom based on lambda function |
| **Lambda Function** | No lambda needed | Requires lambda function |
| **Customization** | Fixed struct output | Fully customizable output |
| **Null Handling** | Appends nulls to shorter arrays | Appends nulls to shorter arrays |
--------------------------------------

# Spark Collection Functions Reference

| Function | Description | Parameters | Returns | Related Functions | Collections Supported | Remarks |
|----------|-------------|------------|---------|-------------------|---------------------|---------|
| **size** | Returns number of elements | `(collection)` | Integer | `cardinality` | Arrays, Maps | More commonly used alias |
| **cardinality** | Returns number of elements | `(collection)` | Integer | `size` | Arrays, Maps | SQL standard name |
| **reverse** | Reverses element order | `(array)` | Array | `sort_array`, `array_sort` | Arrays | No sorting, just reversal |
| **sort_array** | Sorts array with order control | `(array, [asc/desc])` | Array | `array_sort`, `reverse` | Arrays | Simple ascending/descending |
| **array_sort** | Sorts with custom comparator | `(array, [comparator])` | Array | `sort_array`, `reverse` | Arrays | Custom sorting logic |
| **aggregate** | Reduces array with custom logic | `(array, start, merge, finish)` | Any | `reduce` | Arrays | SQL standard name |
| **reduce** | Reduces array with custom logic | `(array, start, merge, finish)` | Any | `aggregate` | Arrays | Functional programming name |
| **concat** | Concatenates arrays | `(array1, array2, ...)` | Array | `arrays_zip`, `zip_with` | Arrays | Multiple arrays supported |
| **element_at** | Gets element at position | `(collection, index/key)` | Element | `try_element_at` | Arrays, Maps | Throws error if not found |
| **try_element_at** | Safe element access | `(collection, index/key)` | Element | `element_at` | Arrays, Maps | Returns null if not found |
| **exists** | Checks if any element matches | `(array, condition)` | Boolean | `forall`, `filter` | Arrays | Short-circuits on first match |
| **forall** | Checks if all elements match | `(array, condition)` | Boolean | `exists`, `filter` | Arrays | Short-circuits on first failure |
| **filter** | Filters array elements | `(array, condition)` | Array | `exists`, `forall` | Arrays | Returns new filtered array |
| **map_filter** | Filters map entries | `(map, condition)` | Map | `filter` | Maps | Condition on (key, value) pairs |
| **transform** | Transforms array elements | `(array, function)` | Array | `transform_keys`, `transform_values` | Arrays | Element-wise transformation |
| **transform_keys** | Transforms map keys | `(map, function)` | Map | `transform`, `transform_values` | Maps | Keys change, values same |
| **transform_values** | Transforms map values | `(map, function)` | Map | `transform`, `transform_keys` | Maps | Values change, keys same |
| **arrays_zip** | Zips arrays into structs | `(array1, array2, ...)` | Array[Struct] | `zip_with` | Arrays | Multiple arrays, fixed output |
| **zip_with** | Zips with transformation | `(array1, array2, function)` | Array | `arrays_zip` | Arrays | Custom output, exactly 2 arrays |

## Key Notes:
- **Structs are NOT collections** - none of these functions work on structs
- **Arrays and Maps** are the only true collections in Spark
- **Functionally identical pairs**: `size`/`cardinality`, `aggregate`/`reduce`
- **Short-circuiting**: `exists` stops at first match, `forall` stops at first failure