
## PySpark SQL Functions


###PySpark – Aggregate Functions


#### Aggregate Function:

* **approx_count_distinct()**: returns the count of distinct items in a group. It uses HyperLogLog (HLL) to estimate the count of distinct values with a relatively small margin of error. The trade-off is that it may not provide the exact count, but the error is controlled.
* **avg()** : returns the average of the values in the input column
* **collect_list()**: returns all values from an input column with Duplicates
* **collect_set()**: returns all values from an input column with duplicate values eliminated.
* **countDistinct()**: returns the number of distinct elements in a columns , This method is more accurate but can be slower, especially when dealing with large datasets or columns with high cardinality.
* **count()**: returns number of elements in a column.
* **first()**: returns the first element in a column when ignoreNulls is set to true, it returns the first non-null element.
* **last()**: returns the last element in a column when ignoreNulls is set to true, it returns the last non-null element.
* **max()**: returns the maximum value in a column.
* **min()**: returns the minimum value in a column.
* **mean()**: returns the average of the values in a column. Alias for Avg
* **stddev()**:  alias for stddev_samp.
* **stddev_samp()**: function returns the sample standard deviation of values in a column.
* **stddev_pop()**: function returns the population standard deviation of the values in a column.
* **sum()**: function Returns the sum of all values in a column.
* **sumDistinct()** : function returns the sum of all distinct values in a column.
* **variance()**: alias for var_samp
* **var_samp()**: function returns the unbiased variance of the values in a column.
* **var_pop()**: function returns the population variance of the values in a column.

In [0]:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import approx_count_distinct,collect_list
from pyspark.sql.functions import collect_set,sum,avg,max,countDistinct,count
from pyspark.sql.functions import first, last, kurtosis, min, mean, skewness 
from pyspark.sql.functions import stddev, stddev_samp, stddev_pop, sumDistinct
from pyspark.sql.functions import variance,var_samp,  var_pop

simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
  
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

print("approx_count_distinct: " + \
      str(df.select(approx_count_distinct("salary")).collect()[0][0]))

print("avg: " + str(df.select(avg("salary")).collect()[0][0]))

df.select(collect_list("salary")).show(truncate=False)

df.select(collect_set("salary")).show(truncate=False)

df2 = df.select(countDistinct("department", "salary"))
df2.show(truncate=False)
print("Distinct Count of Department & Salary: "+str(df2.collect()[0][0]))

print("count: "+str(df.select(count("salary")).collect()[0]))
df.select(first("salary")).show(truncate=False)
df.select(last("salary")).show(truncate=False)
df.select(kurtosis("salary")).show(truncate=False)
df.select(max("salary")).show(truncate=False)
df.select(min("salary")).show(truncate=False)
df.select(mean("salary")).show(truncate=False)
df.select(skewness("salary")).show(truncate=False)
df.select(stddev("salary"), stddev_samp("salary"), \
    stddev_pop("salary")).show(truncate=False)
df.select(sum("salary")).show(truncate=False)
df.select(sumDistinct("salary")).show(truncate=False)
df.select(variance("salary"),var_samp("salary"),var_pop("salary")) \
  .show(truncate=False)


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4100  |
|Maria        |Finance   |3000  |
|James        |Sales     |3000  |
|Scott        |Finance   |3300  |
|Jen          |Finance   |3900  |
|Jeff         |Marketing |3000  |
|Kumar        |Marketing |2000  |
|Saif         |Sales     |4100  |
+-------------+----------+------+

approx_count_distinct: 6
avg: 3400.0
+------------------------------------------------------------+
|collect_list(salary)                                        |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+

+------------------------------------+
|c



+--------------------+
|sum(DISTINCT salary)|
+--------------------+
|20900               |
+--------------------+

+-----------------+-----------------+---------------+
|var_samp(salary) |var_samp(salary) |var_pop(salary)|
+-----------------+-----------------+---------------+
|586666.6666666666|586666.6666666666|528000.0       |
+-----------------+-----------------+---------------+




### Pyspark Windows function



In [0]:

simpleData = (("James", "Sales", 3000), \
    ("Michael", "Sales", 4600),  \
    ("Robert", "Sales", 4100),   \
    ("Maria", "Finance", 3000),  \
    ("James", "Sales", 3000),    \
    ("Scott", "Finance", 3300),  \
    ("Jen", "Finance", 3900),    \
    ("Jeff", "Marketing", 3000), \
    ("Kumar", "Marketing", 2000),\
    ("Saif", "Sales", 4100) \
  )
 
columns= ["employee_name", "department", "salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
df.show(truncate=False)


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4100  |
|Maria        |Finance   |3000  |
|James        |Sales     |3000  |
|Scott        |Finance   |3300  |
|Jen          |Finance   |3900  |
|Jeff         |Marketing |3000  |
|Kumar        |Marketing |2000  |
|Saif         |Sales     |4100  |
+-------------+----------+------+




####row_number Window Function

row_number(): window function is used to give the sequential row number starting from 1 to the result of each window partition.




In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy('department').orderBy('Salary')

df.withColumn('row_number', row_number().over(windowSpec)).show(truncate=False)

+-------------+----------+------+----------+
|employee_name|department|salary|row_number|
+-------------+----------+------+----------+
|Maria        |Finance   |3000  |1         |
|Scott        |Finance   |3300  |2         |
|Jen          |Finance   |3900  |3         |
|Kumar        |Marketing |2000  |1         |
|Jeff         |Marketing |3000  |2         |
|James        |Sales     |3000  |1         |
|James        |Sales     |3000  |2         |
|Robert       |Sales     |4100  |3         |
|Saif         |Sales     |4100  |4         |
|Michael      |Sales     |4600  |5         |
+-------------+----------+------+----------+



#### rank Window Function
rank() window function is used to provide a rank to the result within a window partition. This function leaves gaps in rank when there are ties.




In [0]:
from pyspark.sql.functions import rank

windowSpec=Window.partitionBy('department').orderBy('salary')

df.withColumn("rank_num", rank().over(windowSpec)).show(truncate=False)

+-------------+----------+------+--------+
|employee_name|department|salary|rank_num|
+-------------+----------+------+--------+
|Maria        |Finance   |3000  |1       |
|Scott        |Finance   |3300  |2       |
|Jen          |Finance   |3900  |3       |
|Kumar        |Marketing |2000  |1       |
|Jeff         |Marketing |3000  |2       |
|James        |Sales     |3000  |1       |
|James        |Sales     |3000  |1       |
|Robert       |Sales     |4100  |3       |
|Saif         |Sales     |4100  |3       |
|Michael      |Sales     |4600  |5       |
+-------------+----------+------+--------+




#### dense_rank Window Function
window function is used to get the result with rank of rows within a window partition without any gaps. This is similar to rank() function difference being rank function leaves gaps in rank when there are ties.

In [0]:
from pyspark.sql.functions import dense_rank
df.withColumn('dense_rank', dense_rank().over(windowSpec)).show(truncate=False)

+-------------+----------+------+----------+
|employee_name|department|salary|dense_rank|
+-------------+----------+------+----------+
|Maria        |Finance   |3000  |1         |
|Scott        |Finance   |3300  |2         |
|Jen          |Finance   |3900  |3         |
|Kumar        |Marketing |2000  |1         |
|Jeff         |Marketing |3000  |2         |
|James        |Sales     |3000  |1         |
|James        |Sales     |3000  |1         |
|Robert       |Sales     |4100  |2         |
|Saif         |Sales     |4100  |2         |
|Michael      |Sales     |4600  |3         |
+-------------+----------+------+----------+



####percent_rank Window Function
 It assigns a value between 0 and 1 to each row, indicating its position within the ordered set of rows. The PERCENT_RANK function assigns a value between 0 and 1 to each row, where 0 represents the first row (lowest value) in the sorted partition, and 1 represents the last row (highest value) in the partition. Rows with the same values in the ORDER BY columns will receive the same PERCENT_RANK value.


In [0]:
from pyspark.sql.functions import percent_rank

df.withColumn('percent_rank', percent_rank().over(windowsSpec)).show(truncate=False)

+-------------+----------+------+------------+
|employee_name|department|salary|percent_rank|
+-------------+----------+------+------------+
|Maria        |Finance   |3000  |0.0         |
|Scott        |Finance   |3300  |0.5         |
|Jen          |Finance   |3900  |1.0         |
|Kumar        |Marketing |2000  |0.0         |
|Jeff         |Marketing |3000  |1.0         |
|James        |Sales     |3000  |0.0         |
|James        |Sales     |3000  |0.0         |
|Robert       |Sales     |4100  |0.5         |
|Saif         |Sales     |4100  |0.5         |
|Michael      |Sales     |4600  |1.0         |
+-------------+----------+------+------------+



####ntile Window Function:
window function returns the relative rank of result rows within a window partition. In below example we have used 2 as an argument to ntile hence it returns ranking between 2 values (1 and 2)

In [0]:

from pyspark.sql.functions import ntile
df.withColumn("ntile",ntile(2).over(windowSpec)) \
    .show()


+-------------+----------+------+-----+
|employee_name|department|salary|ntile|
+-------------+----------+------+-----+
|        Maria|   Finance|  3000|    1|
|        Scott|   Finance|  3300|    1|
|          Jen|   Finance|  3900|    2|
|        Kumar| Marketing|  2000|    1|
|         Jeff| Marketing|  3000|    2|
|        James|     Sales|  3000|    1|
|        James|     Sales|  3000|    1|
|       Robert|     Sales|  4100|    1|
|         Saif|     Sales|  4100|    2|
|      Michael|     Sales|  4600|    2|
+-------------+----------+------+-----+




#### PySpark Window Analytic functions



#### cume_dist Window Function
cume_dist() window function is used to get the cumulative distribution of values within a window partition.

This is the same as the DENSE_RANK function in SQL.

In [0]:

""" cume_dist """
from pyspark.sql.functions import cume_dist    
df.withColumn("cume_dist",cume_dist().over(windowSpec)) \
   .show()


+-------------+----------+------+------------------+
|employee_name|department|salary|         cume_dist|
+-------------+----------+------+------------------+
|        Maria|   Finance|  3000|0.3333333333333333|
|        Scott|   Finance|  3300|0.6666666666666666|
|          Jen|   Finance|  3900|               1.0|
|        Kumar| Marketing|  2000|               0.5|
|         Jeff| Marketing|  3000|               1.0|
|        James|     Sales|  3000|               0.4|
|        James|     Sales|  3000|               0.4|
|       Robert|     Sales|  4100|               0.8|
|         Saif|     Sales|  4100|               0.8|
|      Michael|     Sales|  4600|               1.0|
+-------------+----------+------+------------------+




#### lag Window Function
The LAG function is a window function that allows you to access the value of a previous row within the result set. It's often used to perform calculations that involve comparing the current row with a previous row in a specified order. 

In [0]:
from pyspark.sql.functions import lag
df.withColumn('lag', lag('salary', 2).over(windowSpec)).show()

+-------------+----------+------+----+
|employee_name|department|salary| lag|
+-------------+----------+------+----+
|        Maria|   Finance|  3000|null|
|        Scott|   Finance|  3300|null|
|          Jen|   Finance|  3900|3000|
|        Kumar| Marketing|  2000|null|
|         Jeff| Marketing|  3000|null|
|        James|     Sales|  3000|null|
|        James|     Sales|  3000|null|
|       Robert|     Sales|  4100|3000|
|         Saif|     Sales|  4100|3000|
|      Michael|     Sales|  4600|4100|
+-------------+----------+------+----+




#### lead in sql 
the 'lead' function is a window function that allows you to access the value of a subsequent row within the result set. It often used to perform calculations that involves comparing current row with a following row in a specific order.

In [0]:
from pyspark.sql.functions import lead
df.withColumn('lead', lead('salary',2).over(windowSpec)).show()

+-------------+----------+------+----+
|employee_name|department|salary|lead|
+-------------+----------+------+----+
|        Maria|   Finance|  3000|3900|
|        Scott|   Finance|  3300|null|
|          Jen|   Finance|  3900|null|
|        Kumar| Marketing|  2000|null|
|         Jeff| Marketing|  3000|null|
|        James|     Sales|  3000|4100|
|        James|     Sales|  3000|4100|
|       Robert|     Sales|  4100|4600|
|         Saif|     Sales|  4100|null|
|      Michael|     Sales|  4600|null|
+-------------+----------+------+----+




####PySpark Window Aggregate Functions


In [0]:
windowSpecAgg = Window.partitionBy('department')
from pyspark.sql.functions import col, avg, sum, min, max, row_number

df.withColumn('row', row_number().over(windowSpec)) \
  .withColumn('avg', avg(col('salary')).over(windowSpecAgg)) \
  .withColumn('sum', sum(col('salary')).over(windowSpecAgg)) \
  .withColumn('min', min(col('salary')).over(windowSpecAgg)) \
  .withColumn('max', max(col('salary')).over(windowSpecAgg)) \
  .show(truncate=False)  

+-------------+----------+------+---+------+-----+----+----+
|employee_name|department|salary|row|avg   |sum  |min |max |
+-------------+----------+------+---+------+-----+----+----+
|Maria        |Finance   |3000  |1  |3400.0|10200|3000|3900|
|Scott        |Finance   |3300  |2  |3400.0|10200|3000|3900|
|Jen          |Finance   |3900  |3  |3400.0|10200|3000|3900|
|Kumar        |Marketing |2000  |1  |2500.0|5000 |2000|3000|
|Jeff         |Marketing |3000  |2  |2500.0|5000 |2000|3000|
|James        |Sales     |3000  |1  |3760.0|18800|3000|4600|
|James        |Sales     |3000  |2  |3760.0|18800|3000|4600|
|Robert       |Sales     |4100  |3  |3760.0|18800|3000|4600|
|Saif         |Sales     |4100  |4  |3760.0|18800|3000|4600|
|Michael      |Sales     |4600  |5  |3760.0|18800|3000|4600|
+-------------+----------+------+---+------+-----+----+----+




### PySpark SQL Date and Timestamp Functions


In [0]:
from pyspark.sql.functions import *

In [0]:
data=[["1","2020-02-01"],["2","2019-03-01"],["3","2021-03-01"]]
df=spark.createDataFrame(data,["id","input"])
df.show()

+---+----------+
| id|     input|
+---+----------+
|  1|2020-02-01|
|  2|2019-03-01|
|  3|2021-03-01|
+---+----------+




#### current_date()

In [0]:
df.withColumn('Current_date', current_date()).show(truncate=False)

df.select(current_date().alias('current_date')).show(truncate=False)

+---+----------+------------+
|id |input     |Current_date|
+---+----------+------------+
|1  |2020-02-01|2023-09-11  |
|2  |2019-03-01|2023-09-11  |
|3  |2021-03-01|2023-09-11  |
+---+----------+------------+

+------------+
|current_date|
+------------+
|2023-09-11  |
|2023-09-11  |
|2023-09-11  |
+------------+




#### date_format()

In [0]:
df.withColumn('Date_format', date_format(df.input,'MM-dd-yyyy')).show()

df.select(col('input'), date_format(col('input'), "MM-dd-yyyy").alias('Date_format')).show()

+---+----------+-----------+
| id|     input|Date_format|
+---+----------+-----------+
|  1|2020-02-01| 02-01-2020|
|  2|2019-03-01| 03-01-2019|
|  3|2021-03-01| 03-01-2021|
+---+----------+-----------+

+----------+-----------+
|     input|Date_format|
+----------+-----------+
|2020-02-01| 02-01-2020|
|2019-03-01| 03-01-2019|
|2021-03-01| 03-01-2021|
+----------+-----------+




####to_date():
Below example converts string in date format yyyy-MM-dd to a DateType yyyy-MM-dd using to_date(). You can also use this to convert into any specific format.

In [0]:
df.select(col('input'), to_date(col('input'), 'yyyy-MM-dd').alias('to_date')).show()

+----------+----------+
|     input|   to_date|
+----------+----------+
|2020-02-01|2020-02-01|
|2019-03-01|2019-03-01|
|2021-03-01|2021-03-01|
+----------+----------+




#### datediff()


In [0]:
df.select(col('input'), datediff(current_date(), col('input')).alias('date_diff')).show(truncate=False)

+----------+---------+
|input     |date_diff|
+----------+---------+
|2020-02-01|1318     |
|2019-03-01|1655     |
|2021-03-01|924      |
+----------+---------+




####months_between()


In [0]:
df.select(col('input'), months_between(current_date(), col('input')).alias('months_between')).show(truncate=False)

+----------+--------------+
|input     |months_between|
+----------+--------------+
|2020-02-01|43.32258065   |
|2019-03-01|54.32258065   |
|2021-03-01|30.32258065   |
+----------+--------------+



####trunc()



In [0]:
df.select(col("input"), 
    trunc(col("input"),"Month").alias("Month_Trunc"), 
    trunc(col("input"),"Year").alias("Month_Year"), 
    trunc(col("input"),"Month").alias("Month_Trunc")
   ).show()

+----------+-----------+----------+-----------+
|     input|Month_Trunc|Month_Year|Month_Trunc|
+----------+-----------+----------+-----------+
|2020-02-01| 2020-02-01|2020-01-01| 2020-02-01|
|2019-03-01| 2019-03-01|2019-01-01| 2019-03-01|
|2021-03-01| 2021-03-01|2021-01-01| 2021-03-01|
+----------+-----------+----------+-----------+




####add_months() , date_add(), date_sub()


In [0]:
df.select(col("input"), 
    add_months(col("input"),3).alias("add_months"), 
    add_months(col("input"),-3).alias("sub_months"), 
    date_add(col("input"),4).alias("date_add"), 
    date_sub(col("input"),4).alias("date_sub") 
  ).show()

+----------+----------+----------+----------+----------+
|     input|add_months|sub_months|  date_add|  date_sub|
+----------+----------+----------+----------+----------+
|2020-02-01|2020-05-01|2019-11-01|2020-02-05|2020-01-28|
|2019-03-01|2019-06-01|2018-12-01|2019-03-05|2019-02-25|
|2021-03-01|2021-06-01|2020-12-01|2021-03-05|2021-02-25|
+----------+----------+----------+----------+----------+




####year(), month(),next_day(), weekofyear()


In [0]:
df.select(col('input'),
          year(col('input')).alias('year'),
          month(col('input')).alias('month'),
          next_day(col('input'),'Sunday').alias('next_day'),
          weekofyear(col('input')).alias('weekofyear')).show(truncate=False)

+----------+----+-----+----------+----------+
|input     |year|month|next_day  |weekofyear|
+----------+----+-----+----------+----------+
|2020-02-01|2020|2    |2020-02-02|5         |
|2019-03-01|2019|3    |2019-03-03|9         |
|2021-03-01|2021|3    |2021-03-07|9         |
+----------+----+-----+----------+----------+



#### dayofweek(), dayofmonth(), dayofyear()


In [0]:
df.select(col("input"),  
     dayofweek(col("input")).alias("dayofweek"), 
     dayofmonth(col("input")).alias("dayofmonth"), 
     dayofyear(col("input")).alias("dayofyear"), 
  ).show()

+----------+---------+----------+---------+
|     input|dayofweek|dayofmonth|dayofyear|
+----------+---------+----------+---------+
|2020-02-01|        7|         1|       32|
|2019-03-01|        6|         1|       60|
|2021-03-01|        2|         1|       60|
+----------+---------+----------+---------+




#### current_timestamp()


In [0]:
df.select(current_timestamp().alias('current_timestamp')).show(1, truncate=False)

+-----------------------+
|current_timestamp      |
+-----------------------+
|2023-09-11 11:21:32.619|
+-----------------------+
only showing top 1 row



####to_timestamp()


In [0]:
data=[["1","02-01-2020 11 01 19 06"],["2","03-01-2019 12 01 19 406"],["3","03-01-2021 12 01 19 406"]]
df2=spark.createDataFrame(data,["id","input"])
df2.show(truncate=False)

df2.select(col('input'), 
           to_timestamp(col('input'), "MM-dd-yyyy HH mm ss SSS").alias('To-Timestamp')).show(truncate=False)

+---+-----------------------+
|id |input                  |
+---+-----------------------+
|1  |02-01-2020 11 01 19 06 |
|2  |03-01-2019 12 01 19 406|
|3  |03-01-2021 12 01 19 406|
+---+-----------------------+

+-----------------------+-----------------------+
|input                  |To-Timestamp           |
+-----------------------+-----------------------+
|02-01-2020 11 01 19 06 |2020-02-01 11:01:19.06 |
|03-01-2019 12 01 19 406|2019-03-01 12:01:19.406|
|03-01-2021 12 01 19 406|2021-03-01 12:01:19.406|
+-----------------------+-----------------------+




####hour(), Minute() and second()


In [0]:
data=[["1","2020-02-01 11:01:19.06"],["2","2019-03-01 12:01:19.406"],["3","2021-03-01 12:01:19.406"]]
df3=spark.createDataFrame(data,["id","input"])

df3.select(col('input'),
           hour(col('input')).alias('Hour'),
           minute(col('input')).alias('minute'),
           second(col('input')).alias('second')).show(truncate=False)

+-----------------------+----+------+------+
|input                  |Hour|minute|second|
+-----------------------+----+------+------+
|2020-02-01 11:01:19.06 |11  |1     |19    |
|2019-03-01 12:01:19.406|12  |1     |19    |
|2021-03-01 12:01:19.406|12  |1     |19    |
+-----------------------+----+------+------+



In [0]:
from pyspark.sql import Row
jsonString = """{"zipcode":704, "ZipCodeType":"STANDARD", "City": "PARC PARQUE", "State": "PR"}"""
df=spark.createDataFrame([(1, jsonString)],['id','value'])
df.show(truncate=False)

+---+-------------------------------------------------------------------------------+
|id |value                                                                          |
+---+-------------------------------------------------------------------------------+
|1  |{"zipcode":704, "ZipCodeType":"STANDARD", "City": "PARC PARQUE", "State": "PR"}|
+---+-------------------------------------------------------------------------------+




####from_json()

In [0]:
#Convert JSON string column to Map type
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import from_json

df2=df.withColumn('value', from_json(df.value, MapType(StringType(),StringType())))
df2.printSchema()
df2.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |{zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR}|
+---+---------------------------------------------------------------------------+




####to_json
convert DataFrame columns MapType or Struct type to JSON string

In [0]:
from pyspark.sql.functions import to_json, col
df2.withColumn('value',to_json(col('value'))).show(truncate=False)

+---+----------------------------------------------------------------------------+
|id |value                                                                       |
+---+----------------------------------------------------------------------------+
|1  |{"zipcode":"704","ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+----------------------------------------------------------------------------+




####json_tuple()

it is used the query or extract the elements from json column and create result as a new column


In [0]:
from pyspark.sql.functions import json_tuple
df.select(col('id'), json_tuple(col('value'),'ZipCodeType','City', 'State')) \
    .toDF('id', 'ZipCodeType', 'City','State') \
    .show(truncate=False)

+---+-----------+-----------+-----+
|id |ZipCodeType|City       |State|
+---+-----------+-----------+-----+
|1  |STANDARD   |PARC PARQUE|PR   |
+---+-----------+-----------+-----+



#### get_json_object()
it is used to extract the json string based on path from the json column

In [0]:
from pyspark.sql.functions import get_json_object

df.select(col('id'), get_json_object(col('value'), "$.ZipCodeType").alias('ZipCodeType')).show(truncate=False)

+---+-----------+
|id |ZipCodeType|
+---+-----------+
|1  |STANDARD   |
+---+-----------+




####schema_of_json()
It is used to create a schema string from json string column


In [0]:
from pyspark.sql.functions import schema_of_json,lit
schemaStr=spark.range(1) \
          .select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))) \
           .collect()[0][0]

print(schemaStr)

STRUCT<City: STRING, State: STRING, ZipCodeType: STRING, Zipcode: BIGINT>
