## Problem :
Create a new column with key-value similar to Python Dictonary based on exisiting columns from the DataFrame.

## Solution:
In Spark 2.0 or later versions, PySpark built in SQL function **'create_map'** will be used to convert selected columns of the DataFrame to **MapType**. Function create_map() takes a list of columns that are grouped as key-value pairs.

**API Reference:**
https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.create_map

## Implementation

In [1]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrame Functions").getOrCreate()
spark

#### Create a DataFrame with Sample Data

In [4]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

schema = StructType([
     StructField('year', StringType(), True),
     StructField('course', StringType(), True),
     StructField('fee', IntegerType(), True),    
     ])

data = [
       (2022,'Spark',15000),(2022,'BigData',10000),(2022,'Scala',10000),(2022,'Python',10000),(2022,'Java',10000), (2022,'DevOps',15000), (2022,'AWS',20000),(2022,'ML',35000),
       (2021,'Spark',15000),(2021,'BigData',10000),(2021,'Scala',10000),(2021,'Python',10000),(2021,'Java',10000), (2021,'DevOps',15000), (2021,'AWS',20000),(2021,'ML',35000),
       (2020,'Spark',15000),(2020,'BigData',10000),(2020,'Scala',10000),(2020,'Python',10000),(2020,'Java',10000), (2020,'DevOps',15000), (2020,'AWS',20000),(2020,'ML',30000),
       ]

courses_df = spark.createDataFrame(data,schema)
courses_df.printSchema()

root
 |-- year: string (nullable = true)
 |-- course: string (nullable = true)
 |-- fee: integer (nullable = true)



In [5]:
courses_df.show()

+----+-------+-----+
|year| course|  fee|
+----+-------+-----+
|2022|  Spark|15000|
|2022|BigData|10000|
|2022|  Scala|10000|
|2022| Python|10000|
|2022|   Java|10000|
|2022| DevOps|15000|
|2022|    AWS|20000|
|2022|     ML|35000|
|2021|  Spark|15000|
|2021|BigData|10000|
|2021|  Scala|10000|
|2021| Python|10000|
|2021|   Java|10000|
|2021| DevOps|15000|
|2021|    AWS|20000|
|2021|     ML|35000|
|2020|  Spark|15000|
|2020|BigData|10000|
|2020|  Scala|10000|
|2020| Python|10000|
+----+-------+-----+
only showing top 20 rows



#### Convert DataFrame columns to MapType

In [6]:
from pyspark.sql.functions import col,lit,create_map
course_details = (
  courses_df.withColumn("course_details",create_map(
        lit("course"),col("course"),
        lit("fee"),col("fee")
        )).drop("course","fee")
)
course_details.printSchema()

root
 |-- year: string (nullable = true)
 |-- course_details: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



In [7]:
course_details.show(truncate=False)

+----+---------------------------------+
|year|course_details                   |
+----+---------------------------------+
|2022|{course -> Spark, fee -> 15000}  |
|2022|{course -> BigData, fee -> 10000}|
|2022|{course -> Scala, fee -> 10000}  |
|2022|{course -> Python, fee -> 10000} |
|2022|{course -> Java, fee -> 10000}   |
|2022|{course -> DevOps, fee -> 15000} |
|2022|{course -> AWS, fee -> 20000}    |
|2022|{course -> ML, fee -> 35000}     |
|2021|{course -> Spark, fee -> 15000}  |
|2021|{course -> BigData, fee -> 10000}|
|2021|{course -> Scala, fee -> 10000}  |
|2021|{course -> Python, fee -> 10000} |
|2021|{course -> Java, fee -> 10000}   |
|2021|{course -> DevOps, fee -> 15000} |
|2021|{course -> AWS, fee -> 20000}    |
|2021|{course -> ML, fee -> 35000}     |
|2020|{course -> Spark, fee -> 15000}  |
|2020|{course -> BigData, fee -> 10000}|
|2020|{course -> Scala, fee -> 10000}  |
|2020|{course -> Python, fee -> 10000} |
+----+---------------------------------+
only showing top