# Creating UDFs

### Creating Data Frame with sample Data/schema

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F

In [2]:
myspark = SparkSession.builder.appName("UDFTest").enableHiveSupport().getOrCreate()

In [3]:
temp = [[8.7,0.6,"Delhi"],[9.7,0.9,"Gurgaon"],[11.0,1.6,"Noida"],[11.2,-0.5,"Jaipur"],[9.0,-1.0,"Hisar"]]
df = myspark.createDataFrame(temp,["avgHigh","avgLow","city"])

In [4]:
df.printSchema()

root
 |-- avgHigh: double (nullable = true)
 |-- avgLow: double (nullable = true)
 |-- city: string (nullable = true)



In [5]:
df.show()

+-------+------+-------+
|avgHigh|avgLow|   city|
+-------+------+-------+
|    8.7|   0.6|  Delhi|
|    9.7|   0.9|Gurgaon|
|   11.0|   1.6|  Noida|
|   11.2|  -0.5| Jaipur|
|    9.0|  -1.0|  Hisar|
+-------+------+-------+



In [6]:
df.createOrReplaceTempView("mytable")
myspark.sql("SELECT * FROM mytable").show()

+-------+------+-------+
|avgHigh|avgLow|   city|
+-------+------+-------+
|    8.7|   0.6|  Delhi|
|    9.7|   0.9|Gurgaon|
|   11.0|   1.6|  Noida|
|   11.2|  -0.5| Jaipur|
|    9.0|  -1.0|  Hisar|
+-------+------+-------+



## 1. UDF for a Dataframe

Creating and Registering UDF using psyaprk.sql.functions.udf

This takes 2 arguments : Function Name and Return Type(Default is StringType)



Creating the Function to convert Celsius to Fahrenheit 

In [7]:
def changeToF(degreesCelsius):
    Far = (degreesCelsius * 9.0 / 5.0) + 32.0
    return Far

Registering newly created Function as SPARK UDF (using pyspark.sql.functions.udf)

In [8]:
df_udf = F.udf(changeToF, T.StringType())

Or you can use lambda directly (instead of creating the Function separately)

In [9]:
df_udf = F.udf(lambda degreesCelsius: (degreesCelsius * 9.0 / 5.0) + 32.0 , T.DoubleType())

Now you can use df_udf by pass any column to it.

In [10]:
df.select("avgHigh", "avgLow", "city", df_udf("avgLow").alias("avgLow_in_F")).show()

+-------+------+-------+-----------+
|avgHigh|avgLow|   city|avgLow_in_F|
+-------+------+-------+-----------+
|    8.7|   0.6|  Delhi|      33.08|
|    9.7|   0.9|Gurgaon|      33.62|
|   11.0|   1.6|  Noida|      34.88|
|   11.2|  -0.5| Jaipur|       31.1|
|    9.0|  -1.0|  Hisar|       30.2|
+-------+------+-------+-----------+



#### Note: This UDF can't be used on a TABLE

    myspark.sql("SELECT myudf(avgLow) from mytable")

    ERROR: AnalysisException: u"Undefined function: 'myudf'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7"

## 2. UDF for a Table/View

Creating and Registering UDF using psyaprk.sql.SparkSession.udf (it is equivalent to sqlContext.udf.register)

    Registers a python function (including lambda function) as a UDF so IT CAN BE USED IN SQL STATEMENT.

This takes 3 Parameters:

    (i) name – name of the UDF, (ii) f – python function , (iii) returnType – a pyspark.sql.types.DataType object

In [11]:
myspark.udf.register("table_udf", lambda degreesCelsius: (degreesCelsius * 9.0 / 5.0) + 32 , T.DoubleType())

In [12]:
myspark.sql("SELECT avgHigh, avgLow, city , table_udf(avgLow) as avgLow_in_F from mytable").show(5,False)

+-------+------+-------+-----------+
|avgHigh|avgLow|city   |avgLow_in_F|
+-------+------+-------+-----------+
|8.7    |0.6   |Delhi  |33.08      |
|9.7    |0.9   |Gurgaon|33.62      |
|11.0   |1.6   |Noida  |34.88      |
|11.2   |-0.5  |Jaipur |31.1       |
|9.0    |-1.0  |Hisar  |30.2       |
+-------+------+-------+-----------+



### Same thing we can achieve with sqlContext.registerFunction or sqlContext.udf.register.

#### (i) Using "sqlContext.udf.register"

In [13]:
sqlContext.udf.register("table_udf", lambda degreesCelsius: (degreesCelsius * 9.0 / 5.0) + 32 , T.DoubleType())

In [14]:
myspark.sql("SELECT avgHigh, avgLow, city , table_udf(avgLow) as avgLow_in_F from mytable").show(5,False)

+-------+------+-------+-----------+
|avgHigh|avgLow|city   |avgLow_in_F|
+-------+------+-------+-----------+
|8.7    |0.6   |Delhi  |33.08      |
|9.7    |0.9   |Gurgaon|33.62      |
|11.0   |1.6   |Noida  |34.88      |
|11.2   |-0.5  |Jaipur |31.1       |
|9.0    |-1.0  |Hisar  |30.2       |
+-------+------+-------+-----------+



#### (ii) Using "sqlContext.registerFunction"

In [15]:
sqlContext.registerFunction("table_udf", lambda degreesCelsius: (degreesCelsius * 9.0 / 5.0) + 32 , T.DoubleType())

In [16]:
myspark.sql("SELECT * , table_udf(avgLow) AS avgLow_in_F from mytable").show(5,False)

+-------+------+-------+-----------+
|avgHigh|avgLow|city   |avgLow_in_F|
+-------+------+-------+-----------+
|8.7    |0.6   |Delhi  |33.08      |
|9.7    |0.9   |Gurgaon|33.62      |
|11.0   |1.6   |Noida  |34.88      |
|11.2   |-0.5  |Jaipur |31.1       |
|9.0    |-1.0  |Hisar  |30.2       |
+-------+------+-------+-----------+



#### Note: This table_udf can't be used on a DF

    df.select(table_udf(df.avgLow)).show()

    NameError: name 'table_udf' is not defined


# What's Next

1) To Download this Single Notebook, go to Click this file in my Github Account, Copy the URL and paste in http://nbviewer.jupyter.org/. Download button will be in top right corner.

2) Open your Juypter Notebook home page and upload using "upload" Button.

3) Continue Learning from the next Notebook Spark_06_ETL_Operations_Exercise.ipynb