# Operations Between Columns Using Pyspark

The purpose of this notebooks is to give you a basic understanding of how to do basic operations between columns using Pyspark.

We will start by importing all of the pyspark machinery and a function that generates a dummy table. We will use this table to create new columns using the existing ones.

In [1]:
# Spark related machinery
import pyspark
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.functions import concat_ws

spark = pyspark.sql.SparkSession.builder.enableHiveSupport().getOrCreate()

In [2]:
from pyspark_functions import create_sp_table2

Now we will create the table using the function ``` create_sp_table2``` and print it to screen.

In [3]:
#Create table
players_info = create_sp_table2()

#Print table
players_info.show()

+--------------+-------+---------+---+------------+------------+
|participant_id|   name|last_name|age|score_game_1|score_game_2|
+--------------+-------+---------+---+------------+------------+
|             1| Sophia|     Lara| 27|          89|          84|
|             2|   Liam|    Smith| 23|          50|          60|
|             3| Olivia|   Wilson| 25|          78|          70|
|             4|Jackson|   Garcia| 24|          98|          90|
|             5|    Ava|    Moore| 26|         100|          89|
|             6| Oliver|     Leon| 30|          65|          70|
|             7|  Lucas|    Brown| 24|          78|          75|
|             8|    Mia|      Lee| 31|          85|          79|
|             9|   Aria| Robinson| 28|          80|          89|
|            10| Amelia|   Walker| 29|          93|          99|
+--------------+-------+---------+---+------------+------------+



## Simple Operations Between Columns
We will start by adding a new column **score_game_1_plus_10** where we add 10 points to the column **score_game_1**

In [4]:
#Create new column
players_info = players_info.withColumn("score_game_1_plus_10", F.col("score_game_1") + 10)

#Print table
players_info.show()

+--------------+-------+---------+---+------------+------------+--------------------+
|participant_id|   name|last_name|age|score_game_1|score_game_2|score_game_1_plus_10|
+--------------+-------+---------+---+------------+------------+--------------------+
|             1| Sophia|     Lara| 27|          89|          84|                  99|
|             2|   Liam|    Smith| 23|          50|          60|                  60|
|             3| Olivia|   Wilson| 25|          78|          70|                  88|
|             4|Jackson|   Garcia| 24|          98|          90|                 108|
|             5|    Ava|    Moore| 26|         100|          89|                 110|
|             6| Oliver|     Leon| 30|          65|          70|                  75|
|             7|  Lucas|    Brown| 24|          78|          75|                  88|
|             8|    Mia|      Lee| 31|          85|          79|                  95|
|             9|   Aria| Robinson| 28|          80|   

The new column is created with the method ```withColumn()```. The first argument that this function takes is the name of the new column you want to create. The second argument that we passed is the operation used to create the new column, in our case ``` F.col("score_game_1") + 10```.

Now we will add the columns **score_game_1** and **score_game_2** to generate a new column named **total_score**:

In [5]:
#Create new column
players_info = players_info.withColumn("total_score", F.col("Score_game_1") + F.col("Score_game_2"))

#print to screen only selected columns
players_info.select("participant_id", "score_game_1", "score_game_2", "total_score").show()

+--------------+------------+------------+-----------+
|participant_id|score_game_1|score_game_2|total_score|
+--------------+------------+------------+-----------+
|             1|          89|          84|        173|
|             2|          50|          60|        110|
|             3|          78|          70|        148|
|             4|          98|          90|        188|
|             5|         100|          89|        189|
|             6|          65|          70|        135|
|             7|          78|          75|        153|
|             8|          85|          79|        164|
|             9|          80|          89|        169|
|            10|          93|          99|        192|
+--------------+------------+------------+-----------+



It is important to mention that in the previous example we are only printing 4 columns (**participant_id**, **score_game_1**, **score_game_2**,  and **total_score**). This was done using the method ```select()``` and passing the names of the columns separated by comas.

## String Concatenation
If we want to concatenate two columns, in this case strings, we need to use the method ```concat()```. In the example below we will create a new column named **full_name** by concatenating the columns **name** and **last_name**:

In [6]:
#Concatenate two columns (name and last_name) adding a white space inbetween
players_info = players_info.withColumn("full_name", F.concat("name", F.lit(" "), "last_name"))

#Select 3 columns and print them
players_info.select("name", "last_name", "full_name").show()

+-------+---------+--------------+
|   name|last_name|     full_name|
+-------+---------+--------------+
| Sophia|     Lara|   Sophia Lara|
|   Liam|    Smith|    Liam Smith|
| Olivia|   Wilson| Olivia Wilson|
|Jackson|   Garcia|Jackson Garcia|
|    Ava|    Moore|     Ava Moore|
| Oliver|     Leon|   Oliver Leon|
|  Lucas|    Brown|   Lucas Brown|
|    Mia|      Lee|       Mia Lee|
|   Aria| Robinson| Aria Robinson|
| Amelia|   Walker| Amelia Walker|
+-------+---------+--------------+



## User Defined Functions

Sometimes you need to do operations between columns that require a more complex logic. In those cases you can create an User-Defined-Function (UDF) using the method ```udf```.

For example, let’s say that we want to add 10 points to the **score_game_1** column only to players with ages equal or less than 26, leaving all other players with their original score. And we want to put this new numbers in a column called **score_game_1_new**.

In order to do that, we will start by creating a function called ```young_get_extra_ten()```, and a second one called ```udf_young_get_extra_ten``` see below:

In [7]:
#New funciton
def young_get_extra_ten(age, score):    
    #create the new score
    if age <= 26:
        new_score = score + 10
    if age > 26:
        new_score = score
    return new_score

#User-Defined-Function
udf_young_get_extra_ten = F.udf(young_get_extra_ten, IntegerType())

Ok, lets break down the ```def young_get_extra_ten``` function line by line:

* ```def young_get_extra_ten(age, score):```: The name of the function. The funciton takes thow columns **age** and **score**.

* ```if age <= 26:```: If the variable age is less or equal to 26 do the following:
    * ```new_score = score + 10```: Set the variable ```new_score``` to the value of the column **score_game_1** plus ten points.

* ``` if age > 26:```: If the variable age is larger than 26, do the following.
	* ```new_score = score```: Set the variable ```new_score``` to the value of the column **score_game_1**.

*  ``` return new_score```: return the variable ```new_score```

It is important to notice that the Pyspark function that we will use (```udf_young_get_extra_ten```) is created using the ```udf``` method. This method is taking two elements: The name of the function and the type of the output, in this case an integer.

In the code of block below, we will use our UDF function ```udf_young_get_extra_ten```  to create the new column:

In [8]:
#Create new column using the UDF
players_info = players_info.withColumn("score_game_1_new", udf_young_get_extra_ten(F.col("age"), F.col("score_game_1")))

#Select columns and print table
players_info.select("age", "score_game_1", "score_game_1_new").show()

+---+------------+----------------+
|age|score_game_1|score_game_1_new|
+---+------------+----------------+
| 27|          89|              89|
| 23|          50|              60|
| 25|          78|              88|
| 24|          98|             108|
| 26|         100|             110|
| 30|          65|              65|
| 24|          78|              88|
| 31|          85|              85|
| 28|          80|              80|
| 29|          93|              93|
+---+------------+----------------+



It is important to notice that the new column **score_game_1_new** is created using the ```withColumn``` method where we are passing the name of the new column and the UDF we defined earlier. Also it is important to notice that the two arguments passed to the UDF are the columns **age** and **score_game_1**.

## Method ```when```

While UDFs are very useful in implementing complicated logic, most of the time you can accomplish the same goal using the ```when``` method. In order to demonstrate how this method works, in the following block of code we will create a new column, **score_game_2_new**, where we will subtract 10 points from the column **score_game_2** for players at the age of 26 or younger.

In [9]:
#Condition
condition = F.col("age") <= 26

#Action
action = F.col("score_game_2") - 10

#Create new column
players_info = players_info.withColumn("score_game_2_new", F.when(condition, action)\
                                                            .otherwise(F.col("score_game_2")))

#Select 3 columns and print table
players_info.select("age", "score_game_2", "score_game_2_new").show()

+---+------------+----------------+
|age|score_game_2|score_game_2_new|
+---+------------+----------------+
| 27|          84|              84|
| 23|          60|              50|
| 25|          70|              60|
| 24|          90|              80|
| 26|          89|              79|
| 30|          70|              70|
| 24|          75|              65|
| 31|          79|              79|
| 28|          89|              89|
| 29|          99|              99|
+---+------------+----------------+



As you see, the block of code above starts by declaring a condition (if the age of the player is 26 or less) and an action that would be taken if the condition is met (subtract 10 points from the column **score_game_2**). Then, we create our new column (**score_game_2**) using the ```withColumn``` method. As always the first argument that we pass is the name of the new column, while the second argument  is the method ```when```. 

It is important to notice two things: 1) The method ```when``` is taking the condition and action that we previously defined. 2) The method ```otherwise``` indicates the action taken when the initial condition is not met, in this case the action is to return the original value in the column **score_game_2**.

# Final Words:

We have covered the basics concepts behind performing operations between columns in Pyspark. Now it is time for you to start coding, start modifying this notebook by creating other columns and modifying the existing code.
