# Three helpful column functions

In this section, we will look at the `pyspark` renditions of the R functions `ifelse`, `coalesce` and `case_when`.

In [1]:
from pyspark.sql import SparkSession
from more_pyspark import get_spark_types, to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

22/11/03 12:22:30 WARN Utils: Your hostname, nn1448lr222 resolves to a loopback address: 127.0.1.1; using 172.22.172.10 instead (on interface eth0)
22/11/03 12:22:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/03 12:22:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/11/03 12:22:33 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/11/03 12:22:33 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/11/03 12:22:33 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
22/11/03 12:22:33 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.


## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

Recall that this data set uses an unusual character encoding.

#### MoMA Artwork

In [2]:
artwork = spark.read.csv("./data/Artworks.csv", header=True, inferSchema=True)

artwork.take(2) >> to_pandas

                                                                                

22/11/03 12:22:43 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,


# Branching with `when` and `otherwise`

* allows us to pick between two options in a `mutate`.
* has the following syntax: `when(cond, then_expr).otherwise(else_expr)`
* Will return `then_expr` with `cond == True`
* Will return `else_expr` with `cond == False`

In [3]:
from pyspark.sql.functions import when, col

(artwork
.withColumn('Post WW2', when(col('Date') >= 1946, 1).otherwise(0))
.take(2)
) >> to_pandas

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.),Post WW2
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.6,,,168.9,,,0
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.6401,,,29.8451,,,1


### `when` build column expressions

In [4]:
when(col('Date') >= 1946, 1).otherwise(0)

Column<'CASE WHEN (Date >= 1946) THEN 1 ELSE 0 END'>

### Creating `ifelse`

Python functions make it easy to recreate the R function, which is included in `more_dfply`.

In [5]:
from more_pyspark import ifelse

ifelse(col('Date') >= 1946, 1, 0)

Column<'CASE WHEN (Date >= 1946) THEN 1 ELSE 0 END'>

## CASE WHEN using a chain of multiple `when`s

We can also use `when` to replicate the functionality of `case_when` by chaining one `when` into the next.  As before, we can get an else condition using `otherwise`.

* Syntax: `when(cond1, expr1).when(cond2,expr2). ... .otherwise(else_expr)`

In [6]:
from pyspark.sql.functions import when, col

(artwork
.withColumn('Period', (when(col('Date') >= 1946, "Post WW2")
                      .when(col('Date') >= 1939, "WW2")
                      .when(col('Date') > 1918, "Interwar")
                      .when(col('Date') >= 1914, "WW1")
                      .otherwise('Pre-WW1')
                      )
           )
.take(2)
) >> to_pandas

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.),Period
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.6,,,168.9,,,Pre-WW1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.6401,,,29.8451,,,Post WW2


### Again, we are creating a lazy column expression

In [7]:
(when(col('Date') >= 1946, "Post WW2")
.when(col('Date') >= 1939, "WW2")
.when(col('Date') > 1918, "Interwar")
.when(col('Date') >= 1914, "WW1")
.otherwise('Pre-WW1')
)

Column<'CASE WHEN (Date >= 1946) THEN Post WW2 WHEN (Date >= 1939) THEN WW2 WHEN (Date > 1918) THEN Interwar WHEN (Date >= 1914) THEN WW1 ELSE Pre-WW1 END'>

### Working with `case_when`

The module `more_pyspark` contains a `case_when` function that uses `when` and `otherwise` to implement the previous interface.  The main difference is this function uses an optional keyword `else_` to capture the default result.

In [8]:
from more_pyspark import case_when

case_when((col('Date') >= 1946, "Post WW2"),
          (col('Date') >= 1939, "WW2"),
          (col('Date') > 1918, "Interwar"),
          (col('Date') >= 1914, "WW1"),
          else_ = 'Pre-WW1'
         )

Column<'CASE WHEN (Date >= 1946) THEN Post WW2 WHEN (Date >= 1939) THEN WW2 WHEN (Date > 1918) THEN Interwar WHEN (Date >= 1914) THEN WW1 ELSE Pre-WW1 END'>

In [9]:
from pyspark.sql.functions import when, col

(artwork
.withColumn('Period', case_when((col('Date') >= 1946, "Post WW2"),
                                (col('Date') >= 1939, "WW2"),
                                (col('Date') > 1918, "Interwar"),
                                (col('Date') >= 1914, "WW1"),
                                else_ = 'Pre-WW1')
           )
.take(2)
) >> to_pandas

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.),Period
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,,,,48.6,,,168.9,,,Pre-WW1
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,,,,40.6401,,,29.8451,,,Post WW2


## A note on `coalesce` in `pyspark`

* SQL: `coalesce(col1, col2, ...)` is used to fill in missing values.
* pyspark: `df.coalesce` is used to change the number of partitions (more on this later).

## <font color="red"> Exercise 6.3.1 </font>

Consider the `Nationality` column `exhibition` table.  We would like to make a new column that reclassifies this column as `"North American"`, `"European"`, or `"Other"`.  
1. Use a chain of `when`s and `otherwise` to accomplish this task. 
2. Copy and convert 1 to use `more_pyspark.case_when`

In [10]:
# Reminder: Weird encoding here
# Read when standard UTF-8
exhibitions = spark.read.csv('./data/MoMAExhibitions1929to1989.csv', 
                             header=True, 
                             inferSchema=True)
exhibitions.take(2) >> to_pandas # Notice the "bad" symbols



Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557,1,"C�zanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 1902�1981",Male,109252853,Q711362,500241556,moma.org/artists/9168
1,2557,1,"C�zanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 1839�1906",Male,39374836,Q35548,500004793,moma.org/artists/1053


In [11]:
# Read with correct encoding
exhibitions = spark.read.csv('./data/MoMAExhibitions1929to1989.csv', 
                             header=True, 
                             inferSchema=True,
                             encoding="ISO-8859-1")
exhibitions.take(2) >> to_pandas # No more "bad" symbols

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Institution,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL
0,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Curator,Director,...,,American,1902,1981,"American, 19021981",Male,109252853,Q711362,500241556,moma.org/artists/9168
1,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,,French,1839,1906,"French, 18391906",Male,39374836,Q35548,500004793,moma.org/artists/1053


In [12]:
# Your code here
americans = list(['American','Mexican','Canadian','Canadian Inuit','American and Mexican','Native American',
                     'Moroccan and American'])

europeans = list(['French', 'Dutch', 'Italian', 'Spanish', 'German', 'Austrian', 'Finnish', 'Swedish', 'Swiss', 'British',
             'Czech', 'Belgian', 'Russian', 'Russian-Lithuanian', 'English', 'Greek', 'Norwegian', 'Georgian', 'Latvian',
             'Polish', 'Milanese', 'Danish', 'Netherlandish', 'Romanian', 'Flemish', 'Scottish', 'Hungarian', 'Yugoslav',
             'Ukrainian', 'Catalan', 'Florentine', 'Venetian', 'Irish', 'Icelandic', 'Slovene', 'Bosnian',
             'Croatian', 'Luxembourgish'])

In [13]:
(exhibitions
 .withColumn('Region', case_when((exhibitions.Nationality.isin(americans) ,"North American"),
                                 (exhibitions.Nationality.isin(europeans) , "European"),
                                 else_ = 'Pre-WW1')
           )
.take(2)
) >> to_pandas

Unnamed: 0,ExhibitionID,ExhibitionNumber,ExhibitionTitle,ExhibitionCitationDate,ExhibitionBeginDate,ExhibitionEndDate,ExhibitionSortOrder,ExhibitionURL,ExhibitionRole,ExhibitionRoleinPressRelease,...,Nationality,ConstituentBeginDate,ConstituentEndDate,ArtistBio,Gender,VIAFID,WikidataID,ULANID,ConstituentURL,Region
0,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Curator,Director,...,American,1902,1981,"American, 19021981",Male,109252853,Q711362,500241556,moma.org/artists/9168,North American
1,2557,1,"Cézanne, Gauguin, Seurat, Van Gogh","[MoMA Exh. #1, November 7-December 7, 1929]",11/7/1929,12/7/1929,1,moma.org/calendar/exhibitions/1767,Artist,Artist,...,French,1839,1906,"French, 18391906",Male,39374836,Q35548,500004793,moma.org/artists/1053,European
