## Colab Prep

Execute the following code cells to whenever you open/restart the notebook in Google Colab.

In [None]:
!pip install "polars[all]"

In [None]:
!wget https://github.com/WSU-DataScience/dsci_325_module6_basic_data_management_in_python/raw/main/sample_data.zip

In [None]:
!unzip ./sample_data.zip

# Conditional Expressions

In [1]:
import polars as pl
pl.Config.with_columns_kwargs = True

## Data sets

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

#### MoMA Artists

In [2]:
artists = pl.read_csv("./sample_data/Artists.csv")
artists.head(2)

ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
i64,str,str,str,str,i64,i64,str,i64
1,"""Robert Arneson""","""American, 1930–1992""","""American""","""Male""",1930,1992,,
2,"""Doroteo Arnaiz""","""Spanish, born 1936""","""Spanish""","""Male""",1936,0,,


#### MoMA Artwork

In [3]:
artwork = pl.read_csv("./sample_data/Artworks.csv")
artwork.head(2)

Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,Dimensions,CreditLine,AccessionNumber,Classification,Department,DateAcquired,Cataloged,ObjectID,URL,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,f64,str,str,f64,str,str
"""Ferdinandsbrücke Project, Vien…","""Otto Wagner""","""6210""","""(Austrian, 1841–1918)""","""(Austrian)""","""(1841)""","""(1918)""","""(Male)""","""1896""","""Ink and cut-and-pasted painted…","""19 1/8 x 66 1/2"" (48.6 x 168.9…","""Fractional and promised gift o…","""885.1996""","""Architecture""","""Architecture & Design""","""1996-04-09""","""Y""",2,"""http://www.moma.org/collection…","""http://www.moma.org/media/W1si…",,,,48.6,,,168.9,,
"""City of Music, National Superi…","""Christian de Portzamparc""","""7470""","""(French, born 1944)""","""(French)""","""(1944)""","""(0)""","""(Male)""","""1987""","""Paint and colored pencil on pr…","""16 x 11 3/4"" (40.6 x 29.8 cm)""","""Gift of the architect in honor…","""1.1995""","""Architecture""","""Architecture & Design""","""1995-01-17""","""Y""",3,"""http://www.moma.org/collection…","""http://www.moma.org/media/W1si…",,,,40.6401,,,29.8451,,


## Review - CASE WHEN

Recall that the `CASE WHEN` expression to condition results on some boolean condition(s).

```{SQL}
SELECT CASE 
            WHEN Nationality == 'American'
            THEN 'Yes'
            ELSE 'No'
       END AS American
FROM Artists
```

## Conditional expressions in `polars`

To perform a `CASE WHEN` in `polars` with a single dot-chain by
* Start with `pl.when(...).then(...)`
* Add any number of additional `.when(...).then(...)` to the dot-chain
* Add a `.otherwise(...)` to catch all remaining cases.

### Example

In [4]:
df = pl.DataFrame({'cat':['a','b','b','c','c'],
                   'val':[ 1,  1,  2,  1, 2]})
df

cat,val
str,i64
"""a""",1
"""b""",1
"""b""",2
"""c""",1
"""c""",2


#### `case_when` with one predicate pair

Unmatched values are `null`

In [5]:
(df
 .with_columns(new = pl.when(pl.col('cat') == 'a')
                       .then(pl.col('val') + 1)
              )
)

cat,val,new
str,i64,i64
"""a""",1,2.0
"""b""",1,
"""b""",2,
"""c""",1,
"""c""",2,


#### Two WHEN/THEN clauses 

Note that the first matching predicate is applied

In [6]:
(df
 .with_columns(new = pl.when(pl.col('cat') == 'a')
                       .then(pl.col('val') + 1)
                       .when(pl.col('cat') == 'b')
                       .then(pl.col('val') + 10)
               
                       .when(pl.col('cat') == 'c')
                       .then(pl.lit(100))
              )
)

cat,val,new
str,i64,i64
"""a""",1,2
"""b""",1,11
"""b""",2,12
"""c""",1,100
"""c""",2,100


#### Adding an else with `otherwise`

In [7]:
(df
 .with_columns(new = pl.when(pl.col('cat') == 'a')
                       .then(pl.col('val') + 1)
                       .when(pl.col('cat') == 'b')
                       .then(pl.col('val') + 10)
                       .otherwise(pl.col('val'))
              )
)

cat,val,new
str,i64,i64
"""a""",1,2
"""b""",1,11
"""b""",2,12
"""c""",1,1
"""c""",2,2


### Including literal values

Note that
* `polars` is actually implemented in Rust.
* Literal/constant values need to use `pl.lit`.

In [8]:
0 # Python integer

0

In [9]:
pl.lit(0) # Gets converted to Rust/Apache Arrow

In [10]:
pl.lit(0, pl.Int8) # Cast to a specific int type

#### `case_when` with an optional literal value

In [11]:
(df
 .with_columns(new = pl.when(pl.col('cat') == 'a')
                       .then(pl.col('val') + 1)
                       .when(pl.col('cat') == 'b')
                       .then(pl.col('val') + 10)
                       .otherwise(pl.lit(0))
              )
)

cat,val,new
str,i64,i64
"""a""",1,2
"""b""",1,11
"""b""",2,12
"""c""",1,0
"""c""",2,0


## <font color="red"> Exercise 6.7.1 </font>

Consider the `Nationality` column `Artist` data.  We would like to an *indicator column* for the American nationality, that is make a new column that that contains `1` if the artist is of American decent and `0` otherwise. 

In [12]:
# Your code here

## <font color="red"> Exercise 6.7.2 </font>

Consider the `Nationality` column `exhibition` table.  We would like to make a indicator column that reclassifies this column as `"North American"`, `"European"`, or `"Other"`.  Use `case_when` to accomplish this task.

**Hint.** Lists of relevant nationalities are provided, consider using `is_in` with these lists in your predicates.

In [13]:
all_countries = artists['Nationality'].unique().to_list() 
all_countries # This was used to make the following lists

['Cambodian',
 'Portuguese',
 'Estonian',
 'Japanese',
 'Russian',
 'Beninese',
 'Indian',
 'Pakistani',
 'Native American',
 'Peruvian',
 'Bosnian',
 'Canadian Inuit',
 'Guatemalan',
 'Azerbaijani',
 'Costa Rican',
 'Ghanaian',
 'New Zealander',
 'Ukrainian',
 None,
 'Kuwaiti',
 'Bolivian',
 'German',
 'Czech',
 'Malian',
 'Georgian',
 'Brazilian',
 'Nationality unknown',
 'Australian',
 'Macedonian',
 'Panamanian',
 'Bulgarian',
 'Finnish',
 'Iranian',
 'American',
 'Serbian',
 'French',
 'Salvadoran',
 'Czechoslovakian',
 'Vietnamese',
 'Mexican',
 'Austrian',
 'Sahrawi',
 'Cuban',
 'Ethiopian',
 'South African',
 'Egyptian',
 'Belgian',
 'Ecuadorian',
 'Palestinian',
 'Slovak',
 'Sierra Leonean',
 'Singaporean',
 'Irish',
 'Dutch',
 'Cameroonian',
 'Tanzanian',
 'Norwegian',
 'Afghan',
 'Nicaraguan',
 'Coptic',
 'Korean',
 'Taiwanese',
 'Chinese',
 'Chilean',
 'Filipino',
 'Argentine',
 'Persian',
 'Puerto Rican',
 'Sudanese',
 'Ugandan',
 'Namibian',
 'Burkinabe',
 'Latvian',
 'Iv

In [14]:
north_american = ['American',
                  'Canadian',
                  'Moroccan and American',
                  'Canadian Inuit',
                  'Native American',
                  'American and Mexican']

european = ['French', 'Dutch', 'Italian', 'Spanish', 'German',
            'Austrian', 'Finnish', 'Swedish', 'Swiss',
            'British', 'Czech', 'Belgian', 'Russian-Lithuanian', 
            'English', 'Greek', 'Norwegian', 'Latvian', 'Polish', 
            'Milanese', 'Danish', 'Netherlandish', 'Flemish',
            'Scottish', 'Hungarian', 'Yugoslav', 'Catalan', 
            'Florentine', 'Venetian', 'Irish', 'Icelandic', 
            'Slovene', 'Bosnian', 'Croatian', 'Luxembourgish']

In [16]:
# Your code here