## Selecting columns 5: Transforming and adding multiple columns
By the end of this lesson you will be able to:
- transform multiple columns in-place
- add multiple columns
- transform and add multiple columns is less verbose ways

In [None]:
import polars as pl
import polars.selectors as cs

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

## Transforming existing columns

We can transform multiple existing columns by either passing a `list` of expressions to `with_columns` or comma-separated expressions.

Here we pass comma-separated expressions to round the floating columns to 0 decimal places

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Age').round(0),
        pl.col('Fare').round(0),
    )
    .head(3)
)

We can make this less verbose, however.

As we are applying the same transformation to the `Age` and `Fare` columns we can pass them both to the same `pl.col` as comma-separated column names

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Age','Fare').round(0),
    )
    .head(5)
)

In this example `Age` and `Fare` are the only float columns. This means that we can instead pass their dtype to `pl.col` to apply the `round` expression to all float columns

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col(pl.Float64).round(0),
    )
    .head(3)
)

Or we can use selectors to select the columns that we want to round

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        cs.float().round(0),
    )
    .head(3)
)

## Adding new columns from existing columns
Above we overwrite the existing `Age` and `Fare` columns in the `with_columns` statements

We can instead create new columns from existing columns with `alias`. 

In this example we add the rounded `Age` and `Fare` as new columns

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col('Age').round(0).alias('Age_round'),
        pl.col('Fare').round(0).alias('Fare_round')
    )
    .select(
        'Age','Age_round','Fare','Fare_round',
    )
    .head(3)
)

As an alternative to `alias` we can use comma-separated keyword assignments

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        Age_round = pl.col('Age').round(0),
        Fare_round = pl.col('Fare').round(0),
    )
    .select(
        'Age','Age_round','Fare','Fare_round',
    )
    .head(3)
)

Note that if you mix the `alias` and keyword assignment approach in the same `with_columns` the keyword assignments must come after the `alias` expressions.

When should you use `alias` and when should you use the keyword approach?
- There is no performance difference between the `alias` and keyword approach
- You might find the keyword approach more readable in some cases
- You can use python variables inside an `alias` but not with keyword assignment

## Creating new columns when working with multiple expressions
We can still use the less verbose multi-expression approaches we saw above when we want to create new columns.

In this example we round the float columns as new columns by adding the `_round` using `name.suffix`

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col(pl.Float64).round(0).name.suffix("_round"),
    )
    .select(
        'Age','Age_round','Fare','Fare_round',
    )
    .head(3)
)

Using `name.suffix` (or `name.prefix`) is particularly useful when doing aggregations on lots of columns in a `groupby.agg`, as we see later in the course.

## Exercises

In the exercises you will develop your understanding of:
- overwriting existing columns
- adding multiple columns
- transforming multiple columns based on dtype

## Exercise 1
Convert the 64-bit integer and float columns to their 32-bit equivalents

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head()
)

Continue by adding 
- a `family_size` column as the sum of the siblings, parents and the passenger
- a Boolean `over_thirty` column showing if a passenger is aged 30 or over

Add these columns using keyword assignment

### Exercise 2
We have the following fictitious dataset with sales figures of bikes in different countries.

In [None]:
dfb = pl.read_parquet("../data/bike_sales.parquet")
dfb.head()

The monetary values are in the local currency but we want to compare them in US dollars. 

In order to do this we join the following `DataFrame` with the foreign-exchange rates to US dollars

In [None]:
fx_df = (
    pl.DataFrame(
        {
            "country":['Germany', 'Canada', 'Australia', 'United States', 'United Kingdom', 'France'],
            "fx_rate":[1.25,2.0,2.5,1.0,1.5,1.25]
        }
    )
)

dfb = (
    dfb
    .join(fx_df,on="country",how="left")
)
dfb.head()

We now have a column called `fx_rate`.

We learn more about `joins` later in the course
Convert the monetary columns to a float dtype. 

Note that the some column names have whitespace (I recommend printing them out)

Do this conversion to float dtype in a single expression

In [None]:
(
    dfb
    .with_columns(
        <blank>
    )
    .head()
)

Continue by adding a new `with_columns` statement where for each monetary column we add a column that has the US Dollar equivlent amount. We do this conversion by multiplying the monetary columns by `fx_rate`.

- Select the monetary columns using `cs.matches`
- Add `"_usd"` to the new column name
- Ensure you enclose the conversion in `()` before renaming the expressions

## Solutions

### Solution to Exercise 1
Convert the 64-bit integer and float columns to their 32-bit equivalents

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col(pl.Float64).cast(pl.Float32),
        pl.col(pl.Int64).cast(pl.Int32),
    )
    .head()
)

Continue by adding
- a `family_size` column as the sum of the siblings, parents and the passenger
- a Boolean `over_thirty` column showing if a passenger is aged 30 or over

Do this using keyword assignment

In [None]:
(
    pl.read_csv(csv_file)
    .with_columns(
        pl.col(pl.Float64).cast(pl.Float32),
        pl.col(pl.Int64).cast(pl.Int32),
    )
    .with_columns(
        family_size = pl.col("SibSp")+pl.col("Parch")+1,
        over_thirty = pl.col("Age")>=30
    )
    .head()
)

### Solution to Exercise 2
We have the following fictitious dataset with sales figures of bikes in different countries.

In [None]:
dfb = pl.read_parquet("../data/bike_sales.parquet")
dfb.head()

The monetary values are in the local currency but we want to compare them in US dollars. 

In order to do this we join the following `DataFrame` with the foreign-exchange rates to US dollars

In [None]:
fx_df = (
    pl.DataFrame(
        {
            "country":['Germany', 'Canada', 'Australia', 'United States', 'United Kingdom', 'France'],
            "fx_rate":[1.25,2.0,2.5,1.0,1.5,1.25]
        }
    )
)

dfb = (
    dfb
    .join(fx_df,on="country",how="left")
)
dfb.head()

We now have a column called `fx_rate`.

We learn more about `joins` later in the course
Convert the monetary columns to a float dtype. 

Note that the some column names have whitespace (I recommend printing them out)

In [None]:
dfb.columns

Do this conversion to float dtype in a single expression

In [None]:
(
    dfb
    .with_columns(
        pl.col('unit cost','unit price','cost','revenue').cast(pl.Float64)
    )
    .head()
)

Continue by adding a new `with_columns` statement where for each monetary column we add a column that has the US Dollar equivlent amount. We do this conversion by multiplying the monetary columns by `fx_rate`.

- Select the monetary columns using `cs.matches`
- Add `"_usd"` to the new column name
- Ensure you enclose the conversion in `()` before renaming the expressions

In [None]:
(
    dfb
    .with_columns(
        pl.col('unit cost','unit price','cost','revenue').cast(pl.Float64)
    )

    .with_columns(
        (cs.matches("cost|price|revenue")*pl.col("fx_rate")).name.suffix("_usd")
    )
    .head()
)