## Transforming a `DataFrame`
In this lecture you will learn how to:
- rename, drop and re-order columns from a `DataFrame`
- transform a `DataFrame` in a function using `pipe`

In [None]:
import polars as pl
import polars.selectors as cs
# Set the number of rows to be printed to 6
pl.Config.set_tbl_rows(6)

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(2)

## Renaming columns
We can rename columns by passing a `dict` that maps old names to new names.

In [None]:
(
    df
    .rename({"PassengerId":"ID"})
    .head(2)
)

## Dropping columns

We can drop columns by passing a `list` of column names

In [None]:
(
    df
    .drop(["PassengerId","Pclass"])
    .head(2)
)

Or we can pass a comma-seperated list of column names

In [None]:
(
    df
    .drop("PassengerId","Pclass")
    .head(2)
)

## Re-ordering columns
We can re-order columns with a `list` in `select`.

In this example we re-order the columns in alphabetical order

In [None]:
(
    df
    .select(sorted(df.columns))
    .head(2)
)

## Changing dtypes
We can change dtypes within an expression using `pl.col(...).cast()` but we can also call `cast` with a `dict` argument on a DataFrame.

In this example we cast the `Survived` column from integer to string

In [None]:
(
    df
    .cast(
        {
            "Survived":pl.Utf8
        }
    )
    .head(2)
)

We can also cast an entire `DataFrame`

In [None]:
(
    df
    .cast(pl.Utf8)
    .head(2)
)

Or use selectors

In [None]:
(
    df
    .cast(
        {
            cs.numeric():pl.Utf8
        }
    )
    .head(2)
)

## Transforming `DataFrames` in a function

We may want to capture some `DataFrame` transformations in a function. This can be to:
- re-use the same transformations multiple times
- make code easier to read or
- make the transformations testable

If our function:
- takes a `DataFrame` (and some other optional arguments) as an input and
- outputs a `DataFrame`
then we can use the `pipe` method.

In this example we define a function that makes all string columns uppercase

In [None]:
def uppercase_all_strings(df):
    return (
        df
        .with_columns(
            pl.col(pl.Utf8).str.to_uppercase()
        )
    )

We can pipe the `DataFrame` to this function as follows

In [None]:
(
    df
    .pipe(uppercase_all_strings)
)

One advantage of the `pipe` method is that it can allow us to access `DataFrame` method data even when we are using method chaining and do not have a variable with the `DataFrame` assigned.

In the following example we have a query that starts with scanning a CSV file in lazy mode. We want to re-order the columns to alphabetical order but within the method chained code.

We can do this with `pipe`.

The `pipe` method allows us to access the `DataFrame` using a temporary variable inside a function.

In this example we sort the columns alphabetically inside a `lambda` function using `pipe`

In [None]:
(
    pl.scan_csv(csv_file)
    .pipe(
        lambda temp_df: temp_df.select( sorted(temp_df.columns))
    )
    .columns
)

The transformations in `pipe` are passed to the query optimiser in lazy mode.

In this example we only use the first three columns in the `select`

In [None]:
print(
    pl.scan_csv(csv_file)
    .pipe(
        lambda temp_df: temp_df.select( sorted(temp_df.columns[:3]))
    )
    .explain()
)

The query optimiser sees that only 3 columns are required

### Function arguments using `pipe`
The key point about `pipe` are that:
- a `DataFrame` is the first argument and
- only a `DataFrame` is output

We can pass optional arguments to functions using `pipe`

In [None]:
def _multiply_floats(df: pl.DataFrame, multiplication_factor: int) -> pl.DataFrame:
    return df.select(pl.col(pl.Float64)) * multiplication_factor

(
    df
    .pipe(
        _multiply_floats, 
        multiplication_factor=3)
    .head(3)
)


## Exercises
In the exercises you will develop your understanding of:
- renaming columns
- dropping columns
- transformations using `pipe`

### Exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Cast all of the integer columns to 16-bit integers

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

### Exercise 2
Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [None]:
(
    pl.read_csv(csv_file)
    <blank>
    .head(3)
)

## Solutions

### Solution to exercise 1
Drop the `Age` and `Fare` columns from the `DataFrame`

In [None]:
(
    pl.read_csv(csv_file)
    .drop(["Age","Fare"])
    .head(3)
)

Cast all of the integer columns to 16-bit integers

In [None]:
(
    pl.read_csv(csv_file)
    .cast(
        {
            cs.integer():pl.Int16
        }
    )
    .head(3)
)

### Solution to exercise 2

Rename the `Age` column to `age`

In [None]:
(
    pl.read_csv(csv_file)
    .rename({"Age":"age"})
    .head(3)
)

Rename all column names to lower case. Expand the cell below if you would like a hint

In [None]:
#Hint: do the renaming inside .pipe
#Hint: use the Python method .lower() on column name strings

In [None]:
(
    pl.read_csv(csv_file)
    .pipe(
        lambda df:df.rename({oldCol:oldCol.lower() for oldCol in df.columns})
    )
    .head(3)
)