# Selecting columns 3: selecting multiple columns
By the end of this lecture you will be able to:
- select columns based on a regex
- select columns based on dtype
- use selectors

Polars has two ways for selecting multiple columns:
- the expression API with `pl.col` or `pl.all`
- the selectors API with polars selectors such as `cs.contains`

We see both of these in this lecture.

To use the selectors API we typically import it as `cs` alongside Polars

In [None]:
import polars as pl
import polars.selectors as cs

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

### Selecting all columns from a `DataFrame`

We can select all columns by replacing `pl.col` with `pl.all`

In [None]:
(
    df
    .select(
        pl.all()
    )
    .head(3)
)

We can select all but a subset of columns with the `exclude` expression

In [None]:
(
    df
    .select(
        pl.exclude('PassengerId','Survived','Pclass')
    )
    .head(3)
)

This is a shorthand for `pl.all().exclude(...)`

### Selecting columns with a regex
We can select columns with a regex - if the regex starts with `^` and ends with `$`. Note that we meet an easier approach to doing this with selectors below.

The following regex looks for columns starting with `P` and uses the regex *wildcard* `.*` to show `P` can be followed by any characters.

In [None]:
(
    df
    .select(
        "^P.*$"
    )
    .head(3)
)

We can pass this regex to `pl.col` to apply transformations to these columns. In this example we take the `max` of each column

In [None]:
(
    df
    .select(
        pl.col("^P.*$").max()
    )
    .head(3)
)

### Selecting columns based on dtype
We can select all of the columns that have a particular dtype by passing the dtype to `pl.col`. I use this approach **a lot** in my Polars pipelines.

Here we select all the string columns with `pl.Utf8` - the string dtype object

In [None]:
(
    df
    .select(
        pl.col(pl.Utf8)
    )
    .head(3)
)

We can also pass a list of dtypes to `pl.col`. In this case we select both 64-bit integer and float columns

In [None]:
(
    df
    .select(
        pl.col([pl.Int64,pl.Float64])
    )
    .head(3)
)

As a shorthand Polars also gives us shortcuts such as `pl.NUMERIC_DTYPES` to select all numeric dtypes

In [None]:
(
    df
    .select(
        pl.col(pl.NUMERIC_DTYPES)
    )
    .head(3)
)

The `pl.NUMERIC_DTYPES` is really just a sequence of the underlying dtypes - we can see this if we print it

In [None]:
pl.NUMERIC_DTYPES

There are a number of other generic dtype objects - we find these from the `pl` namespace below

In [None]:
[el for el in dir(pl) if "_DTYPES" in el]

## Using the selectors API
The selectors API aims to make selecting multiple columns less verbose. 

For simple cases it replicates using the expression API. For example to select all columns we use `cs.all`

In [None]:
(
    df
    .select(
        cs.all()
    )
    .head(3)
)

We can also do selection by position with `first` or `last`

In [None]:
(
    df
    .select(
        cs.first()
    )
    .head(3)
)

The output of a selector is a standard Polars expression so we can follow it up with standard expression chaining

In [None]:
(
    df
    .select(
        cs.all().max()
    )
)

The selectors API works well in lazy mode and for streaming queries just as expressions do.

We can select columns by groups of dtype - including a group of all integer and floating point dtypes with `cs.numeric`

In [None]:
(
    df
    .select(
        cs.numeric()
    )
    .head(3)
)

We can select by name - in this example with a `~` operator to exclude the names listed

In [None]:
(
    df
    .select(
        ~cs.by_name("Pclass","Age")
    )
    .head(3)
)

As a simpler alternative to the regex example we saw earlier we can use string methods such as:
- `contains`
- `starts_with`
- `end_with`
- `matches`

In this example we select all columns beginning with P

In [None]:
(
    df
    .select(
        cs.starts_with("P")
    )
    .head(3)
)

We can apply an OR condition by passing multiple strings

In [None]:
(
    df
    .select(
        cs.starts_with("P","A")
    )
    .head(3)
).columns

With the `matches` method we can pass a regex without the `^` and `$` we need for the expression API

In [None]:
(
    df
    .select(
        cs.matches("Age|Fare")
    )
    .head(3)
)

The difference between `cs.contains` and `cs.matches` is:
- `cs.contains` looks for all column names that contain the literal substring
- `cs.matches` look for all column names that match the regex

### Intersection of selectors

To do an intersection of selector conditions we use the `&` operator to say both conditions must be fulfilled.

In this example we look for **numeric** columns that **contain A** in the column name

In [None]:
(
    df
    .select(
        cs.numeric() & cs.contains("A") 
    )
    .head(3)
)

### Union of selectors
To do a union operation we use the `|` operator to say at least one of the conditions must be satisfied

In [None]:
(
    df
    .select(
        cs.string() | cs.contains("P") 
    )
    .head(3)
)

### Difference of selectors
To do a difference operation we use a minus operator `-`.

In this example we select all string columns other than any column beginning with T

In [None]:
(
    df
    .select(
        cs.string() - cs.starts_with("T") 
    )
    .head(3)
)

# Exercises

In the exercises you will develop your understanding of:
- selecting all columns from a `DataFrame`
- excluding columns from a selection
- selecting columns with a dtype

### Exercise 1
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Select the `title` and `artist` columns using the expression API (and not selectors)

In [None]:
(
    spotify_df
    <blank>
    .head(3)
)

Select all string and date columns from the spotify `DataFrame` using the expression API

Select all string and date columns from the spotify `DataFrame` - except the `url` column using the expression API (and not selectors)

Select all string and date columns again but use the selectors API

Select all the columns that start with `t` or `a`

Select all columns except the integer columns (using the ~ operator)

### Exercise 2
We create a `DataFrame` with temperature and rainfall data from some weather stations

In [None]:
df_weather = pl.DataFrame(
    [
        {
            "Month": "Jan",
            "Station_A (°C)": 20.5,
            "Station_B (°C)": 18.0,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
        {
            "Month": "Feb",
            "Station_A (°C)": 21.0,
            "Station_B (°C)": 18.5,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
    ]
)
df_weather

Select all the columns with `Station` in the column name using `cs.contains`

In [None]:
(
    df_weather
    .select(
        <blank>
    )
)

Use `cs.matches` to select all the columns with `Station` and `°C`  in the column name

In [None]:
(
    df_weather
    .select(
        <blank>
    )
)

### Exercise 3
Convert the following Pandas code (that I've seen in the wild!) to Polars

Looping over columns in Polars is to be avoided at all costs. 

Convert this Pandas code with a loop over the columns to Polars code using the Expression API.

In the loop we create a dictionary `maxDict` with the column names and maximum values

In [None]:
import pandas as pd
import numpy as np
df = pl.read_csv(csv_file)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)

In [None]:
(
    pl.read_csv(csv_file)
     <blank>
)

## Solutions

### Solution to Exercise 1
We create a `DataFrame` from the Spotify data

In [None]:
pl.Config.set_fmt_str_lengths(30)
spotify_csv = "../data/spotify-charts-2017-2021-global-top200.csv.gz"
spotify_df = pl.read_csv(spotify_csv,try_parse_dates=True)
spotify_df.head(3)

Select the `title` and `artist` columns using the expression API (and not selectors)

In [None]:
(
    spotify_df
    .select(
        pl.col("title","artist")
    )
    .head(3)
)

Select all string and date columns from the spotify `DataFrame` using the expression API

In [None]:
(
    spotify_df
    .select(
        pl.col(pl.Utf8,pl.Date)
    )
    .head(3)
)

Select all string and date columns from the spotify `DataFrame` - except the `url` column using the expression API (and not selectors)

In [None]:
(
    spotify_df
    .select(
        pl.col(pl.Utf8,pl.Date).exclude("url")
    )
    .head(3)
)

Select all string and date columns again but use the selectors API

In [None]:
(
    spotify_df
    .select(
        cs.by_dtype(pl.Utf8,pl.Date) - cs.by_name("url")
    )
    .head(3)
)

Select all the columns that start with `t` or `a`

In [None]:
(
    spotify_df
    .select(
        cs.starts_with("t","a")
    )
    .head(3)
)

Select all columns except the integer columns (using the ~ operator)

In [None]:
(
    spotify_df
    .select(
        ~cs.integer()
    )
    .head(3)
)

### Solution to Exercise 2
We create a `DataFrame` with temperature and rainfall data from some weather stations

In [None]:
df_weather = pl.DataFrame(
    [
        {
            "Month": "Jan",
            "Station_A (°C)": 20.5,
            "Station_B (°C)": 18.0,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
        {
            "Month": "Feb",
            "Station_A (°C)": 21.0,
            "Station_B (°C)": 18.5,
            "Station_A (mm)": 12.0,
            "Station_B (mm)": 13.5,
        },
    ]
)
df_weather

Select all the columns with `Station` in the column name using `cs.contains`

In [None]:
(
    df_weather
    .select(
        cs.contains("Station")
    )
)

Use `cs.matches` to select all the columns with `Station` and `°C`  in the column name

In [None]:
(
    df_weather
    .select(
        cs.matches("Station.*°C")
    )
)

### Solution to Exercise 3
Convert the following Pandas code to Polars
```python
import pandas as pd
import numpy as np
df = pl.read_csv(csv_file)
dfPandas = df.to_pandas()

# Convert this code below to Polars in the following cell
maxDict = {}
for col in dfPandas.columns:
    if dfPandas[col].dtype == np.float64:
        maxDict[col] = [dfPandas[col].max()]
pd.DataFrame(maxDict)
```

In [None]:
(
    pl.read_csv(csv_file)
    .select(
        pl.col(pl.Float64).max()
    )
)

Note that there is a better way to do this in Pandas (I just don't see this so often in the wild!)

In [None]:
df_pandas = df.to_pandas()
df_pandas.select_dtypes("float").max()