Very often, column selection is more complex than simply passing a list of column
names to a transformer: it may be necessary to select all columns that have a
specific data type, or based on some other characteristic (presence of nulls,
column cardinality etc.).

The skrub `selectors` implement a number of selection strategies that can be 
combined in various ways to build complex filtering conditions that can then be 
employed by `ApplyToCols`, `ApplyToFrame`, `SelectCols` and `DropCols`. 

## Skrub selectors
Selectors are available from the `skrub.selectors` namespace:

In [None]:
import skrub.selectors as s

We will use this example dataframe to test some of the selectors: 

In [None]:
import pandas as pd
import datetime

data = {
    "int": [15, 56, 63, 12, 44],
    "float": [5.2, 2.4, 6.2, 10.45, 9.0],
    "str1": ["public", "private", None, "private", "public"],
    "str2": ["officer", "manager", "lawyer", "chef", "teacher"],
    "bool": [True, False, True, False, True],
    "cat1": pd.Categorical(["yes", "yes", None, "yes", "no"]),
    "cat2": pd.Categorical(["20K+", "40K+", "60K+", "30K+", "50K+"]),
    "datetime-col": [
        datetime.datetime.fromisoformat(dt)
        for dt in [
            "2020-02-03T12:30:05",
            "2021-03-15T00:37:15",
            "2022-02-13T17:03:25",
            "2023-05-22T08:45:55",
        ]
    ]
    + [None],    }
df = pd.DataFrame(data)
df

Selectors should be used in conjunction with the transformers described in the 
previous chapter: `ApplyToCols`, `ApplyToFrame`, `SelectCols` and `DropCols`. 

Selectors allow to filter columns by data type:

- `.float`: floating-point columns
- `.integer`: integer columns
- `.any_date`: date or datetime columns
- `.boolean`: boolean columns
- `.string`: columns with a String data type
- `.categorical`: columns with a Categorical data type
- `.numeric`: numeric (either integer or float) columns

In [None]:
from skrub import SelectCols
string_selector = s.string()

SelectCols(cols=string_selector).fit_transform(df)

Additional conditions include:

- `.all`: select all columns
- `.cardinality_below`: select all columns with a number of unique values lower
than the given `threshold`
- `.has_nulls`: select all columns that include at least one null value

In [None]:
SelectCols(cols=s.has_nulls()).fit_transform(df)

Various selectors allow to choose columns based on their name: 

- `.cols`: choose the provided column name (or list of names)
    - note that transformers that can accept selectors can also take column names
    or lists of columns by default
- `.glob`: use Unix shell style `glob` to select column names
- `.regex`: select columns using regular expressions

In [None]:
SelectCols(cols=s.glob("cat*")).fit_transform(df)

## Combining selectors

Selectors can be inverted using `.inv` or the logical operator `~` to
select all _other_ columns, and they can be combined using the `&` and `|`
logical operators. It is also possible to remove from a selection with `-`:

For example, to select all datetime columns OR all string columns that do not 
contain nulls, we can do:

In [None]:
SelectCols(cols=(s.any_date() | (s.string()) & (~s.has_nulls()))).fit_transform(df)

## Extracting selected columns
Selectors can use the `expand` and `expand_index` methods to extract the columns
that have been selected:

In [None]:
has_nulls = s.has_nulls()
has_nulls.expand(df)

This can be used, for example, to pass a list of columns to a dataframe library. 

In [None]:
df.drop(columns=has_nulls.expand(df))

## Designing custom filters
Finally, it is possible to define function-based selectors using `.filter` and 
`.filter_names`. 

`.filter` selects columns for which the `predicate` evaluated by a user-defined
function is `True`. 
For example,  it is possible to select columns that include a certain amount of 
nulls by defining a function like the following:

In [None]:
import pandas as pd
import skrub.selectors as s
from skrub import DropCols

df = pd.DataFrame({"a": [None, None, None, 1], "b": [1,2,3,4]})

def more_nulls_than(col, threshold=.5):
    return col.isnull().sum()/len(col) > threshold

DropCols(cols=s.filter(more_nulls_than, threshold=0.5)).fit_transform(df)

`.filter_names` is similar to `.filter` in the sense that it takes a function that
returns a predicate, but in this case the function is evaluated over the column
names. 

If we define this example dataframe:

In [None]:
from skrub import selectors as s
import pandas as pd
df = pd.DataFrame(
    {
        "height_mm": [297.0, 420.0],
        "width_mm": [210.0, 297.0],
        "kind": ["A4", "A3"],
        "ID": [4, 3],
    }
)
df

We can select all the columns that end with `"_mm"` as follows: 

In [None]:
selector = s.filter_names(lambda name: name.endswith('_mm'))
s.select(df, selector)

## Exercise: using selectors together with `ApplyToCols`
Consider this example dataframe:

In [None]:
import pandas as pd

df = pd.DataFrame(
    {
        "metric_1": [10.5, 20.3, 30.1, 40.2],
        "metric_2": [5.1, 15.6, None, 35.8],
        "metric_3": [1.1, 3.3, 2.6, .8],
        "num_id": [101, 102, 103, 104],
        "str_id": ["A101", "A102", "A103", "A104"],
        "description": ["apple", None, "cherry", "date"],
        "name": ["Alice", "Bob", "Charlie", "David"],
    }
)
df

Using the skrub selectors and `ApplyToCols`:

- Apply the `StandardScaler` to numeric columns, except `"num_id"`. 
- Apply a `OneHotEncoder` with `sparse_output=False` on all string columns except
`"str_id"`. 

In [None]:
import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline

# Write your solution here
# 
# 
# 
# 
# 
# 
# 
# 
# 

In [None]:
import skrub.selectors as s
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from skrub import ApplyToCols
from sklearn.pipeline import make_pipeline

scaler = ApplyToCols(StandardScaler(), cols=s.numeric() - "num_id")
one_hot = ApplyToCols(OneHotEncoder(sparse_output=False), cols=s.string() - "str_id")

transformer = make_pipeline(scaler, one_hot)

transformer.fit_transform(df)

Given the same dataframe and using selectors, drop only string columns that contain
nulls. 

In [None]:
from skrub import DropCols

# Write your solution here
# 
# 
# 
# 
# 
# 
# 

In [None]:
from skrub import DropCols

DropCols(cols=s.has_nulls() & s.string()).fit_transform(df)

Now write a custom function that selects columns where all values are lower than
`10.0`. 

In [None]:
from skrub import SelectCols

# Write your solution here
# 
# 
# 
# 
# 
# 
# 

In [None]:
from skrub import SelectCols

def lower_than(col):
    return all(col < 10.0)

SelectCols(cols=s.numeric() & s.filter(lower_than)).fit_transform(df)