## Missing values
By the end of this lecture you will be able to:
- identify missing values in a `DataFrame`
- count the number of missing values in a column
- find and drop `null` or non-`null` values

In [None]:
import polars as pl
import polars.selectors as cs

### Missing values in Polars
Missing values in Polars are represented with a `null` value for all dtypes. We can create them manually with `None` value.

We create a simple `DataFrame` where the rows have:
- all `null` values
- some `null` values
- one `null` value

In [None]:
df = pl.DataFrame(
    {
        'col1':[None,2,3,4],
        "col2":[None,None,5,6],
        "col3":[None,None,None,7]
    }
)
df

> In Pandas a missing value can be represented with a `null`,`NaN` or `None` value depending on the dtype of the column. Polars also allows `NaN` values for floating point columns to represent non-numberic values (e.g. where division by zero has occurred). This use of `NaN` is distinct from missing values. 

### Metadata on `null` values
Polars stores metadata about `null` values for each column in a `DataFrame`.

#### Null count
Polars stores a count of how many `null` values there are. We can access this with the `null_count` method on a single column or on all the columns

In [None]:
df.null_count()

Polars keeps track of the `null_count` at all times so this is a cheap operation regardless of the size of the column.

### Finding `null` values

We use the `is_null` expression to find out whether each value is `null` and `is_not_null` for the converse

In [None]:
(
    df
    .select(
        [
            pl.col("col1"),
            pl.col("col1").is_null().alias("is_null"),
            pl.col("col1").is_not_null().alias("is_not_null")
        ]
    )
)

### Filtering by `null` values

#### Filtering on a single column
We can use these methods to filter by `null` or non-`null` values on a single column.

In this example we want all rows where the values in `col1` are not `null`

In [None]:
(
    df
    .filter(
        pl.col("col1").is_not_null(),
    )
)

#### Filtering by `null` values in multiple columns

In this example we want to remove rows where **all** values are `null`. We can do this using:
- `pl.all().is_not_null()` to give `True` values where we get non-`null` values
- `pl.any_horizontal` to find if there is at least one `True` value in a row

In [None]:
df

In [None]:
(
    df
    .filter(
        pl.any_horizontal(pl.all().is_not_null())
    )
)

In this example we want to keep rows where there are no `null` values

In [None]:
(
    df
    .filter(
        pl.all_horizontal(pl.all().is_not_null())
    )
)

### Using the `drop_nulls` method

Polars has a convenience `drop_nulls` method for dropping rows where all values are `null`

In [None]:
(
    df
    .drop_nulls()
)

We can also specify a subset of columns to apply the condition on

In [None]:
(
    df
    .drop_nulls(subset=["col1","col2"])
)

## Exercises
In the exercises you will develop your understanding of:
- counting the `null` values
- filtering by `null` values

### Exercise 1
Count the number of `null` values in each row of the Titanic data

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    <blank>
)

Filter out the rows that are `null` from the `Cabin` column and count the null values for all columns again

### Exercise 2
Find all the rows for which the `Age` is `null`

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    <blank>
)

Find all the rows for which neither the `Age` nor the `Cabin` is `null`.

Use the Selectors API (imported above as `cs`) to select the columns

## Solutions
### Solution to Exercise 1
Count the number of `null` values in each row of the Titanic data

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    .null_count()
)

Filter out the rows that are `null` from the `Cabin` column and count the null values for all columns again

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    .filter(pl.col("Cabin").is_not_null())
    .null_count()
)

### Solution to Exercise 2
Find all the rows for which the `Age` is `null`

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    .filter(pl.col("Age").is_null())
    .head()
)

Find all the rows for which either the `Age` or the `Cabin` is `null`

Use the Selectors API (imported above as `cs`) to select the columns

In [None]:
csv_file = "../data/titanic.csv"
(
    pl.read_csv(csv_file)
    .filter(
        pl.any_horizontal(cs.matches("Age|Cabin").is_null())
    )
    .select(
        cs.matches("Age|Cabin")
    )       
)