# DataFrames II

## The fill_null Method
- Polars uses `null` to represent a missing value.

- The `fill_null` method replaces missing values with a specified algorithm.
- We can provide a constant value, a fill strategy, or a Polars expression.

- The `strategy` parameter specifies whether to use the previous or next present value to replace a missing value.

- Let's select the first 5 rows, then add a `title` column to the `DataFrame`.
- We can use the title as the best match for a missing `department` column value.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/missing-data/#filling-missing-data
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.fill_null.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.fill_null.html

## Interpolation
- Interpolation replaces missing values using linear interpolation.
- Interpolation draws a straight line between two values and fills in the gaps along that line.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/missing-data/#fill-with-interpolation
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.interpolate.html

## Dropping Missing Data
- The `drop_nulls` method on an expression removes `null` values from the target column.
- If there are missing values, the new column will be shorter than the original one.
- The `with_columns` method attaches new columns to the end of the `DataFrame` and expects columns of equal length.

- The `drop_nulls` method on a `DataFrame` removes rows that have one or more `null` values.
- The `subset` parameter limits the columns that Polars uses to identify `null` values.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.drop_nulls.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop_nulls.html

## Sorting by Single Column
- Sorting changes the order of rows based on one or more columns' values.
- The number of rows in the `DataFrame` remains the same in the sorted result.

- The default sort order is ascending (smallest to largest, alphabetical, earliest to latest)
- Pandas has an `ascending` parameter. Polars has a `descending` parameter which defaults to `False`.

- An ascending sort orders dates from earliest to latest.

- Pass `True` to the `descending` parameter to sort in descending order.

- The `nulls_last` parameter can force `null` (missing) values to the end of the sort.

- Polars sorts uppercase letters before lowercase ones.

- We can also sort an individual column in an expression.
- Say we are assigning every employee a different existing salary.
- The original connection between a row and its salary will be lost.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sort.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sort.html

## Sorting by Multiple Columns I
- Pass multiple strings to the `sort` method to sort by multiple columns.
- Polars will apply a uniform ascending sort to each column by default.

- Polars will place `null` values first, then sort by department, then sort by each name within each department.

## Sorting by Multiple Columns II
- Polars sorts each column in ascending order by default.
- Pass the `descending` parameter a list of Booleans to customize sort order by column.
- The number of entries in the list must match the number of columns.

- Provide both `True` and `False` to sort in different order per column.
- `[True, False]` sorts `department` in descending order and `name` in ascending order within each department.
- `[False, True]` sorts `department` in ascending order and `name` in descending order within each department.

## Characters vs Bytes

- Earlier, we introduced the `name` attribute/namespace for methods that deal with column names.
- As always, the methods return a new expression that we can apply in a specific Polars context.

- The `str` namespace contains methods for string manipulations.
- The `to_lowercase` and `to_uppercase` methods lowercase or capitalize the values in a column.

- The `str.len_chars` method returns a count of characters in a string.
- The complementary `str.len_bytes` method counts the bytes in a string.
- One English alphabetic character occupies one byte in memory.
- The 1-to-1 relationship is not always true. An emoji like üçï has 1 character but occupies 4 bytes.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/strings/#the-string-namespace
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.to_lowercase.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.to_uppercase.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.len_bytes.html
- https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.len_chars.html

## Sorting with Expressions
- We can pass an expression to the `sort` method to customize the sort.

- We can use the expression as the basis of a custom sort.
- Let's sort the rows based on the lengths of the employee's names.
- Custom expressions can be combined with plain column expressions.

## The top_k and bottom_k Methods
- The `top_k` method extracts a specified number of the greatest/maximum values.
- The `bottom_k` method extracts a specified number of the smallest/minimum values.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.top_k.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.bottom_k.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.top_k.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.bottom_k.html

## The rank Method
- The `rank` method assigns each row value a position in line based on its order.
- By default, the smallest value has a ranking of #1.

- Let's rank the employees by salary. The employee with the greatest salary should be 1.
- Pass the `descending` parameter a value of `True` to rank from largest to smallest.
- Multiple occurrences of the same value will share a rank.
- The ranking will then pick up from the next logical value.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.rank.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cast.html

## The shuffle Method
- The `shuffle` method randomizes the order of elements in a column.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.shuffle.html

## Counting and Extracting Unique Values
- The `n_unique` method counts the exact number of unique values in a column.
- Polars include a `null`/missing value in the count.
- The `approx_n_unique` method returns an approximate count of unique values.

- For small datasets, the `approx_n_unique` method will likely return a perfect value.
- For larger datasets, the `approx_n_unique` may be off slightly but will perform faster.
- The more unique values that a column holds, the greater chance that `approx_n_unique` will be off.
- The `approx_n_unique` method does not support datetime columns.

- The `unique` method returns the unique values from the specified column.
- Each unique value is listed once.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/basic-operations/#counting-unique-values
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.n_unique.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.approx_n_unique.html

## The value_counts Method
- The `value_counts` method counts the number of occurrences of each unique value.
- The `value_counts` method returns a column of structs.
- A struct is a data structure that comparable to a Python dictionary. It consists of key-value pairs.
- Each row's struct holds two pieces of data, the `department` name and its count.

- The `unnest` method on a `DataFrame` extracts each row's struct's values into separate columns.
- The struct's key-value pairs are "nested" within the struct -- this "unnests" them.

- Pass `True` to the `sort` parameter to sort by the highest occuring value first.

- Set `normalize` to `True` to see the relative percentages of each unique value..

- Notice the column name is `proportion` instead of `count`.
- The column name comes directly from the key inside the struct.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.value_counts.html