# DataFrames I

## Intro to DataFrames
- A `DataFrame` is a 2-dimensional table consisting of rows and columns.
- A `DataFrame` is a collection of `Series` glued together.
- The columns in a `DataFrame` can be of different data types but must be of the same length.
- The data within any single `Series` must be homogenous.

![DataFrame data structure](images/DataFrame.png)

## Create a DataFrame from Scratch
- The `pl.DataFrame` class constructor accepts a variety of inputs.
- One option is an dictionary that maps column names to column values.
- Pass a list for each column's values. The lengths of the lists must be equal.

- The output includes the shape/dimensions of the `DataFrame` (height x width).
- Polars also prints the data type of each column.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#dataframe
- https://docs.pola.rs/api/python/stable/reference/dataframe/index.html

## Read a DataFrame from CSV
- The `pl.read_csv` function imports a CSV file as a `DataFrame`.
- The function's parameters can customize details like which columns to target, how many rows to include, what values constitute a missing value, and more.
- Polars will output 10 `DataFrame` rows by default: the first 5 and the last 5 separated with a gap in between.
- Polars uses ellipses (`...`) to mark a gap in the data. The data is still present; it's just not printed.

- A **method** is a command, while an attribute is a piece of information.
- The `head` method returns rows from the start of the `DataFrame`.
- The `limit` method is an alias for `head`.
- The `tail` method returns rows from the end of the `DataFrame`.

- An **attribute** is a piece of data that lives on the object.
- Methods require parentheses. Attributes need a dot and the attribute name.
- The `columns` attribute provides a list of the column names.
- The `dtypes` attribute provides a list of the columns' data types.

- A **schema** is a "representation of a plan in an outline or model".
- The `schema` attribute returns a `Schema` object that connects each column to its type.
- Printing the `Schema` instance displays the relationships between columns and data types.
- The display format of data types may differ between the schema and the printed `DataFrame`.

- The `DataFrame` will outputs its dimensions (height x width) when printed.
- The height is the number of rows, and the width is the number of columns.
- The `height` and `width` attributes provide the same shape information.
- The `shape` attribute is a tuple with the dimensions.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#head
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#tail
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#schema
- https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.columns.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.dtypes.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.schema.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.height.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.width.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.shape.html

## No Index, No Problem
- Unlike Pandas, Polars DataFrames do not create an ascending numeric index starting from 0.
- The `with_row_index` method will add an `index` column at the start of the `DataFrame`.

- The `offset` parameter sets the starting number.
- Use 1 for an index that starts from 1.

- The `name` parameter sets the index column name.

- The `read_csv` function also supports `row_index_name` and `row_index_offset` parameters for the same result.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.with_row_index.html

## Intro to Expressions
- An **expression** is an instruction for how to target or transform data.
- An **expression** is a future computation. It's an evaluation that only has significance when combined with a function/method.

- The `pl.col` function creates an expression that targets a column.
- The expression has no association with a specific `DataFrame`.
- We use the terminology "lazy" because the expression is a building block for a future operation.
- The currently expression's logic is: Locate a `years_at_company` column.

- Expressions have methods. Each method returns a new expression with an expanded set of instructions,.
- Let's create an expression that targets a column named `years_at_company` and calculates the average of its values.
- The expression is lazy. It has no knowledge of a specific `DataFrame`. It also doesn't know the data type of its column.
- The currently expression's logic is: Locate the `years_at_company` column, take the average of its values.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expressions
- https://docs.pola.rs/api/python/stable/reference/expressions/col.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html

## The select Method I
- An expression is a lazy transformation that does not execute until it is used in a specific _context_.
- A **context** is a situation that forces the computation.
- Most Polars work consists of combining expressions with methods.

- The `select` method executes one or more expressions.
- Polars gathers the results of the expression executions in a new `DataFrame`.
- Each `select` method expression creates a new `DataFrame` column.
- The simplest example of an expression is selecting a column. There are no additional transformations but the _selection_ of a column constitutes a step.
- If the expression targets an existing column, Polars will keep the column's name in the new `DataFrame`.

- The `mean` method transforms an expression to calculate the average of a column.
- In this scenario, the `select` method returns a `DataFrame` with a single row holding the average.
- The `years_of_company` column stores integers but the average calculation forces a float.

### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#select
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#contexts
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.select.html

## Renaming Columns
- Invoke the `alias` method on an expression to rename the targeted column.
- The `alias` method is another example of a method that returns a new expression from an existing expression.
- The currently expression's logic becomes: Locate a `years_at_company` column, calculate the average of its values, rename the column to `years`.

- Alternatively, pass the `select` method a keyword argument to set the name of the column.
- With either syntax option, Polars returns a new `DataFrame`.

- Polars will raise an error if multiple columns have the same name.

- The `select` methods is non-mutative.
- The method returns a new `DataFrame`; it does _not_ modify the existing `DataFrame`.
- The inconsistency between copies versus in-place mutations was a pain point in Pandas.
- To keep the changes, assign the resulting `DataFrame` to a variable.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mean.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sum.html

## The select Method II
- The `select` method supports a variety of syntax options.
- Polars recommends targeting every column with an individual `pl.col` expression.
- The separation of columns makes it easier to build different expressions for each column.

- The `select` method sequential string arguments or a list of the columns' names.
- Expression expansion is a shorthand notation that applies the same transformation to multiple columns.
- Polars will utilize whatever optimization enables calculations to execute in parallel.
- For example, Polars will convert sequential string arguments to `select` into expressions.

- The `pl.col` function creates an expression, a building block for future transformations.
- The `pl.col` function also supports sequential arguments as well as a list input.
- The `pl.col([])` syntax works well when applying consistent operations to the list's columns.

- Polars can perform aggregate operations on every targeted column in an expression.
- The following example calculates the largest value within both the `name` and `salary` columns.
- To be clear, this is _not_ Zachary Woods's salary. This is the highest salary in the dataset.

- Polars supports multiple expression arguments to `select`.
- This syntax is the most flexible because code can apply different logic to different columns.

- The next example extracts the greatest name (last alphabetical value) and the smallest salary.
- Polars applies different operations to different columns.

- Polars also supports a list of expressions as an argument to `select`.
- This syntax can be helpful when constructing dynamic lists.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#expression-expansion
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#function-col
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.select.html

## The select Method III: Targeting by Data Type
- The `pl.col` function also accepts Polars data types.
- All Polars data types are avalable at the top-level `pl` namespace.
- The `select` method will extract all columns of that data type.

- The `col` function supports multiple types.
- Data types must be exact. `pl.Int32` will not match a `pl.Int64` column.

## Expressions as Building Blocks
- Expressions are lazy steps to apply to a future computation.
- Expressions are reusable building blocks that exist independently of a specific `DataFrame`.
- Let's assign an expression to a variable and then reuse it across different `DataFrames.`
- The `todos.csv` dataset has a `start_date` column

- An expression does not care about the column's data type.
- The `read_csv` function supports a `try_parse_dates` parameter to convert columns to datetimes.
- Polars will read the `start_date` column as a `date` column rather than a string column.
- The `start_date` expression continues to work. 

## Expressions that Count Values
- Expressions support methods that create new expressions with expanded instructions.
- The `len` method counts all values in the column.
- The `count` method counts only present values.
- The `null_count` method counts null values.
- The methods return a `u32` (unsigned integer) because the count must be 0 or positive.

- The `describe` method returns a `DataFrame` of statistics for columns.
- We can invoke it on `employees` but it's not particularly helpful for non-numeric columns.
- Let's use `select` to target the two numeric columns, then invoke `describe` on the new `DataFrame`.

- Another way to solve this problem is targeting the columns by their shared `pl.Int64` data type.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#describe
- https://docs.pola.rs/user-guide/transformations/time-series/parsing/#parsing-dates-from-a-file
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.len.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.count.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.null_count.html

## Extracting One or More Rows
- Unlike Pandas, Polars does not maintain an index (a unique identifier for each row).
- Polars stores its data in columnar format. It is optimized for column operations.

- We can still extract a row's values by its index position.
- Polars will aggregate a row's values across all of its columns.
- The `row` method return a tuple with a single row's values.
- A tuple is an immutable sequenced collection of values.

- The `slice` method extracts multiple rows by their index positions.
- The first argument is the starting row index.
- The second argument is the number of rows to extract.
- The next example starts at index 1 (row 2) and extracts 4 rows.
- The rows have index positions 1, 2, 3, and 4 in `employees`.

- The first argument supports negative values.
- A negative value starts relative to the end of the `DataFrame`.
- Example: `(-8, 3)` starts 8 rows from the end of the `DataFrame`  and extracts 3 rows.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.row.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.slice.html

## List Slicing Syntax
- Polars supports the list slicing syntax from Python.
- This approach is generally discouraged by the Polars team. Prefer `slice` and method-based approaches.

- A single value targets a row by index position.

- The list slicing syntax targets multiple rows by index position.
- The ending position is exclusive; the row at that index will be excluded.
- Subtract the first index from the second index to calculate the number of rows.

- Omit the value before the colon to extract from the start of the `DataFrame`.

- Polars supports negative values in either or both positions.

- Omit the value after the colon to extract to the end of the `DataFrame`.

## Expressions that Target Row Values

- Invoke the `get` method on an expression to target a value in a specific column.

- This is one example where multiple expressions create more verbose code.
- We can pass multiple values to `pl.col` to have the expression target multiple columns.

- Passing `pl.col` a complete list of columns will quickly become verbose.
- The `pl.all` helper function returns an expression that targets all columns.

- Passing an asterisk (`*`) to `pl.col` targets all columns as well.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#selecting-all-columns
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.get.html

## Extracting a Single Value from DataFrame

- The `item` method extracts a single cell value by its row and column index position.
- The first argument is the row index, the second argument is the column index.
- Both the row and column index start counting from 0.

- Be careful. Null/missing values will not have a printed representation.
- Wrap the value in Python's `print` function to see the visual `None`.
- Python's `None` data type represents an absent/missing value.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.item.html

## The gather and gather_every Methods
- The `gather` method extracts multiple rows by index position.
- Pass the method a list with the index positions.

- The next example uses `pl.all` to target all columns, then extracts the rows at index positions 0, 100, and 200.

- The `gather_every` method extracts rows using a gap/interval.
- The next example targets every second row (index positions 0, 2, 4, etc).

- The `gather_every` method is also available on the `DataFrame`.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.gather.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.gather_every.html
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.gather_every.html

## Extracting a Random Set of Values

- The `sample` method extracts a number of random rows from the `DataFrame`.
- The new `DataFrame` may contain the rows in a different order than they appear in the `DataFrame`.

- The `fraction` parameter extracts a portion/percentage of the original rows.
- The next example extracts 3.5% of the original rows.

### Further Reading
- https://docs.pola.rs/user-guide/concepts/data-types-and-structures/#sample
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.sample.html

## Casting Columns to Different Types
- The `cast` method converts a column's values into another data type.
- Pass the `cast` method a built-in Polars data type. For example, `pl.Float64` will attempt to convert to 64-bit floats.
- Polars is strict by default; it will raise an error if a single value cannot be converted.

- When applying an expression, Polars will keep the original column name by default.
- Polars will raise an error if multiple columns in a new `DataFrame` have the same name.
- The `select` method thus requires each expression to produce a column with a unique name.

- The `Int8` type supports the range of values from -128 to 127. Nobody will work 127 years for the company!
- The `UInt8` type supports the range of values from 0 to 255.


- We can cast multiple columns to new types in a single `select` call.

- Recall that `pl.col` can create an expresson targeting multiple columns.
- Pass the `pl.col` function the data type to target.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/casting/#basic-example
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cast.html

## Customizing the DataFrame Schema
- The `schema_overrides` parameter overwrites Polars' inferred data type for specified columns.
- The parameter accepts a dictionary that maps column names to desired data types.

- The `schema` parameter is also available but it requires the explicit provision of every column.
- Polars will raise a `ComputeError` violation if a column in missing.

- The `schema` attribute prints out the current schema of the `DataFrame`.

- An alternative to specifying `pl.Date` or `pl.DateTime` types is using the `try_parse_dates` parameter.
- The `try_parse_dates` parameter attempts to parse strings as dates/datetimes (if Polars can figure out the format).
- The parameter accepts a Boolean.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/datatypes.html

## Renaming Columns
- The `select` method can rename columns in the new `DataFrame` but we have to specify all columns we want to include.

- The `rename` accepts a dictionary where each key is an existing column name and each value is its new name.
- It renames the specified columns while keeping existing ones.

- The `read_csv` function includes a `new_columns` parameter that accepts a partial or complete list of column names in order.
- The next example overrides the first two column names.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.rename.html

## The name Attribute
- Polars nests additional expression methods under attributes/namespaces.
- The `name` attribute/namespace holds methods for adjusting column names.
- The `pl.all` function creates an expression that targets all columns.
- The `name.to_uppercase` method converts all column names to uppercase.
- The `name.to_lowercase` method converts all column names to lowercase.

- The `name.prefix` method concatenates a consistent piece of text before each column name.
- The `name.suffix` method concatenates a consistent piece of text after each column name.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.to_uppercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.to_lowercase.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.prefix.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.name.suffix.html

## Dropping Columns
- The `drop` method removes one or more columns from a `DataFrame`.
- Using `select` to target only the columns you'd like to keep is another valid strategy.

- A nonexistent column will trigger an error.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.drop.html

## Replacing Values
- The `replace` method swaps one value with another.
- The method accepts a list of values to replace with a list of another values.
- Polars matches the values based on shared index positions.
- The method also accepts a dictionary of mappings.

- The code below replaces `HR` with `Human Resources` and `Engineering` with `Tech`.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.replace.html

## Mathematical Operations I
- Polars supports common mathematical operations on `DataFrames` and `Series`.
- Most mathematical symbols have complementary method equivalents.

- Operations on `null` (missing) values will produce `null` values.
- If using the `alias` method, wrap the full expression in parentheses.

- Polars provides method equivalents for the mathematical operators.

- The `equals` method compares the equality of two `DataFrames`.
- All row values and column names must be equal.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/basic-operations/#basic-arithmetic
- https://docs.pola.rs/user-guide/expressions/basic-operations/#comparisons
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.add.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.sub.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mul.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.truediv.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.floordiv.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.pow.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.mod.html

## Mathematical Operations II
- An expression can reference multiple columns.
- The expression below multiplies the `years_at_company` column values by the `salary` column values.

- Polars will gracefully handle type conversion in arithmetic operations if necessary.
- For example, multiplying an integer by a float will produce a float.

## Cumulative Mathematical Operations
- Polars includes cumulative methods that calculate change from one row to the next.

- The `cum_sum` method returns the cumulative sum up to and including that row.
- The `cum_count` method counts the number of present values (exclude `null`) by each row.
- The `cum_max` method returns the largest value encountered so far.
- The `cum_min` method returns the smallest value encountered so far.
- The `cum_prod` method returns the cumulative product up to and including that row.
- The `pct_change` method returns the percent change between the current row and the previous one.

### Further Reading
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_sum.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_count.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_max.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_min.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.cum_prod.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.pct_change.html

## The with_columns Method
- The `with_columns` method creates a new `DataFrame` with all existing columns and the new columns from the expressions.
- The `with_columns` method adds the columns to the right end of the existing `DataFrame`.

- The `with_columns` method overwrites the original column if a new column has the same name.

- Polars prefers to execute expressions independently in parallel.
- An expression thus cannot depend on a column from another expression in the same call.
- Chain multiple `with_columns` methods to keep all existing columns, then create new columns based on generated ones.
- Technically, Python's walrus operator (`:=`) enables this behavior but `with_columns` is easier to reason about.

### Further Reading
- https://docs.pola.rs/user-guide/getting-started/#with_columns
- https://docs.pola.rs/user-guide/concepts/expressions-and-contexts/#with_columns
- https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.with_columns.html

## The all and exclude Functions
- The `pl.all` function returns an expression that targets all columns.
- The `with_columns` method is equivalent to `df.select(pl.all(), new_column_expressions)` behind the scenes.

- The `pl.exclude` function returns an expression that targets all columns _except_ for the specified ones.
- The next example targets all columns except for `start_date`.

- We can target columns of a specific data type type in a `select` call.
- We can also exclude columns of a data type with the `pl.exclude` function.

### Further Reading
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#selecting-all-columns
- https://docs.pola.rs/user-guide/expressions/expression-expansion/#excluding-columns
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html
- https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.exclude.html