# Altair Intro: Transform

In the grammar of graphics, transform refers to the step of generating new data from existing data. For example, when plotting histograms, the count associated with each bin is generated based on the bins setting. Such new data can be generated either _before_ the chart creation process (e.g. using pandas), or _during_ the chart creation process (e.g. with Altair). In general, we recommend transforming the data _before_ creating a chart.  However, as we will see in this session, it is sometimes preferable to conduct data transforms as we generate the chart.

Altair supports many different types of data transforms. See a complete list [here](https://altair-viz.github.io/user_guide/transform/index.html). We will talk about a few of the most commonly used transforms.

## Setup

We will be using the cars dataset from `vega_datasets` as an example.  Here is a quick setup to create a base chart.

In [1]:
import altair as alt
import pandas as pd
from vega_datasets import data

base = alt.Chart(data.cars.url)

## Aggregation

The most common type of data transform is aggregation. Aggregation computes some type of property on a selected field.  For example, we can count the number of observations associated with a key.

In [2]:
base.mark_bar()\
    .encode(y="Year:O", x="count()", color="Origin:N")

In the example above, `x="count()"` is an aggregation transform of the data. It counts the observations with the same `Year` and `Origin` value.

Here is a more interesting example:

In [3]:
base.mark_bar()\
    .encode(x="mean(Acceleration):Q", y="Cylinders:O")

Instead of counting, `x="mean(Acceleration):Q"` computes the mean of the `Acceleration` column that shares the same `Cylinders` value, and use it as the `x` coordinate.  In fact, `x="mean(Acceleration):Q"` is a short hand.  It full form is the following:

In [4]:
base.mark_bar()\
    .encode(x=alt.X("Acceleration:Q", aggregate="mean"), y="Cylinders:O")

Sometimes, it is useful to store the aggregated data with a name so that it can be reused in other parts of the chart. To achieve this, we need to use the `.transform_aggregate` method.

In [5]:
base.mark_bar()\
    .transform_aggregate(mean_acc="mean(Acceleration)", groupby=["Cylinders"])\
    .encode(x="mean_acc:Q", y="Cylinders:O")

In the above example, we have transformed the dataframe by creating a new column named `mean_acc`, which is the mean acceleration for each group defined using the `Cylinders` column. We can now reuse the column `mean_acc` in other parts of the chart such as the `x` encoding.  Note that the aggregated data frame only contains the aggregate column `mean_acc` and columns listed in the `groupby` argument.  All other columns are dropped.

## Binning

Another common data transform is binning. Binning groups continuous data into different bins. It is often used in conjunction with aggregation (i.e. once data is grouped into separate bins, we can compute some aggregate quantity within each bin).

In [6]:
base.mark_rect()\
    .encode(x=alt.X("Acceleration:O", bin=True),
            y=alt.Y("Weight_in_lbs:O", bin=True),
            color="count()")

In the example above, we created a rectangular heat map of the binned `Acceleration` column vs binned `Weight_in_lbs` column. By default, Altair created 9 bins for each variable.  This can be changed by replacing `bin=True` with `bin=alt.Bin(maxbins=10)` for example.

Similar to aggregation, we can also store the binned data in a named variable in the data frame via the `transform_bin` method.

In [7]:
base.mark_rect()\
    .transform_bin("binned_acc", field="Acceleration")\
    .transform_bin("binned_weight", field="Weight_in_lbs")\
    .encode(x="binned_acc:O", y="binned_weight:O", color="count()")

In the example above, the binned `Acceleration` field is stored in the `binned_acc` column, and the binned `Weight_in_lbs` field is stored in the `binned_weight` column.  All other variables are dropped. We are able to reproduce the above heap map using `binned_acc` and `binned_weight` columns.  Note that axis tick labels are different. When using shorthand form, Altair will take care of generating meaningful labels. When we use the `transform_bin` methods, Altair assumes we want full control and it is up to the users to generate meaning labels.

## Filtering

Sometimes we only want to visualize a subset of the dataset. This is when a filtering transform is necessary. Usually, it is recommended to use pandas to filter the data frame before plotting. However, filtering as part of the chart creation process can be useful in particular when creating interactive visualizations.

In [8]:
base.mark_point()\
    .transform_filter(alt.datum.Origin=="USA")\
    .encode(x="Weight_in_lbs:Q", y="Miles_per_Gallon:Q")

In the example above, we used the condition `alt.datum.Origin=="USA"` to filter the dataset so only observations of US cars are left. The object `alt.datum` is a special object representing the current observation (row in the data frame). Multiple filtering conditions can be combined together using python logical operators, e.g. `alt.datum.Origin=="USA" and alt.datum.Cylinders<=6`.

## Regression

Regression refers to the process of fitting a curve to the data. For example, the following example fits a line to a scatter plot.

In [9]:
scatter_plot = base.mark_point(opacity=0.2)\
    .encode(x="Weight_in_lbs:Q", y="Miles_per_Gallon:Q")
scatter_plot + scatter_plot.transform_regression(
    "Weight_in_lbs", "Miles_per_Gallon", method="linear").mark_line()

The first two arguments of `.transform_regression` methods provide the input x and y variables. The `method` argument defines the type of curve to fit to the data.  Other options include `log`, `exp`, `pow`, `quad` and `poly`.  Here is the same plot with power regression.

In [10]:
scatter_plot + scatter_plot.transform_regression(
    "Weight_in_lbs", "Miles_per_Gallon", method="pow").mark_line()

## Summary

In summary, transforms are ways of generating new data in the dataset. While it is generally recommended to conduct data transforms as a pre-process before creating the chart, it is sometimes useful to compute the transform during the chart generation process. We have studied several popular data transforms, and we provide links to their documentation for more detail.
* [Aggregation](https://altair-viz.github.io/user_guide/transform/aggregate.html)
* [Binning](https://altair-viz.github.io/user_guide/transform/bin.html)
* [Filtering](https://altair-viz.github.io/user_guide/transform/filter.html)
* [Regression](https://altair-viz.github.io/user_guide/transform/regression.html)