# Part 3 - Exercises - Data Transformation and Interaction

In [None]:
import altair as alt
import pandas as pd
from vega_datasets import data
print("The installed Vega-Altair version is " + alt.__version__)

## Diamonds Exercise
Let's explore the diamonds dataset from the Seaborn sample datasets.

In [None]:
diamonds_url = "../resources/datasets/diamonds.csv"
diamonds = pd.read_csv(diamonds_url)
diamonds

Here are descriptions of the dataset columns:

|Variable|Description|Values|
|--- |--- |--- |
|carat|weight of the diamond|0.2-5.01|
|cut|quality of the cut|Fair, Good, Very Good, Premium, Ideal|
|color|diamond color|J (worst) to D (best)|
|clarity|measurement of how clear the diamond is|I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)|
|depth|total depth percentage|43-79|
|table|width of top of diamond relative to widest point|43-95|
|price|price in US dollars|\$326-\$18,823|
|x|length in mm|0-10.74|
|y|width in mm|0-58.9|
|z|depth in mm|0-31.8|

### Simple scatter

Create a simple scatterplot with `price` on the y-axis and `carat` on the x-axis. Use a [circle mark](https://altair-viz.github.io/user_guide/marks/circle.html) and lower the `size` and `opacity` properties to reduce overplotting.


> **Note:** Because this dataset has more than 5,000 rows, you will get a `MaxRowsError` by default. Follow the instructions in the error message to enable the "vegafusion" data transformer to address this error, which will optimize the generated Vega specification by removing unused columns and evaluating data transformations in the Python kernel. This raises the row limit to 100k, and the limit is enforced after any data aggregations have been applied.

<details>
  <summary>(Show Image)</summary>
  <img src="../resources/images/part2/diamonds1.png">
</details>

<details>
  <summary>(Show Answer)</summary>

  ```python
alt.data_transformers.enable("vegafusion")
alt.Chart(diamonds).mark_circle(opacity=0.2, size=10).encode(
    alt.X("carat"),
    alt.Y("price"),
)
  ```
</details>

### Simple histogram
Create a histogram of the `carat` column. This will require enabling binning on the `x` encoding, and using `count()` as the y encoding.

Click "Open in Vega Editor" from the chart's dropdown menu and notice how the dataset included in the spec is already binned and aggregated (by VegaFusion).

Use `chart.transformed_data()` to extract the binned and aggregated data as a pandas DataFrame.

<details>
  <summary>(Show Image)</summary>
  <img src="../resources/images/part2/diamonds2.png">
</details>

<details>
  <summary>(Show Answer)</summary>

  ```python
alt.data_transformers.enable("vegafusion")
chart = alt.Chart(diamonds).mark_bar().encode(
    alt.X("carat").bin(),
    alt.Y("count()"),
)
chart.show()
chart.transformed_data()
  ```
</details>


What happens if you only use `count()` as the `y` encoding without binning `x`?

<details>
  <summary>(Show Answer)</summary>
  The bars are centered on each unique value of `carat`. The bars have a fixed width, so they may overlap each other. When binning is enabled, Vega-Altair automatically sets the bar width to match the bin intervals.
</details>

### Heatmap

Create a heatmap with `cut` as the x encoding, `color` and the y-encoding, and `average(price)` as the heatmap color.

Then configure the scales so that the best cut and quality is in the lower right corner, and the worst cut and quality is in the upper left corner.

<details>
  <summary>(Show Image)</summary>
  <img src="../resources/images/part2/diamonds4.png">
</details>

<details>
  <summary>(Show Answer)</summary>

  ```python
alt.Chart(diamonds).mark_rect().encode(
    alt.X('cut').scale(domain=["Fair", "Good", "Very Good", "Premium", "Ideal"]),
    alt.Y('color').sort("descending"),
    alt.Color('average(price)')
)
  ```
</details>

## Filtered Heatmap
Repeat the Heatmap from the previous example, but this time use a filter transform to filter to diamonds larger than 1.5 carats. What do you notice about the price distribution of these diamonds?

For more information on the filter transform, see the [Vega-Altair documentation](https://altair-viz.github.io/user_guide/transform/filter.html).

<details>
  <summary>(Show Image)</summary>
  <img src="../resources/images/part2/diamonds5.png">
</details>

<details>
  <summary>(Show Answer)</summary>

  ```python
alt.Chart(diamonds).mark_rect().transform_filter("datum.carat > 1.5").encode(
    alt.X('cut').scale(domain=["Fair", "Good", "Very Good", "Premium", "Ideal"]),
    alt.Y('color').sort("descending"),
    alt.Color('average(price)')
)
  ```
</details>

## Cars Exercise - Scatterplot + Bar Chart Interaction

Let's create an interactive visualization that combines a scatterplot and bar chart. We'll reverse the typical interaction pattern so that clicking on a bar highlights corresponding points.

**Goal**: Create a scatter plot of horsepower vs miles per gallon, and a bar chart of car origins. Clicking on a bar should highlight the corresponding points in the scatter plot.

Here is a static snapshot of the desired chart after clicking on the "Japan" bar:

![](../resources/images/part3/cars1.png)


In [None]:
# Load the cars dataset
cars = data.cars()
cars.head()


### Step 1: Highlighting Points
Create a point selection that will select all cars from the same origin when you click on a bar. Use conditional encoding to highlight the selected points in the scatter plot, while non-selected points appear in light gray.

<details>
  <summary>(Show Answer)</summary>

  ```python
# Create selection parameter
selection = alt.selection_point(fields=["Origin"])

# Scatter plot with conditional coloring
scatter = alt.Chart(cars).mark_circle(size=100).encode(
    alt.X('Horsepower'),
    alt.Y('Miles_per_Gallon'),
    color=alt.condition(selection, 'Origin', alt.value('lightgray'))
)

# Bar chart with selection parameter
bars = alt.Chart(cars).mark_bar().encode(
    alt.X('count(Origin)').scale(domain=[0,260]),
    alt.Y('Origin').scale(domain=["Europe", "Japan", "USA"]),
    alt.Color('Origin'),
).add_params(
    selection
)

# Combine charts
scatter & bars
  ```
</details>


### Step 2: Filtering Points
Now modify your solution to filter the points in the scatter plot rather than just highlighting them. This means only the selected points should be visible.

Here is a static snapshot after clicking on the "USA" bar:

![](../resources/images/part3/cars2.png)

<details>
  <summary>(Show Answer)</summary>

  ```python
# Create selection parameter
selection = alt.selection_point(fields=["Origin"])

# Scatter plot with filtering
scatter = alt.Chart(cars).mark_circle(size=100).encode(
    alt.X('Horsepower'),
    alt.Y('Miles_per_Gallon'),
    alt.Color('Origin')
).transform_filter(
    selection
)

# Bar chart with conditional coloring
bars = alt.Chart(cars).mark_bar().encode(
    alt.X('count(Origin)').scale(domain=[0,260]),
    alt.Y('Origin').scale(domain=["Europe", "Japan", "USA"]),
    color=alt.condition(selection, 'Origin', alt.value('lightgray'))
).add_params(
    selection
)

# Combine charts
scatter & bars
  ```
</details>


In [None]:
# Create selection parameter
selection = alt.selection_point(fields=["Origin"])

# Scatter plot with conditional coloring
scatter = alt.Chart(cars).mark_circle(size=100).encode(
    alt.X('Horsepower'),
    alt.Y('Miles_per_Gallon'),
    color=alt.condition(selection, 'Origin', alt.value('lightgray'))
)

# Bar chart with selection parameter
bars = alt.Chart(cars).mark_bar().encode(
    alt.X('count(Origin)').scale(domain=[0,260]),
    alt.Y('Origin').scale(domain=["Europe", "Japan", "USA"]),
    alt.Color('Origin'),
).add_params(
    selection
)

# Combine charts
scatter & bars


### Keep exploring!
Check out all of the transformation types that Vega-Altair supports in [the documentation](https://altair-viz.github.io/user_guide/transform/index.html).  Pick one we haven't discussed yet and apply it to the spotify dataset. For example, use the density transform on song tempo.
