# Tutorial 5. EDA I

In previous notebooks we learned how to use marks and visual encodings to represent individual data records. Here we will explore methods for *transforming* data, including the use of aggregates to summarize multiple records. Data transformation is an integral part of visualization: choosing the  variables to show and their level of detail is just as important as choosing appropriate visual encodings. After all, it doesn't matter how well chosen your visual encodings are if you are showing the wrong information!

You have already been exposed to some Vega-Lite's data transforms, including
- `transform_aggregate()`: used to create a new data column by aggregating an existing column ([aggregation options](https://vega.github.io/vega-lite/docs/aggregate.html#ops) including `sum`, `min`)
- `tranform_timeunit()`: used to discretize/group a date by a time unit (e.g., day, month, year, etc)

In this notebook we will work **Bin Transforms**, **Density Transforms**, and some other **Aggregate Transform** options.

As you work through this module, we recommend that you open the [Altair Data Transformations documentation](https://altair-viz.github.io/user_guide/transform/index.html) in another tab. It will be a useful resource if at any point you'd like more details or want to see what other transformations are available.

## Learning Goals
Those who actively work through this notebook will be able to:
- Use the bin transform to create histograms,
- Use the density transform to create kernel density estimate plots
- Explore how we can visualize the distributions of quantitative attributes when they are grouped by nominal attributes. 

<div style="border-left: 5px solid #2E7D32; padding: 1em; background-color: #4CAF50; color: white;">
<h3><b>Important: Learning Focus Disclaimer</b></h3>
<p>This tutorial combines visualization techniques with data filtering examples in Altair. <strong>You are only responsible for learning the visualization creation skills.</strong></p>
    
<p><strong>You are NOT responsible for learning Altair's filtering (i.e. <code>transform_filter</code>) syntax.</strong> We focus on pandas filtering because these skills transfer to other data science contexts, while Altair's filtering syntax is library-specific.</p>
    
<p>As you work through this tutorial, remember: <em>concentrate on understanding how to create the visualizations themselves.</em> <br>The filtering examples are provided for context, but your learning objective is <strong>mastering the chart creation techniques.</strong></p>
</div>

## Dataset and Environment Setup

### Movies Dataset

We will be working with a table of data about motion pictures, taken from the [vega-datasets](https://vega.github.io/vega-datasets/) collection. The data includes variables such as the film name, director, genre, release date, ratings, and gross revenues. However, _be careful when working with this data_: the films are from unevenly sampled years, using data combined from multiple sources. If you dig in you will find issues with missing values and even some subtle errors! Nevertheless, the data should prove interesting to explore.

Let's retrieve the URL for the JSON data file from the vega_datasets package, and then read the data into a Pandas data frame so that we can inspect its contents.


| Column Name | Data Type | Description |
|-------------|-----------|-------------|
| Title  | Text | Movie title |
| US_Gross | Quantitative | USA box office revenue in USD|
| Worldwide_Gross | Quantitative | Global box office revenue in USD|
| US_DVD_Sales | Quantitative | DVD/Physical sales in USD |
| Production_Budget | Quantitative | Movie production costs in USD |
| Release_Date | Date | Theatrical release date |
| MPAA_Rating | Ordinal | Movie rating (G, PG, PG-13, R, etc.) |
| Major_Genre | Nominal | Primary movie genre |
| Running_Time_min | Quantitative | Movie length (minutes) |
| Distributor | Nominal | Distribution company |
| IMDb_Rating | Quantitative | IMDb user ratings (1-100 scale) |
|Rotten_Tomatoes_Rating | Quantitative | Rotten Tomatoes user ratings (1-10 scale) |
| Director | Nominal | Movie director name |

In [None]:
import pandas as pd
import altair as alt


In [None]:
movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)
movies = movies.query("MPAA_Rating in ['G', 'PG', 'PG-13', 'R']")


## Histogram

We'll start our transformation tour by _binning_ data into discrete groups and _counting_ records to summarize those groups. The resulting plots are known as [_histograms_](https://en.wikipedia.org/wiki/Histogram).
Histograms are used to represent how often values fall into given ranges. In other words, they are used to represent the distrubtion of a dataset.

<div style="border-left: 5px solid #007BFF; padding: 2em; background-color: #F0F8FF;">

<h3><b>  Viz Task: Distribution of Movie Ratings on Rotten Tomatoes</b></h3>

<ul>
<li>Use the <code>bar</code> mark to show the distribution of ratings.</li>
<li>Encode:
<ul>
<li><code>rating</code> on the <b>x channel</b> as quantitative or ordinal, representing the Rotten Tomatoes score.</li>
<li><code>count()</code> on the <b>y channel</b> as an aggregate of records, showing how many movies received each rating.</li>
</ul>
</li>
<li>Optionally, add <code>color</code> to encode a categorical variable like <code>genre</code> for additional insight.</li>
<li>Add a descriptive title: <i>“Distribution of Movie Ratings on Rotten Tomatoes”</i>.</li>
</ul>
</div>


In [None]:
alt.Chart(movies).mark_circle(size=60).encode(
   ...
)

The plot above shows the number of records for each rating. <br>
2 records were rated 99%, <br>
17 records were rated 90%,<br>
24 records were rated 50%.<br>
If there is at least 1 record for each rating (1 - 100), we have 100 data points.
What does this tell us?
What insights can be gain at this level of abstraction?
Very little.
Instead, let us aggregate the data.

 To summarize this data, we can *bin* a data field to group numeric values into discrete groups. Here we bin along the x-axis by adding `bin=True` to the `x` encoding channel. The result is a set of ten bins of equal step size, each corresponding to a span of ten ratings points. The `y` encoding channel shows an aggregate `count` of records, so that the vertical position of each point indicates the number of movies per Rotten Tomatoes rating bin. As the `count` aggregate counts the number of total records in each bin regardless of the field values, we do not need to include a field name in the `y` encoding.

In [None]:
alt.Chart(movies).mark_circle().encode(
   ...
)

Setting `bin=True` uses default binning settings, but we can exercise more control if desired. Let's instead set the maximum bin count (`maxbins`) to 20, which has the effect of doubling the number of bins. Now each bin corresponds to a span of five ratings points.


In [None]:
alt.Chart(movies).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count()')
)

While the scatter plot looks very similar to the previous one, notice that there are more data points and that the y axes values have changed.
To arrive at a standard histogram, let's change the mark type from `circle` to `bar`:

In [None]:
alt.Chart(movies).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('count()')
)

We can now examine the distribution of ratings more clearly: we can see fewer movies on the negative end, and a bit more movies on the high end, but a generally uniform distribution overall. Rotten Tomatoes ratings are determined by taking "thumbs up" and "thumbs down" judgments from film critics and calculating the percentage of positive reviews. It appears this approach does a good job of utilizing the full range of rating values.
<br>
<br>
---


<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>  Viz Task: Distribution of IMDB Ratings</b></h3>

<ul>
<li>Use the <code>bar</code> mark to show the distribution of IMDB ratings.</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>x channel</b> as quantitative, binned into 20 bins.</li>
<li><code>count()</code> on the <b>y channel</b> as an aggregate of records, showing how many movies fall into each rating bin.</li>
</ul>
</li>
<li>Optionally, add <code>color</code> to encode a categorical variable like <code>Major_Genre</code> for more insight.</li>
<li>Add a descriptive title: <i>“Distribution of IMDB Ratings”</i>.</li>
</ul>
</div>



In [None]:
alt.Chart(movies).mark_bar().encode(
   ...
).properties(title='Distribution of IMDB Ratings')

_In contrast to the more uniform distribution we saw before, IMDB ratings exhibit a bell-shaped (though [negatively skewed](https://en.wikipedia.org/wiki/Skewness)) distribution. IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site's users. We can see that this form of measurement leads to a different shape than the Rotten Tomatoes ratings. We can also see that the mode of the distribution is between 6.5 and 7: people generally enjoy watching movies, potentially explaining the positive bias!_



**🔍 What This Tells Us:** 
- **Shape**: Is the distribution normal, skewed, or bimodal?
- **Range**: What's the actual spread of ratings?
- **Outliers**: Are there unusual ratings that need investigation?
- **Gaps**: Are there rating values with no movies?



In [None]:
movies['IMDB_Rating'].describe()

**💡 Key Insight:** Notice how the visual immediately shows you patterns that `movies['IMDB_Rating'].describe()` would miss, like the slight left skew and the concentration around 6-7 rating range.


## Layered Histograms
In addition to quantitative attributes, we may wish to visualize how the distribution changes for specific values of nominal data. The layered histogram allows us to compare distributions across different categories within the same visualization, making it easy to spot patterns and differences between groups.
However, you should use them sparingly because once you have more than 3-4 categories, it becomes difficult to discern individual distributions due to overlapping bars and visual clutter.


<br>
<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Distribution of IMDB Ratings by MPAA Rating</b></h3>

<ul>
<li>Use the <code>bar</code> mark with <code>opacity=0.6</code> and <code>binSpacing=0</code> to create a binned histogram with overlapping bars.</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>x channel</b> as quantitative, binned into 20 bins (<code>bin = alt.BinParams(maxbins=20)</code>).</li>
<li><code>count()</code> on the <b>y channel</b> to show the number of movies in each bin.</li>
<li><code>MPAA_Rating</code> on the <b>color channel</b> as nominal, to differentiate between G, PG, PG-13 movies, and use the <code>accent</code> color scheme</li>
</ul>
</li>
<li>Data transformation:
<ul>
<li>Use <code>transform_filter()</code> with <code>alt.FieldOneOfPredicate</code> to include only movies with MPAA ratings: <code>['G', 'PG', 'PG-13']</code>.</li>
</ul>
</li>
</ul>
</div>



In [None]:
alt.Chart(movies).transform_filter(
    alt.FieldOneOfPredicate(field='MPAA_Rating', oneOf=['G', 'PG', 'PG-13'])
).mark_bar(  
    opacity=0.6,
    binSpacing=0
).encode(
   ...
).properties(title='Distribution of IMDB Ratings by MPAA Rating')

Note that the visualization shown above is misleading. By default, it stacks each rating on top of the others, making it difficult to assess the true distribution of the G and PG ratings.

To show the actual distribution, we need to ensure that the data for each bin is not aggregated into a single stacked column. By removing the stack option (setting `stack=None`), we can see a more accurate representation of how the count of records varies across different MPAA rating categories.



In [None]:
alt.Chart(movies).transform_filter(
    alt.FieldOneOfPredicate(field='MPAA_Rating', oneOf=['G', 'PG', 'PG-13'])
).mark_bar(  
    opacity=0.6,
    binSpacing=0
).encode(
   ...
).properties(title='Distribution of IMDB Ratings by MPAA Rating')

Now we can see that for each MPAA rating, the count starts from zero rather than being stacked on top of each other. However, the blending of colors creates visualization problems. When two genre's distribution at a given ranking is similar a new color emerges, making it difficult for sensemaking. This is not ideal for data analysis. We almost never use layered histograms for user-focused tasks because the overlapping colors obscure rather than clarify the data patterns.

## Density Plot


### Simple Density Plot
Otherwise known as Kernel Desity Plots, Density Trace Graph
A density plot is a representation of the distribution of a numeric variable. It uses the kernel density estimate to show the probability density function of a variable. It is basically a smoothed out version of the histogram.
To create a density plot, we will first need to introduce the `transform_density`.
The density transform performs one-dimensional kernel density estimation over input data and generates a new column of samples of the estimated densities.
We need to specify the input data that we will calculate the density estimate on. <br>
The `as` property is used to indicate the output fields, you could technically call the fields whatever you want (e.g., x, y, cat, bat, cow) but it helps to stick to a naming convention that mirrors the data being produced (e.g., value, density).
Now that the estimate has been calculated, we can now specify how the output should be encoded.

<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Density of IMDB Ratings</b></h3>

<ul>
<li>Use the <code>area</code> mark to show the density distribution of IMDB ratings.</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>x channel</b> as quantitative.</li>
<li><code>density</code> on the <b>y channel</b> as quantitative, representing the estimated density of ratings.</li>
</ul>
</li>
<li>Use <code>transform_density()</code> to compute a smoothed density estimate:
<ul>
<li>Specify the <code>field</code> as <code>'IMDB_Rating'</code>, which is the variable to calculate density for.</li>
<li>Use the <code>as_</code> parameter to store the results: the first element is the name for the x-axis values (<code>'IMDB Rating'</code>) and the second element is the name for the density values (<code>'density'</code>).</li>
</ul>
</li>
</ul>
</div>


In [None]:
alt.Chart(movies).mark_area().transform_density(
    'IMDB_Rating',
    as_=['IMDB Rating', 'density'],
).encode(
    x="IMDB Rating:Q",
    y='density:Q',
)

Note that the density plot is very similar to the first histogram we created. However, there are a few key differences
  - the density plot is a smoothed over version of the histogram,
  - the `y` channel shows the density (i.e., the proportion of the whole) and not the count of records,
    - in addition, `density` is a calculated value, it is not an attribute from the dataset, we used `transform_density` to create it.

There are 6 common distribution shapes that exist.

<img title="Common Distribution Types" src="https://www.data-to-viz.com/graph/density_files/figure-html/unnamed-chunk-2-1.png" style="max-width: 400px;"><br/>

**Image source: <a href="https://www.data-to-viz.com/"><em>from Data to Viz</em></a> website**

The shape of the plot helps inform how an attribute is distributed in the dataset and may serve as the basis on which future exploration and analysis is performed.


### Cumulative Density Plot
Another common distribution plot is the cumulative density plot.
The Simple Density plot, which is based on the probability density function, returns the probability of a given continuous outcome.
The Cumulative Density plot, is based on the cumulative distribution function, and it returns the probability for values less than or equal to a given outcome.

To create a Cumulative Density Plot, we just need to change the default value for the `cumulative` property from False to True.

In [None]:
alt.Chart(movies).transform_density(
    'IMDB_Rating',
    as_=['IMDB Rating', 'density'],
    cumulative = True,
).mark_area().encode(
    x="IMDB Rating:Q",
    y='density:Q',
)

As expected, the y axes falls between 0 and 1 as we are exploring the probability not as it relates to a single value, but any value up unto and including the current value.
While the Cumulative Density Plot is less widely used, it is still an important visualization for exploratory data analysis.

### Density Plot Matrix
We now have a sense of the distribution of the IMBD ratings and also the Rotten Tomato ratings.
Every movie has a genre, so what if we wanted to get a sense of the distribution of ratings grouped by movie genre.

The density can also be computed on a per-group basis, by specifying the `groupby` argument. Here we split the above density computation across movie genres:
We have to make 3 main changes.
- Not all movies have their genre specified. So the first thing we will do is filter out records without a valid `Major_Genre` value.
- Use the `groupby` property to specify groupings
- Use the Facet channel to specify on which attribute the charts should be created and their placement

<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Faceted Density of IMDB Ratings by Genre</b></h3>

<ul>
<li>Use the <code>area</code> mark to show the density distribution of IMDB ratings for each movie genre.</li>
<li>Encode:
<ul>
<li><code>IMDB Rating</code> on the <b>x channel</b> as quantitative.</li>
<li><code>density</code> on the <b>y channel</b> as quantitative, representing the estimated density of ratings.</li>
</ul>
</li>
<li>Data transformation:
<ul>
<li>Use <code>transform_filter()</code> to include only valid genres (<code>isValid(datum.Major_Genre)</code>).</li>
<li>Use <code>transform_density()</code> to calculate density per genre:
<ul>
<li><code>field='IMDB_Rating'</code> — the variable to calculate density for.</li>
<li><code>groupby=['Major_Genre']</code> — compute separate density curves for each genre.</li>
<li><code>as_=['IMDB Rating', 'density']</code> — store the x-axis values as <code>IMDB Rating</code> and density values as <code>density</code>.</li>
<li><code>extent=[1, 10]</code> — restrict the range of IMDB ratings considered for the density calculation.</li>
</ul>
</li>
</ul>
</li>
<li>Facet the chart by <code>Major_Genre</code> into multiple small multiples, arranged in 4 columns.</li>
</ul>
</div>


In [None]:
alt.Chart(movies).transform_filter(
    'isValid(datum.Major_Genre)'
).transform_density(
    'IMDB_Rating',
    groupby=['Major_Genre'],
    as_=['IMDB Rating', 'density'],
    extent=[1, 10],
).mark_area().encode(
  ...
).facet(
    'Major_Genre:N',
    columns=4
)

Let's do a bit of housekeeping and clean up our plot:
- Make each density plot smaller by setting width and height. For a facet chart, the width and height must be specified inline as part of the Chart's properties
- Change the title of the facet to a meaningful name
This next bit is not important but is added to demonstrate customization options.
- Make the plot black
- Change all the text to white and grid to grey
- Increase font size of title.

In [None]:
alt.Chart(movies
).transform_filter(
    'isValid(datum.Major_Genre)'
).transform_density(
    'IMDB_Rating',
    groupby=['Major_Genre'],
    as_=['IMDB Rating', 'density'],
    extent=[1, 10],
).mark_area().encode(
    alt.X("IMDB Rating:Q"),
    alt.Y('density:Q').title('Density'),
    facet=alt.Facet('Major_Genre:N',
                    columns=4,
                    title="Density Plot Matrix for IMDB Rating")
).properties(
    width = 120,
    height = 80
).configure(
    background='black',
).configure_axis(
    labelColor='white',
    titleColor='white',
    gridColor='grey'
).configure_header(
    titleColor='white',
    titleFontSize=20,
    labelColor='white',
    labelFontSize=12)


How would you describe each distribution?
Create the density plot matrix for the other quantitative attributes and describe the distributions.
So far we have explored the distribution for one quantitative attribute at a time and using facets and groupby we have broken down the distribution by genre.

## Layered Density Plots
We can also create a layered density plot. It has also been described as a stacked density estimate, but this name is a misnomer as the estimates are not stacked but layered.


<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Density of IMDB Ratings by Genre (Colored)</b></h3>

<ul>
<li>Use the <code>area</code> mark with <code>opacity=0.6</code> to show overlapping density curves for different genres.</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>x channel</b> as quantitative.</li>
<li><code>density</code> on the <b>y channel</b> as quantitative, with <code>stack=None</code> to prevent stacking.</li>
<li><code>Major_Genre</code> on the <b>color channel</b> as nominal, to differentiate genres.</li>
</ul>
</li>
<li>Data transformation:
<ul>
<li>Use <code>transform_filter()</code> to include only valid genres: <code>isValid(datum.Major_Genre)</code>.</li>
<li>Use <code>transform_density()</code> to compute the density per genre:
<ul>
<li><code>field='IMDB_Rating'</code> — the numeric variable to calculate density for.</li>
<li><code>groupby=['Major_Genre']</code> — compute a separate density curve for each genre.</li>
<li><code>as_=['IMDB_Rating', 'density']</code> — store the x-axis values as <code>IMDB_Rating</code> and the density values as <code>density</code>.</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>


In [None]:
alt.Chart(movies).transform_filter(
    'isValid(datum.Major_Genre)'
).transform_density(
   ...
).mark_area( opacity=0.6).encode(
   ...
)


This is the point where you feel proud that you can create hills and valleys but it also the point 
where you ask yourself, what can I do with this viz. 
In which situations would having this visualization be a good idea. 
Think about each of the channels that are being used 
and what data they are representing. Bring your answers to class. 

## Summary
In this notebook, you have been exposed to various visualization techniques used in exploratory data analysis.
While this tutorial is far from exhaustive, it does provide you with common visualizations that can help you make sense of the data before preceeding to more complex or targeted data analysis tasks.

Interested in learning more about this topic?
- Skim through Chapters 7 -9 and 16 in Claus O. Wilke's book [Fundamentals of Data Visualization](https://clauswilke.com/dataviz/)
- Explore visualizations by Function in the [Data Visualization Catalogue](https://datavizcatalogue.com/)