# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109A Introduction to Data Science 

## Lab 2: Visualization

**Harvard University**<br/>
**Fall 2024**<br/>
**Instructors**: Pavlos Protopapas and Natesh Pillai<br/>
<hr style='height:2px'>

In [0]:
#If not in Ed, you can RUN THIS CELL 
import requests
from IPython.core.display import HTML
styles = requests.get("http://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

### Remember:
* "Reset to Scaffold"
* Download all + run on local
* **Labs are not submitted**
### Contents:
* A Visual Guessing Game 
* Colors for Categories 
* Colors for Quantities 
* "It Doesn't Mean You Should Just Because You Can" 
* Visualization Best Practices 
* Exercise: Exploring the Tips Dataset 
* Visualizing Distributions

In this lab we'll discuss some general visualization principles and best practices.\
We'll then move on to examine some common issues students have been having with a special focus on the visualization components.

In [0]:
from IPython.display import clear_output
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

In [1]:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

## Eyes to Brain 👀

<div>
    <img src='attachment:7170e3e3-a837-4558-92ed-6c1640375eee.png' width=250>
</div>

Humans are visual creatures. But in a strong sense _we see with our brains, not with our eyes_.\
Over 50% the cortex is implicated in the processing of visual information[[1]](https://www.rochester.edu/pr/Review/V74N4/0402_brainscience.html).\
Insights from [cognitive science](https://en.wikipedia.org/wiki/Cognitive_science) and [psychophysics](https://en.wikipedia.org/wiki/Psychophysics) into how visual stimuli are processed by the brain and turned into perceptions should be used to inform our visualization decisions.

Let's play a little guessing game to test and verify some of these insights.

## Visual Guessing Game 🎲

In [0]:
from helper import *

We'll look at two quantitative values in a single viz. Where the second is some multiple of the first.\
The possible multiples are: [2, 2.5, 3, 3.5, 4, 4.5, 5]\
The rub (the difficulty) is that **these quantities will be encoded by different visual properties**.

The quantities will be randomly generated. 

**Be sure you are voting on the images being presented to the entire class! Any ones you create by running the cells in your own notebook will be different.**

Cast your vote for each section on the [Lab 2 Ed survey](https://edstem.org/us/courses/59912/lessons/113483/slides/622326).\


**Length**

In [0]:
scaler1 = guess_length()

In [0]:
# choices are 2, 2.5, 3, 3.5, 4, 4.5, or 5
ANSWER_1 = ...

In [0]:
test_answer(ANSWER_1, scaler1)

**Slope / Angle**

In [0]:
scaler2 = guess_slope()

In [0]:
ANSWER_2 = ...

In [0]:
test_answer(ANSWER_2, scaler2)

**Area**

In [0]:
scaler3 = guess_area()

In [0]:
ANSWER_3 = ...

In [0]:
test_answer(ANSWER_3, scaler3)

**Shading**

In [0]:
scaler4 = guess_darkness()

In [0]:
ANSWER_4 = ...

In [0]:
test_answer(ANSWER_4, scaler4)

You may have seen a dramatic example of this kind of difficulty in images like this one:

<div>
    <img src='attachment:d4ee29ec-fb0f-454b-849b-69ae766985ea.png' width=350>
</div>

How much darker is A than B here? If you have any doubts, you can use a [color picker](https://chrome.google.com/webstore/detail/colorpick-eyedropper/ohcpnigalekghcmgcdcenkpelffpdolg) to checker.

**Color**

In [0]:
show_color_gradient()

In [0]:
scaler5 = guess_hue()

In [0]:
ANSWER_5 = ...

In [0]:
test_answer(ANSWER_5, scaler5)

**Take Away**\
Encode your information using those visual properties most efficiently processed and compared by the human visual system.

<div>
    <img src='attachment:d35c798b-4130-41f1-bbdd-367a9464776c.png' width=450>
    </div>

## Colors for Categories
As a general rule, do not use more than 5 colors at once.\
Combining each color marker with a _unique_ shape can help.

<div>
    <img src='attachment:434c6328-fbb2-4199-8f44-e1f4be5eed48.png' width=400>
</div>

To explore this topic of colors for categories, well use the `car_crashes` dataset in `seaborn` because it has a categorical variable that takes on a large number of values. Namely, state abbreviation (`abbrev`).

This dataset was used by [FiveThirtyEight](https://projects.fivethirtyeight.com/polls/) for their article, [Dear Mona, Which State Has The Worst Drivers?](https://fivethirtyeight.com/features/which-state-has-the-worst-drivers/) You can find more info on the various features in the dataset on [Kaggle](https://www.kaggle.com/fivethirtyeight/fivethirtyeight-bad-drivers-dataset).

Note: You can check out [this Github repo](https://github.com/mwaskom/seaborn-data) to see what other example datasets are accessible via Seaborn's `load_dataset` function.

In [0]:
## will not work on Ed
car_crashes = sns.load_dataset('car_crashes')
car_crashes.head()

First, a quick plot of the total number of drivers involved in fatal car crashes per billion miles driven by state.\
(We'll keep it sorted by state name so it's easier to find a specific state)

In [0]:
# set the figure dimensions
plt.figure(figsize=(15,3))
# create plot within a style context
with sns.axes_style('whitegrid'):
    sns.barplot(data=car_crashes, x='abbrev', y='total');

Why the rainbow colors?🌈\
This is a rather annoying default behaviour in `seaborn`. There are far too many colors to be able to identify a state from the color alone. And besides, we are already encoding state with position on the $x$-axis! The rainbow colors may be pleasing to the eye, but they are simply a distraction.

This is an important principle of visualization: **be parsimonious.** If you are adding something to a plot that doesn't directly contribute to communicating your ideas then you are blunting the effect of your visualization.

In [0]:
plt.figure(figsize=(15,3))
with sns.axes_style('whitegrid'):
    sns.barplot(data=car_crashes, x='abbrev', y='total', color='orange');

That's a little less distracting, wouldn't you say?

**But what about max and min values?**

In [0]:
car_crashes_sorted = car_crashes.sort_values(by='total', ascending=False)

plt.figure(figsize=(18,3))
with sns.axes_style('darkgrid'):
    sns.barplot(data=car_crashes_sorted, x='abbrev', y='total', color='orange')

In the rainbow bar plot above, color isn't doing any work for us as **position** is already encoding the state. Let's look at some **scatter plots** below where we explore the relationship between:
* % drivers involved in fatal car crashes who were `speeding` 
* % drivers involved in fatal car crashes who were under the influence of `alcohol` 
* the state in which the crash occurred 

These scatter plots will give us a chance to put color (and shape) to work for us encoding state because the $x$ and $y$ position will be used to encode `speeding` and `alcohol` respectively.\
But with so many states, we should probably limit our scope of investigation. So we'll look at the 12 most densly populated states (and Washington D.C.).

In [0]:
dozen_most_dense = ['DC','NJ','MA','CT','MD','DE','NY','FL','PA','OH','CA','IL']
mask = car_crashes.abbrev.isin(dozen_most_dense)

fig, axs = plt.subplots(1,3, figsize=(18,5))
# notice the use of the `hue` and `style` parameters
sns.scatterplot(data=car_crashes[mask], x='speeding', y='alcohol', hue='abbrev', s=250, ax=axs[0]).set_title('Color')
sns.scatterplot(data=car_crashes[mask], x='speeding', y='alcohol', style='abbrev', s=250, ax=axs[1]).set_title('Shape')
sns.scatterplot(data=car_crashes[mask], x='speeding', y='alcohol', hue='abbrev', style='abbrev', s=250, ax=axs[2]).set_title('Color & Shape')
for ax in axs:
    ax.legend(markerscale=2)
plt.suptitle('Car Crashes')
plt.tight_layout();

We can bend our rule about each property of the image encoding one and only one type of information a little here. When we have so many categories, it is hard to distinguish on color alone (notice how similar IL, MD, and MA are). Shape is always an option for encoding categories, but it doesn't "jump out" at us like color does. By combinging the two we are able to make a noticeable improvement in the readability of our scatter plot.

But 12 categories is still really pushing it. You have to do a lot of back and forth consulting of the legend. If we restrict our investigaton to New England we see out result is much easier to take in at a glance.

In [0]:
new_england = ['MA','NH','ME','CT','VT','RI']
mask = car_crashes.abbrev.isin(new_england)

sns.scatterplot(data=car_crashes[mask], x='speeding', y='alcohol', hue='abbrev', style='abbrev', s=250);
plt.legend(markerscale=2);

Here there is very little ambiguity as to which category is which, even with 6 different values.

**Q:** But do we *really* need to be using color or shape here at all? What else might we do to help the viewer identify the state corresponding to each data point?

In [0]:
ax = sns.scatterplot(data=car_crashes[mask], x='speeding', y='alcohol', s=100, color = "orange") 

for abbrev in new_england:
    pos = car_crashes.loc[car_crashes.abbrev == abbrev, ["speeding", "alcohol"]].values[0]
    ax.annotate(abbrev, pos)

## Colors for Quantities

Color can also be used to encode numerical values, continuous or discrete.\
This works best when the viewer doesn't need to compute the exact differences between values, but just get a sense of roughly where they stand relative to one another.

The [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html) is one of the most popular ways of using color to encode continuous (and sometimes discrete) quantities.

Here we use the **flights** as our example. ✈️

In [0]:
## will not work on Ed
flights = sns.load_dataset("flights")
flights.head()

In [0]:
flights.dtypes

We'll look at how the number of flights over time using different `matplotlib` **colormaps**.

In [0]:
flights_table = flights.pivot_table("passengers", "year", "month")
flights_table

In [0]:
fig, axs = plt.subplots(2,3, figsize=(10, 5), sharex=True, sharey=True)

all_cmaps = list(matplotlib.cm.__dict__['datad'].keys())

FEELING_LUCKY = False
if FEELING_LUCKY:
    cmaps = np.random.choice(all_cmaps, size=np.prod(axs.shape), replace=False)
else:
    cmaps = ['gray', 'afmhot', 'gist_earth', 'rainbow', 'Accent', 'prism']

for ax, cmap in zip(axs.ravel(), cmaps):
    sns.heatmap(flights_table, cmap=cmap, ax=ax).set(title=cmap)
plt.tight_layout()

We can see the number of flights have been increasing each year with the bulk centered around July and August.\
Of course, not all the colormaps above are suitable for continuoys values. You can never go wrong with **gray** scale. The color gradiant on **afmhot** and **gist_earth** are interpretable with the latter perhaps veering into aesthetics-over-information territory.

**Rainbow** really should never be used for this purpose. **Accent** maybe be useful for discrete values, but we lose a lot of information here on our continuous data. And the strange **prism** colormap, with its repeating pattern is not appropriate here.

You can see a random set of colormaps by setting `FEELING_LUCKY` to `True` and re-running the cell.

`matplotlib`'s documentation has a [colormap reference](https://matplotlib.org/stable/gallery/color/colormap_reference.html) and a helpful page on [choosing colormaps](https://matplotlib.org/stable/tutorials/colors/colormaps.html), but exploring them ourselves within a Jupyter notebook is very easy. You can simply use tab autocomplete after `matplotlib.cm` to see all of your options.

Calling `display()` on a colormap object will show us a nice color gradiant bar!\
In fact, we don't even have to explicitly call `display()` on the object. The return value of the last line of any cell is displayed by default.\
_(This is also why you sometimes get pesky text above your plots. You can surpress this output by appending `;` to the last line of your cell)_

In [0]:
matplotlib.cm.gnuplot2_r

You can also do something similar with `seaborn`, though it requires you to know the name of the colormap (no tab autocomplete for exploring).

In [0]:
sns.color_palette("rocket", as_cmap=True)

The same principles regarding colormaps for quantitative values hold when plotting density functions with something like  [kdeplot](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).

In [0]:
# generate some data
x = np.random.uniform(size=100)
y = np.random.uniform(size=100)

In [0]:
fig, axs = plt.subplots(2,3, figsize=(10, 5), sharex=True, sharey=True)

FEELING_LUCKY = False
if FEELING_LUCKY:
    cmaps = np.random.choice(all_cmaps, size=np.prod(axs.shape), replace=False)
else:
    cmaps = ['Blues', 'viridis', 'afmhot', 'gist_earth', 'PiYG', 'hsv']

for ax, cmap in zip(axs.ravel(), cmaps):
    sns.kdeplot(x=x, y=y, ax=ax,
                cmap=cmap,
                fill=True,
                thresh=0.025,
                antialiased=True,
                alpha=0.9,
                n_levels=40).set(title=cmap)
    sns.despine(left=True, bottom=True)
    hide_ticks(ax)
plt.tight_layout()

**Color-blindness**\
[According to the National Institutes of Health](https://ghr.nlm.nih.gov/condition/color-vision-deficiency#statistics), around 1 in 12 males and 1 in 200 females have some form of color vision deficiency.
* Don't use red and green as contrasting colors (this is the most common form of color-blindness)
* Seek out colormaps and palletes designed to be color-blind friendly
* View your visualizations under color-blindness simulations

Resources:
* [Color-blind Friendly Diagrams](https://yoshke.org/blog/essays/2020/07/colorblind-friendly-diagrams/) (short article)
* [CBcolors.py](https://gist.github.com/thriveth/8560036#file-cbcolors-py) (just one color-blind friendly pallette for plotting)
* [Coblis - Color Blindness Simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) (upload your own images and simulate various color vision deficiencies)

## It Doesn't Mean You Should Just Because You Can 🐧❗

Often, our first impuse is to represent _as much information as possible_ in our visualization. We spend a lot of time coming up with clever ways to encode each feature in the dataset into a single plot.

But this is counterproductive. We want to draw attention to some interesting aspect of the data: a strong relationship, an unexpected patter, etc. These important insights can be overlooked if there is too much going on in a visualization.

Let's look at an example using `seaborn`'s penguins dataset (a worthy successor to the connonical iris dataset.)

In [0]:
## will not work on Ed
penguins = sns.load_dataset('penguins').dropna()
penguins.head()

We have 3 species of penguins living across 3 different islands. There are measurements of bill length, bill depth, flipper length, and body mass. We also have categorcial variable for each penguin's sex giving us a total of 7 features.

Imagine your colleague were to present you with the plot below.

In [0]:
sns.relplot(data=penguins, x='flipper_length_mm', y='bill_length_mm',
            hue='species', style='sex', size='body_mass_g', height=6);
plt.title('Penguins!', fontdict={'color': 'teal', 'size': 20, 'weight': 'bold', 'family': 'serif'});

Here they've managed to encode `species`, `bill_length_mm`, `flipper_length_mm`, `body_mass_g`, and `sex` all in a single `seaborn` call. That's 5 of our original 7 predictors. And they assure you they're working on a way to encode `bill_depth_mm` with a marker's alpha value and encode `island` by giving each marker a different border color.

But was it really worth it? We need to ask ourselves, "what am I trying to communicate with this plot?"\
If we don't have a clear answer then that is a sign to re-evaluate our current approach.

How about this next example using the **exercise** dataset? 🏃🚶🧘

In [0]:
## will not work on Ed
exercise = sns.load_dataset('exercise').drop('Unnamed: 0', axis=1)
exercise.head()

Here we have the pulse measurements from individuals on either a 'low fat' or 'no fat' diet as they engage in sitting, walking, or running for different periods of time. Once again, your ambitious colleague boasts they have captured, "all the information in a single visualization," and presents you with the plot below.

In [0]:
with sns.axes_style('whitegrid'):
    sns.lineplot(data=exercise, x='time', y='pulse', hue='diet', style='kind', estimator='mean')\
    .set_title('Exercise')

Yikes! 🤯\
While it's true we've managed to cram pretty much all the information from the dataframe into our plot, the result is a mess. And the grid lines here only make things more cluttered.

One thing would could try is splitting up the information into multiple plots so as not to overwhelm the viewer.

In [0]:
fig, axs = plt.subplots(1,3, figsize=(14,3), sharey=True)
for i, kind in enumerate(exercise.kind.unique()):
    mask = exercise.kind == kind
    sns.barplot(data=exercise[mask], x='time', y='pulse', hue='diet', ax=axs[i]).set_title(kind)
axs[0].legend(loc='upper left')
axs[1].get_legend().remove()
axs[2].get_legend().remove()
plt.tight_layout()

Ok, we've split up the different types of exercises. But what is the viewer really supposed to take away from this set of plots? We are still trying to do everything at one. We'd do well to focus in where we think the real story is.

There is very little difference between the diet groups when resting or walking. Let's instead focus on running.

In [0]:
mask = exercise.kind == 'running'
with sns.axes_style('whitegrid'):
    ax= sns.violinplot(data=exercise[mask], x='time', y='pulse', hue='diet')
    ax.legend(loc='upper left');
    ax.set_title('Running Exercise');
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['left'].set_visible(False)

By zooming in, not only have we reduced the cognitive load on the viewer by reducing the number of components they need to process, but we are also able to see more granular information about the segment of the data we are interested in. Using the violin plots here allows us to consider the distributions more easily than with the bar plots which just showed us the means and tiny standard deviation bars. 

Here's one curious fact that reveals itself in this new visualization: at 15 minutes, the 'low fat' group still has individuals will pulses as high as those in the 'no fat' group (some even higher), topping off at just over 150. _But_ after 30 minutes, there are _no_ 'low fat' runners with pulse much greater about 130. By contrast, the 'no fat' group's distribution reaches new heights at the 30 minute mark. Perhaps the 'low fat' runners are better able to acclimate to the strain of running after a sufficient amount of time and their cardiovascular system doesn't have to work quite as hard.

## Visualization Best Practices 🥇

* Reduce 'ink'-to-information ratio
* Follow the heirarchy of visual efficiency (MOST position -> length -> angle -> area -> intensity -> color LEAST)
* Ask, "what information is this property of the plot encoding?"
* Don't sacrifice your ability to communicate insights for eye-candy!
* Ask, "what insight am I trying to communicate with this plot?"
* Use accessible colormaps and pallettes
* Don't try to say everything in a single plot
* Often less really is more
* Just say no to rainbow colormaps and pie charts 🚫🌈 🚫🥧

**When do I use `seaborn` rather than `matplotlib`?**

`seaborn` is most useful when your data is already in a DataFrame. This allows variables to be selected using their column names. Simply pass the df as the `data` argument to the `seaborn` plotting method and pass column names for the other arguments like `x`, `y`, `hue`, etc. The method calls are very readable and you get axis labels and legends 'for free.'

`matplotlib` is best when ploting `numpy` ndarrays. Here you there are no existing column names for `seaborn` to take advantage of. This is why you see `matplotlib` used in the PCA examples below: PCA's `transform` method and `explained_variance_ratio_` attribute return `numpy` ndarrays, not a DataFrames.

Resources:
* [Seaborn Aesthetics](https://seaborn.pydata.org/tutorial/aesthetics.html)

<div class="alert alert-success">
    <strong>🏋🏻‍♂️ CLASS ACTIVITY:</strong> Explore the Tips Dataset 💸  Share your Viz! 🎁</div>  

Here is yet another interesting dataset that comes with `seaborn`.\
It contains information about tips, the setting in which they were left, and the individuals who left them.

In [0]:
## will not work on Ed
tips = sns.load_dataset('tips')
tips.head()

Using the principles and methods covered above, create a visualization or set of visualizations to:
1. **Investigate a _specific_ question about the dataset**
2. **Communicate a _specific_ insight about the dataset**

If we were inspired by any of the datasets shown earlier (`car_crashes`, `flights`, `penguins`, or `exercise`) you may choose to use one of them instead.

Finally, please **share your work with the class** (disasters are also welcome!)\
Post your visualization on this [Ed post](https://edstem.org/us/courses/59912/discussion/5264489).

You can either:
1. take a screenshot of your plot(s) [Screenshots on Mac](https://support.apple.com/en-us/HT201361)
2. use `plt.savefig('my_cool_viz.png')` in the cell with the plot(s) to save it to disk

Then upload your image to the post including:\
**(1)** The question you were asking of the data and/or **(2)** the insight you are trying to communicate with your plot.

You can always make your posts anonymously if you like 🥷

## Reviewing Common Visualization Issues

## Wine Dataset 🍷

For the next example we'll use the **wines** dataset. Each row contains information about the objective chemical properties of the wine as well as a concensus quality rating.

In [0]:
wines = pd.read_csv('data/wines.csv', index_col=0)
wines.head()

## Comparing Distributions

**Suppose you ask someone, "investivate the distribution of alcohol content in wines of each quality rating," and they produce this plot.**

In [0]:
ax = wines.groupby('quality')\
    .agg({'alcohol': 'mean'})\
    .plot.bar(legend=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.set_ylabel('avg alcohol content');

**🤔 Q: List at least _3 problems_ with this plot.** 

`seaborn` allows us to produce a very similar plot with some stdev bars and colors (whether this is an improvement is debateable). But many of the same issues remain.

In [0]:
ax = sns.barplot(data=wines, x='quality', y='alcohol', estimator=np.mean)
ax.set_title('Wines');

Ideally, we want to display more information about these distributions then just their mean.\
There are several options. Histograms (which `seaborn` calls [histplots](https://seaborn.pydata.org/generated/seaborn.histplot.html)) make a very good default/first choice for EDA. 

But trying to plot too many distributions on top of one another will result in an incomprehensible mess. Grouping categories or ploting only a subset can make all the difference between an informative visualization and a useless one.

In [0]:
fig, axs = plt.subplots(1,3, figsize=(16,4))
sns.histplot(data=wines, x='alcohol', hue='quality', alpha=0.5, ax=axs[0]).set_title('All Qualities')
wines_binary = wines.copy()
wines_binary['quality_binary'] = np.where(wines_binary.quality > 5, 'good', 'bad')
sns.histplot(data=wines_binary, x='alcohol', hue='quality_binary', alpha=0.5, ax=axs[1]).set_title('Good/Bad Binary')
sns.histplot(data=wines_binary, x='alcohol', hue='quality_binary', 
             kde=True, alpha=0.5, fill=False, ax=axs[2]).set_title('KDE /w No Fill')
plt.suptitle('Wines')
plt.tight_layout();

If you are intent on comparing more than 2 or 3 distributions then [boxplot]() would probably the next thing to try. Another benefit they have overhistograms is the ability to clearly display outliers (or 'fliers'). But displaying outliers can squash the distributions. So it is important to know how to toggle them off when we need to.

In [0]:
# create an extreme outlier as a demonstration
new_obs = wines.iloc[-1].copy()
new_obs['alcohol'] += 10
wines_outlier = pd.concat([wines.copy(), new_obs.to_frame().T])

# for some reason the above steps changed `quality` to a float
wines_outlier['quality'] = wines_outlier.quality.astype(int)

In [0]:
fig, axs = plt.subplots(1,3, figsize=(16,4))
sns.boxplot(data=wines, x='quality', y='alcohol', ax=axs[0]).set_title('Original')
sns.boxplot(data=wines_outlier, x='quality', y='alcohol', ax=axs[1]).set_title('/w Extreme Outlier')
sns.boxplot(data=wines_outlier, x='quality', y='alcohol', showfliers=False, ax=axs[2]).set_title('No Outliers')
plt.suptitle('Wines (Boxplots)');

A related issue to the distribution squashing effect of the extreme outlier above: be careful about plotting on a **log scale**. It is not always warrented and can also flatten what might otherwise be obvious differences between distributions.

The [boxenplot](https://seaborn.pydata.org/generated/seaborn.boxenplot.html) is a nice alternative to the box plot that provides a bit more information about the other quantiles. The [swarmplot](https://seaborn.pydata.org/generated/seaborn.swarmplot.html) is like a mix between a scatter plot and a violin plot.

In [0]:
fig, axs = plt.subplots(1,2, figsize=(14,4))
sns.boxenplot(data=wines, x='quality', y='alcohol', ax=axs[0]);
sns.swarmplot(data=wines.sample(frac=.5), x='quality', y='alcohol', ax=axs[1], s=2.5);

`swarmplot` can be finicky. It will throw warnings and only plot a subset of the data if there isn't enough space. I've taken a sample here and decreased the marker size to avoid the warning.

**Q:** 🤔 If we wanted to be careful, what should we ensure about the sample we are plotting?

[stripplot]() is a nice alternative to `swarmplot` that won't complain about the number of points. [catplot](https://seaborn.pydata.org/generated/seaborn.catplot.html) is similar but has the quirk of being a 'figure-level' function. And so does not accept target axes as arguments.

The violinplot is another popular choice.

In our example, the numerical value of `quantity` is easily interpreted by the plot. But for many non-ordinal categorical variables that have been numerically encoded this is not the case (think of the dietary restrictions or cities in HW2). In these cases we always want to change the labels for the categories into something that the viewer can interpret directly. To demonstrate this we've made some string labels for each category and added them to the plot using `ax.set_ticklabels()` .

In [0]:
quality_names = ['Blech!', 'Eww', 'Meh', 'Hmm..', 'Yum', 'Wow!'] 
fig, axs = plt.subplots(1,2, figsize=(14,4))
sns.stripplot(data=wines, x='quality', y='alcohol', alpha=0.5, ax=axs[0]).set_title('Strip Plot')
ax = sns.violinplot(data=wines, x='quality', y='alcohol', alpha=0.5, ax=axs[1])
ax.set_xticklabels(labels=quality_names)
ax.set_title('Violin Plot')
plt.suptitle('Wines');

<div class="alert alert-success">
    <strong>🏋🏻‍♂️ BONUS CLASS ACTIVITY (time permitting):</strong> Explore the CS109A class survey data!</div>  

A cleaned up version of the survey data can be found in `data/2024_cs1090a_survey.csv`

---



In [0]:
# your code here

**🌈 The End**