# Lecture 6: August 18th, 2023

__Updates:__ 
* New token-earning assignment up on Canvas _50 Years of Data Science_. Due next Friday at midnight. As always, this is an optional assignment.

## A few thoughts on last lecture's material

We'll start today's class with a short discussion.

__Question:__ 
Is it always better to include more data in a chart?

__Brainstorming:__ 

* Benefits of including more data:
    * We could be more accurate in our hypotheses/get a better idea of what's going on
    * "Easier to identify trends"

* Drawbacks of including more data:
    * Might be more difficult to see the relationship between certain variables.
    * Maybe more slow to render?
    * When creating charts, it's important to think of your audience. If you include a lot of different variables, it might take the reader a long time to understand what's going on.

__Upshot:__ There is nuance to creating charts! It's not always better to include more data, but there are also plenty of times where more data can be great!

__Question:__ 
What are some ways we can make our charts more accessible? (e.g. colorblindness)

__Brainstorming:__ 
* Text font and size, and maybe also alt text descriptions
* Line styles (thickness, dashed, etc.)
* Colorblindess (find colorblind friendly color schemes online, changing shapes of markers -- we'll see this today!)

In [None]:
import altair as alt 
import seaborn as sns

In [None]:
#check Altair is update to version 5.0.0
alt.__version__

'5.0.0rc1'

In [None]:
df = sns.load_dataset("mpg")

As a reminder, here is a chart that we made on Wednesday:

In [None]:
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="mpg",
    color="origin",
    tooltip = ["weight","mpg","name","origin"]
)

Let's update this chart a little bit to be more accessible.

Notice that we can encode the same variable to multiple channels. Here, "origin" is encoded to both color and shape.

In [None]:
alt.Chart(df).mark_point().encode(
    x="weight",
    y="mpg",
    color="origin",
    shape="origin",
    tooltip=["weight","mpg","name","origin"]
)

What if we swapped the color channel with the x-axis? What kinds of charts do we get by swapping the encodings? What do you think will happen?

Start thinking about the types of decisions Altair has made for us:
* for example, the x-axis has "distinct bins"

Does this chart look a little weird? Is it different than you expected?

In [None]:
alt.Chart(df).mark_point().encode(
    x="origin",
    y="mpg",
    color="weight",
    shape="origin",
    tooltip=["weight","mpg","name","origin"]
)

What if we now swapped x and y?

In [None]:
alt.Chart(df).mark_circle().encode(
    x="mpg",
    y="origin",
    color="weight",
    #shape="origin",
    tooltip=["weight","mpg","name","origin"]
)

## Encoding data types

Notice in our above charts, there are all different kinds of data being represented. For instance, the place of origin is represented as a string. How does Altair know what to do with this?

__Key Point:__ 
Quantitative versus Categorical data. 
* This distinction will be very important for when we get to ML next week. 
* Altair picks different default values based on the type of data being encoded. Just because Altair picks a default value, this does not always mean it's the best value! We will often need to specify the data encoding ourselves.

[Click here for the source!](https://altair-viz.github.io/altair-viz-v4/user_guide/encoding.html#encoding-data-types)

![](altair_data_encodings.png)

There are the 5 main data types recognized by Altair. Here is a brief description of each!

* __Quantitative data:__ Ordinary numerical data types, like floats.
    * `mpg` from the cars dataset might be a good candidate. My one comment is that `mpg` values are typically integers, so we have another option for this encoding...
    * Another example would be the `fare` columns from the taxis dataset.

* Ordinal and Nominal are examples of __Categorical data:__ Values are represented as distinct categories or classes.
    * Ordinal data comes with a natural _ordering_ (Jaedan's example: ordering in a race.)
    * Nominal data does not come with a natural ordering (e.g. "origin" in the mpg dataset.)

* __Temporal:__ This deals with datetime-like objects. We might see this during our ML portion of the course.

* The last one deals, __geojson__ deals with geographic data (things like maps). I don't think this has ever been done in Math 10 before, but might be cool to do something with for your final project.

* Notice I have a `requirements.txt` file and an `Init` notebook. We'll use these to update Altair to a more recent version. Import the Altair and check the version number using the dunder attribute `__version__`.

We already did this at the start of lecture today.

* Load the “mpg” dataset (`sns.load_dataset`) from Seaborn and name the DataFrame `df`.

We also already did this above.

*** 

* Find the sub-DataFrame for which the name of the car contains the substring “skylark”. Name the sub-DataFrame `df_sub`. (Reminder. Use `str` and `contains`.)

In [None]:
df.head(4)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst


Here's a way that will throw an error:

Recall, `contains` is a method belonging to string objects. We need to use the `str` accessor to get access to string methods.

In [None]:
df["name"].contains("skylark")

AttributeError: 'Series' object has no attribute 'contains'

In [None]:
df["name"].str.contains("skylark")

0      False
1       True
2      False
3      False
4      False
       ...  
393    False
394    False
395    False
396    False
397    False
Name: name, Length: 398, dtype: bool

In [None]:
df_sub = df[df["name"].str.contains("skylark")]
df_sub

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
226,20.5,6,231.0,105.0,3425,16.9,77,usa,buick skylark
305,28.4,4,151.0,90.0,2670,16.0,79,usa,buick skylark limited
339,26.6,4,151.0,84.0,2635,16.4,81,usa,buick skylark


* Make a scatter plot in Altair from this sub-DataFrame using the “model_year” for both the x-coordinate and the color, and using “mpg” for the y-coordinate. (We can increase the size of the points, and remove zero from the x-axis, to make it easier to see.)

First, I'll make the basic chart. It won't look very nice at first, but we'll fix it after.

In [None]:
alt.Chart(df_sub).mark_circle().encode(
    x="model_year",
    y="mpg",
    color="model_year"
)

Let's start by making the circles bigger.

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x="model_year",
    y="mpg",
    color="model_year"
)

By default, Altair starts the x-axis at 0. In this case, it leaves us with a lot of blank space. So let's tell Altair it doesn't need to do that.

If you haven't updated to Altair 5.0.0, the syntax below will not work. Here are the old and new ways.

In [None]:
# How you could install "by hand"...Deepnote doesn't like this though
# !pip install altair==5.0.0rc1

In [None]:
alt.__version__

'5.0.0rc1'

Old way:
`x = alt.X("model_year,scale=alt.Scale(zero=False))`

New way:
`x = alt.X("model_year).scale(zero=False)`

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year").scale(zero=False),
    y="mpg",
    color="model_year"
)

This chart is looking better, but still not great. Let's keep making improvements.

* What changes if you specify different encoding types for “model_year”? (The difference in color between quantitative and ordinal will be more clear if you use a different color scheme: [options](https://vega.github.io/vega/docs/schemes/).)

I'll start by encoding "model_year" as ordinal. Notice that the spacing bewteen the x-axis gets ignored.

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    y="mpg",
    color="model_year:O"
)

Now, let's change the color scheme. The chat has spoken! We will use "purples".

Warning: If you weren't able to update Altair, changing the colors is going to use the new syntax. To use the old syntax see the examples above, or read the document for Altair version 4.

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    y="mpg",
    color=alt.Color("model_year:Q").scale(scheme="purples")
    #this is the default encoding, if we don't specify, Altair will pick this
)

Observations: When color is encoded with "Q", notice that the values 77,79, and 81 are kind of "bunched" together (color looks pretty similar). The color for 70 looks very different.

Now let's try with an ordinal encoding and see what happens.

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    y="mpg",
    color=alt.Color("model_year:O").scale(scheme="purples")
)

Notice that when we switched the encoding to be ordinal, the colors are now all evenly spaced.

Lastly, let's try switching the color encoding to "Nominal". In theory, it is Ordinal, but nothing is stopping us from telling Altair to treat it as Nominal. Since Nominal assumes there's no ordering, if we don't specify which colors to use, Altair will make them as distinct as possible.

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    y="mpg",
    color=alt.Color("model_year:N")
)

Notice that even without specifying the color, Altair makes Nominal data as distinct as possible.

In [None]:
alt.Chart(df_sub).mark_circle(size=150).encode(
    x=alt.X("model_year:O").scale(zero=False),
    y="mpg",
    color=alt.Color("model_year:N").scale(scheme="pastel2")
)

## Other types of charts in Altair

Here we switch back to the full DataFrame, `df`. There are many types of charts in Altair (browse the [example gallery](https://altair-viz.github.io/gallery/index.html) to see some of the possibilities).

* Make a bar chart using “cylinders” for the x-coordinate, using the median of the mpg values for the y-coordinate.

First, I will make a skeleton of the chart without the median. Then we will include it.

In [None]:
alt.Chart(df).mark_bar().encode(
    x="cylinders",
    y="mpg"
)

In [None]:
alt.Chart(df).mark_bar().encode(
    x="cylinders:O",
    y="mpg"
)

In the plot above, notice that scale on the y-axis! This is very strange. Also notice the space bewteen the blue sections of the bars. What's happening here is that individual points are getting plotted on top of each other. The issue is that bar charts need some kind of aggregated data. For us, this mean taking the median across cylinder groups.

Now, let's get the median of `mpg`. Notice that "median(mpg)" is not a column name in `df`. Instead, Altair knows this means to take the median of the "mpg" column. Let's also specify an encoding for the number of cylinders.

In [None]:
alt.Chart(df).mark_bar().encode(
    x="cylinders:O",
    y="median(mpg)"
)

* Add a tooltip so we can find the precise median values.

Next, let's add a tooltip.

In [None]:
alt.Chart(df).mark_bar().encode(
    x="cylinders:O",
    y="median(mpg)",
    tooltip=["median(mpg)","cylinders"]
)

* Can you find these same median values using `df.groupby`? Deepnote hides the warning, but use the keyword argument `numeric_only` when computing the median to avoid a Python warning.

In [None]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [None]:
#compute the median only for the numeric columns
df.groupby("cylinders").median(numeric_only=True)

Unnamed: 0_level_0,mpg,displacement,horsepower,weight,acceleration,model_year
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,20.25,70.0,98.5,2375.0,13.5,75.0
4,28.25,105.0,78.0,2232.0,16.2,78.0
5,25.4,131.0,77.0,2950.0,19.9,79.0
6,19.0,228.0,100.0,3201.5,16.1,76.0
8,14.0,350.0,150.0,4140.0,13.0,73.0


In [None]:
df.groupby("cylinders").median(numeric_only=True)["mpg"]

cylinders
3    20.25
4    28.25
5    25.40
6    19.00
8    14.00
Name: mpg, dtype: float64

* Make a “rectangle chart” using `mark_rect` with “model_year” along the x-axis, with “cylinders” along the y-axis, and with the rectangles colored by `"count()"`.

In [None]:
alt.Chart(df).mark_rect().encode(
    x="model_year:O",
    y="cylinders:O",
    color="count()",
    tooltip=["count()"]
)

From reading the chart, we can see that in 1982, 28 cars had 4 cylinders.

Now, I'll save this chart with the `c1` so that we can acces later.

In [None]:
c1 = alt.Chart(df).mark_rect().encode(
    x="model_year:O",
    y="cylinders:O",
    color="count()",
    tooltip=["count()"]
)

c1

* Make a “text chart” using `mark_text` with the same parameters as above, but remove the color encoding, and add a text encoding based on `"count()"`.

In [None]:
c2 = alt.Chart(df).mark_text().encode(
    x="model_year:O",
    y="cylinders:O",
    text="count()"
)

c2

This chart doesn't look too great right now, but let's combine it with `c1`!

* Layer these last two charts together, either using `+` or using `alt.layer`

`+` is just a shortcut for layering charts.

In [None]:
c1+c2

In [None]:
alt.layer(c1,c2)

That's all we got to today! We'll pick up with the following material on Monday.

## Multi-view plots in Altair

* Make a facet chart using “horsepower” for the x-coordinate, “mpg” for the y-coordinate, “cylinders” for the color with the Nominal data encoding type, and dividing the data according to the number of cylinders. Put each chart in its own row.

## Introduction to the penguins dataset

* In the penguins dataset, change the column named “island” so it is named “location” and change the column named “body_mass_g” so it is named “weight”. Use the pandas DataFrame method rename, and input a Python dictionary.

* Apply `any` with a suitable `axis` keyword argument to determine which rows have any missing data.

* Now use Boolean indexing like usual. You might need to take a negation, using tilde ~.

* Be sure to save the resulting DataFrame with the same name `df`. It should now have 333 rows.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=793a0d24-011d-46a6-98ce-fafc2b730139' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>