# Pandas for Altair and Altair for Pandas
Eytan Adar, University of Michigan

This is a "recipe" book of equivalent commands between Pandas and Altair. There are often multiple ways to achieve the same thing in both Pandas and Altair. We won't cover them all here, but you'll hopefully get a sense of the various mappings.

There are also better "shortcuts" for the code snippets we're presenting. We've occasionally broken things into multiple lines of code so we can better document things. Once you understand how things work, you'll like be able to fit many of these into one or two lines.

This recipe book is also intended for people who already know Pandas *or* Altair well. If you need an explanation of what is happening internally, we've created an associated video with all the details.

In [1]:
import pandas as pd
import altair as alt
import numpy as np

In [2]:
# let's create a data frame, this will be our "DF"

def getDF():
    return pd.DataFrame([('AV','A', 9,5),
                    ('SD','B',10,4),
                    ('ES','A',2,3),
                    ('MB','A',7,5),
                    ('RR','B',8,6),
                    ('YY','B',9,7),
                    ('LA','A',9,8)],
                    columns=('Student','Class', 'T1Grade','T2Grade'))


In [3]:
# this is a utility function that we'll use to reset to the original dataframe

df = getDF()

In [4]:
# let's look inside the DF
df

Unnamed: 0,Student,Class,T1Grade,T2Grade
0,AV,A,9,5
1,SD,B,10,4
2,ES,A,2,3
3,MB,A,7,5
4,RR,B,8,6
5,YY,B,9,7
6,LA,A,9,8


What we have is a grade sheet for all students (their initials are in "Student") across 2 classes (A and B, see Class). They've taken two tests and we've recorded their scores into T1Grade and T2Grade.

# A Basic Chart (Pandas and Altair)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example1.png)

This is basically a picture of each student's T1 Grade

We can think of this in Grammar of Graphics terms:

**Mark**: rectangle 

Data (2 variables):
* **Student**: Nominal   
* **T1Grade**: Quantatitive

Encoding (2--one per variable):
* **Student**: x-axis
* **T1Grade**: y-axis (bar length)

In [5]:
alt.Chart(df).mark_bar().encode(
    x='Student',
    y='T1Grade'
)

# Filtering Data

Let's try for a simple filtering experiment, we're going to get rid of people who got less than 7 on their test

![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example2.png)

## The Pandas way

In [6]:
# get the data
df = getDF()

In [7]:
# filter the grades using pandas 

df = df[df.T1Grade > 7]

# the end result before we render:
df

Unnamed: 0,Student,Class,T1Grade,T2Grade
0,AV,A,9,5
1,SD,B,10,4
4,RR,B,8,6
5,YY,B,9,7
6,LA,A,9,8


In [8]:
# and then use the filtered df

alt.Chart(df).mark_bar().encode(
    x='Student',
    y='T1Grade'
)

You could, of course, do it all in one, but it will get messy fast

```alt.Chart(df[df.T1Grade > 7]).mark_bar().encode(
    x='Student',
    y='T1Grade'
)```


## The Altair Way

In [9]:
# get the data
df = getDF()

In [10]:
alt.Chart(df).transform_filter(
    alt.datum.T1Grade > 7
).mark_bar().encode(
    x='Student',
    y='T1Grade'
)

# Aggregation (groupby/agg vs transform_aggregate)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example3.png)

We're going to calculate a new value based on some grouping. In this is example, we'll find the minimum grade in each class. This requires grouping (in this case Class) and calculating some new value (in this case the min, but it can be anything... mean, max, etc.)

## The Pandas way

In [11]:
# get the data
df = getDF()

In [12]:
# we're going to first group by 'Class'
# That group will be aggregated into a new column called classmin by using
# the NamedAgg (named aggregate fuction on the T1Grade column)

df = df.groupby('Class').agg(classmin=pd.NamedAgg(column='T1Grade',aggfunc='min'))

df = df.reset_index()  # we don't want "Class" to be the index (so we'll reset)

# the end result before we render is:
df

Unnamed: 0,Class,classmin
0,A,2
1,B,8


In [13]:
# wrap up the rendering with Altair

alt.Chart(df).mark_bar().encode(
    x='Class:N',
    y='classmin:Q'
)

# notice that we specified that Class was 'Nominal' and classmin was 'Quantitative'
# Altair can't infer the types for the columns in df. You'll see an exception if you
# leave out ":N" or ":Q"

## The Altair way

In [15]:
# get the data
df = getDF()

In [16]:
alt.Chart(df).transform_aggregate(
    groupby=['Class'],                          # Groupby class
    classmin='min(T1Grade)'                     # For each class, calculate the min T1Grade and put in classmin
).mark_bar().encode(
    x='Class:N',
    y='classmin:Q'
)

## Alternative Altair way

In [17]:
# get the data
df = getDF()

In [18]:
alt.Chart(df).mark_bar().encode(
    x='Class:N',
    y='min(T1Grade)'
)

# This will create a new column, but you don't get to control the name this way.
# It will be something like min_T1Grade. This is faster for simple aggregation
# but you have more control with the original (e.g., aggregating on multive variables)

# Calculated Field (Pandas) and transform_calculate (Altair)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example4.png)

In this example, we were missing a field for each row that we want to calculate. Specifically, we want to know the change in grade between test 1 and test 2. We're going to modify our data to add this extra column to support this.

## The Pandas way

In [19]:
# get the data
df = getDF()

In [20]:
# for each row, subtract T2Grade from T1Grade and put it into a new column, testDifference
df['testDifference'] = df['T1Grade'] - df['T2Grade']

# the end result before we render is:
df

Unnamed: 0,Student,Class,T1Grade,T2Grade,testDifference
0,AV,A,9,5,4
1,SD,B,10,4,6
2,ES,A,2,3,-1
3,MB,A,7,5,2
4,RR,B,8,6,2
5,YY,B,9,7,2
6,LA,A,9,8,1


In [21]:
alt.Chart(df).mark_bar().encode(
    x='Student',
    y='testDifference'
)

## The Altair Way

In [22]:
# get the data
df = getDF()

In [23]:
alt.Chart(df).transform_calculate(
    testDifference = alt.datum.T1Grade - alt.datum.T2Grade
).mark_bar().encode(
    x='Student',
    y='testDifference:Q'                                        # notice that we need to explicitly say :Q
)

# Aggregating and calculating (transform_aggregate + transform_calculate)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example5.png)

This combines a few different things. First, we're going to group and calculate some aggregates (in this case we'll use the min and max grades in each class). Second, we're going to augment our table as above to hold the additional field we care about (the difference).

## The Pandas way

In [24]:
# get the data
df = getDF()

In [25]:
# create a new dataframe with a row for each class
# and then calculate the min/max for each class and put those 
# into a new variable

df = df.groupby('Class').agg(
    classmin=pd.NamedAgg(column='T1Grade',aggfunc='min'),
    classmax=pd.NamedAgg(column='T1Grade',aggfunc='max')
)

# The dataframe is now focused on the class:
df

Unnamed: 0_level_0,classmin,classmax
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2,9
B,8,10


In [26]:
# for each row calculate the difference between them

df['difference'] = df['classmax'] - df['classmin']

# reset the index (the groupby makes "Class" the index... we don't want that)
df = df.reset_index()

# the end result before we render it is:
df

Unnamed: 0,Class,classmin,classmax,difference
0,A,2,9,7
1,B,8,10,2


In [27]:
# now render it
alt.Chart(df).mark_bar().encode(
    x='Class:N',
    y='difference:Q'
)

## The Altair way

In [28]:
# get the data
df = getDF()

In [29]:
alt.Chart(df).transform_aggregate(               # make the new columns
    groupby = ['Class'],                                     # For each class...
    classmax = 'max(T1Grade)',                               # ...find the min and max
    classmin = 'min(T1Grade)'
).transform_calculate(                                      # calculate the difference for each row
    difference = alt.datum.classmax - alt.datum.classmin
).mark_bar().encode(
    x='Class',
    y='difference:Q'
)

# Join (Pandas) vs. transform_joinaggregate (Altair)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example6.png)

This one is a bit tricky. We need to first calulate some property of the group and then put that back into our original table. In this case, we need to use the grouping trick we did above to create the summary table (in this case the maximum T1 grade per class), and then annotate our original table. We'll do this in two steps. First finding the aggregate and then joining it back in. This is an example where Altair has a single function (joinaggregate) to support his.

## The Pandas way

In [30]:
# get the data
df = getDF()

In [31]:
df1 = df.groupby('Class').agg(                # group by class
    classmax=pd.NamedAgg(column='T1Grade',    # For each class, create a named aggregate on T1Grade
                        aggfunc='max')        # and find the max
)



In [32]:
# take a peak inside
df1

Unnamed: 0_level_0,classmax
Class,Unnamed: 1_level_1
A,9
B,10


In [33]:
df = df.join(df1, on='Class')

# now df looks like:
df

Unnamed: 0,Student,Class,T1Grade,T2Grade,classmax
0,AV,A,9,5,9
1,SD,B,10,4,10
2,ES,A,2,3,9
3,MB,A,7,5,9
4,RR,B,8,6,10
5,YY,B,9,7,10
6,LA,A,9,8,9


In [34]:
# render it:
alt.Chart(df).mark_bar().encode(
    x='Student:N',
    y='T1Grade:Q',
    color='classmax:O'
)

## The Altair way

In [35]:
# get the data
df = getDF()

In [36]:
alt.Chart(df).transform_joinaggregate(
    groupby=['Class'],                                 # Class group
    classmax='max(T1Grade)',                          # find the max, for each...
).mark_bar().encode(
    x='Student:N',
    y='T1Grade:Q',
    color='classmax:O'                                # use the value calculated below
)

# Joins and Aggregates 2 (transform_joinaggregate + transform_joincalculate)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example7.png)

We'll expand on the example above. We are now going to calculate some additional property (how much better the student did relative to the class max grade). As before, we're going to calculate the max, join it back in and then we'll use the calculation to find the difference. In this case, we don't have a single Altair command but will do it in two steps.

## The Pandas way

In [37]:
# get the data
df = getDF()

In [38]:
# group and calculate the max per group
df1 = df.groupby('Class').agg(
    classmax=pd.NamedAgg(column='T1Grade',
                        aggfunc='max')
)

# let's look in df1:
df1

Unnamed: 0_level_0,classmax
Class,Unnamed: 1_level_1
A,9
B,10


In [39]:
# join back into original table
df = df.join(df1,on='Class')

# calculate the difference per student
df['difference'] = df.T1Grade - df.classmax

# see what it looks like inside:
df

Unnamed: 0,Student,Class,T1Grade,T2Grade,classmax,difference
0,AV,A,9,5,9,0
1,SD,B,10,4,10,0
2,ES,A,2,3,9,-7
3,MB,A,7,5,9,-2
4,RR,B,8,6,10,-2
5,YY,B,9,7,10,-1
6,LA,A,9,8,9,0


In [40]:
alt.Chart(df).mark_bar().encode(
    x='Student:N',
    y='difference:Q'
)

## The Altair way

In [41]:
# get the data
df = getDF()

In [42]:
alt.Chart(df).transform_joinaggregate(        # first step, calculate the max per Class
    classmax='max(T1Grade)',
    groupby=['Class']
).transform_calculate(                                   # second step, calculate the difference for each Student
    difference = alt.datum.T1Grade - alt.datum.classmax
).mark_bar().encode(
    x='Student:N',
    y='difference:Q'                                     # third step (yes, not in order) plot the difference
)

# Ranks and transform_window

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example8.png)

In this example we need to calculate something that depends on all the data. Specifically, we are looking at the rank of students based on their grades.

## The Pandas way

In [43]:
# get the data
df = getDF()

In [44]:
# we generate a new column that has the rank associated with each grade
# and put this in a new Rank column in the data frame

df['Rank'] = df['T1Grade'].rank(ascending=True)

# it looks like:
df

Unnamed: 0,Student,Class,T1Grade,T2Grade,Rank
0,AV,A,9,5,5.0
1,SD,B,10,4,7.0
2,ES,A,2,3,1.0
3,MB,A,7,5,2.0
4,RR,B,8,6,3.0
5,YY,B,9,7,5.0
6,LA,A,9,8,5.0


In [45]:
# plot the values

alt.Chart(df).mark_bar().encode(
    x='Student:N',
    y='Rank:Q'
)

## The Altair way

In [46]:
# get the data
df = getDF()

In [47]:
# similar to above

alt.Chart(df).transform_window(
    sort=[{'field' : 'T1Grade'}],   # sort by T1Grade
    Rank = 'rank(*)'                # use the rank(..) operator to calculate the rank 
).mark_bar().encode(
    x='Student:N',
    y='Rank:Q'
)

# Ranks and transform_window 2

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example9.png)

This is a slightly more sophisticated version of the above example. Rather than learning the rank for each student overall, we want to calculate their rank in the class. So we first need to group by classes and then sort/calculate ranks. This applies to both Pandas and Altair.

## The Pandas way

In [48]:
# get the data
df = getDF()

In [49]:
# first we groupby class, then extract the T1Grade for each group
# this extracted version is then sorted and a rank value determined.
# Ultimately, this is placed back in the data frame.

df['GradeRank'] = df.groupby('Class')['T1Grade'].rank(ascending=True,
                                                      method='min')

# let's look inside:
df

Unnamed: 0,Student,Class,T1Grade,T2Grade,GradeRank
0,AV,A,9,5,3.0
1,SD,B,10,4,3.0
2,ES,A,2,3,1.0
3,MB,A,7,5,2.0
4,RR,B,8,6,1.0
5,YY,B,9,7,2.0
6,LA,A,9,8,3.0


In [50]:
# plot the data
alt.Chart(df).mark_bar().encode(
    x='Student',
    y='GradeRank:Q',
    color='Class:N'
)

## The Altair way

In [51]:
# get the data
df = getDF()

In [52]:
alt.Chart(df).transform_window(
    groupby=['Class'],            # group by class
    sort=[{'field':'T1Grade'}],   # sort by the T1 Grade 
    GradeRank='rank(*)',          # determine the rank of that row
).mark_bar().encode(              # plot
    x='Student:N',
    y='GradeRank:Q',
    color='Class:N'
)

# melt (Pandas) and transform_fold (Altair)

Our goal is to get a figure like this:
    
![objective](https://raw.githubusercontent.com/eytanadar/si649public/master/lab4/assets/pandasaltair/example10.png)

This requires a "pivot" on the data. We make the distinction between long form and wide form data. The original version of our data has a column for each test and a row for each student. This is fine if we want to make marks for each student or class (appropriate "long form" for student/class visualizations). However, this is considered "wide" form for working with test grades. We need to make the conversion. One way to do this in Pandas is "melt." In Altair, we would use transform_fold.

## The Pandas way

In [53]:
# get the data
df = getDF()

In [54]:
# first, just pull out the columns we care about to make this easier
df = df[['Student','T1Grade','T2Grade']]

# we have:
df

Unnamed: 0,Student,T1Grade,T2Grade
0,AV,9,5
1,SD,10,4
2,ES,2,3
3,MB,7,5
4,RR,8,6
5,YY,9,7
6,LA,9,8


In [55]:
# next, "melt" to create the new version. We're going to indicate that
# we should keep "Student" stable, but for each additional column
# (T1Grade and T2Grade) we'll make a new row. Because we want to know
# which test the score came from, we'll create a "Test" column

df = df.melt('Student',var_name=['Test'])

# the value of each test will end up in a column named 'value'.
# If you wanted to override this, you could add the argument to melt:
# value_name='TestScore'

# now we have some long form data:
df

Unnamed: 0,Student,Test,value
0,AV,T1Grade,9
1,SD,T1Grade,10
2,ES,T1Grade,2
3,MB,T1Grade,7
4,RR,T1Grade,8
5,YY,T1Grade,9
6,LA,T1Grade,9
7,AV,T2Grade,5
8,SD,T2Grade,4
9,ES,T2Grade,3


In [56]:
# Now that we have the data in long form, we can group by Test (T1/T2) and calculate 
# the mean for each group

df = df.groupby('Test').agg(ClassMean=pd.NamedAgg(column='value',aggfunc='mean'))

# reset the index (we don't want "Test" to be the index)
df = df.reset_index()

# this looks like:
df

Unnamed: 0,Test,ClassMean
0,T1Grade,7.714286
1,T2Grade,5.428571


In [57]:
# plot the data

alt.Chart(df).mark_bar().encode(
    x='Test:N',
    y='ClassMean:Q'
)


## The Altair way

In [58]:
# get the data
df = getDF()

In [59]:
alt.Chart(df).transform_fold(
    ['T1Grade','T2Grade'],      # the columns we want to "fold" into 1
    as_  = ['Test','grade']     # the name of the column will get pulled into test and the value into grade
).mark_bar().encode(
    x = 'Test:N',
    y = 'mean(grade):Q'         # we're using a shortcut here to calculate the mean grade per Test
)