# Information Visualization I 
## School of Information, University of Michigan

## Week 2: 
- Expressiveness and Effectiveness
- Grammar of Graphics

## Assignment Overview
### Our objectives for this week:

- Review, reflect, and apply the concepts of encoding. Given a visualization recreate the data that was encoded.
- Review, reflect, and apply the concepts of Expressiveness and Effectiveness. Given a visualization, evaluate alternatives with the same expressiveness.

!["Drawing"](assets/data-table-resized.png)


<p style="text-align: center;">Two visualizations, same expressiveness </p>

- Review and evaluate an implementation of Grammar of Graphics using [Altair](https://altair-viz.github.io/) 

### The total score of this assignment will be 100 points consisting of:
- Case study reflection: Next Bechdel Test (30 points)
- Altair programming exercise (70 points)
- Bonus (5 points)


### Resources:
- This article by [FiveThirtyEight](https://fivethirtyeight.com) available [online](https://projects.fivethirtyeight.com/next-bechdel/) (Hickey, Koeze, Dottle, Wezerek 2017)  
- Datasets from FiveThirtyEight, we have downloaded a subset of these datasets available in the folder for your use into [./assets](./assets)
    - The original dataset can be found on [FiveThirtyEight Next Bechdel Dataset](https://github.com/fivethirtyeight/data/tree/master/next-bechdel)
    
    
### Important notes:
1) Grading for this assignment is entirely done by a human grader. They will be running tests on the functions we ask you to create. This means there is no autograding (submitting through the autograder will result in an error). You are expected to test and validate your own code. 

2) You should guide your answer by the look of our examples. It doesn't need to be pixel perfect (e.g., you may not always know what our example is scaled by), but it should be pretty close.

3) Keep your notebooks clean and readable. If your code is highly messy or inefficient you will get a deduction.

4) Pay attention to the return types of your functions.

5) Follow the instructions for submission on Coursera. You will be providing us a generated link to a read-only version of your notebook and a PDF. When turning in your PDF, please use the File -> Print -> Save as PDF option ***from your browser***. Do ***not*** use the File->Download as->PDF option. Complete instructions for this are under Resources in the Coursera page for this class. If you're having trouble with printing, take a look at [this video](https://youtu.be/PiO-K7AoWjk).

## Part 1. Expressiveness and Effectiveness (30 points)
Read the following article [*Creating the next Bechdel Test*](https://projects.fivethirtyeight.com/next-bechdel/) and answer the following questions:


### 1.1 Recreate the table (by hand or excel) needed to create the following visualization (7 points)

You *should* consider the interactive parts of the visualization in your answer. We do not want you to recreate the visualization but the table that was used to make it. A table with the right columns and a couple of example rows is sufficient. Take a picture or screenshot of your table and add it to the answer below

!["Drawing"](assets/article_1.png)


An easy way to upload images is to jump into the [./assets](./assets) directory (or use the Coursera notebook explorer and navigate to it) and then use the upload button to save your image:

![upload](assets/upload.png)

Once you have the image, you can link to it using the markdown command: `![answer1.2](assets/my_image_1.2.png)`

![anser1.2](table.png)

### 1.2 Sketch an alternative visualization with the same expressiveness (7 points)
By hand is fine, but you can also use a tool. This is a sketch, the data need not be perfectly accurate or to scale. Again, upload a picture or screenshot below. Make sure there is enough annotation so it's clear why your picture has the same expressiveness.

![answer1.2](percent_women.png)

### 1.3 Sketch an alternative visualization with the same expressiveness of the following visualization (10 points)

!["article"](assets/article_2_resized.png)


Same deal as last question: by hand or with a tool is fine. The data need not be perfectly accurate or to scale. Make sure there is enough annotation so it's clear why your picture has the same expressiveness. Again, upload a picture or screenshot below. 

![answer 2.1](grouping.png)

### 1.4 Reflect on which visualization you think is more *effective* and why? (6 points)
You are comparing the original figure in 1.3 and the one you created.

My visualization is more effective becuase it groups the categories of tests and the color provides a visual for comparision by stacking the bar graph. You can more eaily see how many test were past and how many of each category have been passed.

## Part 2. Altair programming exercise (70 points)
We have provided some code to create visualizations based on the following datasets:

1. [all_tests](assets/nextBechdel_allTests.csv) Is a collection of different Bechdel test results for the 50 top-grossing films [at the domestic box office in 2016](https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/released-in-2016)
2. [cast_gender](assets/nextBechdel_castGender.csv) Is the gender for all the cast member of each movie in the Bechdel rankings
3. [top_2016](assets/top_2016.csv) Is the date, box office and theater count for each top 2016 movie.

Complete each assignment function and run each cell to generate the final visualizations


In [1]:
import pandas as pd
import numpy as np
import altair as alt

In [2]:
# enable correct rendering
alt.renderers.enable('default')

RendererRegistry.enable('default')

In [3]:
# uses intermediate json files to speed things up
alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')

In [4]:
def load_bechdel_data(alldata='assets/nextBechdel_allTests.csv',
                     castgenderdata='assets/nextBechdel_castGender.csv',
                     top2016data='assets/top_2016.csv'):
    # read all the tables
    all_tests_df = pd.read_csv(alldata)
    cast_gender = pd.read_csv(castgenderdata)
    top_2016 = pd.read_csv(top2016data)

    # set up the tables for use
    act_movies = top_2016.set_index('Movie').join(cast_gender.set_index('MOVIE')).join(all_tests_df.set_index('movie')).reset_index().dropna()
    mov_order = top_2016.sort_values(by=['Rank'])['Movie'].tolist()
    act_movies = act_movies.rename(columns={'index': 'movieName'})
    return(act_movies,mov_order)


In [5]:
actors_movies,movies_order = load_bechdel_data()

actors_movies.columns

Index(['movieName', 'Rank', 'Domestic Box Office', 'Opening Theater Count',
       'Opening Weekend Box Office', 'Max Theater Count', 'ACTOR',
       'CHARACTER_NAME', 'TYPE', 'BILLING', 'GENDER', 'bechdel', 'peirce',
       'landau', 'feldman', 'villareal', 'hagen', 'ko', 'villarobos', 'waithe',
       'koeze_dottle', 'uphold', 'white', 'rees-davies'],
      dtype='object')

### 2.1 Variables encoded (5 points)
Warmup: how many variables are encoded in the following visualization?  

This will be a number. Hint: think of how you would write a caption for the image. How many variables do you need to describe?

We also suggest you look closely this code and make sure you understand what we're doing. Most of the other problems below will follow this structure.

In [6]:
def get_base_vis(indf):
    #input: indf, the base chart dataset (i.e., actors_movies)

    # this is a "base" chart -- on its own, it will not display anything because we haven't defined a mark
    # but we can do operations that we'll want to do multiple times, in this case filtering some data
    return alt.Chart(indf).transform_filter(
        (alt.datum.TYPE != 'Unknown') & (alt.datum.GENDER != 'Unknown') & (alt.datum.GENDER != 'null')
    )
    
base_vis = get_base_vis(actors_movies)

In [7]:
# this is where we build the actual chart
def gen_f_bar_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies -- we *could* use this, but it's easier to 
                   use the 'base' chart so we're not repeating work
    morder: the order of the movies, defaults to movies_order as created at the top of the file
    
    we will "add" to the base chart with an additional transform filter
    """
    encoding = base.transform_filter(
        alt.datum.GENDER == 'Female'
    ).encode(
        # adding all the encoding stuff
        y= alt.Y(
            'movieName:N',
            sort= morder
        ),
        x=alt.X('count(movieName):Q',
                title='cast count'),
    ).mark_bar( # use a "bar" symbol
    ).properties(
        # set the title
        title='Female'
    )
    
    
    # The less efficient way (where we don't use "base" but repeat work):
    
    # encoding = alt.Chart(indf).mark_bar().transform_filter(
    #     (alt.datum.TYPE != 'Unknown') & (alt.datum.GENDER != 'Unknown') & (alt.datum.GENDER != 'null')
    # ).transform_filter(
    #     alt.datum.GENDER == 'Female'
    # ).encode(
    #     y= alt.Y(
    #         'movieName:N',
    #         sort= morder
    #     ),
    #     x=alt.X('count(movieName):Q',
    #            title='cast count'),
    # ).properties(title='Female')
    # 
    
    return encoding

female_bar = gen_f_bar_vis(base_vis,actors_movies)

In [8]:
female_bar

How many variables are encoded in the visualization above? Add your answer below.
YOUR ANSWER HERE

In [None]:
Answer: 2. Movie title, cast Count

### 2.2 Alternative encoding (5 points)
Complete the `gen_f_circle_vis` function. Change the encoding used in the previous example from a bar to a circle. Your visualization should look like 

!["Drawing"](assets/problem_22_crop.png)

click [here](assets/problem_22_full.png) to see the full-sized image.

In [9]:
def gen_f_circle_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    morder: the order of the movies, defaults to movies_order as created at the top of the file
    
    return the call to altair function that uses the circle mark for the variables encoded in the previous example
    """
    encoding = base.transform_filter(
        alt.datum.GENDER == 'Female'
    ).encode(
        y= alt.Y(
            'movieName:N',
            sort= morder
        ),
        x=alt.X('count(movieName):Q',
                title='cast count'),
    ).mark_circle( 
    ).properties(
        title='Female'
    )
    
    return encoding

In [10]:
# generate and display the vis
circle = gen_f_circle_vis(base_vis, actors_movies)
circle

### 2.3 Increase variables encoded (5 points)
Complete the `gen_f_stacked_vis` function. Modify the first bar chart encoding the type of the actor with the color of the bar. Your visualization should look like the following (note that we don't have labels yet):
!["Drawing"](assets/problem_23_crop.png)

click [here](assets/problem_23_full.png) to see the full-sized image.

- _Partial credit can be granted for each visualization (up to 2 points) if you provide a description of what the missing piece of the function is supposed to do without need for an Altair working version_


In [11]:
def gen_f_stacked_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    morder: the order of the movies, defaults to movies_order as created at the top of the file
 
    return the call to Altair function that uses the bar mark for the variables and the color for the TYPE 
    """

    
    # this is the original chart, you should replace this code
    encoding = base.transform_filter(
        alt.datum.GENDER == 'Female'
    ).encode(
        y= alt.Y(
            'movieName:N',
            sort= morder,
            axis=None ## remove the Y axis labels and tick marks
        ),
        x=alt.X('count(movieName):Q',
                title='cast count'),
                color='TYPE',
                order=alt.Order('TYPE',sort='descending')
    )

    return encoding.mark_bar().properties(title='Female')

In [12]:
female = gen_f_stacked_vis(base_vis, actors_movies)
female

### 2.4 Change filter transform (5 points)
Complete the male_actors function, modify the previous visualization so that the actors visualized have Male gender. Use the Altair transform function for this, not Pandas.


!["Drawing"](assets/problem_24_crop.png)

click [here](assets/problem_24_full.png) to see the full-sized image.

- _Partial credit can be granted for each visualization (up to 2 points) if you provide a description of what the missing piece of the function is supposed to do without need for an altair working version_

In [13]:
def gen_m_stacked_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    morder: the order of the movies, defaults to movies_order as created at the top of the %%file
   
    return the call to Altair function that uses the bar mark for the variables and the color for the TYPE 
    """
    #add filter transform
    
    # This is the starting point. Again, modify or replace this code to get the encoding we describe above
    encoding = base.transform_filter(
        alt.datum.GENDER == 'Male'
    ).encode(
            y= alt.Y(
            'movieName:N',
            sort= morder
        ),
        x=alt.X('count(movieName):Q',
                sort='descending',
                title='cast count'),
                color='TYPE',
                order=alt.Order('TYPE',sort='descending')
    ).mark_bar().properties(title='Male')

    return encoding.mark_bar().properties(title='Male')

In [14]:
male = gen_m_stacked_vis(base_vis, actors_movies)
male

### 2.5 Variables encoded 2 (5 points)
Execute the following cell and determine how many variables are being encoded in the combined visualization. If you have been able to complete the previous examples, the plot should look like this

!["Drawing"](assets/problem_25_crop.png)

click [here](assets/problem_25_full.png) to see the full-sized image.

In [15]:
def get_middle_vis(base, indf, morder = movies_order, rank_order = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    input: rank_order -- the original rank order of the movies
    input: morder -- the desired order of the movies, defaults to movies_order as created at the top of the file 
    
    return the "middle" column -- a text visualization
    """
    
    # sort to match morder (if needed)
    rankord = [movies_order.index(movie)+1 for movie in morder]
   
    middle = base.encode(
        y=alt.Y('rnk:O', axis=None, sort = rankord),
        text=alt.Text('rnk:O'),
        color=alt.Color('bechdel:N')
    ).transform_aggregate(
        rnk='mean(Rank)',
        groupby=['movieName','bechdel']
    ).mark_text().properties(width=20)
    return(middle)


In [16]:
# merge together the three charts, male, middle, female
middle = get_middle_vis(base_vis, actors_movies)

male | middle | female

How many variables are encoded in the visualization above? (this should be an integer)
YOUR ANSWER HERE

### 2.6 Alternative encoding 1 (20 Points)
Create a new visualization within the `gen_bubble_vis` function with the following encoding:
- Use circles as the mark
- Use the scale of the circles to encode the number of actors on each category
- Use the y position of the circle to encode the movie
- Use the x position of the circle to encode the type of actor
- Use the color of the circle to encode the gender of the actor
- Match the styling of the example (it doesn't need to be pixel perfect, but should be close)
- you don't need to generate the leftmost column (it's the same as "middle" above and we'll add it at the end.


!["Drawing"](assets/problem_26_crop.png)

click [here](assets/problem_26_full.png) to see the full-sized image.

- _Partial credit can be granted for each visualization (up to 5 points) if you provide a description of what the missing piece of the function is supposed to do_

In [18]:
def gen_bubble_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    input: morder -- the order of the movies, defaults to movies_order as created at the top of the file  
    
    return an altair chart per the specification above
    """
    #change this
    plot = base.mark_circle(
       opacity=0.8,
       stroke='black',
       strokeWidth=1
    ).encode(
        alt.Y('movieName:N', sort= morder),
        alt.X('TYPE:N'),
        alt.Size('count(movieName):Q',scale=alt.Scale(range=[0, 2000]), legend=alt.Legend(title='Count of actors')),
        color='GENDER'
    ).properties(
        width=350,
        height=880
    )
 
    return plot

In [19]:
# let's create the bubble chart 
al_enc_one = gen_bubble_vis(base_vis, actors_movies)

# add middle to the left edge and display (if you modified the height for the chart in al_enc_one
# you should use the same value for "middle" below)

(middle.properties(height=880) | al_enc_one)

### 2.7 Alternative encoding 2 (25 Points)
We will be completing the "heat map" style vis as specificed below, but it will be easier to create a male and female subchart in two separate functions (the code will get long otherwise) and then put them together. Complete `gen_f_hm_vis()` and `gen_m_hm_vis()` functions to create a new visualization with the following encoding:
- The left and right plot should filter male and female actors respectively
- Use rectangles as the mark
- Use the text inside each rectangle to encode the count of actors one each category (gender, type and movie)
- Use the y position of the rectangle to encode the movie
- Use the x position of the rectangle to encode the type of actor
- Use the color of the rectangle to encode whether that movie passes the Bechdel test or not (bechdel variable)
- Note that only the female vis has labels on the left, the male vis does not

The top of the female plot would look like this (click [here](assets/problem_27a_full.png) to see the full-sized image):

!["Drawing"](assets/problem_27a_crop.png)

If you've done everything correctly, your final visualization should look like the one below (click [here](assets/problem_27b_full.png) for the full plot).

!["Drawing"](assets/problem_27b_crop.png)

- _Partial credit can be granted for each visualization (up to 4 points for each function) if you provide a description of what the missing piece of the function is supposed to do without need for an Altair working version_

In [41]:
def gen_f_hm_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    input: morder -- the order of the movies, defaults to movies_order as created at the top of the file
    
    return an altair chart per the specification above
    """
    # modify to add filter transform
    plot = base.transform_filter(
        alt.datum.GENDER == 'Female'
    ).mark_rect().encode(
       x= alt.X('TYPE:N'),
       y= alt.Y('movieName:N',sort= morder),
        color='bechdel:N'
    )
    #modify to add filter transform
    text = base.transform_filter(
        alt.datum.GENDER == 'Female'
    ).mark_text(baseline='middle').encode(
        x=alt.X('TYPE:N'),
        y=alt.Y('movieName:N', sort= morder),
        text='count()'
    )
    return plot+text

In [42]:
# create the female side
f_a_1 = gen_f_hm_vis(base_vis, actors_movies)

# display it to check
f_a_1

In [48]:
def gen_m_hm_vis(base, indf, morder = movies_order):
    """
    input: base -- 'base' chart as defined above
    input: indf -- a dataframe like actors_movies (you can use this or 'base' to achieve the desired work)
    input: morder -- the order of the movies, defaults to movies_order as created at the top of the file
    
    return an altair chart per the specification above
    """
    #add filter transform
    plot = base.transform_filter(
        alt.datum.GENDER == 'Male'
    ).mark_rect().encode(
       x= alt.X('TYPE:N'),
       y= alt.Y('movieName:N',sort= morder,axis=None),
        color='bechdel:N'
    )
    #modify to add filter transform
    text = base.transform_filter(
        alt.datum.GENDER == 'Male'
    ).mark_text(baseline='middle').encode(
        x=alt.X('TYPE:N'),
        y=alt.Y('movieName:N', sort= morder,axis=None),
        text='count()'
    )
    return plot+text

In [49]:
m_a_1 = gen_m_hm_vis(base_vis, actors_movies)
m_a_1

In [50]:
# create the visualization
(f_a_1 | middle | m_a_1)

### 2.8 (Bonus) Compare expressiveness and effectiveness (5 points)
Look at the visualization for question 2.7. How does this visualization compare in terms of expressiveness and effectiveness to the visualizations in questions 2.5 and 2.6?

_2.8 Answer_