# Introduction to Data Visualizations in Altair


We're going to learn how to create a simple interactive visualization using [Altair](https://altair-viz.github.io/), which is a Python library built on top of Vega and Vega-Lite - two visualization libraries for JavaScript.


## Install Altair

In [None]:
# If you're working on your own local machine (and not the cloud), uncomment the line below and run this cell
#!pip install altair

## Import Altair

In [1]:
import altair as alt

## Import our data

We're going to go back to a dataset we used from previous classes––Hannah Anderson and Matt Daniels's Film Scripts dataset.

I've created two CSV files from the dataframes we created as part of [Week 7's Exploratory Data Analysis Lesson](https://github.com/sceckert/IntroDHSpring2021/blob/main/_week7/exploratory-data-analysis-with-pandas.ipynb)

If you want to refresh yourself on how we created these datasets, look back at Week7.

In [2]:
import pandas as pd

In [3]:
women_character_film_data_df = pd.read_csv('../_datasets/women_character_film_data.csv', encoding='utf-8')
men_character_film_data_df = pd.read_csv('../_datasets/men_character_film_data.csv', encoding='utf-8')

In [4]:
women_character_film_data_df.head()

Unnamed: 0,script_id,imdb_character_name,words,gender,age,total_dialogue,imdb_id,title,year,gross (inflation-adjusted),link,proportion_of_dialogue
0,280,betty,311,f,35.0,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.048639
1,280,carolyn johnson,873,f,,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.136534
2,280,eleanor,138,f,,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.021583
3,280,francesca johns,2251,f,46.0,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.352049
4,280,madge,190,f,46.0,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.029715


In [5]:
women_character_film_data_df.head()

Unnamed: 0,script_id,imdb_character_name,words,gender,age,total_dialogue,imdb_id,title,year,gross (inflation-adjusted),link,proportion_of_dialogue
0,280,betty,311,f,35.0,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.048639
1,280,carolyn johnson,873,f,,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.136534
2,280,eleanor,138,f,,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.021583
3,280,francesca johns,2251,f,46.0,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.352049
4,280,madge,190,f,46.0,6394,tt0112579,The Bridges of Madison County,1995,142.0,http://www.awesomefilm.com/script/bomc.txt,0.029715


## Let's visualize our data

Let's say we want to visualize the poroportion of **men's** dialogue in films, by year. 

First, we have to isolate the data that we want to plot––we don't want all of our dataframe. 

To do that, we're going to create a new dataframe ("men_dialogue_df") with just the data we want to plot use the `.groupby()` function to group by filme titles and years. Then we sum up the proportion of dialogue.

In [6]:
men_dialogue_df = men_character_film_data_df.groupby(['title', 'year'])[['proportion_of_dialogue']].sum()\
.sort_values(by='proportion_of_dialogue', ascending=False).reset_index()

### Plot our visualization

In [16]:
# Create a static altair chart.
chart = alt.Chart(men_dialogue_df, title='How Much Do Men Speak in Hollywood Films?').mark_circle(size=70).encode(
   alt.X('year:Q', axis=alt.Axis(format='', title='Release Year'),
        scale=alt.Scale(domain=(1920, 2020))
    ),
    alt.Y('proportion_of_dialogue', axis=alt.Axis(format='%', title='Amount of Dialogue Spoken by Men'))
).properties(width=800, height=500)
chart
# To plot our chart with a simple linear regression plotted on top in red, uncomment the line below
# chart + chart.transform_regression('year', 'proportion_of_dialogue').mark_line(color="red")

# To plot a simple polynomila regression,uncomment the line below
# chart + chart.transform_regression('year', 'proportion_of_dialogue', method='poly').mark_line(color="red")

What did we just do? 

- We specify a new `Chart` class 
- We then specify the type of mark we are using (in this case, a `circle`. In Altair, we can use all types of marks to represent our data https://altair-viz.github.io/user_guide/marks.html. 
- We call `encoding` to specify what variable we want to represent on the x and y axis, as well as specifying the domain for the x-axis (from 1920 to 2020), how we want the data to be encoded (i.e., making sure that percentage data is read as percent), and adding labels to our x and y axis.
- Finally, we call `properties` to tell altair how large, in pixels to make our visualization.

Altair has many fields for encoding https://altair-viz.github.io/user_guide/encoding.html

## Let's make an interactive visualization of our data

we're going to follow the same steps as above: specifying a chart class, the type of mark, the encoding, and the properties.

But we're also going to add two new steps:

- We're going to specify a `tooltip`, which will create a tooltip to disply additional information about our datapoints when we hover over it.
- We're going to call `interactive` to specify that we want our visualization to be interactive.

Altair has built interactivity fields

Altair ALSO has a function to save your chart (see below)

In [8]:
alt.Chart(men_dialogue_df, title='How Much Do Men Speak in Hollywood Films?').mark_circle(size=70).encode(
   alt.X('year:Q', axis=alt.Axis(format='', title='Release Year'),
        scale=alt.Scale(domain=(1920, 2020))
    ),
    alt.Y('proportion_of_dialogue', axis=alt.Axis(format='%', title='Amount of Dialogue Spoken by Men')),
    tooltip=['title', 'year', alt.Tooltip('proportion_of_dialogue:Q', format='.1%')]
).interactive().properties(width=800, height=500)

What do you notice in the tooltip bar? Why do you think the dialogue proportion field is encoded slightly differently?

### Saving a chart in altair as an HTML file
If you right click at the top of one of the above charts, you'll be able to download the chart as a static image. But what if we wanted to save an *interactive* version? To save a chart made in altair as an HTML file, simply use the `.save('your-name-for-your-chart.html')` operation. For instance, we could add `.save('chart1.html')` at the end of the very end of our line of code, after the `properties()` specification.

Click and run the cell below. Look in the lefthand panel of JupyterLab–– did you see your HTML file?

In [9]:
alt.Chart(men_dialogue_df, title='How Much Do Men Speak in Hollywood Films?').mark_circle(size=70).encode(
   alt.X('year:Q', axis=alt.Axis(format='', title='Release Year'),
        scale=alt.Scale(domain=(1920, 2020))
    ),
    alt.Y('proportion_of_dialogue', axis=alt.Axis(format='%', title='Amount of Dialogue Spoken by Men')),
    tooltip=['title', 'year', alt.Tooltip('proportion_of_dialogue:Q', format='.1%')]
).interactive().properties(width=800, height=500).save('chart-men-dialogue.html')

## Your turn! 

1. Go through the steps of creating a dataframe called "women_dialogue_df" with just the years, titles, and proporitions of dialogue (see our above example with the men_character_film_data). 
2. Create a static visualization of the proportion of dialogue spoken by women over time
3. Creat an interactive visualization of the same data and save it as an HTML file

In [None]:
## Your Code here for creating a dataframe called `women_dialogue_df`

In [None]:
## Your Code here for creating a static visualization

In [None]:
## Your code her for creating an interactive visualizaiton 