# Marks for Temporal Data

## Learning Goals
Those who actively work through this notebook will be able to:
 - Perform various data wrangling tasks to format datasets
 - Use graphical marks to create common vizzes
 - Create visualizations for sequential temporal tasks (e.g., line, area, and stacked area charts)
 - Create visualizations for cyclic temporal tasks (e.g., heatmaps)

## Data Wrangling: Currency Exchange Data
We will be using the Kaggle Dataset that shows the daily exchange rates per Euro from 1999 to 2023. For more information about the data visit https://www.kaggle.com/datasets/lsind18/euro-exchange-daily-rates-19992020 


In [354]:
import pandas as pd
import altair as alt


### Load the data from the csv file
remember to parse the 'Date' column appropriately. 

In [355]:
data = ...

### Basic Data Wrangling Task
1. What is the size of the dataset
2. What are the column names
3. Is the data in an appropriate form for us to encode it with altair. 


In [356]:
...

(6209, 7)

In [357]:
...

Unnamed: 0,Date,Canadian dollar,Chinese yuan renminbi,UK pound sterling,Indian rupee,Mexican peso,US dollar
0,2022-12-30,1.444,7.3582,0.88693,88.171,20.856,1.0666
1,2022-12-29,1.4475,7.4151,0.88549,88.2295,20.651,1.0649
2,2022-12-28,1.4361,7.4224,0.88058,88.0943,20.6856,1.064
3,2022-12-27,1.4384,7.3994,0.88333,88.0808,20.5515,1.0624
4,2022-12-23,1.4433,7.4198,0.8803,87.958,20.7115,1.0622


In [358]:
...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6209 entries, 0 to 6208
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    6209 non-null   datetime64[ns]
 1   Canadian dollar         6209 non-null   object        
 2   Chinese yuan renminbi   5941 non-null   object        
 3   UK pound sterling       6209 non-null   object        
 4   Indian rupee            5941 non-null   object        
 5   Mexican peso            6209 non-null   object        
 6   US dollar               6209 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 339.7+ KB


In [359]:
...

Index(['Date', 'Canadian dollar ', 'Chinese yuan renminbi ',
       'UK pound sterling ', 'Indian rupee ', 'Mexican peso ', 'US dollar '],
      dtype='object')

### Data Wrangling - Making It Tidy
Switch the data from a wide format to a long format where each row depicts an item
You should use melt to do this, set the variable name to 'currency' and the value_name to 'rate'

    

In [361]:
...

Unnamed: 0,variable,value
0,Date,2022-12-30 00:00:00
1,Date,2022-12-29 00:00:00
2,Date,2022-12-28 00:00:00
3,Date,2022-12-27 00:00:00
4,Date,2022-12-23 00:00:00
...,...,...
43458,US dollar,1.1659
43459,US dollar,1.1632
43460,US dollar,1.1743
43461,US dollar,1.179


In [362]:
data = ...

### Basic Data Wrangling Task
1. What is the size of the dataset
2. What are the column names
3. Are there any items with NaN values, if so remove them 
4. Is the data in an appropriate form for us to encode it with altair. 


In [363]:
data

Unnamed: 0,Date,currency,rate
0,2022-12-30,Canadian dollar,1.444
1,2022-12-29,Canadian dollar,1.4475
2,2022-12-28,Canadian dollar,1.4361
3,2022-12-27,Canadian dollar,1.4384
4,2022-12-23,Canadian dollar,1.4433
...,...,...,...
37249,1999-01-08,US dollar,1.1659
37250,1999-01-07,US dollar,1.1632
37251,1999-01-06,US dollar,1.1743
37252,1999-01-05,US dollar,1.179


In [364]:
...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37254 entries, 0 to 37253
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      37254 non-null  datetime64[ns]
 1   currency  37254 non-null  object        
 2   rate      36718 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 873.3+ KB


In [365]:
...

Index(['Date', 'currency', 'rate'], dtype='object')

In [369]:
...

(37254, 3)

### Additional Wrangling Steps
1. Make sure all column names start with lowercase letters
2. Create a new column called country that is the first word for the currency variables
e.g Canadian Dollar should be split so that the country column has only Canadian 
3. Add a year column that has the year for each item. Hint see Datetime in Pandas (e.g.  pd.DatetimeIndex(df['date']).year)
4. Only keep data that was collected on the 15th of each month (filter out all other dates)

In [370]:
data = ...

In [371]:
...

In [372]:
...

In [373]:
...

Unnamed: 0,date,currency,rate,country
0,2022-12-30,Canadian dollar,1.444,Canadian
1,2022-12-29,Canadian dollar,1.4475,Canadian
2,2022-12-28,Canadian dollar,1.4361,Canadian
3,2022-12-27,Canadian dollar,1.4384,Canadian
4,2022-12-23,Canadian dollar,1.4433,Canadian
...,...,...,...,...
37249,1999-01-08,US dollar,1.1659,US
37250,1999-01-07,US dollar,1.1632,US
37251,1999-01-06,US dollar,1.1743,US
37252,1999-01-05,US dollar,1.179,US


In [374]:
...

In [375]:
...

### Canada Data
Create a new dataframe that only keeps the Canadian items

In [376]:
data_canada = ...

In [377]:
data_canada.shape

(206, 5)

In [1]:
data_canada.head()

NameError: name 'data_canada' is not defined

# Visualization Tasks 

## Line Chart

The line chart (also called line graph) was created by William Playfair. It encode data as a series of data points that are connected by a straight line. 

Let's start by creating a line chart for Canada, 
Use the `x` channel to encode the temporal field/attribute, while the `y` channel is used to encode the rate.

In [379]:
alt.Chart(data_canada)

 Altair allows you to include points for each value by adding the `point` property for mark_line.
 Set the `point` property to `True`
    

In [380]:
...

You can alternatively the appearance of the point with `OverlayMarkDef`.
`point=alt.OverlayMarkDef(color="black")`

In [381]:
...

Altair allows you to change the way marks are connected. Before we do that, let's use Altair to filter the data to only show what was happening in 2002 and 2004
Instead of using the Vega expression (with datum) use the Field Predicates
https://altair-viz.github.io/user_guide/transform/filter.html 

In [382]:
...

Let's look at changing the interpolations. 
The default interpolation for `mark_line` and `mark_area` is **linear**
The [API](https://altair-viz.github.io/user_guide/generated/core/altair.Interpolate.html) has a full listing of interpolate options. Explore the options before continuing to the next section.


In [383]:
...

### Axis Formatting
Let's format our x-axis so that it doesn't include unnecessary data,
Being that we only kept what is happening on the 15, we really don't need to keep the day or time. 
See https://altair-viz.github.io/user_guide/transform/timeunit.html#user-guide-timeunit-transform for information on Time Transformations

In [384]:
...

We can also customize the labels on the axis. 
See https://altair-viz.github.io/user_guide/generated/core/altair.Axis.html for some examples

In [386]:
...

## Multi-Line Chart
Let's go back to our orginal dataset stored in 'data' and show the exchange rate for the countries in the datset. 
With the `mark_line` you can encode temporal data for multiple countries at the same time. 
We will encode country with the color channel. 



In [387]:
...

It is hard to see what is happening for Canada, US and UK because of the hire rates for India, let's use filter so that we are only showing Canada, US and UK.

In [388]:
...

## Area Marks

The `area` mark type combines aspects of `line` and `bar` marks: it visualizes connections (slopes) among data points, but also shows a filled region, with one edge defaulting to a zero-valued baseline.

### Area Chart
The area chart is similar in function to the line chart.

In [389]:
...

## Stacked Area Chart
We can create a stacked area chart by using the original data set and encoding country with color

In [394]:
...

## Normalized Stacked Area Chart
Does this even make sense for this data, why or why not?

In [398]:
...

Many of the properties we customized for `mark_line` exist for `mark_area` as well.

## Rect Marks
The last mark that we will use is the `rect_mark`.
It is typically used to create heatmaps.
The term heatmap was assigned to this visual representation in the early 1990s and was widely used in the financial industry to depict cyclic time-varying data.
A heatmap is bascially a matrix or table in which each cell uses color to encodea a numerical value.

### Energy Data
We will be visualizing a subset of Mike Bostock's energy consumption data for 2019.
To get a sense of the data, please skim the [visualization](https://observablehq.com/@mbostock/electric-usage-2019) he created.

In [399]:
path = 'data/energy_usage.csv'
data = pd.read_csv(path)
data.head(10)

Unnamed: 0,date,usage
0,2019-01-01T08:00Z,1.88
1,2019-01-01T09:00Z,2.69
2,2019-01-01T10:00Z,1.73
3,2019-01-01T11:00Z,1.6
4,2019-01-01T12:00Z,3.24
5,2019-01-01T13:00Z,2.0
6,2019-01-01T14:00Z,3.33
7,2019-01-01T15:00Z,3.79
8,2019-01-01T16:00Z,1.55
9,2019-01-01T17:00Z,-0.85


### TimeUnit Tranforms

Here are excerpts from the API about [Times and Dates](https://altair-viz.github.io/user_guide/times_and_dates.html?highlight=time)

> Altair is designed to work best with Pandas timeseries. A standard timezone-agnostic date/time column in a Pandas dataframe will be both interpreted and displayed as local user time.
For date-time inputs like these, it can sometimes be useful to extract particular time units (e.g. hours of the day, dates of the month, etc.).
In Altair, this can be done with a time unit transform, discussed in detail in [TimeUnit Transform](https://altair-viz.github.io/user_guide/transform/timeunit.html#user-guide-timeunit-transform).

We will provide some examples, but strongly recommend that you consult the API.

For example, we might decide we want a heatmap with hour of the day on the x-axis, and day of the month on the y-axis:
Let's start off my encoding the month for each data item with the `x` channel.

In [400]:
alt.Chart(data).mark_rect().encode(
    alt.X('month(date)')
)

It is very hard to see each individual rectangle. All we can surmise from this visual representation is that the dataset includes energy data for the first 7 months of the year. Note that you should also use the aggregate transform sum to determine the usage for each month 
Let's use color to encode the energy usage for each month.

In [401]:
...

Now we can see individual rectangles. Because Mike has solar panels the energy consumption reduces as we proceed through the year.
In this visualize we are addressing a sequential task. Let's transition to time-varying tasks.

### Heatmap
Let us answer the question _What does the energy consumption look like for each day of the week?_
To create a heatmap, let's use the `y` channel to encode the day of the week.

In [402]:
...

Notice how for both the `x` and `y` channels we are using the same attribute/field in our dataset **date**.
The TimeUnit transform extracts the relevant aspects from the datum.
What if we wanted to ask the question, _what time of the day has the highest or lowest energy usage?_
Let's use `y` to encode the month, and `x` to encode the time of the day.


In [403]:
...

It is worth mentioning that the color channel is encoding an aggregation.
It is not encoding the individual energy for a specific day.
Change the aggregation from `sum` to `average` what differences do you observe.
Remove the aggregation, what is being depicted?
We can ask the question again _what time of day has the highest or lowest energy usage?_ but this time let us aggregated the day by day of the week as opposed to month.

In [404]:
...

Let's visualize the entire dataset,
Use the `y` channel to encode the date, the `x` channel to encode the time of day.


In [405]:
...

Do you notice the date that has no data? Go to Mike's post](https://observablehq.com/@mbostock/electric-usage-2019) to find out why.
This is a big plot, let's switch the data being encoded on the `x` and `y` channels and make the chart smaller.
Let's rename the axes titles as well and add a title for the chart.

In [331]:
...

What do you observe?
Note that you can customize each rectangle and set its size. If you go that route, you have to play around with resizing the chart to make sure that there is no blank space.
