# Marks for Temporal Data

## Learning Goals
Those who actively work through this notebook will be able to:
 - Perform various data wrangling tasks to format datasets
 - Use graphical marks to create common vizzes
 - Create visualizations for sequential temporal tasks (e.g., line, area, and stacked area charts)
 - Create visualizations for cyclic temporal tasks (e.g., heatmaps)

## Data Wrangling: Currency Exchange Data
We will be using the Kaggle Dataset that shows the daily exchange rates per Euro from 1999 to 2023. For more information about the data visit https://www.kaggle.com/datasets/lsind18/euro-exchange-daily-rates-19992020 


In [1]:
import pandas as pd
import altair as alt


### Load the data from the csv file
remember to parse the 'Date' column appropriately. 

In [2]:
data = pd.read_csv('euro-daily-hist_1999_2022.csv', parse_dates=['Date'])

### Basic Data Wrangling Task
1. What is the size of the dataset
2. What are the column names
3. Is the data in an appropriate form for us to encode it with altair. 


In [3]:
data.shape

(6209, 7)

In [4]:
data.head()

Unnamed: 0,Date,Canadian dollar,Chinese yuan renminbi,UK pound sterling,Indian rupee,Mexican peso,US dollar
0,2022-12-30,1.444,7.3582,0.88693,88.171,20.856,1.0666
1,2022-12-29,1.4475,7.4151,0.88549,88.2295,20.651,1.0649
2,2022-12-28,1.4361,7.4224,0.88058,88.0943,20.6856,1.064
3,2022-12-27,1.4384,7.3994,0.88333,88.0808,20.5515,1.0624
4,2022-12-23,1.4433,7.4198,0.8803,87.958,20.7115,1.0622


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6209 entries, 0 to 6208
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    6209 non-null   datetime64[ns]
 1   Canadian dollar         6209 non-null   object        
 2   Chinese yuan renminbi   5941 non-null   object        
 3   UK pound sterling       6209 non-null   object        
 4   Indian rupee            5941 non-null   object        
 5   Mexican peso            6209 non-null   object        
 6   US dollar               6209 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 339.7+ KB


In [6]:
data.columns

Index(['Date', 'Canadian dollar ', 'Chinese yuan renminbi ',
       'UK pound sterling ', 'Indian rupee ', 'Mexican peso ', 'US dollar '],
      dtype='object')

### Data Wrangling - Making It Tidy
Switch the data from a wide format to a long format where each row depicts an item
You should use melt to do this, set the variable name to 'currency' and the value_name to 'rate'

    

In [7]:
data.melt()

Unnamed: 0,variable,value
0,Date,2022-12-30 00:00:00
1,Date,2022-12-29 00:00:00
2,Date,2022-12-28 00:00:00
3,Date,2022-12-27 00:00:00
4,Date,2022-12-23 00:00:00
...,...,...
43458,US dollar,1.1659
43459,US dollar,1.1632
43460,US dollar,1.1743
43461,US dollar,1.179


In [8]:
data = data.melt(id_vars="Date", var_name='currency', value_name='rate')

### Basic Data Wrangling Task
1. What is the size of the dataset
2. What are the column names
3. Are there any items with NaN values, if so remove them 
4. Is the data in an appropriate form for us to encode it with altair. 


In [9]:
data

Unnamed: 0,Date,currency,rate
0,2022-12-30,Canadian dollar,1.444
1,2022-12-29,Canadian dollar,1.4475
2,2022-12-28,Canadian dollar,1.4361
3,2022-12-27,Canadian dollar,1.4384
4,2022-12-23,Canadian dollar,1.4433
...,...,...,...
37249,1999-01-08,US dollar,1.1659
37250,1999-01-07,US dollar,1.1632
37251,1999-01-06,US dollar,1.1743
37252,1999-01-05,US dollar,1.179


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37254 entries, 0 to 37253
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Date      37254 non-null  datetime64[ns]
 1   currency  37254 non-null  object        
 2   rate      36718 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 873.3+ KB


In [11]:
data.columns

Index(['Date', 'currency', 'rate'], dtype='object')

In [12]:
data.shape

(37254, 3)

### Additional Wrangling Steps
1. Make sure all column names start with lowercase letters
2. Create a new column called country that is the first word for the currency variables
e.g Canadian Dollar should be split so that the country column has only Canadian 
3. Add a year column that has the year for each item. Hint see Datetime in Pandas (e.g.  pd.DatetimeIndex(df['date']).year)
4. Only keep data that was collected on the 15th of each month (filter out all other dates)

In [13]:
data = data.rename(columns={'Date':'date'})

In [14]:
new = data['currency'].str.split(" ", n = 1, expand = True)

In [15]:
data['country'] = new[0]

In [16]:
data

Unnamed: 0,date,currency,rate,country
0,2022-12-30,Canadian dollar,1.444,Canadian
1,2022-12-29,Canadian dollar,1.4475,Canadian
2,2022-12-28,Canadian dollar,1.4361,Canadian
3,2022-12-27,Canadian dollar,1.4384,Canadian
4,2022-12-23,Canadian dollar,1.4433,Canadian
...,...,...,...,...
37249,1999-01-08,US dollar,1.1659,US
37250,1999-01-07,US dollar,1.1632,US
37251,1999-01-06,US dollar,1.1743,US
37252,1999-01-05,US dollar,1.179,US


In [17]:
data['year'] = pd.DatetimeIndex(data['date']).year

In [18]:
data = data[data['date'].dt.day == 15]

### Canada Data
Create a new dataframe that only keeps the Canadian items

In [19]:
data_canada = data.loc[(data['country'] == 'Canadian')]

In [20]:
data_canada.shape

(206, 5)

In [21]:
data_canada.head()

Unnamed: 0,date,currency,rate,country,year
10,2022-12-15,Canadian dollar,1.4443,Canadian,2022
32,2022-11-15,Canadian dollar,1.3816,Canadian,2022
75,2022-09-15,Canadian dollar,1.3172,Canadian,2022
98,2022-08-15,Canadian dollar,1.3167,Canadian,2022
119,2022-07-15,Canadian dollar,1.3147,Canadian,2022


# Visualization Tasks 

## Line Chart

The line chart (also called line graph) was created by William Playfair. It encode data as a series of data points that are connected by a straight line. 

Let's start by creating a line chart for Canada, 
Use the `x` channel to encode the temporal field/attribute, while the `y` channel is used to encode the rate.

In [22]:
alt.Chart(data_canada).mark_line().encode(
    alt.X('date:T'),    # you don't need to encode the type but if you do ordinal or quantitative
    alt.Y('rate:Q'),    # notice what happens when you hardcode the data type 
)

 Altair allows you to include points for each value by adding the `point` property for mark_line.
 Set the `point` property to `True`
    

In [23]:
alt.Chart(data_canada).mark_line(point=True).encode(
    alt.X('date'),
    alt.Y('rate:Q'),
)

You can alternatively the appearance of the point with `OverlayMarkDef`.
`point=alt.OverlayMarkDef(color="black")`

In [24]:
alt.Chart(data_canada).mark_line(
    point=alt.OverlayMarkDef(color="black")
).encode(
    alt.X('date'),
    alt.Y('rate:Q'),
)

Altair allows you to change the way marks are connected. Before we do that, let's use Altair to filter the data to only show what was happening in 2002 and 2004
Instead of using the Vega expression (with datum) use the Field Predicates
https://altair-viz.github.io/user_guide/transform/filter.html 

In [25]:
alt.Chart(data_canada).mark_line(point = True
).encode(
    alt.X('date:N'),
    alt.Y('rate:Q'),
).transform_filter(
   # alt.FieldEqualPredicate(field='year', equal = 2002)
 #   alt.FieldOneOfPredicate(field='year', oneOf = [2002, 2003, 2004])
    alt.FieldRangePredicate(field='year', range=[2002, 2004])
 #   (alt.datum.year == 2002) or (alt.datum.year == 2003)
)

Let's look at changing the interpolations. 
The default interpolation for `mark_line` and `mark_area` is **linear**
The [API](https://altair-viz.github.io/user_guide/generated/core/altair.Interpolate.html) has a full listing of interpolate options. Explore the options before continuing to the next section.


In [26]:
alt.Chart(data_canada).mark_line(
    point= True,
    interpolate='step-before',  #basis, basis-open, catmull-rom, step, natural, cardinal-open we will discuss these later
).encode(
    alt.X('date:N'),
    alt.Y('rate:Q'),
).transform_filter(
    alt.FieldRangePredicate(field='year', range=[2002, 2004])
)

### Axis Formatting
Let's format our x-axis so that it doesn't include unnecessary data,
Being that we only kept what is happening on the 15, we really don't need to keep the day or time. 
See https://altair-viz.github.io/user_guide/transform/timeunit.html#user-guide-timeunit-transform for information on Time Transformations

In [27]:
alt.Chart(data_canada).mark_line(
    point= True,
    interpolate='linear',
).encode(
    alt.X('yearmonth(date):T'),
    alt.Y('rate:Q'),
).transform_filter(
    alt.FieldRangePredicate(field='year', range=[2002, 2004])
).properties(
width = 600
)

We can also customize the labels on the axis. 
See https://altair-viz.github.io/user_guide/generated/core/altair.Axis.html for some examples

In [28]:
alt.Chart(data_canada).mark_line(
    point= True,
    interpolate='linear',
).encode(
    alt.X('yearmonth(date):T',
         axis = alt.Axis(labelAngle = 45, title = 'Date (Year and Month)')),
        
    alt.Y('rate:Q', axis = alt.Axis(title = "Rate of Exchange Canadian Dollars to Euros")),
).transform_filter(
    alt.FieldRangePredicate(field='year', range=[2002, 2004])
).properties(
width = 600
)

## Multi-Line Chart
Let's go back to our orginal dataset stored in 'data' and show the exchange rate for the countries in the datset. 
With the `mark_line` you can encode temporal data for multiple countries at the same time. 
We will encode country with the color channel. 



In [29]:
alt.Chart(data).mark_line().encode(
    alt.X('date'),
    alt.Y('rate:Q'),
    alt.Color('country:N')
)

It is hard to see what is happening for Canada, US and UK because of the hire rates for India, let's use filter so that we are only showing Canada, US and UK.

In [30]:
alt.Chart(data).mark_line().encode(
    alt.X('date:T'),
    alt.Y('rate:Q'),
    alt.Color('country:N')
).transform_filter(
   alt.FieldOneOfPredicate(field='country', oneOf = ['Canadian', 'UK', 'US'])
)

## Area Marks

The `area` mark type combines aspects of `line` and `bar` marks: it visualizes connections (slopes) among data points, but also shows a filled region, with one edge defaulting to a zero-valued baseline.

### Area Chart
The area chart is similar in function to the line chart.

In [31]:
alt.Chart(data_canada).mark_area().encode(
    alt.X('date'),
    alt.Y('rate:Q')
)

## Stacked Area Chart
We can create a stacked area chart by using the original data set and encoding country with color

In [32]:
alt.Chart(data).mark_area().encode(
    alt.X('date:T'),
    alt.Y('rate:Q'),
    alt.Color('country:N')   # change to Ordinal, does this make sense, why or why not. 
).transform_filter(
   alt.FieldOneOfPredicate(field='country', oneOf = ['Canadian', 'UK', 'US'])
)

## Normalized Stacked Area Chart
Does this even make sense for this data, why or why not?

In [33]:
alt.Chart(data).mark_area().encode(
    alt.X('date:T'),
    alt.Y('rate:Q', stack='normalize'),
    alt.Color('country:N')   # change to Ordinal, does this make sense, why or why not. 
).transform_filter(
   alt.FieldOneOfPredicate(field='country', oneOf = ['Canadian', 'UK', 'US'])
)

Many of the properties we customized for `mark_line` exist for `mark_area` as well.

The Heatmap sections can be found in Tooling 6 on the course website 
https://pages.github.ubc.ca/kemiola/DSCI320-22W2/lectures/6_Marks_4_Temporal.html#rect-marks 