Pandas Visualization
------------------------------

A lot of times we are going to be working with data whose scale is much larger than only a few lines.  While this is very useful
it can be very hard for use to find patterns by looking at the raw content.  In this notebook I will demonstrate how pandas provides
some built-in and very powerful visualization tools (wrapping matplotlib), as well as the extra support that it provides for time
series data.  

To start, as always we are going to first load our modules. 

In [1]:
import pandas as pd
import numpy as np

from ml_course.util.downloader import download_data
%matplotlib inline

For this notebook we are going to use the data that we cleaned in the last notebook, 
luckily we have the location of the file.  

Remember that we added a few other columns, namely `ns` and `ew`.  

In [2]:
import os
filename = os.path.join('..', '.data', 'austin_cleaned.csv')

Alright, we have our data downloaded, lets go ahead and load the data into a dataframe so that we
can view the output of the data.  

In [3]:
df = pd.read_csv(filename, index_col='date', parse_dates=True)
df.head(5)

FileNotFoundError: File b'../.data/austin_cleaned.csv' does not exist

So this time we have loaded our data, however I have adjusted the index column that the data should use
to be the date column from our provided dataset.  This means that we now have a timeseries that we can use
to plot things.  

Now there are some columns that we don't want to use at this time, namly the `location`, `ns` and `ew` columns
so we will start by creating our `plot_df` that contains only the columns we care about. 

In [None]:
plot_df = df.loc[:, 'nb':'total']
plot_df.head(5)

To plot data in pandas, we can use the built-in `plot` function that is provided by each dataframe.  This will default to
create a `line` plot and will us the index as the x axis and a line for each of the columns that are supplied.  

In [None]:
plot_df.plot()

Now this line chart is a bit noisy, so instead what we can do is to resample the data based on the date value. Since we decided to use our `date` column
as the index for the rows this lets us tap into the power of the timeseries.  

To resample a dataframe, we just need to use the `.resample(rule)` function.  This function allows for us to do a form of a
group_by based on the date and automatically adjust the column values accordingly.  For example we could resample to
a month boundary 'M' or a year 'Y' or even a week 'W'.  

So lets run the plot again, except this time we will resample on the yearly basis.  

In [None]:
plot_df.resample('Y').sum().plot()

**WARNING:** Now before we go much further we have to consider if we are looking at this data correctly or not.  If you remember or dataset it
also included a location column.  This location column defined the area that the measurements were taking place in.  If we disregard
that column then we are looking at metrics that could be taken from anywhere in the county, which would not be an adequat dataset to
use to really determine if traffic is changing.  

However, I am going to continue to demonstrate how to use the plotting tools using this dataset rather than introduce a new
dataset to everyone at this time, I just want to make sure that it is understood we are demonstrating the tooling, not actually
trying to use the data to find trends.  

So above we create a line chart, we can also create a bar chart very easily by using the bar command on the plot function.  Now a bar
plot is really nice, especially stacked, however the total represents the sum of all the values, so at this point lets remove that
from our dataframe and render the bar graph stacked.  

In [None]:
plot_df = plot_df.loc[:, 'nb':'wb']
plot_df.resample('Y').sum().plot.bar(stacked=True, figsize=(10, 5))

If you are person that really likes piecharts we can do that as well based on the sums of the different directions as they are compared
to to the total sum.  

In [None]:
plot_df.sum().plot.pie()

So these are some of the common plotting tools that are available to use, however there are a lot more, including

- `plot.area()`
- `plot.barh()`
- `plot.hist()`
- `plot.scatter()`
- `plot.density()`

Now lets get back to working with our timeseries data, to see what changes we can make there.  

First lets adjust our timeseries to be sampled on a monthly basis, but have it rolling every 12 months (every year).  

In [None]:
plot_df.resample('M').sum().rolling(12).sum().plot(figsize=(10, 5), alpha=0.9)

Rolling is an interesting plot in that it takes the sum of the previos 12 months when it goes, so it is an additive value.  This
can be useful in seeing growing trends or tailing trends in the data.  

Finally for our last plot we are going to pivot the data to see how to totals work out on a monthly basis over the years.  This can help
us to see if there might be more traffic occuring in summer months vs winter months or holiday seasons.   

In [None]:
pd.DataFrame(df['total']).pivot_table('total', index=plot_df.index.month, columns=plot_df.index.year).plot.bar(stacked=True, alpha=0.5, figsize=(10, 5), legend=False)

## References

There is a wonderful article and short youtube video series by Jake Vanderplas that can be found 
[here](http://jakevdp.github.io/blog/2017/03/03/reproducible-data-analysis-in-jupyter/).  It is a great
dive into some of these details with a dataset that better fits this representation.  