## Parallel Dataframes

When applications align with "big data" collections,
parallel computing looks a lot like normal computing.

### Objectives

*  Observe that parallel collections are useful when the problem fits them well

### Requirements

*  Dask.dataframe

In [None]:
import dask.dataframe as dd
df = dd.read_csv('../data/minute/aig/2012-*.csv', parse_dates=['timestamp'])
df

We can inspect the data frame with standard methods, such as head, which compute immediately:

In [None]:
df.head()

Other operations do not evaluate until `.compute()` is called:

In [None]:
df.describe().compute()

In [None]:
%matplotlib inline

We can chain these operations with properties and methods that give new DataFrames or Series:

In [None]:
(df.groupby(df.timestamp.dt.date)
   .close
   .mean()
   .compute()
   .plot(figsize=(16, 4))
)

How Big Data Collections Differ
-------------------------------

Collections like Dask.dataframe, Dask.array, or Spark.dataframe often copy elements from previously established sequential libraries like Pandas, NumPy, or from SQL databases.  However they are rarely comprehensive and often lack desired features from the sequential libraries.  This is natural, parallel computing is harder than sequential programming; operations like `sort` and individual data insertion are hard to do in parallel or massively distributed contexts.

### Exercises

Use previous Pandas knowledge (or ask a neighbor) to compute the following:

1.  What was the maximum value of the stock `aig` in 2012?
2.  On what date did this occur?
3.  On what day did `aig` reach its maximum value over all of the data?
4.  Set the index of the `aig` 2012 dataset to the timestamp.  Inspect the head and tail to ensure that this worked.  Use `loc` to pull out time slices.
5.  Having a datetime index enables more datetime functionality.  Resample the dataset by hour, then try resampling it by week.  Did something wrong happen? what?

In [None]:
# your code here

In [None]:
%load solutions/dataframes-1.py