### Working with distributed data

Let's see how to make some Data analisys on distributed dataset

In [None]:
from dask.distributed import Client, progress
client = Client(n_workers=2, threads_per_worker=2, memory_limit='2GB')
client

Let's create a random timeseries dataset with the following attributes:

+ It stores a record for every 10 seconds of the year 2000

+ It splits that year by month, keeping every month as a separate Pandas dataframe

+ Along with a datetime index it has columns for names, ids, and numeric values

This is a small dataset of about 480 MB. Increase the number of days or reduce the frequency to practice with a larger dataset.

In [None]:
import dask
import dask.dataframe as dd
df = dask.datasets.timeseries(start='2019-03-01', end='2019-04-30')

In [None]:
df

In [None]:
df.dtypes

In [None]:
import pandas as pd
pd.options.display.precision = 2
pd.options.display.max_rows = 10

In [None]:
df.head(3)

In [None]:
df2 = df[df.y > 0]
df3 = df2.groupby('name').x.std()
df3

In [None]:
computed_df = df3.compute()
type(computed_df)

In [None]:
computed_df

### Persist data in memory
If you have the available RAM for your dataset then you can persist data in memory.

This allows future computations to be much faster.



In [None]:
df = df.persist()

### Time Series Operations
Because we have a datetime index time-series operations work efficiently

In [None]:
%matplotlib inline

df[['x', 'y']].resample('1h').mean().head()

In [None]:
df[['x', 'y']].resample('24h').mean().compute().plot()

In [None]:
df[['x', 'y']].rolling(window='24h').mean().head()

In [None]:
df.loc['2019-04-05']

In [None]:
%time df.loc['2019-04-05'].compute()

### Set Index
Data is sorted by the index column. 
This allows for faster access, joins, groupby-apply operations, etc.. However sorting data can be costly to do in parallel, so setting the index is both important to do, but only infrequently.

In [None]:
df = df.set_index('name')
df

Again, because computing this dataset is expensive and we can fit it in our available RAM, we persist the dataset to memory.

In [None]:
df = df.persist()

Dask now knows where all data lives, indexed cleanly by name. 
As a result operations like random access are cheap and efficient

In [None]:
%time df.loc['Alice'].compute()

### Groupby-Apply a simple way to work on large datasets
Now that our data is sorted by name we can easily do operations like random access on name, or groupby-apply with custom functions.

Here we train a different Scikit-Learn linear regression model on each name.

In [None]:
from  sklearn.linear_model import LinearRegression

def train(partition):
    est = LinearRegression()
    est.fit(partition[['x']].values, partition.y.values)
    return est

df.groupby('name').apply(train, meta=object).compute()