# Data Analysis

Before you begin, make sure you have completed the steps in the [setup-external-connection.ipynb](setup-external-connection.ipynb) notebook.

## Load Data

At this point, we can load in our dataset, which for me is a set of just over 60 CSV files in an S3 repository. The total dataset represents more than 12 million rows of purchase records, with 23 columns. 

I am loading directly from S3 to Dask - however if you have a flat file in a local directory, you can very easily load into `pandas` here and then convert to Dask.

In [None]:
%%time
import os
import pandas as pd
import dask.dataframe as dd

In [None]:
import s3fs
s3 = s3fs.S3FileSystem(anon=True)
s3fpath = 's3://saturn-public-data/ia_data/ia_10.csv'

iowa = dd.read_csv(
    s3fpath,
    parse_dates = ['Date'],
    engine = 'python',
    dtype={'Zip Code': 'object'},
    error_bad_lines = False,
    warn_bad_lines = False,
    storage_options={'anon': True},
    assume_missing=True
)

# Comment out below if using multiple files
iowa = iowa.repartition(npartitions = 4)

In [None]:
%%time
from dask.distributed import wait

iowa = iowa.persist()
_ = wait(iowa)
iowa.columns

## Run Analyses

To demonstrate an analysis on the cluster, I'll do a couple of analyses that you might want to run for business.

The first task to do aggregations across dataframes effectively with Dask is to **set the index of the dataframe**. This lets Dask easily organize the data that is partitioned across the cluster, while still keeping it distributed. 

> This is sometimes a slow task, but it only needs to be done once.

In [None]:
%%time

iowa = iowa.set_index("Date")
iowa = iowa.persist()
_ = wait(iowa)


### Create a Rolling Average

From here, we can treat the dataframe very much like a pandas dataframe, but it remains distributed.   
We'll calculate a new series, which is the 30 day rolling average of items sold (bottles), then shape it into a dataframe.

In [None]:
%%time

bottles_sold_roll = iowa['Bottles Sold'].rolling('30D').sum()
bottles_sold_roll = bottles_sold_roll.to_frame(name="bottles_sold_roll")
bottles_sold_roll = bottles_sold_roll.persist()

In [None]:
%%time

bottles_sold_roll.head()

### Group and Summarize

For a second example of calculations over the dataset on the cluster, I'll group by store and date, and calculate the store level daily sales in dollars.

In [None]:
%%time

iowa['Sale (Dollars)'] = iowa['Sale (Dollars)'].str.lstrip('$').astype('float')

In [None]:
%%time

sum_store_sales = iowa.groupby(['Date', "Store Number"])["Sale (Dollars)"].sum()
sum_store_sales = sum_store_sales.to_frame(name="sum_store_sales")
sum_store_sales = sum_store_sales.persist()

In [None]:
%%time

sum_store_sales.head()

## Combine Dataframes

If you want to, from here you can rejoin those new columns to your existing data using the indices.

In [None]:
%%time

iowa_new = dd.concat([iowa, bottles_sold_roll], axis=1)
iowa_new = iowa_new.persist()
_ = wait(iowa_new)

In [None]:
%%time

iowa_final = iowa_new.merge(sum_store_sales, how="left",
                            on=['Date', "Store Number"])
iowa_final = iowa_final.persist()
_ = wait(iowa_final)

## View Data

If you examine this object, you end up seeing the shape of the dataframe but not the contents - this is a function of its distributed nature.

In [None]:
%%time

iowa_final

However, if we check the head of this object, we can see the actual values. This may take time, because part of the dataframe must be computed to show the values.

In [None]:
%%time
iowa_final.head()

In [None]:
%%time

iowa_final[iowa_final['Store Number'] == 2649].head()

In [None]:
len(iowa_final)

## Return to pandas

At this point, you can use this dataset for whatever next steps you have - that might include passing it to a machine learning workflow, for example.

If you need to use the data in a way that is not Dask compatible, and the data is small enough, you can return it to a pandas dataframe with this command. Because this means all the computations are run, and the data is consolidated into the Client environment, it can be slow.

> NOTE: If you loaded all 12 million rows of data, don't run this chunk! It will crash your kernel.

In [None]:
%%time

#iowa_pd = iowa_final.compute()
#type(iowa_pd)

## Housekeeping

Because we are not working inside the UI, we want to make sure that we close down any resources when we are done- otherwise undesired costs can be incurred.

To shut down the cluster entirely:

In [None]:
client.close()
cluster.close()