# Dask: Introduction to Framework

[Dask](https://dask.org/) is a framework that scales Python packages (like Pandas, Scikit-Learn, Numpy) and the surrounding ecosystem. It works with the existing Python ecosystem to scale it to multi-core machines and distributed clusters.

## Import Dask and get access to Dask dashboard

In [None]:
import os
import re
import json
import dask
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, progress

Note that you may change `n_workers` and `memory_limit` to manage performance but you should take into account your server (or cluster) resource limits. After running next cell you will get an URL to the Dask dashboard:

In [None]:
client = Client(
    n_workers=2,
    threads_per_worker=1,
    memory_limit='4GB'
)
print(
    'Dask dashboard available at:',
    'https://jhas01.gsom.spbu.ru{}proxy/{}/status'.format(
        os.environ['JUPYTERHUB_SERVICE_PREFIX'],
        client.scheduler_info()['services']['dashboard']
    )
)
client

## Read data

Read file from local disk:

In [None]:
ddf = dd.read_csv('./data/telecom_churn.csv', sep=',')
ddf.head()

In [None]:
ddf.describe()

In [None]:
print('all columns:', ddf.columns)

## Basic operations examples

Group by `State` column and count values:

In [None]:
ddf.groupby('State').State.count().compute().head(5)

Display unique values of `Churn` column:

In [None]:
ddf.Churn.unique().compute()

Apply function to dataframe column:

In [None]:
def one_hot(text):
    if text == 'False':
        return 0
    elif text == 'True':
        return 1
    else:
        return 2

ddf = ddf.assign(Churn_OH=ddf.Churn.map(lambda x: one_hot(x), meta=('x', str)))

In [None]:
ddf.compute()