# Explorory Data Analysis on Huge Data with Dask

First EDA on Default Prediction competition data. Hope, this notebook will be useful for you ;)

**Warning** In this notebook I explored all train and test data with Dask library. Because there is a very huge dataset, some cells here perform very slow. You are warned ;)

# Table of contents

1. [Import libraries](#import-libraries)
1. [Load data](#load-data)
1. [First look at data](#first-look-at-data)
1. [Examine shape of given data](#examine-shape-of-given-data)
    1. [Shape of train data](#shape-of-train-data)
    1. [Shape of test data](#shape-of-test-data)
1. [Examine columns in data](#examine-columns-in-data)
1. [Check types of each column](#check-types-of-each-column)
1. [Discover NaNs](#discover-nans)
1. [Build distplot for all numeric features](#build-distplot)

## Import libraries <a class="anchor" id="import-libraries"></a>

In [None]:
import os

import numpy as np
import pandas as pd
import dask.dataframe as dd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

Used this option to disable scientific notation

In [None]:
PATH_TO_DATA = os.path.join('/kaggle', 'input', 'amex-default-prediction')

## Load data <a class="anchor" id="load-data"></a>

[Dask](https://docs.dask.org/en/latest/10-minutes-to-dask.html) allows us to work with data, without loading it fully on RAM. It's API is very similar to Pandas

In [None]:
train_data = dd.read_csv(os.path.join(PATH_TO_DATA, 'train_data.csv'))
test_data = dd.read_csv(os.path.join(PATH_TO_DATA, 'test_data.csv'))
train_labels = dd.read_csv(os.path.join(PATH_TO_DATA, 'train_labels.csv'))
sample_submission = dd.read_csv(os.path.join(PATH_TO_DATA, 'sample_submission.csv'))

## First look at data  <a class="anchor" id="first-look-at-data"></a>

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
train_labels.head()

In [None]:
train_data.describe().compute()

In [None]:
test_data.describe().compute()

In [None]:
train_labels.describe().compute()

## Examine shape of given data <a class="anchor" id="examine-shape-of-given-data"></a>

### Shape of train data <a class="anchor" id="shape-of-train-data"></a>

In [None]:
train_rows = train_data.size.compute()
train_columns = train_data.shape[1]

print(f'Shape of train data: {train_rows} {train_columns}')

### Shape of test data  <a class="anchor" id="shape-of-test-data"></a>

In [None]:
test_rows = test_data.size.compute()
test_columns = test_data.shape[1]

print(f'Shape of test data: {test_rows} {test_columns}')

Looks pretty huge)

## Examine columns in data  <a class="anchor" id="examine-columns-in-data"></a>

### Check is columns in train data same as in test data

In [None]:
train_data.columns.tolist() == test_data.columns.tolist()

### List of columns' names

In [None]:
train_data.columns.tolist()

## Check types of each column   <a class="anchor" id="check-types-of-each-column"></a>

In [None]:
train_data.dtypes

As we see, nearly all columns are numeric, exclude 'customer_id' ans 'S_2', which represents date.

## Discover NaNs  <a class="anchor" id="discover-nans"></a>

As we see here, there are some columns with big number of nans.

In [None]:
train_data.isnull().sum(axis = 0).compute()

In [None]:
test_data.isnull().sum(axis = 0).compute()

## Build distplot for all numeric features  <a class="anchor" id="build-distplot"></a>

In [None]:
%matplotlib inline
plt.style.use('seaborn')
columns_to_draw = list(set(train_data.columns.tolist()) - set(['customer_ID', 'S_2']))
for current_column in columns_to_draw:
    if pd.api.types.is_numeric_dtype(train_data[current_column].dtype):
        fig, ax = plt.subplots(figsize=(10, 4))
        ax = sns.distplot(train_data[current_column].compute(), kde=False)
        ax.set_title(current_column)
        plt.show()

**TO BE CONTINUED...**