# Overview
The goal of this session is to get familiar with core data mining tasks and the data mining process without diving into the details of how algorithms work. The exercises can be done in groups of 4-5 people.

# Exploratory data analysis
In this session, we use a number of tools from Python's data analysis stack. `Pandas` provides `DataFrame` data structure that facilitates manipulating tabular data. The following snippet loads a CSV file and prints attribute names and types. You can have a look at the few first rows of the dataset using the `head` function:

In [None]:
# Install a missing package
# Please, restart the kernel afterwards: menu -> Kernel -> Restart & Clear Output
# This cell can be deleted afterwards

!pip install seaborn --user

In [None]:
import pandas as pd, numpy as np

eda_data = pd.read_csv('eda.csv', na_values=['?'])

print(str(eda_data.shape[0]) + ' records')
print(str(eda_data.columns.size) + ' attributes:')

eda_types = eda_data.dtypes
print(eda_types)

eda_data.head()

## Visualisation
Visualisation is the primary way to get high-level understanding of the data. We use `Matplotlib` as the plotting engine, whereas `Seaborn` provides a plethora of convenient shortcuts to most common plotting tasks. The following snippet initialises these packages.

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
sns.set(color_codes=True)

### Individual attributes
The basic tool for visualising **categorical** attributes is a histogram, which shows frequencies of individual values of an attribute. The following snippet uses `countplot` to plot a histogram of each categorical attribute in the data.

In [None]:
categorical_attrs = list(eda_types[eda_types == 'object'].index)
for attr_name in categorical_attrs[0:1]:
    plt.figure()
    
    attr_data = eda_data[attr_name]
    missing_count = np.count_nonzero(attr_data.isnull().values)
    
    # Rotate the label of the vertical axis
    # so that it's easy to read
    plot = sns.countplot(x=attr_name, data=eda_data)
    plot.set_title(attr_name + '\nMissing: ' + str(missing_count))
    plot.set_xlabel(' ')
    plot.set_ylabel('Count')
    plot.yaxis.label.set_rotation(0)
plt.draw()

The distribution of an individual **numeric** attribute can be discretised and visualised with a histogram using `distplot`. Alternatively, a `boxplot` visualises a five-point summary (min, the 25th percentile, median, the 75th percentile, and max) along with outliers.

In [None]:
from matplotlib.ticker import MaxNLocator

numeric_attrs = list(eda_types[(eda_types == 'int64') | (eda_types=='float64')].index)
for attr_name in numeric_attrs[3:4]:
    # Create a figure with two subfigures that share an X axis
    f, (ax_hist, ax_box) = plt.subplots(2, sharex=True, 
                                           gridspec_kw={"height_ratios": (.9, .1)})

    attr_data = eda_data[attr_name]
    # Compute basic attribute summaries
    min = attr_data.min()
    mean = attr_data.mean()
    median = attr_data.median()
    max = attr_data.max()
    std_dev = attr_data.std()
    missing_count = np.count_nonzero(attr_data.isnull().values)
    
    # `dropna()` removes missing values from consideration
    distplot = sns.distplot(eda_data[attr_name].dropna(), kde=True, rug=False, axlabel=False, ax=ax_hist)
    
    # - Put the attribute name and stats in the title
    # - Keep only integer tics
    distplot.set_title(attr_name + '\n' + 
                   'Min: '         + str(min)               + '   ' +
                   'Avg: '         + str(round(mean, 2))    + '   ' +
                   'Std.dev: '     + str(round(std_dev, 2)) + '   ' +
                   'Median: '      + str(median)            + '   ' +
                   'Max: '         + str(max)               + '   ' +
                   'Missing: '     + str(missing_count))
    distplot.xaxis.set_major_locator(MaxNLocator(integer=True))
    
    boxplot = sns.boxplot(attr_data, ax=ax_box)
    boxplot.set_xlabel(' ')

### Pairs of attributes

Pairwise attribute relationships can be visualised with variations of a scatter plot. Furthermore, the third variable can be brought into the mix via colouring data points. Explore options for visualising a pair of categorical attributes, a pair of numeric attributes, and a mixed pair.

In [None]:
grid = sns.factorplot(data=eda_data,y='diagnosis',col='hospital',kind='count')

grid.axes[0,0].yaxis.label.set_rotation(0)
grid.axes[0,0].yaxis.labelpad = 25

plt.subplots_adjust(top=0.87)
grid.fig.suptitle('Pair of categorical attributes')

In [None]:
grid = sns.FacetGrid(size=7, data=eda_data, hue=None)
grid.map(plt.scatter, "height", "bmi")

plt.subplots_adjust(top=0.95)
grid.fig.suptitle('Pair of numeric attributes')

grid.ax.yaxis.label.set_rotation(0)

In [None]:
plot = sns.stripplot(data=eda_data, x="age", y="hospital", jitter=True)
plot.set_title('Mixed categorical/numeric plot')
plot.yaxis.label.set_rotation(0)
plot.yaxis.labelpad = 25