# Data Mining

> Goal: To give an overview of the Machine Learning and Deep Learning fields.

Recommended readings:

* <cite data-cite="granger2013">Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Courville, 2016</cite>
* <cite data-cite="granger2013">Data Mining: Concepts and Techniques, Han and Jiawei, 2005</cite>

See also: 

* [KDnuggets.com](https://www.kdnuggets.com)
* [SIGKDD](http://www.kdd.org)



## TOC:
* [Classification](#classification)
* [Clustering](#second-bullet)
* [Association Rule Discovery]()
* [Sequential Pattern Discovery](#seq-pattern-discovery)

## Classification <a class="anchor" id="classification"></a>  

In [71]:
import pandas as pd # data structures and data analysis tools for the Python programming language.

> **Classification**: the task of learning a target function $f$ that maps each attribute set $X$ to one of the predefined class labels $y$. The target function is also known informally as a _classification model_.

In [91]:
## Load Iris dataset from sklearn library 
## ref: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
data = load_iris()

In [76]:
## Convert the dataset to Pandas dataframe
df = pd.DataFrame({data.feature_names[i]:data.data[:,i] for i in range(len(data.feature_names))})
df['target class'] = [data.target_names[i] for i in data.target]

In [77]:
## Choose randomly 10 objects from the dataset and display attributes and class
## the first column corresponds to the index of the object.
df.sample(n=10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target class
135,7.7,3.0,6.1,2.3,virginica
27,5.2,3.5,1.5,0.2,setosa
36,5.5,3.5,1.3,0.2,setosa
129,7.2,3.0,5.8,1.6,virginica
79,5.7,2.6,3.5,1.0,versicolor
97,6.2,2.9,4.3,1.3,versicolor
3,4.6,3.1,1.5,0.2,setosa
111,6.4,2.7,5.3,1.9,virginica
139,6.9,3.1,5.4,2.1,virginica
22,4.6,3.6,1.0,0.2,setosa


The class label, must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a predictive modeling task in which the target is a continuous attribute.

In [92]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.33, random_state=42)

In [95]:
print('X_train: {}'.format(X_train.shape))
print('X_test: {}'.format(X_test.shape))
print('y_train: {}'.format(y_train.shape))
print('y_test: {}'.format(y_test.shape))

X_train: (100, 4)
X_test: (50, 4)
y_train: (100,)
y_test: (50,)


### Decision Trees

### Instance Based Classifiers

## Clustering 

## Association Rule Discovery

## Sequential Pattern Discovery <a class="anchor" id="seq-pattern-discovery"></a>  

## Model Evaluation

### Introduction

Simplest method: *hold-out*; You split the dataset into 2 buckets, one for the training (usually 2/3) and one for the test (usually 1/3).

> This solution is not a good one, especially if used only once because some data may not appear in the training (resp. test) set. That means that your model would never learn to recognize certain records. Imagine that your dataset consists in photos of cars and planes, your task is to classify new photos in one or the other category. Now imagine that you split your dataset and only have cars in the training set and planes in the test set. What do you think about the performances of your model? 

A variant of the previous method is called random subsampling. Here you repeat a hold-out several times. Like previously, it's not guaranteed that each record appears at least once in the trainig and test set, so maybe the model will not be able to recognize it well. 

### Cross-validation

In [None]:
#### k-fold

In [None]:
#### leave-one-out (LOO)

### Choosing a metric 