# Data Loading

Get some data to play with

In [1]:
from sklearn.datasets import fetch_openml
blood = fetch_openml('blood-transfusion-service-center')
print(blood.DESCR)

**Author**: Prof. I-Cheng Yeh  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center)  
**Please cite**: Yeh, I-Cheng, Yang, King-Jang, and Ting, Tao-Ming, "Knowledge discovery on RFM model using Bernoulli sequence", Expert Systems with Applications, 2008.   

**Blood Transfusion Service Center Data Set**  
Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan -- this is a classification problem.

To demonstrate the RFMTC marketing model (a modified version of RFM), this study adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. To build an FRMTC model, we selected 748 donors at random from the donor database. 

### Attribute Information  
* V1: Recency - months since last donation
* V2: Frequency - total number of donation
* V3: Monetary - total bl

In [2]:
blood.data.shape

(748, 4)

In [3]:
blood.data

array([[2.00e+00, 5.00e+01, 1.25e+04, 9.80e+01],
       [0.00e+00, 1.30e+01, 3.25e+03, 2.80e+01],
       [1.00e+00, 1.60e+01, 4.00e+03, 3.50e+01],
       ...,
       [2.30e+01, 3.00e+00, 7.50e+02, 6.20e+01],
       [3.90e+01, 1.00e+00, 2.50e+02, 3.90e+01],
       [7.20e+01, 1.00e+00, 2.50e+02, 7.20e+01]])

In [4]:
import pandas as pd
X = pd.DataFrame(data.data, columns=['recency', 'frequency', 'total_amount', 'since_first'])

NameError: name 'data' is not defined

In [None]:
blood.target.shape

In [None]:
blood.target

In [None]:
y = pd.Series(data.target)
y.value_counts()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
pd.plotting.scatter_matrix(X, c=y=='2', cmap='Paired', figsize=(10, 10));

**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**

Split the data to get going

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
blood.data.shape

In [None]:
X_train.shape

In [None]:
X_test.shape

# Exercises

## Excercise 1

Load the iris dataset from the ``sklearn.datasets`` module using the ``load_iris`` function.
The function returns a dictionary-like object that has the same attributes as ``digits``.

What is the number of classes, features and data points in this dataset?
Use a scatterplot to visualize the dataset.

You can look at ``DESCR`` attribute to learn more about the dataset.
``print(iris.DESCR)``

Split the data into training and test set.

## Exercise 2

Usually data doesn't come in that nice a format. You can find the csv file that contains the iris dataset at the following path:

```python
import sklearn.datasets
import os
iris_path = os.path.join(sklearn.datasets.__path__[0], 'data', 'iris.csv')
```
Load the data from there using pandas ``pd.read_csv`` method and bring it into the same format as before with the data in a variable X and the labels in a variable y. The first few lines of ``iris.csv`` file looks like:

```
150,4,setosa,versicolor,virginica
5.1,3.5,1.4,0.2,0
4.9,3.0,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
```

http://github.com/amueller/ml-workshop-1-of-4

In [None]:
# %load solutions/load_iris.py