# Introduction to Scikit-Learn (sklearn)

## `sklearn` dataset API
The objective of this notebook is to demonstrate `sklearn` dataset API.
It has three APIs:
1. Loaders (`load_*`) load small standard datasets bundled with `sklearn`.
2. Fetchers (`fetch_*`) fetch large datasets from the internet and loads them in memory.
3. Generators (`generate_*`) generate controlled synthetic datasets.

Loaders and fetchers return a `bunch` object and generators return a tuple of feature matrix and label vector (or matrix).
Loaders and fetchers can also return a tuple of feature matrix and label vector if we set the agrument `return_X_y=True`

## Loaders
### Loading iris dataset

In [1]:
from sklearn.datasets import load_iris
data = load_iris()

This returns a `Bunch` object `data` which is a dictionary like object with the following attributes:
* `data`, which has the feature matrix.
* `target`, which has the label vector.
* `feature_names` contain the names of the features.
* `target_names` contain the names of the classes.
* `DESCR` has the full description of the dataset.
* `filename` has the path to the location of the data.

In [2]:
type(data)

sklearn.utils.Bunch

We can access them one by one and examine their contents:

In [3]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [4]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [6]:
# Let's look at the first five examples in feature matrix
data.data[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

We can observe 4 features per example

In [7]:
data.data.shape

(150, 4)

In [8]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [9]:
data.target.shape

(150,)

We can read addition documentation about `load_iris` in the following manner:

In [12]:
data.DESCR



In [10]:
? load_iris

[0;31mSignature:[0m  [0mload_iris[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the iris dataset (classification).

The iris dataset is a classic and very easy multi-class classification
dataset.

Classes                          3
Samples per class               50
Samples total                  150
Dimensionality                   4
Features            real, positive

Read more in the :ref:`User Guide <iris_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object. See
    below for more information about the `data` and `target` object.

    .. versionadded:: 0.18

as_frame : bool, default=False
    If True, the data is a pandas DataFrame including columns with
    appropriate dtypes (numeric). The target is
    a pandas DataFrame or Se

We can obtain feature matrix and label or target from `load_iris` and other loaders in general by setting `return_X_y` argument to `True`.

In [11]:
feature_matrix, label_vector = load_iris(return_X_y=True)
print('Shape of feature matrix: ', feature_matrix.shape)
print('Shape of label vector: ', label_vector.shape)

Shape of feature matrix:  (150, 4)
Shape of label vector:  (150,)


***In this way, we can load and examine different datasets.***  
e.g.:  
> load_diabetes, load_digits, load_wine, load_breast_cancer, load_linnerud etc.

## Fetchers
### `fetch_california_housing`

**Step 1**: Import the library and access the documentation.

In [13]:
from sklearn.datasets import fetch_california_housing
?fetch_california_housing

[0;31mSignature:[0m
[0mfetch_california_housing[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdownload_if_missing[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load the California housing dataset (regression).

Samples total             20640
Dimensionality                8
Features                   real
Target           real 0.15 - 5.

Read more in the :ref:`User Guide <california_housing_dataset>`.

Parameters
----------
data_home : str, default=None
    Specify another download and cache folder for the datasets. By default
    all scikit-learn data is stored in '~/scikit_learn_data' subfolders.

download

**Step 2**: Call the loader and obtain the `Bunch` object.

In [15]:
housing_data = fetch_california_housing()

**Step 3**: Examine the bunch object.

In [16]:
housing_data.DESCR

'.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block\n        - HouseAge      median house age in block\n        - AveRooms      average number of rooms\n        - AveBedrms     average number of bedrooms\n        - Population    block population\n        - AveOccup      average house occupancy\n        - Latitude      house block latitude\n        - Longitude     house block longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttp://lib.stat.cmu.edu/datasets/\n\nThe target variable is the median house value for California districts.\n\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit

In [18]:
housing_data.data.shape

(20640, 8)

In [19]:
housing_data.data[:5]

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02],
       [ 5.64310000e+00,  5.20000000e+01,  5.81735160e+00,
         1.07305936e+00,  5.58000000e+02,  2.54794521e+00,
         3.78500000e+01, -1.22250000e+02],
       [ 3.84620000e+00,  5.20000000e+01,  6.28185328e+00,
         1.08108108e+00,  5.65000000e+02,  2.18146718e+00,
         3.78500000e+01, -1.22250000e+02]])

In [21]:
housing_data.target.shape

(20640,)

In [22]:
housing_data.target[:5]

array([4.526, 3.585, 3.521, 3.413, 3.422])

*Note that the labels seem to be real numbers*.

In [23]:
housing_data.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [24]:
housing_data.target_names

['MedHouseVal']

### `fetch_openml`
[openml.org](https://www.openml.org "OpenML") is a public repository for machine learning data and experiments, that allows everybody to upload open datasets.

In [25]:
from sklearn.datasets import fetch_openml
?fetch_openml

[0;31mSignature:[0m
[0mfetch_openml[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mversion[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m][0m [0;34m=[0m [0;34m'active'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_id[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtarget_column[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;34m'default-target'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcache[0m[0;34m:[0m [0mbool[0m 

*Note that this is an experimental API and is likely to change in the future releases.*  

Let's use this API for loading MNIST dataset.  
>MNIST is a large database of handwritten digits that is commonly used for training various image processing systems.

In [None]:
# fetching as feature matrix and label vector
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
print('Feature matrix shape: ', X.shape)
print('Label shape: ', y.shape)

## Generators
### `make_regression`

In [28]:
from sklearn.datasets import make_regression
?make_regression

[0;31mSignature:[0m
[0mmake_regression[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_targets[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbias[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0meffective_rank[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtail_strength[0m[0;34m=[0m[0;36m0.5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnoise[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcoef[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0

#### Example 1
Let's generate 100 samples with 5 features for a single label regression problem.

In [29]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True, random_state=42)

In [30]:
X.shape

(100, 5)

In [31]:
y.shape

(100,)

#### Example 2
Let's generate 100 samples with 5 features for multiple regression problem with 5 outputs.

In [32]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True, random_state=42)

In [33]:
X.shape

(100, 5)

In [34]:
y.shape

(100, 5)

### `make_classification`
Generate a random n-class classification problem set up.

In [35]:
from sklearn.datasets import make_classification
?make_classification

[0;31mSignature:[0m
[0mmake_classification[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_redundant[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_repeated[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_classes[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_clusters_per_class[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mweights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mflip_y[0m[0;34m=[0m[0;36m0.01[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclass_sep[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhypercube[0m[0;34m=[0m[0;32m

Let's generate a binary classification problem with 10 features and 100 samples.

In [36]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=42)

In [37]:
X.shape

(100, 10)

In [38]:
y.shape

(100,)

In [39]:
X[:5]

array([[ 0.11422765, -1.71016839, -0.06822216, -0.14928517,  0.30780177,
         0.15030176, -0.05694562, -0.22595246, -0.36361221, -0.13818757],
       [ 0.70775194, -1.57022472, -0.23503183, -0.63604713,  0.62180996,
        -0.56246678,  0.97255445, -0.77719676,  0.63240774, -0.47809669],
       [ 0.63859246,  0.04739867,  0.33273433,  1.1046981 , -0.65183611,
        -1.66152006, -1.2110162 ,  1.09821151, -0.0660798 ,  0.68024225],
       [-0.23894805, -0.97755524,  0.0379061 ,  0.19896733,  0.50091719,
        -0.90756366,  0.75539123,  0.12437227, -0.57677133,  0.07871283],
       [-0.59239392, -0.05023811,  0.17573204, -1.43949185,  0.27045683,
        -0.86399077, -0.83095012,  0.60046915,  0.04852163,  0.32557953]])

In [40]:
y[:5]

array([1, 1, 1, 1, 0])

For multiclass classification set-up:

In [41]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=3, n_clusters_per_class=1, random_state=42)

In [42]:
y.shape

(100,)

In [43]:
y[:5]

array([2, 0, 1, 0, 0])

### `make_multilabel_classification`

In [45]:
from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, n_labels=2)

In [47]:
X.shape

(100, 20)

In [48]:
y.shape

(100, 5)

In [49]:
X[:3]

array([[2., 0., 3., 1., 3., 3., 0., 1., 0., 1., 1., 4., 1., 4., 1., 8.,
        0., 2., 0., 2.],
       [3., 3., 2., 0., 5., 0., 4., 0., 2., 0., 5., 0., 3., 5., 2., 2.,
        4., 3., 4., 2.],
       [3., 0., 1., 1., 2., 2., 1., 2., 0., 0., 2., 1., 5., 3., 1., 2.,
        0., 4., 2., 1.]])

In [50]:
y[:3]

array([[1, 1, 1, 0, 1],
       [0, 0, 1, 0, 0],
       [1, 1, 0, 1, 1]])

### `make_blobs`
`make_blobs` enables us to generate random data for clustering.  

Let's generate a random dataset of 10 samples with 2 features each for clustering

In [51]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10, n_features=2, centers=3, random_state=42)
print('Feature matrix shape: ', X.shape)
print('Label shape: ', y.shape)

Feature matrix shape:  (10, 2)
Label shape:  (10,)


We can find the cluster membership of each point in `y`.

In [52]:
y

array([2, 2, 1, 2, 0, 0, 0, 1, 1, 0])