In [6]:
import numpy as np

# Introduction

The objective of this notebook is to demonstrate `sklearn` dataset API

Recall that it has 3 APIs:
1. **Loaders** (`load_*`) loads small standard datasets bundled with `sklearn` (these need not be downloaded separately).
2. **Fetchers** (`fetch_*`) fetches large datasets from the internet and loads them in memory.
3. **Generators** (`genrate_*`) generates controlled synthetic datasets (as required by us).

*Loaders* and *fetchers* return a `bunch` object and *generators* return a tuple - (feature matrix, label vector (or matrix)).

# Loaders

## Loading Iris dataset

In [3]:
from sklearn.datasets import load_iris
iris_data = load_iris()

This returns a `Bunch` object `data` which is a dictionary like object with following attributes:

- `data`, which has the feature matrix
- `target`, which is a label vector
- `feature_names`, contains the names of the feaures
- `target_names`, contains the names of the classes
- `DESCR` has the full description of the dataset
- `filename` has the path to the location of data

In [4]:
type(iris_data)

sklearn.utils.Bunch

We can access them one by one and examine their contents. For example, we can acces `feature_names` as follows:

In [5]:
iris_data.feature_names
# We can see the names of the features in this dataset.

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Let's examine the names of the labels.

In [6]:
iris_data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The featrue matrix can be accessed using `iris_data.data`. Let's look at the first five examples in feature matrix.

In [10]:
iris_data.data[:5]
# We can observe 4 features per example

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

Let us examine the shape of the feature matrix.

In [11]:
iris_data.data.shape
# We can see that there are 150 examples with 4 features each.

(150, 4)

Finally, we examine the label vector and its shape.

In [13]:
iris_data.target
# There are 50 examples each from 3 classes: 0, 1 and 2.

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

We can read additional documentationn about `load_iris` in the following manner:

In [17]:
help(load_iris)

Help on function load_iris in module sklearn.datasets._base:

load_iris(*, return_X_y=False, as_frame=False)
    Load and return the iris dataset (classification).
    
    The iris dataset is a classic and very easy multi-class classification
    dataset.
    
    Classes                          3
    Samples per class               50
    Samples total                  150
    Dimensionality                   4
    Features            real, positive
    
    Read more in the :ref:`User Guide <iris_dataset>`.
    
    Parameters
    ----------
    return_X_y : bool, default=False
        If True, returns ``(data, target)`` instead of a Bunch object. See
        below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The target is
        a pandas DataFrame or Series depending on the number of 

In [108]:
print(iris_data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In this way, we can load and examinfe different datasets.

We can obtain feature matrix and label (as tuples) from `load_iris` and other loaders in general by setting the `return_X_y` argument to `True` (which is `False` by default to return a `Bunch` object)

In [20]:
iris_feature_matrix, iris_label_vector = load_iris(return_X_y= True)

print('Shape of feature matrix:', iris_feature_matrix.shape)
print('Shape of label vector:', iris_label_vector.shape)

Shape of feature matrix: (150, 4)
Shape of label vector: (150,)


To load Iris dataset as a dateframe, we set the `as_frame` argument to `True`, while loading it and returning as `(X,y)` tuple.

In [89]:
X,y = load_iris(as_frame=True, return_X_y=True)

In [90]:
type(X)

pandas.core.frame.DataFrame

In [92]:
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Loading Diabetes dataset

In [1]:
from sklearn.datasets import load_diabetes
diabetes_data = load_diabetes()

Additional details about this loader can be accessed from the documentation:

In [22]:
help(load_diabetes)

Help on function load_diabetes in module sklearn.datasets._base:

load_diabetes(*, return_X_y=False, as_frame=False)
    Load and return the diabetes dataset (regression).
    
    Samples total    442
    Dimensionality   10
    Features         real, -.2 < x < .2
    Targets          integer 25 - 346
    
    Read more in the :ref:`User Guide <diabetes_dataset>`.
    
    Parameters
    ----------
    return_X_y : bool, default=False.
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The target is
        a pandas DataFrame or Series depending on the number of target columns.
        If `return_X_y` is True, then (`data`, `target`) will be pandas
        DataFrames or Series as described below.
    
      

In [109]:
print(diabetes_data.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature va

In [25]:
diabetes_data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [27]:
# Shape of the feature matrix
diabetes_data.data.shape

(442, 10)

In [29]:
# Shape of label vector
diabetes_data.target.shape

(442,)

In [30]:
# Looking at the first 5 examples from the feature matrix
diabetes_data.data[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

In [31]:
diabetes_data.target[:5]

array([151.,  75., 141., 206., 135.])

## Loading Digits dataset

In [33]:
from sklearn.datasets import load_digits
help(load_digits)

Help on function load_digits in module sklearn.datasets._base:

load_digits(*, n_class=10, return_X_y=False, as_frame=False)
    Load and return the digits dataset (classification).
    
    Each datapoint is a 8x8 image of a digit.
    
    Classes                         10
    Samples per class             ~180
    Samples total                 1797
    Dimensionality                  64
    Features             integers 0-16
    
    Read more in the :ref:`User Guide <digits_dataset>`.
    
    Parameters
    ----------
    n_class : int, default=10
        The number of classes to return. Between 0 and 10.
    
    return_X_y : bool, default=False
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The ta

In [34]:
digits_data = load_digits()

In [110]:
print(digits_data.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

In [37]:
digits_data.data.shape
# 1797 examples with 64 features each

(1797, 64)

In [38]:
digits_data.data[:5]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.],
       [ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
         9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
        15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
         0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
        16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.],
       [ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
        14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
         1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
         0.,  0.,  9., 16., 16.,  5.,  0.,  0.,  0.,  0.,  

In [42]:
digits_data.feature_names

['pixel_0_0',
 'pixel_0_1',
 'pixel_0_2',
 'pixel_0_3',
 'pixel_0_4',
 'pixel_0_5',
 'pixel_0_6',
 'pixel_0_7',
 'pixel_1_0',
 'pixel_1_1',
 'pixel_1_2',
 'pixel_1_3',
 'pixel_1_4',
 'pixel_1_5',
 'pixel_1_6',
 'pixel_1_7',
 'pixel_2_0',
 'pixel_2_1',
 'pixel_2_2',
 'pixel_2_3',
 'pixel_2_4',
 'pixel_2_5',
 'pixel_2_6',
 'pixel_2_7',
 'pixel_3_0',
 'pixel_3_1',
 'pixel_3_2',
 'pixel_3_3',
 'pixel_3_4',
 'pixel_3_5',
 'pixel_3_6',
 'pixel_3_7',
 'pixel_4_0',
 'pixel_4_1',
 'pixel_4_2',
 'pixel_4_3',
 'pixel_4_4',
 'pixel_4_5',
 'pixel_4_6',
 'pixel_4_7',
 'pixel_5_0',
 'pixel_5_1',
 'pixel_5_2',
 'pixel_5_3',
 'pixel_5_4',
 'pixel_5_5',
 'pixel_5_6',
 'pixel_5_7',
 'pixel_6_0',
 'pixel_6_1',
 'pixel_6_2',
 'pixel_6_3',
 'pixel_6_4',
 'pixel_6_5',
 'pixel_6_6',
 'pixel_6_7',
 'pixel_7_0',
 'pixel_7_1',
 'pixel_7_2',
 'pixel_7_3',
 'pixel_7_4',
 'pixel_7_5',
 'pixel_7_6',
 'pixel_7_7']

In [39]:
digits_data.target.shape

(1797,)

In [40]:
digits_data.target[:5]

array([0, 1, 2, 3, 4])

In [41]:
digits_data.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Experiment with `load_wine`, `load_breast_cancer`, `load_linerud` later.

## Loading Breast Cancer dataset

In [2]:
from sklearn.datasets import load_breast_cancer
help(load_breast_cancer)

Help on function load_breast_cancer in module sklearn.datasets._base:

load_breast_cancer(*, return_X_y=False, as_frame=False)
    Load and return the breast cancer wisconsin dataset (classification).
    
    The breast cancer dataset is a classic and very easy binary classification
    dataset.
    
    Classes                          2
    Samples per class    212(M),357(B)
    Samples total                  569
    Dimensionality                  30
    Features            real, positive
    
    Read more in the :ref:`User Guide <breast_cancer_dataset>`.
    
    Parameters
    ----------
    return_X_y : bool, default=False
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.
    
        .. versionadded:: 0.18
    
    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The target is
        a pand

In [4]:
X,y = load_breast_cancer(as_frame=True, return_X_y=True)

In [5]:
y.value_counts()

1    357
0    212
Name: target, dtype: int64

In [7]:
uniques,counts = np.unique(y, return_counts=True)
print(list(zip(uniques,counts)))

[(0, 212), (1, 357)]


In [106]:
bcancer_data = load_breast_cancer()

In [107]:
bcancer_data.target_names

array(['malignant', 'benign'], dtype='<U9')

So, `'malignant'` corresponds to `0` and `'benign'` corresponds to `1`.

In [111]:
print(bcancer_data.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

# Fetchers

## Fetching California Housing dataset

**Step 1:** Import the library and access the documentation

In [43]:
from sklearn.datasets import fetch_california_housing
help(fetch_california_housing)

Help on function fetch_california_housing in module sklearn.datasets._california_housing:

fetch_california_housing(*, data_home=None, download_if_missing=True, return_X_y=False, as_frame=False)
    Load the California housing dataset (regression).
    
    Samples total             20640
    Dimensionality                8
    Features                   real
    Target           real 0.15 - 5.
    
    Read more in the :ref:`User Guide <california_housing_dataset>`.
    
    Parameters
    ----------
    data_home : str, default=None
        Specify another download and cache folder for the datasets. By default
        all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
    
    download_if_missing : bool, default=True
        If False, raise a IOError if the data is not locally available
        instead of trying to download the data from the source site.
    
    
    return_X_y : bool, default=False.
        If True, returns ``(data.data, data.target)`` instead of 

Note that the `fetch_*` also returns a `Bunch` object.

**Step 2:** Load the dataset and obtain a `Bunch` object.

In [45]:
# Call the loader and obtain the `Bunch` object.
housing_data = fetch_california_housing()

**Step 3:** Examine the bunch object.

In [112]:
# Description of the dataset.
print(housing_data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [47]:
# Shape of the feature matrix
housing_data.data.shape

(20640, 8)

In [48]:
# Look at the first 5 examples from the featrue matrix
housing_data.data[:5]

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02],
       [ 5.64310000e+00,  5.20000000e+01,  5.81735160e+00,
         1.07305936e+00,  5.58000000e+02,  2.54794521e+00,
         3.78500000e+01, -1.22250000e+02],
       [ 3.84620000e+00,  5.20000000e+01,  6.28185328e+00,
         1.08108108e+00,  5.65000000e+02,  2.18146718e+00,
         3.78500000e+01, -1.22250000e+02]])

In [50]:
# Name of the features
housing_data.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

In [49]:
# Shape of the label matrix
housing_data.target.shape

(20640,)

In [53]:
housing_data.target[:5]

array([4.526, 3.585, 3.521, 3.413, 3.422])

In [51]:
# Name of the label
housing_data.target_names

['MedHouseVal']

## Fetching from `openml`

[openml.org](openml.org) is a public repository for machine learning data and experiments, that allows everybody to upload open datasets.


(Many popular datasets like the MNIST Digit dataset can be fetched from here)

Import the library and access the documentation.

In [54]:
from sklearn.datasets import fetch_openml
help(fetch_openml)

Help on function fetch_openml in module sklearn.datasets._openml:

fetch_openml(name: Union[str, NoneType] = None, *, version: Union[str, int] = 'active', data_id: Union[int, NoneType] = None, data_home: Union[str, NoneType] = None, target_column: Union[str, List, NoneType] = 'default-target', cache: bool = True, return_X_y: bool = False, as_frame: Union[str, bool] = 'auto')
    Fetch dataset from openml by name or dataset id.
    
    Datasets are uniquely identified by either an integer ID or by a
    combination of name and version (i.e. there might be multiple
    versions of the 'iris' dataset). Please give either name or data_id
    (not both). In case a name is given, a version can also be
    provided.
    
    Read more in the :ref:`User Guide <openml>`.
    
    .. versionadded:: 0.20
    
    .. note:: EXPERIMENTAL
    
        The API is experimental (particularly the return value structure),
        and might have small backward-incompatible changes without notice
    
   

Note that this is an experimental API and is likely to change in the future releases.

> We use this API for loading MNIST dataset.

In [55]:
X,y = fetch_openml('mnist_784', version=1, return_X_y=True)
print("Feature matrix shape:", X.shape)
print("Label vector shape:", y.shape)

Feature matrix shape: (70000, 784)
Label vector shape: (70000,)


In [57]:
print(type(X))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [59]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Columns: 784 entries, pixel1 to pixel784
dtypes: float64(784)
memory usage: 418.7 MB


In [60]:
# So, each sample has 784 features, each of which is a pixel
X.columns

Index(['pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5', 'pixel6', 'pixel7',
       'pixel8', 'pixel9', 'pixel10',
       ...
       'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779', 'pixel780',
       'pixel781', 'pixel782', 'pixel783', 'pixel784'],
      dtype='object', length=784)

For exercise: `fetch_20newsgroup` and `fetch_kddcup99`

# Generators

## `make_regression`

This helps us generate datasets for regression.

In [61]:
from sklearn.datasets import make_regression
help(make_regression)

Help on function make_regression in module sklearn.datasets._samples_generator:

make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
    Generate a random regression problem.
    
    The input set can either be well conditioned (by default) or have a low
    rank-fat tail singular profile. See :func:`make_low_rank_matrix` for
    more details.
    
    The output is generated by applying a (potentially biased) random linear
    regression model with `n_informative` nonzero regressors to the previously
    generated input and some gaussian centered noise with some adjustable
    scale.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int, default=100
        The number of samples.
    
    n_features : int, default=100
        The number of features.
    
    n_informative : int, default

### Example 1

Generating 100 samples with 5 features for a single label regression problem.

In [62]:
X,y = make_regression(n_samples= 100, n_features= 5, n_targets=1, shuffle= True, random_state=42)

# It is a good practice to set seed so that we get to see repeatability in the experimentation.

Let's look at the shapes of feature matrix and label vector.

In [63]:
print(X.shape)
print(y.shape)

(100, 5)
(100,)


In [64]:
X[:5]

array([[-0.93782504,  0.51504769,  0.51503527,  3.85273149,  0.51378595],
       [ 1.0889506 , -0.71530371,  0.06428002,  0.67959775, -1.07774478],
       [-0.60170661, -1.05771093,  1.85227818,  0.82254491, -0.01349722],
       [ 0.8219025 ,  0.09176078,  0.08704707, -1.98756891, -0.29900735],
       [ 1.54993441,  0.81351722, -0.78325329, -1.23086432, -0.32206152]])

In [65]:
y[:5]

array([271.31612081,   6.2305406 ,  11.86102446, -63.94057571,
        49.63008485])

### Example 2

Generating 100 samples with 5 features for multiple regression problem with 5 outputs (each label 5-dim).

In [66]:
X,y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True, random_state=42)

In [70]:
print(X.shape)
print(y.shape)
# Since, we generated multi-output target with 5 outputs, the output has shape `(100, 5)`

(100, 5)
(100, 5)


In [68]:
X[:5]

array([[ 0.88163976, -1.42225371,  1.68714164, -0.64657288, -1.081548  ],
       [ 1.30547881, -0.2176812 ,  0.81350964,  1.09877685,  0.82541635],
       [ 0.13074058, -1.24778318, -0.44004449,  1.6324113 , -1.43014138],
       [ 0.58685709,  0.79103195, -1.40185106, -0.90938745,  1.40279431],
       [ 0.35701549, -0.20812225,  0.8496021 , -0.49300093, -0.58936476]])

In [69]:
y[:5]

array([[  46.58248846,   26.95176072,  158.49687583, -211.55965424,
         -23.03343135],
       [ 196.63689114,  168.66737136,  209.93926636,  122.35430764,
          65.76629146],
       [ -78.95894162, -102.12190675,  -76.82170915, -212.75380369,
         -76.10553622],
       [  39.05790167,   69.14878192,  -51.17607712,  180.89618908,
          -1.18306497],
       [  43.04031257,   32.06778878,   87.46866782,  -66.78313107,
          14.64502385]])

## `make_classification`

Generate a random *n-class* classification problem, with a single label output.

In [71]:
from sklearn.datasets import make_classification
help(make_classification)

Help on function make_classification in module sklearn.datasets._samples_generator:

make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
    Generate a random n-class classification problem.
    
    This initially creates clusters of points normally distributed (std=1)
    about vertices of an ``n_informative``-dimensional hypercube with sides of
    length ``2*class_sep`` and assigns an equal number of clusters to each
    class. It introduces interdependence between these features and adds
    various types of further noise to the data.
    
    Without shuffling, ``X`` horizontally stacks features in the following
    order: the primary ``n_informative`` features, followed by ``n_redundant``
    linear combinations of the informative features, followed by ``n_repeated``
    duplicates, dr

### Example 1

Generate 100 samples for a binary classification problem with 10 features.

In [72]:
X,y = make_classification(n_samples=100, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=42)

In [73]:
# Shapes of the feature matrix and the label vector.
print(X.shape)
print(y.shape)

(100, 10)
(100,)


In [74]:
X[:5]

array([[ 0.11422765, -1.71016839, -0.06822216, -0.14928517,  0.30780177,
         0.15030176, -0.05694562, -0.22595246, -0.36361221, -0.13818757],
       [ 0.70775194, -1.57022472, -0.23503183, -0.63604713,  0.62180996,
        -0.56246678,  0.97255445, -0.77719676,  0.63240774, -0.47809669],
       [ 0.63859246,  0.04739867,  0.33273433,  1.1046981 , -0.65183611,
        -1.66152006, -1.2110162 ,  1.09821151, -0.0660798 ,  0.68024225],
       [-0.23894805, -0.97755524,  0.0379061 ,  0.19896733,  0.50091719,
        -0.90756366,  0.75539123,  0.12437227, -0.57677133,  0.07871283],
       [-0.59239392, -0.05023811,  0.17573204, -1.43949185,  0.27045683,
        -0.86399077, -0.83095012,  0.60046915,  0.04852163,  0.32557953]])

In [75]:
# Since, this is a binary classification problem, we only have 2 outputs 0 and 1.
y[:5]

array([1, 1, 1, 1, 0])

### Example 2

Generate a 3 classification problem with 100 samples and 10 features.

In [76]:
X,y = make_classification(n_samples=100, n_features=10, n_classes=3, n_clusters_per_class=1, random_state=42)

In [77]:
# Shapes of the feature matrix and the label vector.
print(X.shape)
print(y.shape)

(100, 10)
(100,)


In [78]:
X[:5]

array([[-0.58351628, -1.73833907, -1.37298251, -1.77311485,  0.45918008,
         0.83392215, -1.66096093,  0.20768769, -0.07016571,  0.42961822],
       [-1.0044394 , -1.43862044,  0.47335819, -0.21188291,  0.0125924 ,
         0.22409248, -0.77300978,  0.49799829,  0.0976761 ,  0.02451017],
       [ 0.07740833,  0.19896733,  0.12437227,  0.17738132, -0.97755524,
         0.50091719,  0.75138712,  0.54336019,  0.09933231, -1.66940528],
       [-0.91759569, -0.9609536 ,  1.07746664,  0.4522739 , -0.32138584,
        -0.8254972 , -0.56372455,  0.24368721,  0.41293145, -0.8222204 ],
       [-0.96222828, -0.96090774,  1.21530116,  0.55980482, -1.24778318,
        -0.25256815, -1.43014138,  0.13074058,  1.6324113 , -0.44004449]])

In [79]:
# Since, this is a 3 classification problem, we only have 3 outputs 0, 1 and 2.
y[:5]

array([2, 0, 1, 0, 0])

## `make_multilabel_classification`

This function helps us in generating a random multi-label classification problem.

In [80]:
from sklearn.datasets import make_multilabel_classification
help(make_multilabel_classification)

Help on function make_multilabel_classification in module sklearn.datasets._samples_generator:

make_multilabel_classification(n_samples=100, n_features=20, *, n_classes=5, n_labels=2, length=50, allow_unlabeled=True, sparse=False, return_indicator='dense', return_distributions=False, random_state=None)
    Generate a random multilabel classification problem.
    
    For each sample, the generative process is:
        - pick the number of labels: n ~ Poisson(n_labels)
        - n times, choose a class c: c ~ Multinomial(theta)
        - pick the document length: k ~ Poisson(length)
        - k times, choose a word: w ~ Multinomial(theta_c)
    
    In the above process, rejection sampling is used to make sure that
    n is never zero or more than `n_classes`, and that the document length
    is never zero. Likewise, we reject classes which have already been chosen.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int, 

### Example 1

Generate 100 multilabel classification samples with 10 features, 5 labels and on average 2 labels per examples.

In [81]:
# (what does average of 2 labels per example mean? 
# this is set by n_labels=2 argument ---)

X,y = make_multilabel_classification(n_samples=100, n_features=10, n_classes=5, n_labels=2, random_state=42)

In [82]:
# Shapes of the feature matrix and the label vector.
print(X.shape)
print(y.shape)

(100, 10)
(100, 5)


In [83]:
X[:5]

array([[ 1.,  8.,  5.,  2.,  0.,  1.,  1.,  3.,  6.,  6.],
       [ 4.,  5.,  8.,  4.,  3.,  5.,  2.,  0.,  2., 13.],
       [ 7.,  5.,  5.,  1.,  1.,  9.,  3.,  2.,  3.,  8.],
       [10.,  4.,  9.,  5.,  7.,  7.,  2.,  4.,  4.,  0.],
       [ 9., 14.,  5.,  0.,  5.,  0.,  2.,  0.,  2., 10.]])

In [84]:
y[:5]

array([[0, 1, 0, 0, 0],
       [1, 1, 1, 0, 0],
       [0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0]])

What do these label mean?
In output 0 (aka `y[0] = array([0,1,0,0,0])`) this means (class-0 is absent, class-1 is present, class-2 is absent, class-3 is absent, class-4 is absent).

## `make_blobs`

This enables us to generate random data for clustering (for unsupervised learning).

In [85]:
from sklearn.datasets import make_blobs
help(make_blobs)

Help on function make_blobs in module sklearn.datasets._samples_generator:

make_blobs(n_samples=100, n_features=2, *, centers=None, cluster_std=1.0, center_box=(-10.0, 10.0), shuffle=True, random_state=None, return_centers=False)
    Generate isotropic Gaussian blobs for clustering.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int or array-like, default=100
        If int, it is the total number of points equally divided among
        clusters.
        If array-like, each element of the sequence indicates
        the number of samples per cluster.
    
        .. versionchanged:: v0.20
            one can now pass an array-like to the ``n_samples`` parameter
    
    n_features : int, default=2
        The number of features for each sample.
    
    centers : int or ndarray of shape (n_centers, n_features), default=None
        The number of centers to generate, or the fixed center locations.
        If n_samples 

### Example 1

Generate a random dataset of 10 samples with 2 features each for clustering.

In [86]:
X, y = make_blobs(n_samples=10, centers=3, n_features=2, random_state=42)
# Here, the number of centres help us generate datasets from those manyy clusters

In [87]:
print(X.shape)
print(y.shape)

(10, 2)
(10,)


In [88]:
# We can find the cluster membership of each point in y. Since, we have centres = 3, we have 3 clusters namely 0,1 and 2.
y

array([2, 2, 1, 2, 0, 0, 0, 1, 1, 0])

For further examples refer this notebook ---

[MLP_SWI_Week_1.ipynb](file:///Users/sampadk04/Desktop/Programming/VSCode-Projects/Python/IITM/IITM-MLP/SWI/MLP_SWI_Week_1.ipynb)