#Introduction

The objective of this colab is to demonstrate sklearn dataset API.

Recall that it has three APIs

1. Loaders (load_*) load small datasets bundled with sklearn
2. Fetchers (fetch_*) fetch large datasets from the internet and loads them in memory.
3. Generators (generate_*) generate controlled synthetic datasets.

Loaders and fetchers return a bunch object and generate return a tuple of feature matrix and label vector(or matrix).

# Loaders

## Loading iris dataset

In [None]:
from sklearn.datasets import load_iris
data = load_iris()

This returns a Bunch object data which is a dictionary like object with the following attributes:

* data, which has the feature matrix.
* target, which is the label vector
* feature_names contain the names of the features.
* target_names contain the names of the classes.
* DESCR has the full description of dataset.
* filename has the path to the location of data.

In [None]:
type(data)

sklearn.utils.Bunch

We can access them one by one and examine their contents. For example, we can access feature_names as follows:

In [None]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

We can see the names of the features in this dataset.

Let's examine the names of the labels.

In [None]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The feature matrix can be accessed as follows: data.data. Let's look at the the first five examples in feature matrix.

In [None]:
data.data[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

We can observe 4 features per example.

Let's examine the shape of the feature matrix.

In [None]:
data.data.shape

(150, 4)

There are 150 examples and each examples has 4 features.

Finally, we will examine the label vector and its shape.

In [None]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

There are 50 examples each from three classes: 0, 1 and 2.

We can read additional documentation about load_iris in the followin manner:

In [None]:
?load_iris

In this way, we can load and examine different datasets.

We can obtain feature matrix and label or target from load_iris and other loaders in general by setting return_X_y arguement to True.

In [None]:
feature_matrix, label_vector = load_iris(return_X_y=True)
print('Shape of feature matrix:', feature_matrix.shape)
print('Shape of feature vector:', label_vector.shape)

Shape of feature matrix: (150, 4)
Shape of feature vector: (150,)


## Loading diabetes dataset

In [None]:
from sklearn.datasets import load_diabetes
data = load_diabetes()

Additional details about this loader can be accessed from the documentation

In [None]:
?load_diabetes

Examine the bunch object.

Look at the description of the dataset.

In [None]:
data.DESCR

'.. _diabetes_dataset:\n\nDiabetes dataset\n----------------\n\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\n\n**Data Set Characteristics:**\n\n  :Number of Instances: 442\n\n  :Number of Attributes: First 10 columns are numeric predictive values\n\n  :Target: Column 11 is a quantitative measure of disease progression one year after baseline\n\n  :Attribute Information:\n      - age     age in years\n      - sex\n      - bmi     body mass index\n      - bp      average blood pressure\n      - s1      tc, total serum cholesterol\n      - s2      ldl, low-density lipoproteins\n      - s3      hdl, high-density lipoproteins\n      - s4      tch, total cholesterol / HDL\n      - s5      ltg, possibly log of serum triglycerides level\n      - s6      glu, blood sugar

Find the shape of the feature matrix.

In [None]:
data.data.shape

(442, 10)

Look at the first first five examples from feature matrix.

In [None]:
data.data[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

Find the shape of the label matrix.

In [None]:
data.target.shape

(442,)

Find the labels of the first five examples.

In [None]:
data.target[:5]

array([151.,  75., 141., 206., 135.])

Find out the names of the features.

In [None]:
data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

Find names of the class labels.

not categorical value hence no method to find distinct class labels

## Loading digits dataset

In [None]:
from sklearn.datasets import load_digits
data = load_digits()

Additional details about this loader can be accessed from the documentation

In [None]:
?load_diabetes

Examine the bunch object.

Look at the description of the dataset.

In [None]:
data.DESCR

".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 1797\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixel

Find the shape of the feature matrix.

In [None]:
data.data.shape

(1797, 64)

Look at the first first five examples from feature matrix.

In [None]:
data.data[:5]

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
        15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
        12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
         0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
        10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.],
       [ 0.,  0.,  0., 12., 13.,  5.,  0.,  0.,  0.,  0.,  0., 11., 16.,
         9.,  0.,  0.,  0.,  0.,  3., 15., 16.,  6.,  0.,  0.,  0.,  7.,
        15., 16., 16.,  2.,  0.,  0.,  0.,  0.,  1., 16., 16.,  3.,  0.,
         0.,  0.,  0.,  1., 16., 16.,  6.,  0.,  0.,  0.,  0.,  1., 16.,
        16.,  6.,  0.,  0.,  0.,  0.,  0., 11., 16., 10.,  0.,  0.],
       [ 0.,  0.,  0.,  4., 15., 12.,  0.,  0.,  0.,  0.,  3., 16., 15.,
        14.,  0.,  0.,  0.,  0.,  8., 13.,  8., 16.,  0.,  0.,  0.,  0.,
         1.,  6., 15., 11.,  0.,  0.,  0.,  1.,  8., 13., 15.,  1.,  0.,
         0.,  0.,  9., 16., 16.,  5.,  0.,  0.,  0.,  0.,  

Find the shape of the label matrix.

In [None]:
data.target.shape

(1797,)

Find the labels of the first five examples.

In [None]:
data.target[:5]

array([0, 1, 2, 3, 4])

Find out the names of the features.

In [None]:
data.feature_names

['pixel_0_0',
 'pixel_0_1',
 'pixel_0_2',
 'pixel_0_3',
 'pixel_0_4',
 'pixel_0_5',
 'pixel_0_6',
 'pixel_0_7',
 'pixel_1_0',
 'pixel_1_1',
 'pixel_1_2',
 'pixel_1_3',
 'pixel_1_4',
 'pixel_1_5',
 'pixel_1_6',
 'pixel_1_7',
 'pixel_2_0',
 'pixel_2_1',
 'pixel_2_2',
 'pixel_2_3',
 'pixel_2_4',
 'pixel_2_5',
 'pixel_2_6',
 'pixel_2_7',
 'pixel_3_0',
 'pixel_3_1',
 'pixel_3_2',
 'pixel_3_3',
 'pixel_3_4',
 'pixel_3_5',
 'pixel_3_6',
 'pixel_3_7',
 'pixel_4_0',
 'pixel_4_1',
 'pixel_4_2',
 'pixel_4_3',
 'pixel_4_4',
 'pixel_4_5',
 'pixel_4_6',
 'pixel_4_7',
 'pixel_5_0',
 'pixel_5_1',
 'pixel_5_2',
 'pixel_5_3',
 'pixel_5_4',
 'pixel_5_5',
 'pixel_5_6',
 'pixel_5_7',
 'pixel_6_0',
 'pixel_6_1',
 'pixel_6_2',
 'pixel_6_3',
 'pixel_6_4',
 'pixel_6_5',
 'pixel_6_6',
 'pixel_6_7',
 'pixel_7_0',
 'pixel_7_1',
 'pixel_7_2',
 'pixel_7_3',
 'pixel_7_4',
 'pixel_7_5',
 'pixel_7_6',
 'pixel_7_7']

Find names of the class labels.

In [None]:
data.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## Exercise Loaders

load_wine

load_breast_cancer

load_linnerrud

1. Import the loader
1a. Access the documentation
2. Load the dataset and obtain a Bunch object
3. Examine the bunch object [Look at the description of the dataset]
4. Find out the shape of the feature matrix
5. Find shape of feature matrix
6. Look at the first five examples of feature matrix
7. Find shape of label matrix
8. Look at the first five examples of label matrix
9. Find out the names of the features
10. Find name of class labels

# Fetchers

## fetch_california_husing

Import the library and access the documentation

In [None]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

Additional details about this loader can be accessed from the documentation

In [None]:
?fetch_california_husing

Examine the bunch object.

Look at the descriotion of the dataset.

In [None]:
data.DESCR

'.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block group\n        - HouseAge      median house age in block group\n        - AveRooms      average number of rooms per household\n        - AveBedrms     average number of bedrooms per household\n        - Population    block group population\n        - AveOccup      average number of household members\n        - Latitude      block group latitude\n        - Longitude     block group longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands of dollars ($100,000

Find the shape of the feature matrix.

In [None]:
data.data.shape

(20640, 8)

Look at the first first five examples from feature matrix.

In [None]:
data.data[:5]

array([[ 8.32520000e+00,  4.10000000e+01,  6.98412698e+00,
         1.02380952e+00,  3.22000000e+02,  2.55555556e+00,
         3.78800000e+01, -1.22230000e+02],
       [ 8.30140000e+00,  2.10000000e+01,  6.23813708e+00,
         9.71880492e-01,  2.40100000e+03,  2.10984183e+00,
         3.78600000e+01, -1.22220000e+02],
       [ 7.25740000e+00,  5.20000000e+01,  8.28813559e+00,
         1.07344633e+00,  4.96000000e+02,  2.80225989e+00,
         3.78500000e+01, -1.22240000e+02],
       [ 5.64310000e+00,  5.20000000e+01,  5.81735160e+00,
         1.07305936e+00,  5.58000000e+02,  2.54794521e+00,
         3.78500000e+01, -1.22250000e+02],
       [ 3.84620000e+00,  5.20000000e+01,  6.28185328e+00,
         1.08108108e+00,  5.65000000e+02,  2.18146718e+00,
         3.78500000e+01, -1.22250000e+02]])

Find the shape of the label matrix.

In [None]:
data.target.shape

(20640,)

Find the labels of the first five examples.

In [None]:
data.target[:5]

array([4.526, 3.585, 3.521, 3.413, 3.422])

Find out the names of the features.

In [None]:
data.feature_names

['MedInc',
 'HouseAge',
 'AveRooms',
 'AveBedrms',
 'Population',
 'AveOccup',
 'Latitude',
 'Longitude']

Find names of the class labels.

In [None]:
data.target_names

['MedHouseVal']

## fetch_openml

Import the library and access the documentation

In [1]:
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=41214)

Additional details about this loader can be accessed from the documentation

In [None]:
?fetch_openml

Examine the bunch object.

Look at the descriotion of the dataset.

In [2]:
data.DESCR

'The dataset freMTPL2freq contains risk features for 677,991 motor third-part liability policies (observed mostly on one year).\n\nDownloaded from openml.org.'

Find the shape of the feature matrix.

In [3]:
data.data.shape

(678013, 12)

Look at the first first five examples from feature matrix.

In [4]:
data.data[:5]

Unnamed: 0,IDpol,ClaimNb,Exposure,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region
0,1.0,1.0,0.1,D,5.0,0.0,55.0,50.0,B12,Regular,1217.0,R82
1,3.0,1.0,0.77,D,5.0,0.0,55.0,50.0,B12,Regular,1217.0,R82
2,5.0,1.0,0.75,B,6.0,2.0,52.0,50.0,B12,Diesel,54.0,R22
3,10.0,1.0,0.09,B,7.0,0.0,46.0,50.0,B12,Diesel,76.0,R72
4,11.0,1.0,0.84,B,7.0,0.0,46.0,50.0,B12,Diesel,76.0,R72


Find the shape of the label matrix.

In [11]:
data.target

Find out the names of the features.

In [13]:
data.feature_names

['IDpol',
 'ClaimNb',
 'Exposure',
 'Area',
 'VehPower',
 'VehAge',
 'DrivAge',
 'BonusMalus',
 'VehBrand',
 'VehGas',
 'Density',
 'Region']

##Exercise Fetchers

fetch_20newsgroups
fetch kdcup99

# Generators

##make_regression

In [14]:
from sklearn.datasets import make_regression
#?make_regression

Example 1

Let's generate 100 samples with 5 features for a single label regression problem.

In [15]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True, random_state=42)

Good practice to set seed to ensure reproducibility

Let's look at the shape of feature matrix and label vector.

In [16]:
X.shape

(100, 5)

In [17]:
y.shape

(100,)

Example 2

Let's generate 100 samples with 5 features for a multiple label regression problem with 5 outputs.

In [18]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True, random_state=42)

Let's look at the shape of feature matrix and label vector.

In [19]:
X.shape

(100, 5)

In [20]:
y.shape

(100, 5)

## make_classification

Generate a random n-class classification problem set up.

In [21]:
from sklearn.datasets import make_classification
#?make_classification

Let's generate a binary classification problem with 10 features and 100 samples.

In [30]:
X, y = make_classification(n_samples=100, n_classes=2, n_clusters_per_class=1, random_state=42)

In [31]:
X.shape

(100, 20)

In [32]:
y.shape

(100,)

Look at the few examples and their labels.

In [33]:
X[:5]

array([[ 1.53273891, -0.35904111,  0.0976761 ,  0.50481307,  0.49799829,
         0.51934651, -0.82723094,  0.0125924 , -0.10876015,  0.99815117,
         0.40171172, -0.28865864,  1.68674524,  0.32271856, -0.77300978,
         0.02451017, -0.40122047,  0.22409248,  0.69014399, -0.01851314],
       [ 0.64719594,  1.1968407 ,  1.68392769,  0.37371252, -0.03850847,
         0.39922312, -1.66858407,  1.0470983 , -0.48318646,  0.76328692,
         1.57398676, -1.02279257,  1.23390668, -0.25737654, -0.45888426,
         1.07868083, -1.46437488,  0.22445182, -1.22576566,  0.61351797],
       [-0.65160035, -1.02646717, -0.79252074,  0.29736414,  0.86575519,
         0.04557184,  0.21645859,  0.85243333,  2.14394409,  0.57439745,
         0.63391902,  0.67959775,  1.00183089, -0.73036663, -0.11473644,
         0.50498728,  0.18645431, -0.66178646, -2.02514259, -0.71530371],
       [ 1.10870358, -0.73168763, -0.16711808,  0.36819047, -0.81693567,
         0.59065483, -0.35929209,  0.64870989,  

In [34]:
y[:5]

array([0, 1, 0, 0, 0])

Let's generate a three class classification problem with 100 samples and 10 features.

In [35]:
X,y = make_classification(n_samples=100, n_features=10, n_classes=3, n_clusters_per_class=1, random_state=42)

In [36]:
X.shape

(100, 10)

In [37]:
y.shape

(100,)

Look at the few examples and their labels.

In [38]:
X[:5]

array([[-0.58351628, -1.73833907, -1.37298251, -1.77311485,  0.45918008,
         0.83392215, -1.66096093,  0.20768769, -0.07016571,  0.42961822],
       [-1.0044394 , -1.43862044,  0.47335819, -0.21188291,  0.0125924 ,
         0.22409248, -0.77300978,  0.49799829,  0.0976761 ,  0.02451017],
       [ 0.07740833,  0.19896733,  0.12437227,  0.17738132, -0.97755524,
         0.50091719,  0.75138712,  0.54336019,  0.09933231, -1.66940528],
       [-0.91759569, -0.9609536 ,  1.07746664,  0.4522739 , -0.32138584,
        -0.8254972 , -0.56372455,  0.24368721,  0.41293145, -0.8222204 ],
       [-0.96222828, -0.96090774,  1.21530116,  0.55980482, -1.24778318,
        -0.25256815, -1.43014138,  0.13074058,  1.6324113 , -0.44004449]])

In [39]:
y[:5]

array([2, 0, 1, 0, 0])

## make_multilabel_classification

Generate a random multilabel classification problem set up.

In [40]:
from sklearn.datasets import make_multilabel_classification
#?make_multilabel_classification

Let's generate a multilabel classification problem with 10 features, 100 samples and 5 labels and on average 2 labels per example.

In [41]:
X, y = make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, n_labels=2)

In [42]:
X.shape

(100, 20)

In [43]:
y.shape

(100, 5)

Look at the few examples and their labels.

In [44]:
X[:5]

array([[4., 1., 2., 1., 4., 3., 2., 0., 1., 3., 1., 1., 4., 2., 1., 2.,
        3., 2., 2., 2.],
       [0., 3., 2., 2., 2., 1., 2., 1., 2., 2., 2., 6., 2., 1., 2., 2.,
        1., 4., 1., 0.],
       [0., 2., 2., 1., 2., 4., 0., 3., 2., 0., 5., 2., 5., 2., 3., 1.,
        2., 4., 4., 2.],
       [2., 2., 5., 2., 1., 1., 2., 0., 2., 5., 1., 3., 1., 4., 2., 5.,
        3., 4., 1., 2.],
       [3., 3., 4., 5., 7., 0., 3., 1., 3., 1., 3., 5., 1., 0., 2., 4.,
        5., 1., 2., 1.]])

In [45]:
y[:5]

array([[0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 0, 0, 0]])

## make_blobs

Generate random data for clustering

In [46]:
from sklearn.datasets import make_blobs
#?make_blobs

Let's generate a random dataset of 10 samples with 2 features each for clustering.

In [47]:
X, y = make_blobs(n_samples=10, centers=3, n_features=2, random_state=42)

In [48]:
X.shape

(10, 2)

In [49]:
y.shape

(10,)

We can find the cluster membership of each point in y.

In [50]:
y

array([2, 2, 1, 2, 0, 0, 0, 1, 1, 0])