https://scikit-learn.org/stable/datasets/toy_dataset.html

# 7.1. Toy datasets

scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.

They can be loaded using the following functions:

`load_boston(*[, return_X_y])`
DEPRECATED: load_boston is deprecated in 1.0 and will be removed in 1.2.

`load_iris(*[, return_X_y, as_frame])`
Load and return the iris dataset (classification).

`load_diabetes(*[, return_X_y, as_frame])`
Load and return the diabetes dataset (regression).

`load_digits(*[, n_class, return_X_y, as_frame])`
Load and return the digits dataset (classification).

`load_linnerud(*[, return_X_y, as_frame])`
Load and return the physical exercise Linnerud dataset.

`load_wine(*[, return_X_y, as_frame])`
Load and return the wine dataset (classification).

`load_breast_cancer(*[, return_X_y, as_frame])`
Load and return the breast cancer wisconsin dataset (classification).

These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. They are however often too small to be representative of real world machine learning tasks.

## Import

In [4]:
# import modules
import pandas as pd
import sklearn.datasets

In [2]:
# import data
boston = sklearn.datasets.load_boston()
iris = sklearn.datasets.load_iris()
diabetes = sklearn.datasets.load_diabetes()
digits = sklearn.datasets.load_digits()
linnerud = sklearn.datasets.load_linnerud()
wine = sklearn.datasets.load_wine()
breast_cancer = sklearn.datasets.load_breast_cancer()

## Transform

In [None]:
### 'boston'

Data Set Characteristics: 

Number of Instances: 506

Number of Attributes:

- 13 numeric/categorical predictive

- Median Value (attribute 14) is usually the target

Attribute Information (in order)

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000’s

In [10]:
boston_df = pd.DataFrame(boston.data)
boston_df.columns = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'
]
boston_df['MDEV'] = boston.target
boston_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MDEV     506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB


In [11]:
boston_df.to_csv('boston_skl.csv', index=False)

### 'iris'

Data Set Characteristics:

Number of Instances: 150 (50 in each of three classes)

Number of Attributes: 4 numeric, predictive attributes and the class

Attribute Information

- `sepal length` in cm

- `sepal width` in cm

- `petal length` in cm

- `petal width` in cm

- `class`:
    + Iris-Setosa
    + Iris-Versicolour
    + Iris-Virginica

In [12]:
iris_df = pd.DataFrame(iris.data)
iris_df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
iris_df['class'] = iris.target
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [13]:
iris_df.to_csv('iris_skl.csv', index=False)

In [None]:
### 'diabetes'

Data Set Characteristics:

Number of Instances: 442

Number of Attributes: First 10 columns are numeric predictive values

Target: Column 11 is a quantitative measure of disease progression one year after baseline

Attribute Information

- age: age in years

- sex

- bmi: body mass index

- bp: average blood pressure

- s1: tc, total serum cholesterol

- s2: ldl, low-density lipoproteins

- s3: hdl, high-density lipoproteins

- s4: tch, total cholesterol / HDL

- s5: ltg, possibly log of serum triglycerides level

- s6: glu, blood sugar level