# Getting Started

to practise main features of `scikit-learn`

## Fitting and predicting: estimator basics

estimators: built-in machine learning algorithms and models

In [None]:
# simple example of using RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

X = [[1, 2, 3], [11, 12, 13]]  # sample matrix (n_samples, n_features)
y = [0, 1]  # target values
clf.fit(X, y)

In [None]:
# using the fitted estimator to predict target values of training data
clf.predict(X)

In [None]:
# predict target values of new data
clf.predict(
    [
        [
            4,
            5,
            6,
        ],
        [14, 15, 16],
    ]
)

## Transformers and pre-processors

A typical machine learning pipeline:
1. preprocessing: transforms or imputes the data
2. predicting: predicts the targeted value

pre-processors, transformers, estimators all inherit from the `BaseEstimator` class
- pre-processors & transformers don't have a predict method, but have a transform method
- for certain use-cases, `ColumnTransformer` is designed for applying different transformations to different features

In [None]:
# An example of using StandardScaler
from sklearn.preprocessing import StandardScaler

X = [[0, 15], [1, -10]]
StandardScaler().fit(X).transform(X)

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evaluation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evaluation
- sklearn provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y)
result

## Fitting and predicting: estimator basics

estimators: built-in machine learning algorithms and models

In [None]:
# simple example of using RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

X = [[1, 2, 3], [11, 12, 13]]  # sample matrix (n_samples, n_features)
y = [0, 1]  # target values
clf.fit(X, y)

In [None]:
# using the fitted estimator to predict target values of training data
clf.predict(X)

In [None]:
# predict target values of new data
clf.predict(
    [
        [
            4,
            5,
            6,
        ],
        [14, 15, 16],
    ]
)

## Transformers and pre-processors

A typical machine learning pipeline:
1. preprocessing: transforms or imputes the data
2. predicting: predicts the targeted value

pre-processors, transformers, estimators all inherit from the `BaseEstimator` class
- pre-processors & transformers don't have a predict method, but have a transform method
- for certain use-cases, `ColumnTransformer` is designed for applying different transformations to different features

In [None]:
# An example of using StandardScaler
from sklearn.preprocessing import StandardScaler

X = [[0, 15], [1, -10]]
StandardScaler().fit(X).transform(X)

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y)
result

## Fitting and predicting: estimator basics

estimators: built-in machine learning algorithms and models

In [None]:
# simple example of using RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

X = [[1, 2, 3], [11, 12, 13]]  # sample matrix (n_samples, n_features)
y = [0, 1]  # target values
clf.fit(X, y)

In [None]:
# using the fitted estimator to predict target values of training data
clf.predict(X)

In [None]:
# predict target values of new data
clf.predict(
    [
        [
            4,
            5,
            6,
        ],
        [14, 15, 16],
    ]
)

## Transformers and pre-processors

A typical machine learning pipeline:
1. preprocessing: transforms or imputes the data
2. predicting: predicts the targeted value

pre-processors, transformers, estimators all inherit from the `BaseEstimator` class
- pre-processors & transformers don't have a predict method, but have a transform method
- for certain use-cases, `ColumnTransformer` is designed for applying different transformations to different features

In [None]:
# An example of using StandardScaler
from sklearn.preprocessing import StandardScaler

X = [[0, 15], [1, -10]]
StandardScaler().fit(X).transform(X)

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y)
result

## Fitting and predicting: estimator basics

estimators: built-in machine learning algorithms and models

In [1]:
# simple example of using RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=0)

X = [[1, 2, 3], [11, 12, 13]]  # sample matrix (n_samples, n_features)
y = [0, 1]  # target values
clf.fit(X, y)

RandomForestClassifier(random_state=0)

In [2]:
# using the fitted estimator to predict target values of training data
clf.predict(X)

array([0, 1])

In [3]:
# predict target values of new data
clf.predict(
    [
        [
            4,
            5,
            6,
        ],
        [14, 15, 16],
    ]
)

array([0, 1])

## Transformers and pre-processors

A typical machine learning pipeline:
1. preprocessing: transforms or imputes the data
2. predicting: predicts the targeted value

pre-processors, transformers, estimators all inherit from the `BaseEstimator` class
- pre-processors & transformers don't have a predict method, but have a transform method
- for certain use-cases, `ColumnTransformer` is designed for applying different transformations to different features

In [None]:
# An example of using StandardScaler
from sklearn.preprocessing import StandardScaler

X = [[0, 15], [1, -10]]
StandardScaler().fit(X).transform(X)

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [None]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [None]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples= )

## Pipelines: chaining pre-processors and estimators
- a pipeline offers the same API functions e.g. `fit` and `predict` as a regular estimator
- using a pipeline can prevent from disclosing testing data in training data (i.e. data leakage)

In [None]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

In [2]:
from sklearn.preprocessing import StandardScaler

# create a pipeline object
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# load the iris dataset and split into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# fit the whole pipeline
pipe.fit(X_train, y_train)

# use it to predict over test data set and calculate accuracy score
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

## Model evalutation

A model needs to be evaluated to see if it can predict well over unseen data. 
- Cross validation is a particular tool for model evalation 
- sklean provides a `cross_validate` helper, which by default will perform a 5-fold cross validation
- it is also possible to do manual iteration over folds, use different data splitting strategies, and use custom scoring functions  

In [3]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

X, y = make_regression(n_samples=1000, random_state=0)
lr = LinearRegression()

result = cross_validate(lr, X, y)
result

{'fit_time': array([0.02423596, 0.02805591, 0.02402782, 0.0249002 , 0.02508903]),
 'score_time': array([0.00064802, 0.00030494, 0.0005331 , 0.00036502, 0.0002718 ]),
 'test_score': array([1., 1., 1., 1., 1.])}