<img src="../../../images/banners/sklearn.png" width="500"/>

<a class="anchor" id="intro_to_data_structures"></a>
# <img src="../../../images/logos/sklearn.png" width="40"/> Getting Started

## <img src="../../../images/logos/toc.png" width="20"/> Table of Contents 
* [Getting Started](#getting_started)
    * [Fitting and predicting: estimator basics](#fitting_and_predicting:_estimator_basics)
    * [Transformers and pre-processors](#transformers_and_pre-processors)
    * [Pipelines: chaining pre-processors and estimators](#pipelines:_chaining_pre-processors_and_estimators)
    * [Model evaluation](#model_evaluation)
    * [Automatic parameter searches](#automatic_parameter_searches)
    * [Next steps](#next_steps)

---

The purpose of this guide is to illustrate some of the main features that
`scikit-learn` provides. It assumes a very basic working knowledge of
machine learning practices (model fitting, predicting, cross-validation,
etc.).

<a class="anchor" id="fitting_and_predicting:_estimator_basics"></a>
## Fitting and predicting: estimator basics

`Scikit-learn` provides dozens of built-in machine learning algorithms and
models, called [estimators](https://scikit-learn.org/stable/glossary.html#term-estimators). Each estimator can be fitted to some data
using its [fit](https://scikit-learn.org/stable/glossary.html#term-fit) method.

Here is a simple example where we fit a
[`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier "sklearn.ensemble.RandomForestClassifier") to some very basic data:

In [1]:
from sklearn.ensemble import RandomForestClassifier

In [2]:
clf = RandomForestClassifier(random_state=0)

In [3]:
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]

In [4]:
y = [0, 1]  # classes of each sample

In [5]:
clf.fit(X, y)

RandomForestClassifier(random_state=0)

The [fit](https://scikit-learn.org/stable/glossary.html#term-fit) method generally accepts 2 inputs:

- The samples matrix (or design matrix) [X](https://scikit-learn.org/stable/glossary.html#term-X). The size of `X` is typically (`n_samples, n_features`), which means that samples are represented as rows and features are represented as columns.

- The target values [y](https://scikit-learn.org/stable/glossary.html#term-y) which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, `y` does not need to be specified. `y` is usually 1d array where the `i` th entry corresponds to the target of the `i` th sample (row) of `X`.

Both `X` and `y` are usually expected to be numpy arrays or equivalent
[array-like](https://scikit-learn.org/stable/glossary.html#term-array-like) data types, though some estimators work with other
formats such as sparse matrices.

Once the estimator is fitted, it can be used for predicting target values of
new data. You don’t need to re-train the estimator:

In [6]:
clf.predict(X)  # predict classes of the training data

array([0, 1])

In [7]:
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

<a class="anchor" id="transformers_and_pre-processors"></a>
## Transformers and pre-processors

Machine learning workflows are often composed of different parts. A typical
pipeline consists of a pre-processing step that transforms or imputes the
data, and a final predictor that predicts target values.

In `scikit-learn`, pre-processors and transformers follow the same API as
the estimator objects (they actually all inherit from the same
`BaseEstimator` class). The transformer objects don’t have a
[predict](https://scikit-learn.org/stable/glossary.html#term-predict) method but rather a [transform](https://scikit-learn.org/stable/glossary.html#term-transform) method that outputs a
newly transformed sample matrix `X`:

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
X = [[0, 15],
     [1, -10]]

In [10]:
# scale data according to computed scaling values
StandardScaler().fit(X).transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

Sometimes, you want to apply different transformations to different features:
the [ColumnTransformer](https://scikit-learn.org/stable/modules/compose.html#column-transformer) is designed for these
use-cases.

<a class="anchor" id="pipelines:_chaining_pre-processors_and_estimators"></a>
## Pipelines: chaining pre-processors and estimators

Transformers and estimators (predictors) can be combined together into a
single unifying object: a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline "sklearn.pipeline.Pipeline"). The pipeline
offers the same API as a regular estimator: it can be fitted and used for
prediction with `fit` and `predict`. As we will see later, using a
pipeline will also prevent you from data leakage, i.e. disclosing some
testing data in your training data.

In the following example, we [load the Iris dataset](https://scikit-learn.org/stable/datasets.html#datasets), split it
into train and test sets, and compute the accuracy score of a pipeline on
the test data:

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [12]:
# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)

In [13]:
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [15]:
# fit the whole pipeline
pipe.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])

In [16]:
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

<a class="anchor" id="model_evaluation"></a>
## Model evaluation

Fitting a model to some data does not entail that it will predict well on
unseen data. This needs to be directly evaluated. We have just seen the
[`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split "sklearn.model_selection.train_test_split") helper that splits a
dataset into train and test sets, but `scikit-learn` provides many other
tools for model evaluation, in particular for [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation).

We here briefly show how to perform a 5-fold cross-validation procedure,
using the [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate "sklearn.model_selection.cross_validate") helper. Note that
it is also possible to manually iterate over the folds, use different
data splitting strategies, and use custom scoring functions. Please refer to
our [User Guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) for more details:

In [17]:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate

In [18]:
X, y = make_regression(n_samples=1000, random_state=0)

In [19]:
lr = LinearRegression()

In [20]:
result = cross_validate(lr, X, y)  # defaults to 5-fold CV

In [21]:
result['test_score']  # r_squared score is high because dataset is easy

array([1., 1., 1., 1., 1.])

<a class="anchor" id="automatic_parameter_searches"></a>
## Automatic parameter searches

All estimators have parameters (often called hyper-parameters in the
literature) that can be tuned. The generalization power of an estimator
often critically depends on a few parameters. For example a
[`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor "sklearn.ensemble.RandomForestRegressor") has a `n_estimators`
parameter that determines the number of trees in the forest, and a
`max_depth` parameter that determines the maximum depth of each tree.
Quite often, it is not clear what the exact values of these parameters
should be since they depend on the data at hand.

`Scikit-learn` provides tools to automatically find the best parameter
combinations (via cross-validation). In the following example, we randomly
search over the parameter space of a random forest with a
[`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV "sklearn.model_selection.RandomizedSearchCV") object. When the search
is over, the [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV "sklearn.model_selection.RandomizedSearchCV") behaves as
a [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor "sklearn.ensemble.RandomForestRegressor") that has been fitted with
the best set of parameters. Read more in the [User Guide](https://scikit-learn.org/stable/modules/grid_search.html#grid-search):

In [22]:
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint

In [23]:
X, y = fetch_california_housing(return_X_y=True)

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [25]:
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5), 'max_depth': randint(5, 10)}

In [26]:
# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(
    estimator=RandomForestRegressor(random_state=0),
    n_iter=5,
    param_distributions=param_distributions,
    random_state=0
)

In [27]:
search.fit(X_train, y_train)

RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
                   param_distributions={'max_depth': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f8310222cd0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f83500f2be0>},
                   random_state=0)

In [28]:
search.best_params_

{'max_depth': 9, 'n_estimators': 4}

In [29]:
# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
search.score(X_test, y_test)

0.735363411343253

> **Note:**
In practice, you almost always want to [search over a pipeline](https://scikit-learn.org/stable/modules/grid_search.html#composite-grid-search), instead of a single estimator. One of the main
reasons is that if you apply a pre-processing step to the whole dataset
without using a pipeline, and then perform any kind of cross-validation,
you would be breaking the fundamental assumption of independence between
training and testing data. Indeed, since you pre-processed the data
using the whole dataset, some information about the test sets are
available to the train sets. This will lead to over-estimating the
generalization power of the estimator (you can read more in this [Kaggle
post](https://www.kaggle.com/alexisbcook/data-leakage)).

Using a pipeline for cross-validation and searching will largely keep
you from this common pitfall.

<a class="anchor" id="next_steps"></a>
## Next steps

We have briefly covered estimator fitting and predicting, pre-processing
steps, pipelines, cross-validation tools and automatic hyper-parameter
searches. This guide should give you an overview of some of the main
features of the library, but there is much more to `scikit-learn`!

Please refer to [User Guide](https://scikit-learn.org/stable/user_guide.html#user-guide) for details on all the tools that we
provide. You can also find an exhaustive list of the public API in the
[API Reference](https://scikit-learn.org/stable/modules/classes.html#api-ref).

You can also look at our numerous [examples](https://scikit-learn.org/stable/auto_examples/index.html#general-examples) that
illustrate the use of `scikit-learn` in many different contexts.

The [tutorials](https://scikit-learn.org/stable/tutorial/index.html#tutorial-menu) also contain additional learning
resources.