# The Core Idea of Scikit

Remember that our modeling pipeline will consist of three major areas before we get to the creation of the actual model:

- Data Preprocessing
- Data Visualization (let's not care about this for now, as it usually is not part of the core pipeline)
- Preparing Data for Machine Learning (so called Feature Engineering)

Within *Data Preprocessing*, things are simple. If we once define that there is a column "city_name" from which all whitespaces should be stripped, we can easily replicate this operation whenever new data comes into out pipeline on which we should predict. We simply...you might have guessed it...strip all of the whitespaces from that column. **We do not need to save the state to replicate the operation usually within preprocessing.**

Things are a bit more tricky within *Preparing Data for ML*. The operations here are usually **not replicable, unless we save certain state of that operation**. For example, if we have decided to impute instead of missing values in our training data a median, we want to impute the same value for all new data that come in. We cannot calculate the median again, we need to reuse it. Each of these operations will have a different state to be saved. Luckily for us, we do not need to learn and care for saving of these states. We just use scikit-learn, which will really simiplify this job.

# Fitting and predicting: estimator basics
If you thing about fitting of Machine Learning model, it is a form of transformation right? We are transforming all independent features, into a single target feature. In order for us to replicate this operation, we need to save the state, which in this case will be the model itself. Here is where the cool idea of saving the state originated...

Here is a simple example where we fit a RandomForestClassifier to some very basic data. In the cell below we only fit the model. The clf hence contains nothing but the model - or in order words the desired state.

In [1]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)
RandomForestClassifier(random_state=0)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator. In this case do not get confused by the fact that we are again using X. We did not alter this data before, we have only fitted the model (saved the state).

In [3]:
clf.predict(X)  # predict classes of the training data

clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

# Transformers and pre-processors

The same idea of saving the state has been implemented also for other functionalities that scikit offers. In this case, do not get confused, they refer to "Preparing Data for ML" as "Preprocessing".

Take a look below. We decided to transform our data using a standard scaler. We again at first only fit it.

In [5]:
from sklearn.preprocessing import StandardScaler
X = [[0, 15],
     [1, -10]]
fitted_scaler = StandardScaler().fit(X)
fitted_scaler

StandardScaler()

Now we can use the transformer (saved state) to replicate the operation.

In [6]:
fitted_scaler.transform(X)

array([[-1.,  1.],
       [ 1., -1.]])

https://scikit-learn.org/stable/getting_started.html