# [SKlearn](https://scikit-learn.org)

In [2]:
from goldilox.datasets import load_iris

df, features, target = load_iris()
df.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0


## Option 1
We build a sklearn pipeline, use *from_sklearn*, and run fit.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import sklearn.pipeline
from goldilox import Pipeline

sklearn_pipeline = sklearn.pipeline.Pipeline([('standar', StandardScaler()),
                                              ('classifier', LogisticRegression())])
pipeline = Pipeline.from_sklearn(sklearn_pipeline).fit(df[features], df['target'])

pipeline.inference(pipeline.raw)



Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,prediction
0,5.1,3.5,1.4,0.2,0


We can see we do not handle missing values. Lets' fix that
* The goldilox Imputer is just wrapper around  ColumnTransformer and SimpleImputer(strategy='mean')

In [6]:
from goldilox.sklearn.transformers import Imputer

sklearn_pipeline = sklearn.pipeline.Pipeline([('imputer', Imputer(features=features)),
                                              ('standar', StandardScaler()),
                                              ('classifier', LogisticRegression())])
pipeline = Pipeline.from_sklearn(sklearn_pipeline).fit(df[features], df['target'])
pipeline.inference(pipeline.raw)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,prediction
0,5.1,3.5,1.4,0.2,0


## Option 2
We make a sklearn pipeline, train it, and than use the *from_sklearn* + providing the raw example

In [7]:
sklearn_pipeline = sklearn_pipeline.fit(df[features], df['target'])
raw = Pipeline.to_raw(df[features])
pipeline = Pipeline.from_sklearn(sklearn_pipeline, raw=raw)
pipeline.inference(pipeline.raw)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,prediction
0,5.1,3.5,1.4,0.2,0


## Numpy transformations
How do we deal with transformers who return numpy arrays?

The default for a transformer which returns the same number of columns as features is to rename back to the same columns.

In [14]:
from sklearn.preprocessing import StandardScaler

pipeline = Pipeline.from_sklearn(StandardScaler(),
        features=features,
        output_columns=features
    ).fit(df[features])
pipeline.inference(pipeline.raw)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,-0.900681,1.019004,-1.340227,-1.315444


This doesn't always make sense - as in the case of PCA

In [15]:
from sklearn.decomposition import PCA

pipeline = Pipeline.from_sklearn(PCA()).fit(df[features])
pipeline.inference(pipeline.raw)



Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,-2.684126,0.319397,-0.027915,-0.002262


Instead we can use the outpout_columns to fix that.

In [18]:
pipeline = Pipeline.from_sklearn(PCA(), output_columns=[f"pca{i}" for i in range(4)]).fit(df[features])
pipeline.inference(pipeline.raw)



Unnamed: 0,pca0,pca1,pca2,pca3
0,-2.684126,0.319397,-0.027915,-0.002262


In real-life and for most cases. You can just have an enitre *sklearn.pipeline.Pipeline(steps)* which will take care of most of your problems. This is just in case.

# [Serve](https://docs.goldilox.io/reference/api-reference/cli/serve)

In [None]:
print(f"Saved to: {pipeline.save('pipeline.pkl')}")
print(f"Check out the docs: http://127.0.0.1:8000/docs\n")
!glx serve pipeline.pkl