# Re-training
Often enough, you want to retrain your model when you get new data - here is how.

# Sklearn
With sklearn, the *fit* function just re-run the default sklearn.pipeline.Pipeline.fit on the new data.
* It can handle a Vaex dataframe, or Pandas dataframe as input.


In [5]:
from goldilox.datasets import load_iris

df, features, target = load_iris()
df.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0


## Naturally
Here we provide the as X and Y.
* The pipeline assumes all columns are features and y is the target.
* The pipeline takes the first row as raw example.

In [17]:
import sklearn.pipeline
from goldilox.sklearn.transformers import Imputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from goldilox import Pipeline



sklearn_pipeline = sklearn.pipeline.Pipeline([('imputer', Imputer(features=features)),
                                              ('standar', StandardScaler()),
                                              ('classifier', LogisticRegression())])

pipeline = Pipeline.from_sklearn(sklearn_pipeline)
X = df[features]
y = df[target]

pipeline.fit(X, y)

<goldilox.sklearn.pipeline.SklearnPipeline at 0x15a212bd0>

## DataFrame
It is very often that the data comes as a single dataframe, so the X,y representation is handled for you.

In [18]:
sklearn_pipeline = sklearn.pipeline.Pipeline([('imputer', Imputer(features=features)),
                                              ('standar', StandardScaler()),
                                              ('classifier', LogisticRegression())])


pipeline = Pipeline.from_sklearn(sklearn_pipeline, 
                                 features=features, 
                                 target=target).fit(df)

<goldilox.sklearn.pipeline.SklearnPipeline at 0x15b1ed250>

In both caes, the pipeline is trained in-place, and also returns itself (it makes for prettier code)

# Vaex

For vaex we need to define the fit function, as ther eis no trival way to know how to fit.   
This is very flexible way to do practically anything.

The fit function should recive a dataframe and return a dataframe which the *from_vaex* will run on.

* if you want to save a variable to the pipeline during fit, add it to the dataframe.

In [19]:
import vaex

df = vaex.from_pandas(df)
df.head(2)

#,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0


In [34]:
def fit(df):
    from vaex.ml.sklearn import Predictor
    from xgboost.sklearn import XGBClassifier
    from sklearn.metrics import accuracy_score

    model = Predictor(model=XGBClassifier(use_label_encoder=False,eval_metric="mlogloss"), 
                      features=features, 
                      target=target)
    train, test = df.ml.train_test_split()
    model.fit(train)    

    # save model evaluation as a variable
    accuracy = accuracy_score(model.predict(test), test[target].values)
    
    # train on the enitre data for best model in production
    model.fit(df)
    model.transform(df)

    df.variables['xgb_accuracy'] = accuracy
    
    # return df -> Pipeline.from_vaex(df) on the results
    return df

pipeline = Pipeline.from_vaex(df, fit=fit)

pipeline.fit(df)
pipeline.inference(pipeline.raw)



#,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0


There isn't much you can't do this way.    
Although goldilox is aimed for productionizing pipelines, this makes re-fitting on new data a non-issue in most cases