# How to Save and Reuse Data Preparation Objects in Scikit-Learn

Author: Jason Brownlee

Article from [machinelearningmastery](https://machinelearningmastery.com/how-to-save-and-load-models-and-data-preparation-in-scikit-learn-for-later-use/).

> Note: In this notebook, I am studying the article mentioned above. Some changes may have been made to the code during its implementation.

# Library

In [17]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from pickle import dump
from pickle import load

# How to save and later use a data preparation object

## 1 - Define a dataset

### Example of creating a test dataset and splitting it into train and test sets

In [2]:
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)

### Split data into train and test sets

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

### Summarize the scale of each input variable

In [4]:
for i in range(X_test.shape[1]):
    print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' % (i, X_train[:, i].min(), X_train[:, i].max(), X_test[:, i].min(), X_test[:, i].max()))

>0, train: min=-11.856, max=0.526, test: min=-11.270, max=0.085
>1, train: min=-6.388, max=6.507, test: min=-5.581, max=5.926


## 2 - Scale the dataset

### Define scaler

In [7]:
scaler = MinMaxScaler()

### Fit scaler on the training dataset

In [8]:
scaler.fit(X_train)

### Transform both datasets

In [9]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Summarize the scale of each input variable

In [11]:
for i in range(X_test.shape[1]):
    print('>%d, train: min=%.3f, max=%.3f, test: min=%.3f, max=%.3f' % (i, X_train_scaled[:, i].min(), X_train_scaled[:, i].max(), X_test_scaled[:, i].min(), X_test_scaled[:, i].max()))

>0, train: min=0.000, max=1.000, test: min=0.047, max=0.964
>1, train: min=0.000, max=1.000, test: min=0.063, max=0.955


## 3 - Save model and data scaler

### Define model

In [14]:
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_scaled, y_train)

### Save the model

In [15]:
dump(model, open('pickle-objects/how_to_save_and_reuse_data_preparation_objects_in_scikit-learn-model.pkl', 'wb'))

### Save the scaler

In [16]:
dump(scaler, open('pickle-objects/how_to_save_and_reuse_data_preparation_objects_in_scikit-learn-scaler.pkl', 'wb'))

## 4 - Load model and data scaler

### Load the model

In [18]:
model = load(open('pickle-objects/how_to_save_and_reuse_data_preparation_objects_in_scikit-learn-model.pkl', 'rb'))

### Load the scaler

In [19]:
scaler = load(open('pickle-objects/how_to_save_and_reuse_data_preparation_objects_in_scikit-learn-scaler.pkl', 'rb'))

### Check scale of the test set before scaling

In [20]:
print('Raw test set range')
for i in range(X_test.shape[1]):
    print('>%d, min=%.3f, max=%.3f' % (i, X_test[:, i].min(), X_test[:, i].max()))

Raw test set range
>0, min=-11.270, max=0.085
>1, min=-5.581, max=5.926


### Transform the test dataset

In [21]:
X_test_scaled = scaler.transform(X_test)
print('Scaled test set range')
for i in range(X_test_scaled.shape[1]):
    print('>%d, min=%.3f, max=%.3f' % (i, X_test_scaled[:, i].min(), X_test_scaled[:, i].max()))

Scaled test set range
>0, min=0.047, max=0.964
>1, min=0.063, max=0.955


### Make predictions on the test set

In [22]:
yhat = model.predict(X_test_scaled)

### Evaluate accuracy

In [23]:
acc = accuracy_score(y_test, yhat)
print('Test Accuracy:', acc)

Test Accuracy: 1.0
