<a href="https://colab.research.google.com/github/tincorpai/tincorpai-Data-Preparation-and-Cleaning-in-Machine-Learning-/blob/master/Data_Preparation_With_Train_and_Test_Sets_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preparation With Train and Test Sets

Evaluate logistic regression model using train and test sets on a synthetic binary classification dataset where the input variables have been normalized. 

> The next is to define the synthetic dataset. Use make_classification() function to create the dataset with 1000 rows of data and 20 numerical input features. 

In [None]:
# test classification dataset
from sklearn.datasets import make_classification

#define dataset
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative = 15, n_redundant=5, random_state=7)

In [None]:
#Summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


We now evaluate our model on a scaled dataset scale using the naive and incorrect approach

## Train-Test Evaluation With Naive Data Preparation

This approach involves applying the data preparation technique to the entire dataset - both the training dataset and testing set. This technique will definitely result in data leakage.

> We can scale all features in the dataset to the range 0-1, then use the fit_transform() function to fit the transform the transform on the dataset in a single step. The result is a normalized version of the input variables, where each column in the array is normalized.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
# define data
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

Next, split the dataset into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state = 1)

In [None]:
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

We can then make a prediction using the input data from the test set, and we can compare the predictions to the expected values and calculate a classification accuracy score.

In [None]:
#evaluate the model
y_hat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % (accuracy*100))

Accuracy: 84.848


Now let's look how to avoid data leakage 

The correct approach to performing data preparation with the train-test split evaluation is to fit the data preparation on the training set, then apply the transform to the train and test sets. This requires that we first fit the dataset into train and test sets

In [None]:
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative = 15, n_redundant=5, random_state=7)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then  define the MinMaxScaler and call the fit() function on the training set. Thereafter, apply the transform() function on the train and test sets to create a normalized version of each dataset.

In [None]:
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

This is an example of fitting the train set and applying it to both train and test sets. This type of method avoids data leakage as the calculation of the manimum and maximum value for each input variable is calculated using the training set instead of the entire dataset.

In [None]:
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))


Accuracy: 85.455


We can see this method of transforming only the X_train dataset give accurate prediction (85.455 percent) compare to the previous method (84.848 percent). We expect that data leakage to result in an incorrect estimate of the model performance.

## Data Preparation With k-fold Cross-Validation

The k-fold cross-validation involves splitting a dataset into k non-overlapping groups evaluated on the held-out fold. This process is repeated so that each fold is given a chance to be used as the holdout test set. The average performance across all evaluations is reported.

The k-fold cross-validation procedure gives a more reliable estimate of model performance than a trian-test split, although it is more computationally expensive the repeated fitting and evaluation of models.

## Cross-Validation Evaluation With Naive Data Preparation

The naive data preparation with cross-validation involves applying the data transforms first, then using the cross-validation procedure.

In [None]:
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative = 15, n_redundant=5, random_state=7)

In [None]:
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

The k-fold cross-validation procedure must first be defined. We will use repeated stratified 10-fold cross-validation, which a best practice for classification.Repeated means that the
whole cross-validation procedure is repeated multiple times, three in this case. Stratified means
that each group of rows will have the relative composition of examples from each class as the
whole dataset. We will use k = 10 or 10-fold cross-validation. This can be achieved using the
RepeatedStratifiedKFold which can be configured to three repeats and 10 folds, and then
using the cross val score() function to perform the procedure, passing in the defined model,
cross-validation object, and metric to calculate, in this case, accurac

In [None]:
# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Accuracy: 85.300 (3.607)


In this case, we can see that the model achieved an estimated accuracy of about 8.300 percent. Which we know is incorrect given the data leakage allowed via data preparation procedure.

Now, let's evaluate the model with cross-validation and avoid data leakage.

## Cross Validation Evaluation With Correct Data Preparation

The evaluation procedure changes from simply and incorrectly evaluating just the model
to correctly evaluating the entire pipeline of data preparation and model together as a single
atomic unit. This can be achieved using the Pipeline class. This class takes a list of steps
that define the pipeline. Each step in the list is a tuple with two elements. The first element is
the name of the step (a string) and the second is the configured object of the step, such as a
transform or a model. The model is only supported as the final step, although we can have as
many transforms as we like in the sequence.

In [None]:
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))


Accuracy: 85.433 (3.471)


Running the example normalizes the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage

We have seen here that there is an improvement in the predictive accuracy from 85.300 percent to about 85.433 percent. This example demonstrate that data leakage may impact the estimate of model performance and how to correct data leakage by correctly perform data preparation after the data is split.