# Common pitfalls and Best Practices

## Inconsistent preprocessing

If data transforamtion are used in training a model they also must be used on subsequent datasets, like test data or  data in a productioon system.

In [17]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

random_state = 42
X, y = make_regression(random_state=random_state, n_features=1, noise=1)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=random_state)

## Wrong

the train dataset is scaled, but not the test dataset, so model performance on the test dataset is worse than expected

In [18]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_transformed = scaler.fit_transform(X_train)

model = LinearRegression()

model.fit(X_train_transformed,y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


In [None]:
# X_train_transformed

In [21]:
print(f"mean squred error the Linear Regression Mode is {mse:.3f}")

mean squred error the Linear Regression Mode is 48.774


## Right way of doing

Instead of passing the non-transformed X_test to predict, we should transform the test data, the same way we transformed the training data.

In [22]:
X_test_transformed = scaler.transform(X_test)

y_pred = model.predict(X_test_transformed)

mse = mean_squared_error(y_test,y_pred)

print(f"mean squred error the Linear Regression Mode is {mse:.3f}")

mean squred error the Linear Regression Mode is 1.042


Pipelines also help avoiding another common pitfall: leaking  the test data into training data.

In [23]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LinearRegression())

model.fit(X_train, y_train)

mean_squared_error(y_test, model.predict(X_test))

np.float64(1.0420222653186997)

## Data Leakage

Data leakage occurs when information that would not be available at prediction time is used when building model. This results in overly optimistic performance estimates.

The general rule is to neverl `fit` on the test data.

## How to avoid data leakage

steps to avoid data leakage

1. Always split the data into train and test subsets firt, particularly before any preprocessing steps.
2. Never include test data when using `fit` and `fit_transform` methods. using the `fit(X)` can result in overly optimistic scores.
3. the `transform` method should be used on both train and test subsets as the same preprocessing should be applied to all the data. This can be achieved using `fit_transform`  on train subset and `transform` on test subset.
4. the scikit-learn pipeline is a great way to prevent data leakage as it ensures that the appropriate method is performed on the correct data subset.
5. the pipeline is ideal for use in cross-validaition and hyper-parameter tuning functions

## Data Leakage during pre-processing

In [24]:
import numpy as np

n_samples, n_features, n_classes = 200, 10000, 2

rng = np.random.RandomState(42)

X = rng.standard_normal((n_samples, n_features))
y = rng.choice(n_classes, n_samples)

### Wrong way of doing

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

this is incorrect preprocessing as the entire data is transformed.

In [30]:
X_selected = SelectKBest(k=25).fit_transform(X,y)

In [31]:
X_train,X_test,y_train,y_test = train_test_split(X_selected,y,random_state=42)

gbc = GradientBoostingClassifier(random_state=1)

gbc.fit(X_train, y_train)

In [33]:
y_pred = gbc.predict(X_test)

acc = accuracy_score(y_test, y_pred)

print(f"accuracy of GBC model is {acc:.3g}")

accuracy of GBC model is 0.76


### Right way of doing

to prevent the data leakage, it is good practice to split your data into train and test subsets **first**. We should use the fit and fit_transform on train subset.

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

select = SelectKBest(k=25)

X_train_selected = select.fit_transform(X_train, y_train)

gbc = GradientBoostingClassifier(random_state=1)

gbc.fit(X_train_selected, y_train)

In [35]:
#transforming test data separately
X_test_selected = select.transform(X_test)

# then making predictions on separately transformed test data
y_pred = gbc.predict(X_test_selected)

acc = accuracy_score(y_test, y_pred)

print(f"accuracy of gbc is {acc}")

accuracy of gbc is 0.46


In [38]:
from sklearn.pipeline import make_pipeline

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

pipeline = make_pipeline(SelectKBest(k=25), GradientBoostingClassifier(random_state=1))

pipeline.fit(X_train, y_train)

In [39]:
y_pred = pipeline.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"accuracy of gbc is {acc}")

accuracy of gbc is 0.46


the pipeline can also be fed into a cross validation function such as cross_val_score

In [40]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X, y)

print(f"Mean accuracy is {scores.mean():.2f} +/ -{scores.std():.2f}")

Mean accuracy is 0.46 +/ -0.07


# Controlling Randomness

some scikit-learn object are inherently random. these are usually estimators (eg: **RandomForestClassifier**) and cross-validation splitters(eg: **KFold**)

For an optimal robustness of cross-validation (CV) results, pass `RandomState` instances when creating estimators, or leave `random_state` to `None`.

Passing integers to CV splitter is usually the safest option and is preferable, passing `RandomState` instances to splitters may sometimes be useful to achieve very specific use-cases.

For both estimators and splitters, passing an integer vs passing instance (or `None`) leads to subtle but significant differences, especially for CV procedures.

For reproducible results across executions, remove any use of `random_state=None`

## Using `None` or `RandomState` instances, repeated calls to `fit` and `split`

the `random_state`parameter determines whether calls to `fit` (for estimators) or to `split` (for cv splitters) will produce the same results, according to the following:

1. if an integer is passed, calling `fit` or `split` multiple times yields the same results.
2. if `None` or a `RandomState` instance is passed: `fit`  and `split` will yield different results each time they are called.

### Estimators

passing instances means that calling `fit` multiple times will not yield the same results, even if the estimator is fitted on the same data with same hyper-parameters.

In [53]:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification
import numpy as np


In [56]:
rng = np.random.RandomState(0)

X,y = make_classification(n_features=5, random_state=rng)
sgd = SGDClassifier(random_state=rng)

sgd.fit(X,y)

In [57]:
sgd.coef_

array([[ 8.85418642,  4.79084103, -3.13077794,  8.11915045, -0.56479934]])

In [58]:
sgd.fit(X,y)

In [59]:
sgd.coef_

array([[ 6.70814003,  5.25291366, -7.55212743,  5.18197458,  1.37845099]])

### CV Splitters


In [60]:
from sklearn.model_selection import KFold

import numpy as np

X = y = np.arange(10)

rng = np.random.RandomState(0)

cv = KFold(n_splits=2, shuffle=True, random_state=rng)

for train,test in cv.split(X,y):
    print(train,test)
for train,test in cv.split(X,y):
    print(train,test)

[0 3 5 6 7] [1 2 4 8 9]
[1 2 4 8 9] [0 3 5 6 7]
[0 4 6 7 8] [1 2 3 5 9]
[1 2 3 5 9] [0 4 6 7 8]


In [61]:
for train,test in cv.split(X,y):
    print(train,test)

[0 1 6 7 9] [2 3 4 5 8]
[2 3 4 5 8] [0 1 6 7 9]


## Common pitfalls and subtleties

while the rules that govern the random_state parameter are seemingly simple, they do however have some subtleties. In some cases, this can even lead to wrong conclusions.

### Estimators

**different `random_state` types lead to different cross-validation procedures**


In [62]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

In [63]:
X,y = make_classification(random_state=0)

rf_123 = RandomForestClassifier(random_state=123)
cross_val_score(rf_123, X, y)

array([0.85, 0.95, 0.95, 0.9 , 0.9 ])

`rf_123` was passed an integer, every call to `fit` uses the same RNG, which means that all random characteristics of the random forest estimator will be the same for each of the folds of CV procedure.

In [64]:
rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))
cross_val_score(rf_inst, X, y)

array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])

*rf_inst* was passed a `RandomState` instance, each call to `fit` starts from a different RNG. as a result the random subset of features will be different for each folds

In [65]:
# cloning

from sklearn import clone
from sklearn.ensemble import RandomForestClassifier

rng = np.random.RandomState(0)

a = RandomForestClassifier(random_state=rng)
b = clone(a)

### CV splitters

In [66]:
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
import numpy as np

rng = np.random.RandomState(0)
X, y = make_classification(random_state=rng)
cv = KFold(shuffle=True, random_state=rng)
lda = LinearDiscriminantAnalysis()
nb = GaussianNB()

for est in (lda, nb):
    print(cross_val_score(est, X, y, cv=cv))

[0.8  0.75 0.75 0.7  0.85]
[0.85 0.95 0.95 0.85 0.95]


## General Recommendations

### Getting reproducible results across multiple executions

in order to obtain reproducible results across mutiple program executions we need to remove all uses of `random_state`=`None`, which is the default. 

the recommended way is to declare a `rng` variable at the top of the program, and pass it down to any object that accepts a `random_state` parameter.

In [67]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

rng = np.random.RandomState(0)

X, y = make_classification(random_state=rng)

rf = RandomForestClassifier(random_state=rng)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=rng)
rf.fit(X_train, y_train).score(X_test, y_test)

0.84

### Robusteness of cross-validation results

When we evaluate a randomized estimator performance by cross-validation, we want to make sure that the estimator can yield accurate predictions for new data, but we also want to make sure that the estimator is robust w.r.t. its random initialization. For example, we would like the random weights initialization of a SGDClassifier to be consistently good across all folds: otherwise, when we train that estimator on new data, we might get unlucky and the random initialization may lead to bad performance. Similarly, we want a random forest to be robust w.r.t the set of randomly selected features that each tree will be using.

For these reasons, it is preferable to evaluate the cross-validation performance by letting the estimator use a different RNG on each fold. This is done by passing a RandomState instance (or None) to the estimator initialization.

When we pass an integer, the estimator will use the same RNG on each fold: if the estimator performs well (or bad), as evaluated by CV, it might just be because we got lucky (or unlucky) with that specific seed. Passing instances leads to more robust CV results, and makes the comparison between various algorithms fairer. It also helps limiting the temptation to treat the estimator’s RNG as a hyper-parameter that can be tuned.

Whether we pass RandomState instances or integers to CV splitters has no impact on robustness, as long as split is only called once. When split is called multiple times, fold-to-fold comparison isn’t possible anymore. As a result, passing integer to CV splitters is usually safer and covers most use-cases.

## Rough

In [25]:
X[5]

array([ 0.09820615, -0.06410823,  0.95179076, ...,  0.14101014,
       -2.18197284, -0.00639759])

In [26]:
X.size

2000000

In [27]:
X.shape

(200, 10000)

In [41]:
scores?

[1;31mType:[0m        ndarray
[1;31mString form:[0m [0.475 0.45  0.35  0.45  0.575]
[1;31mLength:[0m      5
[1;31mFile:[0m        c:\users\saiki\miniconda3\envs\dsml\lib\site-packages\numpy\__init__.py
[1;31mDocstring:[0m  
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

An array object represents a multidimensional, homogeneous array
of fixed-size items.  An associated data-type object describes the
format of each element in the array (its byte-order, how many bytes it
occupies in memory, whether it is an integer, a floating point number,
or something else, etc.)

Arrays should be constructed using `array`, `zeros` or `empty` (refer
to the See Also section below).  The parameters given here refer to
a low-level method (`ndarray(...)`) for instantiating an array.

For more information, refer to the `numpy` module and examine the
methods and attributes of an array.

Parameters
----------
(for the __new__ method; see Notes below)

shape : t

In [42]:
from sklearn.datasets import make_classification

In [44]:
import pandas as pd

In [48]:
df = pd.DataFrame()

df = make_classification(random_state=0)
df

(array([[-0.03926799,  0.13191176, -0.21120598, ...,  1.97698901,
          1.02122474, -0.46931074],
        [ 0.77416061,  0.10490717, -0.33281176, ...,  1.2678044 ,
          0.62251914, -1.49026539],
        [-0.0148577 ,  0.67057045, -0.21416666, ..., -0.10486202,
         -0.10169727, -0.45130304],
        ...,
        [ 0.29673317, -0.49610233, -0.86404499, ..., -1.10453952,
          2.01406015,  0.69042902],
        [ 0.08617684,  0.9836362 ,  0.17124355, ...,  2.11564734,
          0.11273794,  1.20985013],
        [-1.58249448, -1.42279491, -0.56430103, ...,  1.26661394,
         -1.31771734,  1.61805427]]),
 array([0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
        0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
        0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
        0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0]))

In [49]:
dir(df)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'index']