Course "50 scikit-learn tips" by **Kenvin Markham**

Link: https://courses.dataschool.io/courses/scikit-learn-tips/801794-introduction/2376425-welcome-to-the-course

* Updated by Tien LE
* Updated date: 2021-10-01


In [1]:
import sklearn

In [2]:
sklearn.__version__

'0.24.2'

In [3]:
# Upgrade scikit-learn to 1.0
# ! pip install --upgrade scikit-learn

# Use ColumnTransformer to apply different preprocessing to different columns

Use ColumnTransformer to apply different preprocessing to different columns:
+ select from DataFrame columns by name
+ passthrough or drop unspecified columns

Requires scikit-learn 0.20+

Additional links: [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), [make_column_transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html?highlight=make_column_transformer)

In [4]:
import numpy as np
import pandas as pd

In [5]:
import os 

file_input_path = "titanic_train.csv"
if not os.path.exists(file_input_path):
    df = pd.read_csv("http://bit.ly/kaggletrain")
    df.to_csv(file_input_path, header=True, index=False, sep="\t")
else:
    df = pd.read_csv(file_input_path, header=0, sep="\t")

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
df.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [8]:
cols = ["Fare", "Embarked", "Sex", "Age"]
X = df[cols]
X = X.head(6)
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [9]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

In [10]:
ohe = OneHotEncoder()
simple_imputer = SimpleImputer()  # get mean value

In [11]:
col_transformer = make_column_transformer(
    (ohe, ["Embarked", "Sex"]),
    (simple_imputer, ["Age"]),
    remainder="passthrough"
)

#https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html?highlight=make_column_transformer
#remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’
#   By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform.

In [12]:
col_transformer.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 22.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    , 38.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 26.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 35.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 35.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    , 31.2   ,  8.4583]])

# Seven ways to select columns using ColumnTransformer

There are SEVEN ways to select columns using [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html):
+ column name
+ integer position
+ slice
+ boolean mask
+ regex pattern
+ dtypes to include
+ dtypes to exclude

Additional links: [make_column_selector](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html)

In [13]:
import os 

file_input_path = "titanic_train.csv"
if not os.path.exists(file_input_path):
    df = pd.read_csv("http://bit.ly/kaggletrain")
    df.to_csv(file_input_path, header=True, index=False, sep="\t")
else:
    df = pd.read_csv(file_input_path, header=0, sep="\t")

In [14]:
cols = ["Fare", "Embarked", "Sex", "Age"]
X = df[cols]
X = X.head(6)
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [15]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector

In [16]:
ohe = OneHotEncoder()

In [17]:
X.columns

Index(['Fare', 'Embarked', 'Sex', 'Age'], dtype='object')

In [18]:
# all SEVEN of these produce the same results
# Note that: reminder = "drop" (by default)

# column name
col_transformer = make_column_transformer((ohe, ['Embarked', 'Sex']))

# integer position
col_transformer = make_column_transformer((ohe, [1,2]))  # column index 0,1,2,...

# slice
col_transformer = make_column_transformer((ohe, slice(1,3)))  # column from index 1 to k-1

# boolean mask
col_transformer = make_column_transformer((ohe, [False, True, True, False]))

# regex pattern
col_transformer = make_column_transformer((ohe, make_column_selector(pattern="E|S")))  # startwiths E or S
# https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html
# patternstr, default=None
#   Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.

# dtypes to include
col_transformer = make_column_transformer((ohe, make_column_selector(dtype_include=object)))

# dtypes to exclude
col_transformer = make_column_transformer((ohe, make_column_selector(dtype_exclude=np.number)))

In [19]:
col_transformer.fit_transform(X)

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.]])

## Simple Case

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html?highlight=standardscaler#sklearn.preprocessing.StandardScaler

Standardize features by removing the mean and scaling to unit variance.

In [20]:
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [21]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

col_transformer = make_column_transformer(
    (StandardScaler(), make_column_selector(dtype_include=np.number)),  
    (OneHotEncoder(), make_column_selector(dtype_include=object))
)

col_transformer.fit_transform(X)

array([[-0.71829711, -1.50516598,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [ 1.7333147 ,  1.11251398,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ],
       [-0.69245371, -0.85074599,  0.        ,  0.        ,  1.        ,
         1.        ,  0.        ],
       [ 1.03713954,  0.62169899,  0.        ,  0.        ,  1.        ,
         1.        ,  0.        ],
       [-0.6876679 ,  0.62169899,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [-0.67203551,         nan,  0.        ,  1.        ,  0.        ,
         0.        ,  1.        ]])

## Flexible Case

Link: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

In [22]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

numeric_features = ["Fare", "Age"]
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]
)

categorical_features = ["Embarked", "Sex"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric", numeric_transformer, numeric_features),
        ("categorical", categorical_transformer, categorical_features)
    ]
)

preprocessor.fit_transform(X)

array([[-0.71829711, -1.70817275,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [ 1.7333147 ,  1.07122698,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ],
       [-0.69245371, -1.01332282,  0.        ,  0.        ,  1.        ,
         1.        ,  0.        ],
       [ 1.03713954,  0.55008953,  0.        ,  0.        ,  1.        ,
         1.        ,  0.        ],
       [-0.6876679 ,  0.55008953,  0.        ,  0.        ,  1.        ,
         0.        ,  1.        ],
       [-0.67203551,  0.55008953,  0.        ,  1.        ,  0.        ,
         0.        ,  1.        ]])

# What is the difference between "fit" and "transform"?

Q: What is the difference between the "fit" and "transform" methods?

+ "[fit](https://scikit-learn.org/stable/glossary.html#term-fit)": transformer learns something about the data
+ "[transform](https://scikit-learn.org/stable/glossary.html#term-transform)": it uses what it learned to do the data transformation

For examples
+ CountVectorizer
    - fit: learns the vocabulary
    - transform: creates a document-term matrix using the vocabulary
    
+ SimpleImputer
    - fit: learns the value to impute
    - transform: fills in missing entries using the imputation value
    
+ StandardScaler
    - fit: learns the mean and scale of each feature
    - transform: standardizes the features using the mean and scale
    
+ HashingVectorizer
    - fit: is not used, and thus it is known as a "stateless" transformer
    - transform: create the document-term matrix using a hash of the token

# Use "fit_transform" on training data, but "transform" (only) on testing/new data

Use "[fit_transform](https://scikit-learn.org/stable/glossary.html#term-fit_transform)" on training data, but "[transform](https://scikit-learn.org/stable/glossary.html#term-transform)" (only) on testing/new data.

Applies the same transformations to both sets of data, which creates consistent columns and prevents [data leakage](https://scikit-learn.org/stable/common_pitfalls.html#data-leakage)!

# Get the feature names output by a ColumnTransformer

Need to get the feature names output by a [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)?

Use get_feature_names(), which now works with "passthrough" columns (new in version 0.23)!

Note: Beginning in scikit-learn 1.0, the get_feature_names method has been deprecated in favor of [get_feature_names_out](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer.get_feature_names_out).


In [23]:
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [24]:
X.columns

Index(['Fare', 'Embarked', 'Sex', 'Age'], dtype='object')

In [25]:
col_transformer = make_column_transformer(
    (OneHotEncoder(), ["Embarked", "Sex"]),
    remainder="passthrough"
)

In [26]:
ft = col_transformer.fit_transform(X)
ft

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  7.25  , 22.    ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    , 71.2833, 38.    ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  7.925 , 26.    ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 53.1   , 35.    ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  8.05  , 35.    ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  8.4583,     nan]])

In [27]:
ft.shape

(6, 7)

In [28]:
col_transformer.get_feature_names()

['onehotencoder__x0_C',
 'onehotencoder__x0_Q',
 'onehotencoder__x0_S',
 'onehotencoder__x1_female',
 'onehotencoder__x1_male',
 'Fare',
 'Age']

# Passthrough some columns and drop others in a ColumnTransformer

In a [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html), you can use the strings 'passthrough' and 'drop' in place of a transformer. 

Useful if you need to passthrough some columns and drop others!

In [29]:
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


## Method 1: "passthrough" with some columns

In [30]:
col_transformer = make_column_transformer(
    (SimpleImputer(), ["Age"]),
    ("passthrough", ["Fare"]),
    remainder="drop"
)
col_transformer.fit_transform(X)

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       [35.    , 53.1   ],
       [35.    ,  8.05  ],
       [31.2   ,  8.4583]])

## Method 2: "drop" with some columns

In [31]:
col_transformer = make_column_transformer(
    (SimpleImputer(), ["Age"]),
    ("drop", ["Embarked", "Sex"]),
    remainder="passthrough"
)

col_transformer.fit_transform(X)

array([[22.    ,  7.25  ],
       [38.    , 71.2833],
       [26.    ,  7.925 ],
       [35.    , 53.1   ],
       [35.    ,  8.05  ],
       [31.2   ,  8.4583]])

# (Detail ?) Four reasons to use scikit-learn (not pandas) for ML preprocessing

Reasons to use scikit-learn (not pandas) for ML [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html):
+ You can cross-validate the entire workflow
+ You can [grid search](https://scikit-learn.org/stable/modules/grid_search.html) model & preprocessing hyperparameters
+ Avoids adding new columns to the source DataFrame
+ pandas lacks separate fit/transform steps to prevent [data leakage](https://scikit-learn.org/stable/common_pitfalls.html#data-leakage)

# Don't use .values when passing a pandas object to scikit-learn

There's no need to use ".values" when passing a DataFrame or Series to scikit-learn... it knows how to access the underlying NumPy array!


In [32]:
df.head(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


In [33]:
from sklearn.linear_model import LogisticRegression

In [34]:
clf = LogisticRegression()

In [35]:
X = df[["Pclass", "Fare"]]
y = df["Survived"]

In [36]:
type(X)

pandas.core.frame.DataFrame

In [37]:
type(y)

pandas.core.series.Series

In [38]:
clf.fit(X, y)

LogisticRegression()

# Load a toy dataset into a DataFrame

New in version 0.23: Need to load a [toy dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html) into a DataFrame, including column names? Set as_frame=True.

Want features and target as separate objects? Also set return_X_y=True.

Additional links: [load_iris](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html)

In [39]:
from sklearn.datasets import load_iris

## Return DataFrame with features and target

In [40]:
df_toy = load_iris(as_frame=True)

In [41]:
type(df_toy)

sklearn.utils.Bunch

In [42]:
print(dir(df_toy))

['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']


In [43]:
df_toy_frame = df_toy["frame"]

In [44]:
df_toy_frame.head(3)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0


## Return Feature DataFrame and Target Series 

In [45]:
X_toy, y_toy = load_iris(as_frame=True, return_X_y=True)

In [46]:
X_toy.head(3)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2


In [47]:
y_toy.head(3)

0    0
1    0
2    0
Name: target, dtype: int32

# Encode categorical features using OneHotEncoder or OrdinalEncoder

Two common ways to encode categorical features:
+ [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for unordered (nominal) data
+ [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) for ordered (ordinal) data

P.S. [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) is for labels, not features!

In [48]:
# "Shape" is unordered, "Class" and "Size" are ordered
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle'],
                  'Class': ['third', 'first', 'second', 'third'],
                  'Size': ['S', 'S', 'L', 'XL']})

In [49]:
X

Unnamed: 0,Shape,Class,Size
0,square,third,S
1,square,first,S
2,oval,second,L
3,circle,third,XL


In [50]:
X["Shape"].shape

(4,)

In [51]:
X[["Shape"]].shape

(4, 1)

## left-to-right column order is alphabetical (circle, oval, square)

In [52]:
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(X[["Shape"]])

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

## category ordering (within each feature) 

In [53]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[
    ["first", "second", "third"],  # order of Class
    ["S", "M", "L", "XL"]          # order of Size  
])

oe.fit_transform(X[["Class", "Size"]])

array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [2., 3.]])