### Pipelines for pre-processing and model fitting

In [None]:
numeric_features = X_train.select_dtypes(exclude='object').columns.tolist()

# Creating a pipeline to preprocess numerical features.
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("poly_int", PolynomialFeatures(degree=2, include_bias=False)),
        ("scaler", StandardScaler())
    ]
).set_output(transform='pandas')

print(numeric_features)
numeric_transformer

['mileage', 'reg_code', 'standard_colour', 'standard_make', 'standard_model', 'vehicle_condition', 'body_type', 'crossover_car_and_van', 'fuel_type_diesel', 'fuel_type_electric', 'fuel_type_petrol', 'fuel_type_petrol_hybrid', 'fuel_type_petrol_plug_in_hybrid', 'age']


In [None]:
numeric_transformer.fit_transform(X_train[numeric_features]).sample()

Unnamed: 0_level_0,mileage,reg_code,standard_colour,standard_make,standard_model,vehicle_condition,body_type,crossover_car_and_van,fuel_type_diesel,fuel_type_electric,...,fuel_type_petrol^2,fuel_type_petrol fuel_type_petrol_hybrid,fuel_type_petrol fuel_type_petrol_plug_in_hybrid,fuel_type_petrol age,fuel_type_petrol_hybrid^2,fuel_type_petrol_hybrid fuel_type_petrol_plug_in_hybrid,fuel_type_petrol_hybrid age,fuel_type_petrol_plug_in_hybrid^2,fuel_type_petrol_plug_in_hybrid age,age^2
public_reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
202010155051544,-0.614961,-0.919895,1.289227,-1.553368,0.75771,-0.257121,1.196605,-0.053575,1.24398,-0.100777,...,-1.104874,0.0,0.0,-0.68394,-0.18744,0.0,-0.135193,-0.107614,-0.079834,0.807549


Here I defined a Pipeline called numeric_transformer for preprocessing numerical features.

Using the median method, SimpleImputer fills missing values in the numerical characteristics. The strategy parameter for this phase is set to "median". The impute module of scikit-learn contains the SimpleImputer class. PolynomialFeatures produces polynomial features to the specified degree. In this instance, the degree is set to 2, and the include_bias parameter is set to False. This stage enables the development of interaction terms between the numerical features. The preprocessing module of Scikit-Learn contains the PolynomialFeatures class. By deducting the mean and scaling to unit variance, StandardScaler standardises the numerical features. This process makes sure that the feature scales are comparable, which is good for some machine learning techniques. The preprocessing module of scikit-learn contains the StandardScaler class. The set_output(transform='pandas') section makes sure that the output of the

In [None]:
categorical_features = X_train.select_dtypes(include='object').columns.tolist()

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop='if_binary')),
    ]
).set_output(transform='pandas')

print(categorical_features)
categorical_transformer

[]


Here I defined another Pipeline called categorical_transformer for preprocessing categorical features.

OneHotEncoder used to transform categorical variables into a numerical representation that can be used as input for machine learning algorithms.

In [None]:
transform_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder='passthrough',
    verbose_feature_names_out=False
).set_output(transform="pandas")

transform_preprocessor.fit_transform(X_train).sample()

Unnamed: 0_level_0,mileage,reg_code,standard_colour,standard_make,standard_model,vehicle_condition,body_type,crossover_car_and_van,fuel_type_diesel,fuel_type_electric,...,fuel_type_petrol^2,fuel_type_petrol fuel_type_petrol_hybrid,fuel_type_petrol fuel_type_petrol_plug_in_hybrid,fuel_type_petrol age,fuel_type_petrol_hybrid^2,fuel_type_petrol_hybrid fuel_type_petrol_plug_in_hybrid,fuel_type_petrol_hybrid age,fuel_type_petrol_plug_in_hybrid^2,fuel_type_petrol_plug_in_hybrid age,age^2
public_reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
202009224072750,0.155922,0.930538,1.289227,0.385733,-1.367135,-0.257121,-0.644944,-0.053575,-0.803872,-0.100777,...,0.905081,0.0,0.0,-0.393329,-0.18744,0.0,-0.135193,-0.107614,-0.079834,-0.92262


The ColumnTransformer is a powerful tool in scikit-learn that allows you to apply different transformations to different columns of your dataset. It is especially useful when you have a dataset with a mixture of numeric and categorical features and you want to apply specific preprocessing steps to each type of feature.

Two transformers are defined using the ColumnTransformer: one for categorical features and one for numeric features.