# Using XGBoost in pipelines

* takes list of named 2-tuples as input (name, pipeline step)
* tuples can contain any arbitrary scikit-learn compatible estimator or transformer
* pipeline implements fit, predict, score and other methods
* can be used as input estimator for GridSearchCV and other scikit-learn functions

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

# Load data
california = fetch_california_housing()
X, y = california.data, california.target

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor())
])

# Evaluate pipeline
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_squared_error', cv=5)

# Output scores
print("Cross-validated scores (neg_mean_squared_error):", scores)

# print final average rmse - by taking the square root of the absolute negative mean squared error
print("Average RMSE:", np.mean(np.sqrt(np.abs(scores))))



Cross-validated scores (neg_mean_squared_error): [-0.51763375 -0.34609122 -0.37323673 -0.44590967 -0.47167993]
Average RMSE: 0.6546496345930508


## preprocessing

### preprocessing 1: LabelEncoder and OneHotEncoder
* LabelEncoder: converts string labels to integers
* OneHotEncoder: converts integer labels to one-hot vectors

-> cannot be done with a pipeline

### Preprocessing 2: DictVectorizer

* DictVectorizer: converts lists of feature mappings to vectors
* convert DataFrame into list of dictionaries

## Ames data set



In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
# import data
ames = pd.read_csv('ames_unprocessed_data.csv')

In [4]:
# print the first few rows of the data
print(ames.head())

# print info 
print(ames.info())

# print summary statistics
print(ames.describe())


   MSSubClass MSZoning  LotFrontage  LotArea Neighborhood BldgType HouseStyle  \
0          60       RL         65.0     8450      CollgCr     1Fam     2Story   
1          20       RL         80.0     9600      Veenker     1Fam     1Story   
2          60       RL         68.0    11250      CollgCr     1Fam     2Story   
3          70       RL         60.0     9550      Crawfor     1Fam     2Story   
4          60       RL         84.0    14260      NoRidge     1Fam     2Story   

   OverallQual  OverallCond  YearBuilt  ...  GrLivArea  BsmtFullBath  \
0            7            5       2003  ...       1710             1   
1            6            8       1976  ...       1262             0   
2            7            5       2001  ...       1786             1   
3            7            5       1915  ...       1717             1   
4            8            5       2000  ...       2198             1   

   BsmtHalfBath  FullBath  HalfBath  BedroomAbvGr  Fireplaces  GarageArea  \
0  

In [5]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Fill missing values with 0
ames['LotFrontage'] = ames['LotFrontage'].fillna(0)

# Create a boolean mask for categorical columns
categorical_mask = (ames.dtypes == object)

# Get list of categorical column names
categorical_columns = ames.columns[categorical_mask].tolist()

# Print the head of the categorical columns
print(ames[categorical_columns].head())

# Create LabelEncoder object: le
le = LabelEncoder()

# Apply LabelEncoder to categorical columns
ames[categorical_columns] = ames[categorical_columns].apply(lambda x: le.fit_transform(x))

# Print the head of the LabelEncoded categorical columns
print(ames[categorical_columns].head())


  MSZoning Neighborhood BldgType HouseStyle PavedDrive
0       RL      CollgCr     1Fam     2Story          Y
1       RL      Veenker     1Fam     1Story          Y
2       RL      CollgCr     1Fam     2Story          Y
3       RL      Crawfor     1Fam     2Story          Y
4       RL      NoRidge     1Fam     2Story          Y
   MSZoning  Neighborhood  BldgType  HouseStyle  PavedDrive
0         3             5         0           5           2
1         3            24         0           2           2
2         3             5         0           5           2
3         3             6         0           5           2
4         3            15         0           5           2


Need to 'one-hot-encode' the categorical variables so that they can be used in the model.  If this is not done then the model will treat the categorical variables as continuous and will not be able to use them effectively.  The model will assume an order to the categories which is not there.  

In [7]:
# import one hot encoder
from sklearn.preprocessing import OneHotEncoder

# Create OneHotEncoder: ohe
ohe = OneHotEncoder(sparse_output=False)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(ames)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])

# Print the shape of the original DataFrame
print(ames.shape)

# Print the shape of the transformed array
print(df_encoded.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(1460, 21)
(1460, 3369)


## simplify LabelEncoder and OneHotEncoder with DictVectorizer

In [9]:
# import DictVectorizer
from sklearn.feature_extraction import DictVectorizer

# convert ames to a dictionary
ames_dict = ames.to_dict(orient='records')

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)

# Apply dv on ames_dict
df_encoded = dv.fit_transform(ames_dict)

# Print the resulting first five rows
print(df_encoded[:5, :])

# Print the vocabulary
print(dv.vocabulary_)


[[3.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 2.000e+00 5.480e+02
  1.710e+03 1.000e+00 5.000e+00 8.450e+03 6.500e+01 6.000e+01 3.000e+00
  5.000e+00 5.000e+00 7.000e+00 2.000e+00 0.000e+00 2.085e+05 2.003e+03]
 [3.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 2.000e+00 4.600e+02
  1.262e+03 0.000e+00 2.000e+00 9.600e+03 8.000e+01 2.000e+01 3.000e+00
  2.400e+01 8.000e+00 6.000e+00 2.000e+00 0.000e+00 1.815e+05 1.976e+03]
 [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 6.080e+02
  1.786e+03 1.000e+00 5.000e+00 1.125e+04 6.800e+01 6.000e+01 3.000e+00
  5.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 2.235e+05 2.001e+03]
 [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 1.000e+00 6.420e+02
  1.717e+03 0.000e+00 5.000e+00 9.550e+03 6.000e+01 7.000e+01 3.000e+00
  6.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 1.400e+05 1.915e+03]
 [4.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 8.360e+02
  2.198e+03 1.000e+00 5.000e+00 1.426e+04 8.400e+01 6.000e+0

### Preprocssing within a pipeline

In [4]:
# import DictVectorizer and Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from xgboost import XGBRegressor

# Fill LotFrontage missing values with 0
ames['LotFrontage'] = ames['LotFrontage'].fillna(0)

# set up pipeline steps
steps = [
    ('ohe_onestep', DictVectorizer(sparse=False)),
    ('xgb_model', XGBRegressor())
]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# fit the pipeline
xgb_pipeline.fit(ames.to_dict(orient='records'), ames['SalePrice'])

### xgboost in pipeline

* sklearn_pandas.DataFrameMapper: applies sklearn-compatible transformers to columns of a pandas DataFrame
* sklearn.impute.SimpleImputer: impute missing values
* sklearn.pipeline.FeatureUnion: combine multiple pipelines into a single pipeline



In [5]:
#import dictvectorizer, pipelien,cross_val_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# Fill LotFrontage missing values with 0
ames['LotFrontage'] = ames['LotFrontage'].fillna(0)

# set up pipeline steps
steps = [
    ('ohe_onestep', DictVectorizer(sparse=False)),
    ('xgb_model', XGBRegressor(max_depth=2, objective='reg:squarederror'))
]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, ames.to_dict(orient='records'), ames['SalePrice'], scoring='neg_mean_squared_error', cv=10)

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))

10-fold RMSE:  3349.964400032365


## chronic kidney disease data set

In [50]:
# import data
ckd = pd.read_csv('ChronicKidneyDisease.csv')

# trim white space and /t from the ckd['classification'] column
ckd['classification'] = ckd['classification'].str.strip()

print(ckd.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              400 non-null    int64  
 1   age             391 non-null    float64
 2   bp              388 non-null    float64
 3   sg              353 non-null    float64
 4   al              354 non-null    float64
 5   su              351 non-null    float64
 6   rbc             248 non-null    object 
 7   pc              335 non-null    object 
 8   pcc             396 non-null    object 
 9   ba              396 non-null    object 
 10  bgr             356 non-null    float64
 11  bu              381 non-null    float64
 12  sc              383 non-null    float64
 13  sod             313 non-null    float64
 14  pot             312 non-null    float64
 15  hemo            348 non-null    float64
 16  pcv             330 non-null    object 
 17  wc              295 non-null    obj

In [51]:
# import dataframemapper, simpleimputer
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer

X, y = ckd.drop(['id','classification'], axis=1), ckd['classification']

# print unique values of ckd
#print(y.unique())


# change ckd to 0/1
y = y.map({'ckd': 1, 'notckd': 0})

# check nulls for X
print(X.isnull().sum())

print(y.isnull().sum())


age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       70
wc       105
rc       130
htn        2
dm         2
cad        2
appet      1
pe         1
ane        1
dtype: int64
0


In [52]:
# print vlues of classification in ckd
print(ckd['classification'].value_counts())


ckd       250
notckd    150
Name: classification, dtype: int64


In [60]:

from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer

# Create a boolean mask for categorical columns
categorical_mask = (X.dtypes == object)

# Get list of categorical column names
categorical_columns = X.columns[categorical_mask].tolist()

# get list of non-categorical columns
non_categorical_columns = X.columns[~categorical_mask].tolist()

# apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
                                            [(numeric_feature, SimpleImputer(strategy='median')) for numeric_feature in non_categorical_columns],
                                            input_df=True,
                                            df_out=True
                                           )

# apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, SimpleImputer(strategy='most_frequent')) for category_feature in categorical_columns],
                                                input_df=True,
                                                df_out=True
                                               )


### combiine numerical and categorical preprocessing

In [61]:
# import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ('num_mapper', numeric_imputation_mapper),
                                          ('cat_mapper', categorical_imputation_mapper)
                                         ])

### full pipeline

In [56]:
print(X.shape)

(400, 24)


In [58]:
print(X.shape)

(400, 24)


In [95]:
print(non_categorical_columns)
print(categorical_columns)

['age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo']
['rbc', 'pc', 'pcc', 'ba', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']


In [131]:
import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import numpy as np

# Custom transformer to reshape columns
class ReshapeTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.values.reshape(-1, 1)

# Ensure X is a DataFrame
if not isinstance(X, pd.DataFrame):
    X = pd.DataFrame(X)

# Define numeric and categorical columns
numeric_columns = non_categorical_columns  # Assuming non_categorical_columns is already defined
categorical_columns = categorical_columns  # Assuming categorical_columns is already defined

# Apply numeric imputer with reshaping
numeric_imputation_mapper = DataFrameMapper(
    [(numeric_feature, [ReshapeTransformer(), SimpleImputer(strategy='median')]) for numeric_feature in numeric_columns],
    input_df=True,
    df_out=True,
    default=None
)

# Apply categorical imputer with reshaping
categorical_imputation_mapper = DataFrameMapper(
    [(category_feature, [ReshapeTransformer(), SimpleImputer(strategy='most_frequent')]) for category_feature in categorical_columns],
    input_df=True,
    df_out=True,
    default=None
)

# Combine the numeric and categorical transformations using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_imputation_mapper, numeric_columns),
        ('cat', categorical_imputation_mapper, categorical_columns)
    ]
)

# Create a transformer to convert the output of ColumnTransformer back to DataFrame
class DataFrameTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return pd.DataFrame(X, columns=self.columns)

# Create dictifier
class Dictifier(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            return X.to_dict(orient='records')
        else:
            return X

# Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('to_dataframe', DataFrameTransformer(columns=numeric_columns + categorical_columns)),
    ('dictifier', Dictifier()),
    ('vectorizer', DictVectorizer(sort=False)),
    ('clf', XGBClassifier(max_depth=3))
])

# Check the output of each step
# Step 1: Combined numeric and categorical transformations
combined_transformed = preprocessor.fit_transform(X)

# Manually specify the column names for the combined DataFrame
combined_columns = numeric_columns + categorical_columns
combined_transformed_df = pd.DataFrame(combined_transformed, columns=combined_columns)
print("Combined transformed:\n", combined_transformed_df.head())

# Step 2: Full pipeline
pipeline.fit(X, y)

# Cross-validate the model
cross_val_scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=3, error_score='raise')

# Print the 3-fold AUC scores
print("3-fold AUC: ", np.mean(cross_val_scores))


# Optional: Hyperparameter tuning using GridSearchCV
from sklearn.model_selection import GridSearchCV


param_grid = {
    'clf__max_depth': [3, 5, 7],
    'clf__n_estimators': [100, 200],
    'clf__learning_rate': [0.01, 0.1, 0.2]
}

grid_search = GridSearchCV(pipeline, param_grid, scoring='roc_auc', cv=3)
grid_search.fit(X, y)

print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation AUC: ", grid_search.best_score_)

Combined transformed:
     age    bp     sg   al   su    bgr    bu   sc    sod  pot  ...          ba  \
0  48.0  80.0   1.02  1.0  0.0  121.0  36.0  1.2  138.0  4.4  ...  notpresent   
1   7.0  50.0   1.02  4.0  0.0  121.0  18.0  0.8  138.0  4.4  ...  notpresent   
2  62.0  80.0   1.01  2.0  3.0  423.0  53.0  1.8  138.0  4.4  ...  notpresent   
3  48.0  70.0  1.005  4.0  0.0  117.0  56.0  3.8  111.0  2.5  ...  notpresent   
4  51.0  80.0   1.01  2.0  0.0  106.0  26.0  1.4  138.0  4.4  ...  notpresent   

  pcv    wc   rc  htn   dm cad appet   pe  ane  
0  44  7800  5.2  yes  yes  no  good   no   no  
1  38  6000  5.2   no   no  no  good   no   no  
2  31  7500  5.2   no  yes  no  poor   no  yes  
3  32  6700  3.9  yes   no  no  poor  yes  yes  
4  35  7300  4.6   no   no  no  good   no   no  

[5 rows x 24 columns]
3-fold AUC:  0.9987177280550773
Best parameters found:  {'clf__learning_rate': 0.1, 'clf__max_depth': 3, 'clf__n_estimators': 100}
Best cross-validation AUC:  0.999119334480



### Explanation

1. **Numeric and Categorical Columns**: Ensure `numeric_columns` and `categorical_columns` are defined.
2. **Numeric Imputer**: Apply the numeric imputation only to `numeric_columns`.
3. **Categorical Imputer**: Apply the categorical imputation only to `categorical_columns`.
4. **Combined Transformations**: Use `ColumnTransformer` to combine numeric and categorical transformations.
5. **DataFrameTransformer**: Convert the output of `ColumnTransformer` back to a DataFrame.
6. **Pipeline**: Create a pipeline that includes the combined transformations, conversion to DataFrame, dictifier, vectorizer, and classifier.
7. **Check Output**: Print the combined transformed DataFrame to verify the transformations.
8. **Fit Pipeline**: Fit the pipeline to the data.
9. **Cross-Validation**: Perform cross-validation on the pipeline and print the 3-fold AUC scores.

This should ensure that the pipeline is correctly defined, fitted, and evaluated using cross-validation.

### RandomizedSearchCV

Sure, here's the updated pipeline that includes `StandardScaler` and uses `RandomizedSearchCV` with the specified parameters:



In [132]:
import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
import numpy as np

# Custom transformer to reshape columns
class ReshapeTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.values.reshape(-1, 1)

# Ensure X is a DataFrame
if not isinstance(X, pd.DataFrame):
    X = pd.DataFrame(X)

# Define numeric and categorical columns
numeric_columns = non_categorical_columns  # Assuming non_categorical_columns is already defined
categorical_columns = categorical_columns  # Assuming categorical_columns is already defined

# Apply numeric imputer with reshaping
numeric_imputation_mapper = DataFrameMapper(
    [(numeric_feature, [ReshapeTransformer(), SimpleImputer(strategy='median')]) for numeric_feature in numeric_columns],
    input_df=True,
    df_out=True,
    default=None
)

# Apply categorical imputer with reshaping
categorical_imputation_mapper = DataFrameMapper(
    [(category_feature, [ReshapeTransformer(), SimpleImputer(strategy='most_frequent')]) for category_feature in categorical_columns],
    input_df=True,
    df_out=True,
    default=None
)

# Combine the numeric and categorical transformations using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_imputation_mapper, numeric_columns),
        ('cat', categorical_imputation_mapper, categorical_columns)
    ]
)

# Create a transformer to convert the output of ColumnTransformer back to DataFrame
class DataFrameTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return pd.DataFrame(X, columns=self.columns)

# Create dictifier
class Dictifier(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            return X.to_dict(orient='records')
        else:
            return X

# Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('to_dataframe', DataFrameTransformer(columns=numeric_columns + categorical_columns)),
    ('dictifier', Dictifier()),
    ('vectorizer', DictVectorizer(sort=False)),
    ('clf', XGBClassifier(max_depth=3))
])

# Check the output of each step
# Step 1: Combined numeric and categorical transformations
combined_transformed = preprocessor.fit_transform(X)

# Manually specify the column names for the combined DataFrame
combined_columns = numeric_columns + categorical_columns
combined_transformed_df = pd.DataFrame(combined_transformed, columns=combined_columns)
print("Combined transformed:\n", combined_transformed_df.head())

# Step 2: Full pipeline
pipeline.fit(X, y)

# Cross-validate the model
cross_val_scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=3, error_score='raise')

# Print the 3-fold AUC scores
print("3-fold AUC: ", np.mean(cross_val_scores))

# Hyperparameter tuning using RandomizedSearchCV
param_distributions = {
    'clf__subsample': np.arange(0.5, 1.0, 0.5),
    'clf__max_depth': np.arange(3, 20, 1),
    'clf__colsample_bytree': np.arange(0.1, 1.05, 0.05)
}

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    scoring='roc_auc',
    cv=4,
    n_iter=10,
    random_state=42,
    error_score='raise'
)

random_search.fit(X, y)

print("Best parameters found: ", random_search.best_params_)
print("Best cross-validation AUC: ", random_search.best_score_)

Combined transformed:
     age    bp     sg   al   su    bgr    bu   sc    sod  pot  ...          ba  \
0  48.0  80.0   1.02  1.0  0.0  121.0  36.0  1.2  138.0  4.4  ...  notpresent   
1   7.0  50.0   1.02  4.0  0.0  121.0  18.0  0.8  138.0  4.4  ...  notpresent   
2  62.0  80.0   1.01  2.0  3.0  423.0  53.0  1.8  138.0  4.4  ...  notpresent   
3  48.0  70.0  1.005  4.0  0.0  117.0  56.0  3.8  111.0  2.5  ...  notpresent   
4  51.0  80.0   1.01  2.0  0.0  106.0  26.0  1.4  138.0  4.4  ...  notpresent   

  pcv    wc   rc  htn   dm cad appet   pe  ane  
0  44  7800  5.2  yes  yes  no  good   no   no  
1  38  6000  5.2   no   no  no  good   no   no  
2  31  7500  5.2   no  yes  no  poor   no  yes  
3  32  6700  3.9  yes   no  no  poor  yes  yes  
4  35  7300  4.6   no   no  no  good   no   no  

[5 rows x 24 columns]
3-fold AUC:  0.9987177280550773
Best parameters found:  {'clf__subsample': 0.5, 'clf__max_depth': 12, 'clf__colsample_bytree': 0.1}
Best cross-validation AUC:  0.99978549978

In [133]:
# print rmse 
print("RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))

# print the best model
print(random_search.best_estimator_)



RMSE:  0.9993584914557583
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  DataFrameMapper(default=None,
                                                                  df_out=True,
                                                                  drop_cols=[],
                                                                  features=[('age',
                                                                             [ReshapeTransformer(),
                                                                              SimpleImputer(strategy='median')]),
                                                                            ('bp',
                                                                             [ReshapeTransformer(),
                                                                              SimpleImputer(strategy='median')]),
                                                  



### Explanation

1. **Numeric and Categorical Columns**: Ensure 

numeric_columns

 and 

categorical_columns

 are defined.
2. **Numeric Imputer and Scaler**: Apply the numeric imputation and scaling only to 

numeric_columns

.
3. **Categorical Imputer**: Apply the categorical imputation only to 

categorical_columns

.
4. **Combined Transformations**: Use `ColumnTransformer` to combine numeric and categorical transformations.
5. **Pipeline**: Create a pipeline that includes the combined transformations, vectorizer, and classifier.
6. **Parameter Grid**: Define the parameter grid for `RandomizedSearchCV` with the specified ranges.
7. **RandomizedSearchCV**: Create and fit `RandomizedSearchCV` with the pipeline and parameter grid.
8. **Print Results**: Print the best parameters and best cross-validation AUC score.

This version of the pipeline includes `StandardScaler` for numeric features and uses `RandomizedSearchCV` to find the best hyperparameters for the `XGBClassifier`.