# Exploratory data analysis

Before diving into the nitty gritty of pipelines and preprocessing, let's do some exploratory analysis of the original, unprocessed Ames housing dataset. When you worked with this data in previous chapters, we preprocessed it for you so you could focus on the core XGBoost concepts. In this chapter, you'll do the preprocessing yourself!

In [31]:
import pandas as pd
import xgboost as xgb
df = pd.read_csv("dataset/ames_unprocessed_data.csv")
df = df.drop("YearBuilt", axis=1)
X = df.drop("SalePrice", axis=1)
y = df["SalePrice"]
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSSubClass    1460 non-null   int64  
 1   MSZoning      1460 non-null   object 
 2   LotFrontage   1201 non-null   float64
 3   LotArea       1460 non-null   int64  
 4   Neighborhood  1460 non-null   object 
 5   BldgType      1460 non-null   object 
 6   HouseStyle    1460 non-null   object 
 7   OverallQual   1460 non-null   int64  
 8   OverallCond   1460 non-null   int64  
 9   Remodeled     1460 non-null   int64  
 10  GrLivArea     1460 non-null   int64  
 11  BsmtFullBath  1460 non-null   int64  
 12  BsmtHalfBath  1460 non-null   int64  
 13  FullBath      1460 non-null   int64  
 14  HalfBath      1460 non-null   int64  
 15  BedroomAbvGr  1460 non-null   int64  
 16  Fireplaces    1460 non-null   int64  
 17  GarageArea    1460 non-null   int64  
 18  PavedDrive    1460 non-null 

# Encoding categorical columns I: LabelEncoder

Now that you've seen what will need to be done to get the housing data ready for XGBoost, let's go through the process step-by-step.

First, you will need to fill in missing values - as you saw previously, the column LotFrontage has many missing values. Then, you will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically. You can watch this video from Supervised Learning with scikit-learn for a refresher on the idea.

In [9]:


# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == "object")

# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()

# Print the head of the categorical columns
print(df[categorical_columns].head())

# Create LabelEncoder object: le
le = LabelEncoder()

# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())

  MSZoning Neighborhood BldgType HouseStyle PavedDrive
0       RL      CollgCr     1Fam     2Story          Y
1       RL      Veenker     1Fam     1Story          Y
2       RL      CollgCr     1Fam     2Story          Y
3       RL      Crawfor     1Fam     2Story          Y
4       RL      NoRidge     1Fam     2Story          Y
   MSZoning  Neighborhood  BldgType  HouseStyle  PavedDrive
0         3             5         0           5           2
1         3            24         0           2           2
2         3             5         0           5           2
3         3             6         0           5           2
4         3            15         0           5           2


# Encoding categorical columns II: OneHotEncoder

so you have your categorical columns encoded numerically. Can you now move onto using pipelines and XGBoost? Not yet! In the categorical columns of this dataset, there is no natural ordering between the entries. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.

As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or "dummy" variables.

In [11]:
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Create OneHotEncoder: ohe
ohe = OneHotEncoder(sparse=False)

# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)

# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])

# Print the shape of the original DataFrame
print(df.shape)

# Print the shape of the transformed array
print(df_encoded.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(1460, 20)
(1460, 3257)




# Encoding categorical columns III: DictVectorizer

one final trick before you dive into pipelines. The two step process you just went through - LabelEncoder followed by OneHotEncoder - can be simplified by using a DictVectorizer.

Using a DictVectorizer on a DataFrame that has been converted to a dictionary allows you to get label encoding as well as one-hot encoding in one go.

In [13]:
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer

# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")

# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse = False)

# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)

# Print the resulting first five rows
print(df_encoded[:5,:])

# Print the vocabulary (how the features are mapped to columns in the resulting matrix.)
print(dv.vocabulary_)

[[3.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 2.000e+00 5.480e+02
  1.710e+03 1.000e+00 5.000e+00 8.450e+03 6.500e+01 6.000e+01 3.000e+00
  5.000e+00 5.000e+00 7.000e+00 2.000e+00 0.000e+00 2.085e+05]
 [3.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 2.000e+00 4.600e+02
  1.262e+03 0.000e+00 2.000e+00 9.600e+03 8.000e+01 2.000e+01 3.000e+00
  2.400e+01 8.000e+00 6.000e+00 2.000e+00 0.000e+00 1.815e+05]
 [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 6.080e+02
  1.786e+03 1.000e+00 5.000e+00 1.125e+04 6.800e+01 6.000e+01 3.000e+00
  5.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 2.235e+05]
 [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 1.000e+00 6.420e+02
  1.717e+03 0.000e+00 5.000e+00 9.550e+03 6.000e+01 7.000e+01 3.000e+00
  6.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 1.400e+05]
 [4.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 8.360e+02
  2.198e+03 1.000e+00 5.000e+00 1.426e+04 8.400e+01 6.000e+01 3.000e+00
  1.500e+01 5.000e+00 8.000e

# Preprocessing within a pipeline

Now that you've seen what steps need to be taken individually to properly process the Ames housing data, let's use the much cleaner and more succinct `DictVectorizer` approach and put it alongside an `XGBoostRegressor` inside of a scikit-learn pipeline.

In [14]:
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor())]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Fit the pipeline
xgb_pipeline.fit(X.to_dict("records"), y)

# Cross-validating your XGBoost model

In this exercise, you'll go one step further by using the pipeline you've created to preprocess and cross-validate your model.

In [15]:
import numpy as np
import numpy as np
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Cross-validate the model
cross_val_scores = cross_val_score(X=X.to_dict("records"), y=y, estimator=xgb_pipeline, scoring="neg_mean_squared_error", cv = 10)

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))



10-fold RMSE:  30539.07162935346




# Kidney disease case study I: Categorical Imputer

You'll now continue your exploration of using pipelines with a dataset that requires significantly more wrangling. The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.

As Sergey mentioned in the video, you'll be introduced to a new library, sklearn_pandas, that allows you to chain many more processing steps inside of a pipeline than are currently supported in scikit-learn. Specifically, you'll be able to use the DataFrameMapper() class to apply any arbitrary sklearn-compatible transformer on DataFrame columns, where the resulting output can be either a NumPy array or DataFrame.

In [94]:
import pandas as pd
import xgboost as xgb
df = pd.read_csv("dataset/chronic_kidney_disease.csv", header=None)
df[0] = pd.to_numeric(df[0].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[1] = pd.to_numeric(df[1].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[3] = pd.to_numeric(df[3].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[4] = pd.to_numeric(df[4].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[9] = pd.to_numeric(df[9].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[10] = pd.to_numeric(df[10].replace('?', '', regex=False), errors='coerce').astype('float')
df[12] = pd.to_numeric(df[12].replace('?', '', regex=False), errors='coerce').astype('float')
df[15] = pd.to_numeric(df[15].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[16] = pd.to_numeric(df[16].replace('?', '', regex=False), errors='coerce').astype('Int64')
df[2] = pd.to_numeric(df[2].replace('?', '', regex=False), errors='coerce').astype('float')
df[11] = pd.to_numeric(df[11].replace('?', '', regex=False), errors='coerce').astype('float')
df[13] = pd.to_numeric(df[13].replace('?', '', regex=False), errors='coerce').astype('float')
df[14] = pd.to_numeric(df[14].replace('?', '', regex=False), errors='coerce').astype('float')
print(df.head())
print(df.info())
df.columns
X= df.drop(23,axis=1)
y = df[23].apply(lambda val: 1 if val == "yes" else 0)


   0   1      2   3   4       5         6           7           8     9   ...  \
0  48  80  1.020   1   0       ?    normal  notpresent  notpresent   121  ...   
1   7  50  1.020   4   0       ?    normal  notpresent  notpresent  <NA>  ...   
2  62  80  1.010   2   3  normal    normal  notpresent  notpresent   423  ...   
3  48  70  1.005   4   0  normal  abnormal     present  notpresent   117  ...   
4  51  80  1.010   2   0  normal    normal  notpresent  notpresent   106  ...   

   15    16   17   18   19  20    21   22   23   24  
0  44  7800  5.2  yes  yes  no  good   no   no  ckd  
1  38  6000    ?   no   no  no  good   no   no  ckd  
2  31  7500    ?   no  yes  no  poor   no  yes  ckd  
3  32  6700  3.9  yes   no  no  poor  yes  yes  ckd  
4  35  7300  4.6   no   no  no  good   no   no  ckd  

[5 rows x 25 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  -----------

In [98]:
# Import necessary modules
from sklearn_pandas import DataFrameMapper
from sklearn.impute import SimpleImputer

# Check number of nulls in each feature column
nulls_per_column = X.isnull().sum()
print(nulls_per_column)

# Create a boolean mask for categorical columns
categorical_feature_mask = X.dtypes == object

# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()

# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()

# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
                                            [([numeric_feature], SimpleImputer(strategy="median")) for numeric_feature in non_categorical_columns],
                                            input_df=True,
                                            df_out=True
                                           )

# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
                                                [(category_feature, SimpleImputer(strategy="most_frequent")) for category_feature in categorical_columns],
                                                input_df=True,
                                                df_out=True
                                               )

0       9
1      12
2      47
3      46
4      49
5       0
6       0
7       0
8       0
9      44
10     19
11     17
12     87
13     88
14     52
15     71
16    106
17      0
18      0
19      0
20      0
21      0
22      0
24      0
dtype: int64


# Kidney disease case study II: Feature Union

Having separately imputed numeric as well as categorical columns, your task is now to use scikit-learn's FeatureUnion to concatenate their results, which are contained in two separate transformer objects - `numeric_imputation_mapper`, and `categorical_imputation_mapper`, respectively.

In [99]:
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion

# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
                                          ("num_mapper", numeric_imputation_mapper),
                                          ("cat_mapper", categorical_imputation_mapper)
                                         ])

# Kidney disease case study III: Full pipeline

It's time to piece together all of the transforms along with an XGBClassifier to build the full pipeline!

Besides the numeric_categorical_union that you created in the previous exercise

In [103]:
# import xgboost as xgb
# # Create full pipeline
# pipeline = Pipeline([
#                      ("featureunion", numeric_categorical_union),
#                     #  ("vectorizer", DictVectorizer(sort=False)),
#                      ("clf", xgb.XGBClassifier())
#                     ])
# # Perform cross-validation
# cross_val_scores = cross_val_score(estimator=pipeline, X=X, y=y, cv=3) # , scoring="roc_auc"

# # Print avg. AUC
# print("3-fold AUC: ", np.mean(cross_val_scores))

# Bringing it all together

In this final exercise of the course, you will combine your work from the previous exercises into one end-to-end XGBoost pipeline to really cement your understanding of preprocessing and pipelines in XGBoost.

In [104]:
# import numpy as np
# from sklearn.model_selection import RandomizedSearchCV
# # Create the parameter grid
# gbm_param_grid = {
#     'clf__learning_rate': np.arange(0.05, 1, 0.05),
#     'clf__max_depth': np.arange(3, 10, 1),
#     'clf__n_estimators': np.arange(50, 200, 50)
# }

# # Perform RandomizedSearchCV
# randomized_roc_auc = RandomizedSearchCV(estimator=pipeline, param_distributions=gbm_param_grid,scoring="roc_auc", n_iter=2, verbose=1, cv=2)

# # Fit the estimator
# randomized_roc_auc.fit(X, y)

# # Compute metrics
# print(randomized_roc_auc.best_estimator_)
# print(randomized_roc_auc.best_score_)