**Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.**

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

**Cleaner Code:** Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.

**Fewer Bugs:** There are fewer opportunities to misapply a step or forget a preprocessing step.

**Easier to Productionize:** It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.

More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

In [29]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [30]:
data = pd.read_csv('melb_data.csv')

# Separate target from predictors
Y = data.Price
X = data.drop('Price', axis=1)

# Divide data into training and validation subsets
X_train,X_test, Y_train,Y_test = train_test_split(X, Y, test_size=.2, random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)

categorical_cols = [cname for cname in X_train.columns if X_train[cname].nunique() < 10 and X_train[cname].dtype == 'object']

numerical_cols = [cname for cname in X_train.columns if X_train[cname].dtype in ['int64', 'float64']]



# Keep selected columns only
my_cols = categorical_cols + numerical_cols

X_train_ready = X_train[my_cols].copy()
X_test_ready  = X_test[my_cols].copy()


In [31]:
X_train_ready.head() #NaN value is including 

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


**Step 1: Define Preprocessing Steps**

Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps. The code below:

-imputes missing values in numerical data, and

-imputes missing values and applies a one-hot encoding to categorical data.

 **ColumnTransformer** It allows you to apply different preprocessing steps to different subsets of your data, making it particularly useful when you have a dataset with a mix of numerical and categorical features or when you want to apply different transformations to specific feature groups.

 **Pipeline-** 

In [32]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
                            ('impuetr', SimpleImputer(strategy='most_frequent')),
                             ('oneHot', OneHotEncoder(handle_unknown='ignore'))
                            ])


# Bundle preprocessing for numerical and categorical data
# Combine transformers using ColumnTransformer
perprocessor = ColumnTransformer(transformers=[
                    ('numerical', numerical_transformer, numerical_cols),
                    ('categorical', categorical_transformer, categorical_cols)
                     ])



**Step 2: Define the Model**

Next, we define a random forest model with the familiar RandomForestRegressor class.

In [33]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

**Step 3: Create and Evaluate the Pipeline**

Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:

With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)

**SimpleImputer-** can be used for both numerical (continuous) and categorical (discrete) data, but the choice of imputation strategy depends on the type of data.

**Pipeline-** You can easily extend and customize the pipeline to include additional preprocessing steps and different models as needed for your specific machine learning task.

**ColumnTransformer-** It allows you to apply different preprocessing steps to different subsets of your data, making it particularly useful when you have a dataset with a mix of numerical and categorical features or when you want to apply different transformations to specific feature groups.


In [34]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('perprocessor', perprocessor),
                               ('model', model)])


# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, Y_train)

# Preprocessing of validation data, get predictions
y_pred = my_pipeline.predict(X_test)

# Evaluate the model
MAE = mean_absolute_error(y_pred, Y_test)
print('MAE', MAE)
print(my_pipeline.score(X_test, Y_test))




MAE 160679.18917034855
0.8251065157443225


In [35]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

In [36]:
# Sample data with numerical and categorical features
data = {
    'age': [25, 30, 35, None, 28],
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'income': [50000, 60000, None, 55000, 48000]
}

In [37]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,age,gender,income
0,25.0,Male,50000.0
1,30.0,Female,60000.0
2,35.0,Male,
3,,Female,55000.0
4,28.0,Male,48000.0


In [38]:
numerical_col = ['age', 'income']
categorical_col = ['gender']

In [39]:
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                          ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                                          ('oneHot', OneHotEncoder(handle_unknown='ignore'))])



In [40]:
# Combine transformers using ColumnTransformer
perprocessor = ColumnTransformer(transformers=[('numerical_columns', numerical_transformer, numerical_col),
                                                ('categorical_columns', categorical_transformer, categorical_col)])




In [41]:
# Define the final pipeline
pipeline = Pipeline(steps=[('perprocessor', perprocessor)]) #here only clean the data no model training

X_transformed = pipeline.fit_transform(df)


In [42]:
X_transformed #after cleaning

array([[-1.34890655, -0.74231537,  0.        ,  1.        ],
       [ 0.1839418 ,  1.65225034,  1.        ,  0.        ],
       [ 1.71679015, -0.14367394,  0.        ,  1.        ],
       [-0.12262787,  0.45496749,  1.        ,  0.        ],
       [-0.42919754, -1.22122851,  0.        ,  1.        ]])