**pipelines**

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized.
Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

some important benefits. Those include:
1. **Cleaner code**
2. **Fewer Bugs**
3. **Easier to Productionize**
4. **More Options for Model Validation**

First you can imagine you are at a point where you already have the training and validation 

in `X_train`, `X_valid`, `y_train`, and `y_valid`.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

we take a peek at the training data with the `head()` method below. Notice that the data contains both categorical data and columns with missing values. With a pipeline, it's easy to deal woth both!

In [None]:
X_train.head()

**We construct the fulll pipeline in three steps**

**Step 1: Define Processing Steps**

- imputes missing values in **numerical** data, and
- imputes missing values and applies a one-hot encoding to **categorical** data.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy= 'constant')

#Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', simpleImputer(strategy= 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown= 'ignore'))
])


#Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers= [
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])


**Step 2: Define the Model**

Next, we define a random forest model with the familiar `RandomForestRegressor` class

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators= 100, random_atate= 0)


**Step 3: Create and Evaluate the Pipeline**

In [None]:
from sklearn.metrics import mean_absolute_error

#Bundle preprocessing and modeling code in a pipeline
my_pipline = Pipeline(steps= [('preprocessor', preprocessor),
                             ('model', model)
                             ])
#Preprocessing of training data, fit model
my_pipline.fit(X_train, y_train)


#Preprocessing of validation data, get predictions
preds = my_pipline.predict(X_valid)

#Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

**Step4: Generate test predictions**

In [None]:
# Preprocessing of test data, fit model
preds_test =  my_pipeline.predict(X_test)

In [None]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)