In [5]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
from sklearn.preprocessing import StandardScaler , OneHotEncoder
from sklearn.impute import SimpleImputer
warnings.filterwarnings('ignore')

In [4]:
path='datasets\housing.csv'
df=pd.read_csv(path)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
df_num=df.drop(columns=['ocean_proximity'])

In [7]:
new_pipeline=Pipeline([
    ('inputer',SimpleImputer(strategy='median')),
    ('scalar',StandardScaler())
])
df_num_transformed=new_pipeline.fit_transform(df_num)

In [8]:
df_num_transformed

array([[-1.32783522,  1.05254828,  0.98214266, ..., -0.97703285,
         2.34476576,  2.12963148],
       [-1.32284391,  1.04318455, -0.60701891, ...,  1.66996103,
         2.33223796,  1.31415614],
       [-1.33282653,  1.03850269,  1.85618152, ..., -0.84363692,
         1.7826994 ,  1.25869341],
       ...,
       [-0.8237132 ,  1.77823747, -0.92485123, ..., -0.17404163,
        -1.14259331, -0.99274649],
       [-0.87362627,  1.77823747, -0.84539315, ..., -0.39375258,
        -1.05458292, -1.05860847],
       [-0.83369581,  1.75014627, -1.00430931, ...,  0.07967221,
        -0.78012947, -1.01787803]], shape=(20640, 9))

The Pipeline constructor takes a list of name/estimator pairs defining a sequence of
 steps. All but the last estimator must be transformers (i.e., they must have a
 fit_transform() method). The names can be anything you like (as long as they are
 unique and don’t contain double underscores, __); they will come in handy later for
 hyperparameter tuning.
 When you call the pipeline’s fit() method, it calls fit_transform() sequentially on
 all transformers, passing the output of each call as the parameter to the next call until
 it reaches the final estimator, for which it calls the fit() method.
 The pipeline exposes the same methods as the final estimator. In this example, the last
 estimator is a StandardScaler, which is a transformer, so the pipeline has a trans
 form() method that applies all the transforms to the data in sequence (and of course
 also a fit_transform() method, which is the one we used).

 So far, we have handled the categorical columns and the numerical columns sepa
rately. It would be more convenient to have a single transformer able to handle all col
umns, applying the appropriate transformations to each column. In version 0.20,
 Scikit-Learn introduced the ColumnTransformer for this purpose, and the good news
 is that it works great with pandas DataFrames. Let’s use it to apply all the transforma
tions to the housing data

In [10]:
num_attr=list(df_num)
cat_attr=['ocean_proximity']
total_pipeline=ColumnTransformer([
    ('num_pipeline',new_pipeline,num_attr),
    ('categorical_pipeline',OneHotEncoder(),cat_attr)
])
df_total=total_pipeline.fit_transform(df)

In [11]:
df_total

array([[-1.32783522,  1.05254828,  0.98214266, ...,  0.        ,
         1.        ,  0.        ],
       [-1.32284391,  1.04318455, -0.60701891, ...,  0.        ,
         1.        ,  0.        ],
       [-1.33282653,  1.03850269,  1.85618152, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.8237132 ,  1.77823747, -0.92485123, ...,  0.        ,
         0.        ,  0.        ],
       [-0.87362627,  1.77823747, -0.84539315, ...,  0.        ,
         0.        ,  0.        ],
       [-0.83369581,  1.75014627, -1.00430931, ...,  0.        ,
         0.        ,  0.        ]], shape=(20640, 14))

First we import the ColumnTransformer class, next we get the list of numerical col
umn names and the list of categorical column names, and then we construct a Colum
 nTransformer. The constructor requires a list of tuples, where each tuple contains a
 name,22 a transformer, and a list of names (or indices) of columns that the trans
former should be applied to. In this example, we specify that the numerical columns
 should be transformed using the num_pipeline that we defined earlier, and the cate
gorical columns should be transformed using a OneHotEncoder. Finally, we apply this
 ColumnTransformer to the housing data: it applies each transformer to the appropri
ate columns and concatenates the outputs along the second axis (the transformers
 must return the same number of rows).

Note that the OneHotEncoder returns a sparse matrix, while the num_pipeline returns
 a dense matrix. When there is such a mix of sparse and dense matrices, the Colum
 nTransformer estimates the density of the final matrix (i.e., the ratio of nonzero
 cells), and it returns a sparse matrix if the density is lower than a given threshold (by
 default, sparse_threshold=0.3). In this example, it returns a dense matrix. And
 that’s it! We have a preprocessing pipeline that takes the full housing data and applies
 the appropriate transformations to each column.