## Custom Estimators and Machine Learning Workflows

The inbuilt California housing data has features for latitude and longitude. Create a custom transformer that returns features for the euclidean distance away from a given set of coordinates. Use this custom transformer to create features for the distance away from Los Angeles and San Francisco and include them with the original features to create a Linear Regression Model

In [13]:
from sklearn.datasets import fetch_california_housing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

In [2]:
import numpy as np

In [3]:
data=fetch_california_housing()

In [4]:
print(data['DESCR'])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [5]:
X=data['data']
y=data['target']

In [6]:
class DistFromCity(BaseEstimator, TransformerMixin):
    """ This estimator transforms the latitude and longitude features to return the 
    euclidean distance from a given set of coordinates"""
    
    def __init__(self,coord):
        self.coord=coord
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        #Assumes latitude and longitude are in column 0 and 1 respectively
        lat=X[:,0]
        long=X[:,1]
        
        dist=np.sqrt((lat-self.coord[0])**2 + (long-self.coord[1])**2)
        dist=dist.reshape(-1,1)
        
        
        return dist
    

In [8]:
class KeepColumns(BaseEstimator, TransformerMixin):
    """This Transformer keeps the columns specified on initialization"""
    
    def __init__(self,ind_cols):
        self.ind_cols=ind_cols
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        
        return X[:,self.ind_cols]

In [14]:
coord_LA=(34,-118)
coord_SF=(37,-122)
dist_LA=DistFromCity(coord_LA)
dist_SF=DistFromCity(coord_SF)
keepothers=KeepColumns([0,1,2,3,4,5])
keepcoord=KeepColumns([6,7])

pipe_LA=Pipeline([
    ('keepcoord',keepcoord),
    ('dist_la',dist_LA)
])
pipe_SF=Pipeline([
    ('keepcoord',keepcoord),
    ('dist_la',dist_SF)
])
union=FeatureUnion([
    
    ('LA',pipe_LA),
    ('SF',pipe_SF),
    ('keep',keepothers)
])
pipe=Pipeline([
    ('union', union),
    ('scaler', StandardScaler()),
    ('lin_reg',LinearRegression())
])

In [15]:
pipe.fit(X,y)

Pipeline(steps=[('union',
                 FeatureUnion(transformer_list=[('LA',
                                                 Pipeline(steps=[('keepcoord',
                                                                  KeepColumns(ind_cols=[6,
                                                                                        7])),
                                                                 ('dist_la',
                                                                  DistFromCity(coord=(34,
                                                                                      -118)))])),
                                                ('SF',
                                                 Pipeline(steps=[('keepcoord',
                                                                  KeepColumns(ind_cols=[6,
                                                                                        7])),
                                                                 ('dist

In [16]:
pipe.score(X,y)

0.549363502829012