# California Housing Market : Feature engineering and feature selection
In the previous exercise, we concluded it was worth including more variables in a model. But is this set of variables **the best** we could have chosen ? In this exercises, we'll go further by applying two canonical methods:
* Feature engineering consists in creating more variables from the original dataset
* Feature selection allows to select the best set of features among all the available variables

## The dataset
1. Load the California Housing dataset again and remove the outliers:

In [37]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

In [38]:
from sklearn import datasets
data = datasets.fetch_california_housing(data_home=None, download_if_missing=True, return_X_y=False)
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [39]:
dataset = pd.DataFrame(columns=data["feature_names"], data=data["data"])
dataset.loc[:,'Price'] = data["target"]

mask = (dataset['AveRooms'] < 10) & (dataset['AveBedrms'] < 10) & (dataset['Population'] < 15000) & \
    (dataset['AveOccup'] < 10) & (dataset['Price'] < 5)

dataset = dataset.loc[mask,:]

dataset.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [40]:
# Basic stats
print("Number of rows : {}".format(dataset.shape[0]))
print()

print("Display of dataset: ")
display(dataset.head())
print()

print("Basics statistics: ")
data_desc = dataset.describe(include="all")
display(data_desc)
print()

print("Percentage of missing values: ")
display(100 * dataset.isnull().sum() / dataset.shape[0])

Number of rows : 19398

Display of dataset: 


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422



Basics statistics: 


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
count,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0,19398.0
mean,3.674497,28.496907,5.210648,1.066038,1442.17208,2.94464,35.637872,-119.567484,1.924128
std,1.563397,12.477953,1.168098,0.128846,1077.498768,0.766194,2.14296,2.004793,0.971784
min,0.4999,1.0,0.846154,0.333333,3.0,0.75,32.54,-124.35,0.14999
25%,2.5259,18.0,4.407329,1.005413,805.0,2.450413,33.93,-121.77,1.167
50%,3.4478,29.0,5.170038,1.047619,1185.5,2.842105,34.26,-118.49,1.741
75%,4.583175,37.0,5.944617,1.096884,1752.0,3.308127,37.72,-118.0,2.485
max,15.0001,52.0,9.979167,3.411111,13251.0,9.954545,41.95,-114.55,4.991



Percentage of missing values: 


MedInc        0.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
Latitude      0.0
Longitude     0.0
Price         0.0
dtype: float64

2. Separate the target from the features

In [41]:
target_name = "Price"

X = dataset.drop(target_name, axis=1) 
y = dataset[target_name]

display(X.head())
display(y.head())

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: Price, dtype: float64

## From linear to non-linear regression
An easy way of implementing a non-linear regression is to create by hand more columns containing non-linear functions of the features.

3. For each explanatory variable, create 3 new columns in $X$ containing the following functions:
* $\textrm{X}^2$
* $\textrm{X}^3$
* $\textrm{X}^4$
* $\frac{1}{\textrm{X}}$
* $\frac{1}{\textrm{X}^2}$

In [42]:
features_list = X.columns
for c in features_list:
    X.loc[:, c + '_2'] = X[c]**2
    X.loc[:, c + '_3'] = X[c]**3
    X.loc[:, c + '_4'] = X[c]**4
    X.loc[:, c + '_inverse'] = 1/X[c]
    X.loc[:, c + '_inverse2'] = 1/(X[c]**2)
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,69.308955,577.010912,...,1434.8944,54353.799872,2058922.0,0.026399,0.000697,14940.1729,-1826137.0,223208800.0,-0.008181,6.7e-05
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,68.913242,572.076387,...,1433.3796,54267.751656,2054577.0,0.026413,0.000698,14937.7284,-1825689.0,223135700.0,-0.008182,6.7e-05
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,52.669855,382.246204,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14942.6176,-1826586.0,223281800.0,-0.008181,6.7e-05
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,31.844578,179.702136,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14945.0625,-1827034.0,223354900.0,-0.00818,6.7e-05
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,14.793254,56.897815,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14945.0625,-1827034.0,223354900.0,-0.00818,6.7e-05


4. Split your dataset into train (80%) and test (20%)

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

5. Apply the same preprocessing as in the previous exercise

In [44]:
scaler = StandardScaler()

display(X_train.head())
X_train = scaler.fit_transform(X_train)
display(X_train[0:5])

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2
13175,3.2292,52.0,7.075314,1.259414,618.0,2.585774,36.47,-120.95,10.427733,33.673234,...,1330.0609,48507.321023,1769062.0,0.02742,0.000752,14628.9025,-1769366.0,214004800.0,-0.008268,6.8e-05
19133,2.4117,16.0,4.933657,0.985437,1675.0,2.710356,38.35,-122.72,5.816297,14.027163,...,1470.7225,56402.207875,2163025.0,0.026076,0.00068,15060.1984,-1848188.0,226809600.0,-0.008149,6.6e-05
20550,2.8864,23.0,5.607029,1.025559,1061.0,3.389776,38.69,-121.79,8.331305,24.047479,...,1496.9161,57915.683909,2240758.0,0.025846,0.000668,14832.8041,-1806487.0,220012100.0,-0.008211,6.7e-05
15085,5.0118,34.0,6.258865,1.007092,772.0,2.737589,32.82,-116.92,25.118139,125.88709,...,1077.1524,35352.141768,1160257.0,0.030469,0.000928,13670.2864,-1598330.0,186876700.0,-0.008553,7.3e-05
11163,2.2401,24.0,4.873346,1.096408,1217.0,2.300567,33.83,-118.0,5.018048,11.240929,...,1144.4689,38717.382887,1309809.0,0.02956,0.000874,13924.0,-1643032.0,193877800.0,-0.008475,7.2e-05


array([[-0.28483057,  1.87303539,  1.59610023,  1.47731745, -0.77272521,
        -0.46102784,  0.38326465, -0.6849955 , -0.38908888, -0.38518565,
        -0.30963103, -0.1231946 , -0.21214676,  2.33026947,  2.68236883,
         2.95052755, -0.6205339 , -0.2018613 ,  1.69060582,  1.69583867,
         1.61639121, -1.13621771, -0.73344425,  1.17562863,  0.78697612,
         0.43276361, -1.62799169, -1.41807394, -0.40138762, -0.19127333,
        -0.1066343 ,  0.05582162, -0.0221363 , -0.46205705, -0.3856547 ,
        -0.26809221,  0.27187578,  0.1358049 ,  0.3525979 ,  0.32137889,
         0.28973798, -0.44246807, -0.47079196,  0.67887182, -0.67264928,
         0.66633   ,  0.69693785, -0.70275234],
       [-0.80959425, -0.99963967, -0.23787306, -0.61801865,  0.23020549,
        -0.29797013,  1.25765397, -1.56529245, -0.71594174, -0.54985088,
        -0.37711937,  0.49913541,  0.174177  , -0.96115132, -0.84384989,
        -0.72557024,  0.27146325, -0.04023771, -0.32825699, -0.38402006,
   

In [45]:
display(X_test.head())
X_test = scaler.transform(X_test)
display(X_test[0:5])

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2
886,3.1326,17.0,3.833458,1.026886,3386.0,2.528753,37.54,-121.98,9.813183,30.740776,...,1409.2516,52903.305064,1985990.0,0.026638,0.00071,14879.1204,-1814955.0,221388200.0,-0.008198,6.7e-05
5888,4.2679,44.0,4.388704,1.006645,471.0,1.564784,34.15,-118.33,18.21497,77.739672,...,1166.2225,39826.498375,1360075.0,0.029283,0.000857,14001.9889,-1656855.0,196055700.0,-0.008451,7.1e-05
7884,3.3304,20.0,4.425791,1.055961,2326.0,2.829684,33.87,-118.13,11.091564,36.939345,...,1147.1769,38854.881603,1316015.0,0.029525,0.000872,13954.6969,-1648468.0,194733600.0,-0.008465,7.2e-05
16128,3.5221,52.0,4.834008,1.026316,1078.0,2.182186,37.78,-122.47,12.405188,43.692314,...,1427.3284,53924.466952,2037266.0,0.026469,0.000701,14998.9009,-1836915.0,224967000.0,-0.008165,6.7e-05
18604,5.208,46.0,6.094801,1.235474,850.0,2.599388,37.11,-122.11,27.123264,141.257959,...,1377.1521,51106.114431,1896548.0,0.026947,0.000726,14910.8521,-1820764.0,222333500.0,-0.008189,6.7e-05


array([[-0.34683934, -0.91984314, -1.18001086, -0.30102433,  1.85368177,
        -0.5356595 ,  0.8809224 , -1.1972587 , -0.43264741, -0.4097643 ,
        -0.32083795, -0.06658   , -0.18163206, -0.91678168, -0.8227461 ,
        -0.71644605,  0.19567264, -0.06062297, -1.08540211, -0.95045618,
        -0.80270353,  1.07090527,  0.72884945, -0.27064122, -0.21220899,
        -0.14638345,  0.26613727,  0.20154669,  1.1985269 ,  0.49121316,
         0.13542112, -0.21441097, -0.02455642, -0.5153304 , -0.41684169,
        -0.28266274,  0.36777943,  0.22625002,  0.86209401,  0.84164436,
         0.81958374, -0.91372547, -0.92772591,  1.19797182, -1.19858059,
         1.19908487,  1.19551994, -1.19449455],
       [ 0.38192422,  1.23466316, -0.70453446, -0.45582602, -0.91220564,
        -1.79733958, -0.69576897,  0.61804292,  0.16286089, -0.01583902,
        -0.10867722, -0.57001414, -0.42045997,  1.29766687,  1.25072072,
         1.14267447, -0.54845332, -0.19515645, -0.72703924, -0.69995485,
   

6. Train a model including all these features. Do you get better performances than before?

In [46]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [47]:
print(f"R2 score on training set : {lr.score(X_train, y_train)}")
print(f"R2 score on test set : {lr.score(X_test, y_test)}")

R2 score on training set : 0.6828533091049148
R2 score on test set : 0.6854129001458649


## Forward selection
This feature engineering trick improved the model's score significantly ! But now, the model is a lot more complex as it uses 32 input features. Do we really need all these features? Let's implement the forward selection method described in this morning's lecture. 

Fortunately, the latest versions of sklearn provide a class that implements forward selection, such that we don't need to code the algorithm by hand 🥳

7. Have a look at the documentation of [SequentialFeatureSelector](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) and try to understand the following lines of code:

In [48]:
from sklearn.feature_selection import  SequentialFeatureSelector
feature_selector =  SequentialFeatureSelector(lr, n_features_to_select = 20)
feature_selector.fit(X_train, y_train)
features_list = X.columns
best_features = features_list[feature_selector.support_]
print("According to the forward selection algorithm, the following features should be kept: ")
print(best_features.to_list())

According to the forward selection algorithm, the following features should be kept: 
['MedInc', 'HouseAge', 'Population', 'Latitude', 'Longitude', 'MedInc_3', 'MedInc_4', 'MedInc_inverse', 'AveRooms_inverse', 'AveBedrms_inverse', 'Population_inverse2', 'AveOccup_3', 'AveOccup_inverse', 'AveOccup_inverse2', 'Latitude_3', 'Latitude_4', 'Latitude_inverse', 'Latitude_inverse2', 'Longitude_3', 'Longitude_4']


8. Create a DataFrame X_best containing only the best set of features, train a model only with these features and evaluate the performances

In [49]:
X_best = X.loc[:, best_features.to_list()]

X_train, X_test, y_train, y_test = train_test_split(X_best, y)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lr.fit(X_train, y_train)

print(f"R2 score on training set : {lr.score(X_train, y_train)}")
print(f"R2 score on test set : {lr.score(X_test, y_test)}")

R2 score on training set : 0.6723166249259322
R2 score on test set : 0.6551957224046474


## Advanced feature engineering
Let's make even more advanced feature engineering. Until now, we've included the latitude and longitude as such into the models. However, usually the GPS coordinates are not used rawly, instead we deduce some geographical information from these. Let's use an API that will allows to retrieve the name of the city from the latitude and longitude.

💡 As the calls to the API may be time-consuming, we'll work on a sample of the dataset.

9. Take a sample of your dataset X (the one that contains all the features and not only the best set, because we need the values of Latitude and Longitude). Keep only 150 rows.

In [61]:
X_sample = X.loc[:150]
X_sample.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_2,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,69.308955,577.010912,...,1434.8944,54353.799872,2058922.0,0.026399,0.000697,14940.1729,-1826137.0,223208800.0,-0.008181,6.7e-05
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,68.913242,572.076387,...,1433.3796,54267.751656,2054577.0,0.026413,0.000698,14937.7284,-1825689.0,223135700.0,-0.008182,6.7e-05
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,52.669855,382.246204,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14942.6176,-1826586.0,223281800.0,-0.008181,6.7e-05
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,31.844578,179.702136,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14945.0625,-1827034.0,223354900.0,-0.00818,6.7e-05
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,14.793254,56.897815,...,1432.6225,54224.761625,2052407.0,0.02642,0.000698,14945.0625,-1827034.0,223354900.0,-0.00818,6.7e-05


10. Create a Y_sample variable containing the target values corresponding to the rows that were kept in X_sample

In [63]:
y_sample = dataset.loc[:150, target_name]
y_sample.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: Price, dtype: float64

11. Use the following help to translate the longitude and latitude of the data to find the cities corresponding to each observation: [geopy](https://pypi.org/project/geopy)

In [55]:
!pip install geopy

Collecting geopy
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl.metadata (1.4 kB)
Downloading geopy-2.4.1-py3-none-any.whl (125 kB)
Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.4.1


In [56]:
# Example of how to get the adress from a given pair of latitude/longitude coordinates
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="yet_another_app")
location = geolocator.reverse("52.509669, 13.376294")
loc_dict = dict(location.raw)
loc_dict["address"]

{'tourism': 'Potsdamer Platz',
 'road': 'Potsdamer Platz',
 'suburb': 'Tiergarten',
 'borough': 'Mitte',
 'city': 'Berlin',
 'ISO3166-2-lvl4': 'DE-BE',
 'postcode': '10785',
 'country': 'Deutschland',
 'country_code': 'de'}

In [64]:
# Use geopy to extract the city of each row in the sample dataset
X_sample["City"] = 0
for i, row in X_sample.iterrows():
    geolocator = Nominatim(user_agent="yet_another_app_2")
    location = geolocator.reverse("{}, {}".format(X_sample.loc[i, "Latitude"], X_sample.loc[i, "Longitude"]), 
                                  timeout = None)
    loc_dict = dict(location.raw)
    try:
        X_sample.loc[i, "City"] = loc_dict["address"]["city"]
    except:
        try:
            X_sample.loc[i, "City"] = loc_dict["address"]["town"]
        except:
            try:
                X_sample.loc[i, "City"] = loc_dict["address"]["village"]
            except:
                pass
# If city was not found, replace by "Unknown"
X_sample.loc[X_sample['City'] == 0, 'City'] = "Unknown"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_sample["City"] = 0
  X_sample.loc[i, "City"] = loc_dict["address"]["city"]


In [65]:
X_sample.describe(include='all')

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedInc_2,MedInc_3,...,Latitude_3,Latitude_4,Latitude_inverse,Latitude_inverse2,Longitude_2,Longitude_3,Longitude_4,Longitude_inverse,Longitude_inverse2,City
count,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,...,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149.0,149
unique,,,,,,,,,,,...,,,,,,,,,,5
top,,,,,,,,,,,...,,,,,,,,,,Oakland
freq,,,,,,,,,,,...,,,,,,,,,,137
mean,3.218595,43.0,4.970076,1.081219,900.33557,2.379286,37.825503,-122.255705,15.106081,91.351659,...,54119.577559,2047101.0,0.026437,0.0006989254,14946.458117,-1827290.0,223396700.0,-0.00818,6.69055e-05,
std,2.186047,11.585172,1.382896,0.122885,573.607392,0.549293,0.016621,0.028123,20.474284,185.439747,...,71.352277,3598.806,1.2e-05,6.141275e-07,6.875713,1260.782,205499.1,2e-06,3.078887e-08,
min,0.4999,2.0,1.714286,0.571429,18.0,1.437141,37.79,-122.3,0.2499,0.124925,...,53967.298139,2039424.0,0.026399,0.0006969154,14927.9524,-1829277.0,222843800.0,-0.008185,6.685703e-05,
25%,1.6875,36.0,3.980237,1.02381,534.0,2.101083,37.81,-122.28,2.847656,4.80542,...,54053.028541,2043745.0,0.026427,0.0006983896,14942.6176,-1828379.0,223281800.0,-0.008181,6.68789e-05,
50%,2.6,49.0,4.79798,1.068,756.0,2.346154,37.82,-122.26,6.76,17.576,...,54095.927768,2045908.0,0.026441,0.0006991284,14947.5076,-1827482.0,223428000.0,-0.008179,6.690079e-05,
75%,3.9643,52.0,6.047244,1.114943,1129.0,2.60688,37.84,-122.24,15.715674,62.301648,...,54181.794304,2050239.0,0.026448,0.0006994983,14952.3984,-1826586.0,223574200.0,-0.008178,6.692268e-05,


12. Make a train/test splitting from X_sample and Y_sample

In [66]:

X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample)


13. What preprocessings are necessary now ? The cells below implement the preprocessings, read it carefully and check what is done

In [67]:
categorical_features = ['City']
numeric_features = [c for c in X_sample.columns if c != 'City']

In [68]:
# Create transformer for numeric features
numeric_transformer = StandardScaler()

In [69]:
# Create transformer for categorical features
categorical_transformer = OneHotEncoder(drop='first', handle_unknown = 'ignore') # ignore if unknown categories are found in test set

In [70]:
# Use ColumnTransformer to make a preprocessor object that describes all the treatments to be done
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [71]:
# Preprocessings on train set
print("Performing preprocessings on train set...")
X_train = preprocessor.fit_transform(X_train)
print('...Done.')

# Preprocessings on test set
print("Performing preprocessings on test set...")
X_test = preprocessor.transform(X_test) # Don't fit again !! The test set is used for validating decisions
# we made based on the training set, therefore we can only apply transformations that were parametered using the training set.
# Otherwise this creates what is called a leak from the test set which will introduce a bias in all your results.
print('...Done.')

Performing preprocessings on train set...
...Done.
Performing preprocessings on test set...
...Done.


14. Train a regression model and evaluate the performances. Are you satisfied?

In [74]:
lr.fit(X_train, y_train)

print(f"R2 score on training set : {lr.score(X_train, y_train)}")
print(f"R2 score on test set : {lr.score(X_test, y_test)}")

R2 score on training set : 0.9729836462300647
R2 score on test set : -1.4978501773513417
