## 2. Feature Engineering

In [None]:
# Adding new feature age by subtracting current year from year of registration.
sample_df['age'] = 2023 - sample_df['year_of_registration']
# Deleting feature year of registration
sample_df = sample_df.drop(columns=['year_of_registration'])
# Display a sample to see new feature age
sample_df['age'].sample()

public_reference
202009033252341    62
Name: age, dtype: int64

Here, I've defined a brand-new attribute called age, which is determined by deducting the current year from the year the vehicle was registered.

In [None]:
X = sample_df.drop(columns='price')
y = sample_df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, y_train.shape

((27253, 14), (27253,))

As I added a new feature I am again splitting dataset into predictors and targets so that the new feature will be included for further process.

### Polynomial/bias function and interaction features

To fit and modify the data, a polynomial transformer is used. Combining these features raised to various powers will produce new features as part of the PolynomialFeatures transformer. It essentially widens the feature space, enabling the model to learn more intricate patterns and thus enhancing its predictive power. A characteristic of interaction is represented by each column.



In [None]:
preprocessing_pipe = Pipeline(
    steps=[
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(interaction_only=True, include_bias=False))
    ]
).set_output(transform='pandas')

preprocessing_pipe.fit_transform(X_train).sample()

Unnamed: 0_level_0,mileage,reg_code,standard_colour,standard_make,standard_model,vehicle_condition,body_type,crossover_car_and_van,fuel_type_diesel,fuel_type_electric,...,fuel_type_electric fuel_type_petrol,fuel_type_electric fuel_type_petrol_hybrid,fuel_type_electric fuel_type_petrol_plug_in_hybrid,fuel_type_electric age,fuel_type_petrol fuel_type_petrol_hybrid,fuel_type_petrol fuel_type_petrol_plug_in_hybrid,fuel_type_petrol age,fuel_type_petrol_hybrid fuel_type_petrol_plug_in_hybrid,fuel_type_petrol_hybrid age,fuel_type_petrol_plug_in_hybrid age
public_reference,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
202010155030829,-1.210036,1.127393,-1.176887,0.228509,0.198831,3.88922,1.196605,-0.053575,-0.803872,-0.100777,...,-0.091212,0.01889,0.010845,0.113251,-0.169648,-0.0974,-1.017111,0.020171,0.210641,0.120935


Using Scikit-Learn's Pipeline class, I am building a preprocessing pipeline. Two preprocessing phases make up the pipeline: polynomial feature creation and standard scaling.

In [None]:
X_train_pp = preprocessing_pipe.fit_transform(X_train)

X_train_pp.columns

Index(['mileage', 'reg_code', 'standard_colour', 'standard_make',
       'standard_model', 'vehicle_condition', 'body_type',
       'crossover_car_and_van', 'fuel_type_diesel', 'fuel_type_electric',
       ...
       'fuel_type_electric fuel_type_petrol',
       'fuel_type_electric fuel_type_petrol_hybrid',
       'fuel_type_electric fuel_type_petrol_plug_in_hybrid',
       'fuel_type_electric age', 'fuel_type_petrol fuel_type_petrol_hybrid',
       'fuel_type_petrol fuel_type_petrol_plug_in_hybrid',
       'fuel_type_petrol age',
       'fuel_type_petrol_hybrid fuel_type_petrol_plug_in_hybrid',
       'fuel_type_petrol_hybrid age', 'fuel_type_petrol_plug_in_hybrid age'],
      dtype='object', length=105)

The above code transforms the data and assigns it to X_train_pp after applying the preprocessing pipeline 'preprocessing_pipe' to the training predictors X_train.

In [None]:
X_train['mileage'].mean(), X_train['mileage'].std()

(37206.332635416285, 30744.542341025463)

In [None]:
X_train_pp['mileage'].mean(), X_train_pp['mileage'].std()

(-3.584912712986232e-17, 1.0000183471089559)

The preprocessing pipeline's standard scaling phase is what causes the difference in the mean and standard deviation values between the original and preprocessed data. Data is transformed to a standard normal distribution with a mean of 0 and a standard deviation of 1 using standard scaling, which subtracts the mean and divides by the standard deviation.

The'mileage' column in X_train_pp has a mean close to zero and a standard deviation of about 1, demonstrating good standardisation of the data after applying the preprocessing pipeline.