# Vehicle Sales Price Predictions Workshop - Part 2 of 3

## Training Pipeline

In order to make a machine learning system from this dataset, we have structured the service into 3 pipelines:

1. feature engineering pipeline notebook (see Part 1)
2. training pipeline notebook (ie. this Part 2)
3. inferencing pipeline notebook (see Part 3)

This notebook will outline the second step, ie. the training pipeline.


### 5. PREPARING THE DATA 

A machine learning model is a mathematical equation. An equation cannot accept anything other than numbers. Your categorical data must therefore be transformed (encoded) into numerical data at this stage. However, if you encode the data, you must also save the encoder for later decoding once the model is trained.

In [None]:
# Install the Hopsworks client library
!pip install --quiet hopsworks

In [None]:
# Connect to the Hopsworks Feature store and get the feature group
import hopsworks
proj = hopsworks.login()
fs = proj.get_feature_store()
fg = fs.get_feature_group(name="car_prices_pytorch", version=1)

In [None]:
# Create a feature view for the training
feature_view = fs.get_or_create_feature_view(name="car_prices_pytorch",
                                             version=1,
                                             query= fg.select_except(["seller", "saledate"]),
                                             labels=["sellingprice"]
                                             )

In [None]:
# Now we can load the training data into a dataframe
import pandas as pd

features_df, labels_df = feature_view.training_data()
labels_df
features_df

In [None]:
# Now we will encode the dataset

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, FunctionTransformer
import joblib

def encode_categorical_data(dataset, label_encoders):
    # Iterate over the columns of the DataFrame
    for column in dataset.columns:
        # Check if the column is of type 'object' (categorical)
        if dataset[column].dtype == 'object':
            # Create a LabelEncoder instance
            label_encoder = LabelEncoder()

            # Perform encoding on unique column values
            dataset[column] = label_encoder.fit_transform(dataset[column])

            # Add the encoder label to the dictionary
            label_encoders[column] = label_encoder
    return dataset

# Create a dictionary to store label encoders
clf = {}
df_encoded = encode_categorical_data(features_df, clf)
df_encoded

Transform categorical values ​​from dataset 'dataset_cleaned.csv' into numeric values ​​and saves the encoder to a file for later use during prediction.

Cut the encoded dataset into two parts, train and test Keep 1000 data in the test dataset`

In [None]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_encoded, labels_df, test_size=1000, random_state=42)

# Show training and test set sizes
print("Size of the training dataset :", len(X_train))
print("Size of the test dataset :", len(X_test))

### 6. TRAINING OF THE MODEL

Entraîne un modèle de régression utilisant une architecture de neurones profondes. La colonne à calculer est "sellingprice". Utilise la fonction de perte mean absolute error et la librairie pytorch

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Converting data to PyTorch tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32)

# Creation of datasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Model definition
class DeepRegressor(nn.Module):
    def __init__(self, input_size):
        super(DeepRegressor, self).__init__()
        self.fc1 = nn.Linear(input_size, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialisation of the model
model = DeepRegressor(input_size=X_train_tensor.shape[1])

# Definition of loss function and optimizer
criterion = nn.L1Loss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training of the model
num_epochs = 50
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

losses = []
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, targets in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.squeeze(), targets)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * inputs.size(0)
    epoch_loss = running_loss / len(train_loader.dataset)
    losses.append(epoch_loss)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}")

# Evaluation of the model
model.eval()
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
test_loss = 0.0

with torch.no_grad():
    for inputs, targets in test_loader:
        outputs = model(inputs)
        loss = criterion(outputs.squeeze(), targets)
        test_loss += loss.item() * inputs.size(0)

test_loss /= len(test_loader.dataset)
print(f"Test Loss: {test_loss:.4f}")


In [None]:
## Graphical representation of the model training results
import matplotlib.pyplot as plt
import numpy as np

plt.title('Training Loss')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

plt.plot(np.array(losses), 'r')
plt.show()

To minimize the error, do not hesitate to:

1. Test different neuronal architectures (more layers of neurons)
2. Change hyper parameters (more epochs, other learning rate, etc.)
3. Normalize your data before training during the data preparation phase. If you do so, remember to rescale (denormalize) before making predictions.

### 7. SAVE THE MODEL TO A FILE FOR LATER USE

Sauvegarde le modèle entier entraîné dans un fichier pour l'utiliser après et le mettre en production

In [None]:
# This step will upload the model to the Hopsworks Model Registry

import os
from hsml.schema import Schema
from hsml.model_schema import ModelSchema
import joblib

input_schema = Schema(features_df)
output_schema = Schema(labels_df)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)

model_name = "car_prices_model_pytorch"

os.makedirs(model_name + "/images", exist_ok=True)

plt.savefig(model_name + "/images/training_losses.png")
plt.close()

# Saving the model
torch.save(model, model_name + '/regression_model.pth')
joblib.dump(clf, model_name + '/label_encoders.pkl')

mr = proj.get_model_registry()

car_prices_pytorch_model = mr.torch.create_model(
    model_name,
    model_schema=model_schema,
    metrics = {'test_loss' : test_loss}
)

# Save the created model in the "car_prices_model" directory
car_prices_model_pytorch.save(model_name)

Now we can proceed to the Inference Pipeline of the workshop demo example.