# **Build End-to-End ML Pipeline for Truck Delay Classification Part 2**


The project addresses a critical challenge faced by the logistics industry. Delayed truck shipments not only result in increased operational costs but also impact customer satisfaction. Timely delivery of goods is essential to meet customer expectations and maintain the competitiveness of logistics companies.
By accurately predicting truck delays, logistics companies can:
* Improve operational efficiency by allocating resources more effectively
* Enhance customer satisfaction by providing more reliable delivery schedules
* Optimize route planning to reduce delays caused by traffic or adverse weather conditions
* Reduce costs associated with delayed shipments, such as penalties or compensation to customers

In the first phase of our three-part series, [Learn to Build an End-to-End Machine Learning Pipeline - Part 1](https://www.projectpro.io/project-use-case/build-an-end-to-end-machine-learning-pipeline-for-a-classification-model), we laid the groundwork by utilizing PostgreSQL and MySQL in AWS RDS for data storage, setting up an AWS Sagemaker Notebook, performing data retrieval, conducting exploratory data analysis, and creating feature groups with Hopsworks. 

In Part 2, we delve deeper into the machine-learning pipeline. Focusing on data retrieval from the feature store, train-validation-test split, one-hot encoding, scaling numerical features, and leveraging Weights and Biases for model experimentation, we will build our pipeline for model building with logistic regression, random forest, and XGBoost models. Further, we explore hyperparameter tuning with sweeps, discuss grid and random search, and, ultimately, the deployment of a Streamlit application on AWS. 


**Note:  AWS Usage Charges**
This project leverages the AWS cloud platform to build the end-to-end machine learning pipeline. While using AWS services, it's important to note that certain activities may incur charges. We recommend exploring the AWS Free Tier, which provides limited access to a wide range of AWS services for 12 months. Please refer to the AWS Free Tier page for detailed information, including eligible services and usage limitations.






![image.png](https://images.pexels.com/photos/2199293/pexels-photo-2199293.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=1)


## **Approach**

* Data Retrieval from Hopsworks:
    * Connecting Hopsworks with Python.
    * Retrieving data directly from the feature store.


* Train-Validation-Test Split

* One-Hot Encoding

* Scaling Numerical Features

* Model Experimentation and Tracking:
    * Weights and Biases Introduction
    * Setting up a new project and connecting it to Python


* Model Building
    * Logistic Regression
    * Random Forest
    * XGBoost


* Hyperparameter Tuning with Sweeps


* Streamlit Application Development and Fetching the Best Model


* Deployment on AWS EC2 Instance


## **System Requirements**

* python version : 3.10.12

## **Library Requirements**

* hopsworks==3.4.3
* streamlit==1.29.0
* pandas==1.5.3
* joblib==1.3.2
* wandb==0.16.1
* xgboost==2.0.2
* scikit_learn==1.2.2

## **Key Takeaways**



* How to connect Python with Hopsworks and fetch data?

* Understand the significance of train validation test data splitting
* Implement one-hot encoding for categorical variables.
* Distinguish between fit-transform and transform, storing for future use.
* Implement normalization techniques in Python.
* Understand the significance of experiment tracking.
* How to connect with Weights and Biases for model experimentation?
* Implement and Track Logistic regression, Random forest, and XGBoost models.
* Explore model evaluation metrics and their business implications.
* Utilize hyperparameter sweeps in Weights and Biases for tuning.
* Learn to fetch the best model from Weights and Biases
* Develop a Streamlit application and deploy it on AWS EC2 instance.


### **Data Retrieval from Hopsworks**

In [2]:
# import libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import hopsworks

In [3]:
# Login to hopsworks by entering api key value
project = hopsworks.login(api_key_value='<enter_your_api_key>')

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/105624


In [4]:
# Get the feature store
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


In [5]:
# Retrieve final
final_data = fs.get_feature_group('final_data', version=1)

In [6]:
# Select the query
query = final_data.select_all()

In [7]:
# Read all data
final_merge = query.read(read_options={"use_hive": True})



Finished: Reading data from Hopsworks, using Hive (3.70s) 


In [9]:
# First five rows
final_merge.head()

Unnamed: 0,unique_id,truck_id,route_id,departure_date,estimated_arrival,delay,route_avg_temp,route_avg_wind_speed,route_avg_precip,route_avg_humidity,...,driver_id,name,gender,age,experience,driving_style,ratings,vehicle_no,average_speed_mph,is_midnight
0,637,27585963,R-41fb82a2,2019-02-12 07:00:00,2019-02-12 13:59:24,0,54.333333,5.666667,0.0,54.0,...,5001910e-8,Carlos Torres,male,38,9,proactive,2,27585963,63.31,0
1,4088,18855810,R-c061582f,2019-02-06 07:00:00,2019-02-06 19:22:48,0,22.0,10.0,0.0,73.5,...,728ab9fc-f,John Vasquez,male,47,4,conservative,4,18855810,48.03,0
2,6758,31562809,R-a61d33ae,2019-01-10 07:00:00,2019-01-10 14:12:00,1,66.333333,8.666667,0.033333,84.333333,...,49fe8aed-8,Kurt Smith,male,42,13,proactive,8,31562809,63.74,0
3,9783,61984883,R-d87e53cd,2019-02-09 07:00:00,2019-02-09 20:38:24,0,80.5,10.0,0.0,63.5,...,42aa7479-5,Jerry Powers,male,41,7,conservative,2,61984883,56.94,0
4,8042,17692634,R-ab28b5f1,2019-02-06 07:00:00,2019-02-06 16:25:48,0,67.333333,10.666667,0.033333,90.0,...,458adb7d-5,John Coleman,male,50,13,conservative,6,17692634,45.07,0


### **Data Processing**

In [35]:
# Basic Information on the dataframe
final_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12308 entries, 0 to 12307
Data columns (total 49 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   unique_id                       12308 non-null  int64         
 1   truck_id                        12308 non-null  int64         
 2   route_id                        12308 non-null  object        
 3   departure_date                  12308 non-null  datetime64[ns]
 4   estimated_arrival               12308 non-null  datetime64[ns]
 5   delay                           12308 non-null  int64         
 6   route_avg_temp                  12308 non-null  float64       
 7   route_avg_wind_speed            12308 non-null  float64       
 8   route_avg_precip                12308 non-null  float64       
 9   route_avg_humidity              12308 non-null  float64       
 10  route_avg_visibility            12308 non-null  float64       
 11  ro

In [36]:
# Number of null values
final_merge.isna().sum()

unique_id                           0
truck_id                            0
route_id                            0
departure_date                      0
estimated_arrival                   0
delay                               0
route_avg_temp                      0
route_avg_wind_speed                0
route_avg_precip                    0
route_avg_humidity                  0
route_avg_visibility                0
route_avg_pressure                  0
route_description                   0
estimated_arrival_nearest_hour      0
departure_date_nearest_hour         0
origin_id                           0
destination_id                      0
distance                            0
average_hours                       0
origin_temp                         4
origin_wind_speed                   4
origin_description                  0
origin_precip                       4
origin_humidity                     4
origin_visibility                   4
origin_pressure                     4
destination_

In [37]:
# Let's check the rows where origin temp is null
final_merge[final_merge['origin_temp'].isnull()]

Unnamed: 0,unique_id,truck_id,route_id,departure_date,estimated_arrival,delay,route_avg_temp,route_avg_wind_speed,route_avg_precip,route_avg_humidity,...,driver_id,name,gender,age,experience,driving_style,ratings,vehicle_no,average_speed_mph,is_midnight
3086,7662,18091756,R-112b790b,2019-01-25 07:00:00,2019-01-27 02:40:48,1,66.555556,6.888889,0.0,90.888889,...,e975a383-c,Neil Herring,male,45,7,proactive,3,18091756,58.02,1
9692,11359,22916520,R-78ee1f97,2019-01-25 07:00:00,2019-01-28 10:08:24,0,57.5,10.142857,0.0,78.214286,...,ffedbf74-a,Thomas Ochoa,male,57,19,proactive,6,22916520,63.64,1
9768,8165,24746768,R-b5f9418a,2019-01-25 07:00:00,2019-01-27 14:35:24,0,47.454545,9.090909,0.0,70.636364,...,3d91387f-2,William Anderson III,male,50,0,conservative,4,24746768,40.69,1
12063,7721,24654257,R-21472caf,2019-01-25 07:00:00,2019-01-27 16:50:24,0,69.0,12.363636,0.018182,79.181818,...,f110642c-1,Marc Walters,male,47,5,proactive,3,24654257,61.93,1


In [38]:
# Let's check the rows where origin humidity is null
# Looks like we have null values in the same rows, let's find out which origin city is this
final_merge[final_merge['origin_humidity'].isnull()]

Unnamed: 0,unique_id,truck_id,route_id,departure_date,estimated_arrival,delay,route_avg_temp,route_avg_wind_speed,route_avg_precip,route_avg_humidity,...,driver_id,name,gender,age,experience,driving_style,ratings,vehicle_no,average_speed_mph,is_midnight
3086,7662,18091756,R-112b790b,2019-01-25 07:00:00,2019-01-27 02:40:48,1,66.555556,6.888889,0.0,90.888889,...,e975a383-c,Neil Herring,male,45,7,proactive,3,18091756,58.02,1
9692,11359,22916520,R-78ee1f97,2019-01-25 07:00:00,2019-01-28 10:08:24,0,57.5,10.142857,0.0,78.214286,...,ffedbf74-a,Thomas Ochoa,male,57,19,proactive,6,22916520,63.64,1
9768,8165,24746768,R-b5f9418a,2019-01-25 07:00:00,2019-01-27 14:35:24,0,47.454545,9.090909,0.0,70.636364,...,3d91387f-2,William Anderson III,male,50,0,conservative,4,24746768,40.69,1
12063,7721,24654257,R-21472caf,2019-01-25 07:00:00,2019-01-27 16:50:24,0,69.0,12.363636,0.018182,79.181818,...,f110642c-1,Marc Walters,male,47,5,proactive,3,24654257,61.93,1


In [39]:
# Fetch the routes data
routes_data = fs.get_feature_group('routes_details_fg', version=1)

routes_data_query = routes_data.select_all()

routes_df = routes_data_query.read(read_options={"use_hive": True})



Finished: Reading data from Hopsworks, using Hive (1.20s) 


In [40]:
# Find the rows with the routes ids which has no info on origin city's weather on 25th jan
# Only 1 city is there in all these rows
routes_df[routes_df.route_id.isin(['R-112b790b', 'R-78ee1f97','R-b5f9418a', 'R-21472caf'])]

Unnamed: 0,route_id,origin_id,destination_id,distance,average_hours,event_time
6,R-b5f9418a,C-f8f01604,C-4fe0fa24,2779.33,55.59,2023-08-23
438,R-21472caf,C-f8f01604,C-2e349ccd,2892.14,57.84,2023-08-23
702,R-112b790b,C-f8f01604,C-d3bb431c,2183.94,43.68,2023-08-23
1290,R-78ee1f97,C-f8f01604,C-f5ed4c15,3757.02,75.14,2023-08-23


In [41]:
# Let's check if we have any information on this city
# Fetching the weather data
weather_data = fs.get_feature_group('city_weather_details_fg', version=1)

weather_query = weather_data.select_all()

weather_df = weather_query.read(read_options={"use_hive": True})



Finished: Reading data from Hopsworks, using Hive (3.05s) 


In [42]:
# Filter the weather data with city and date
# We don't have any information on this, we will remove these rows
# It is important to check with the business regarding the information though 
weather_df[(weather_df.city_id=='C-f8f01604')&(weather_df.date==pd.to_datetime('2019-01-25'))]

Unnamed: 0,city_id,date,hour,temp,wind_speed,description,precip,humidity,visibility,pressure,chanceofrain,chanceoffog,chanceofsnow,chanceofthunder


In [49]:
# Drop the rows

final_merge=final_merge.dropna(subset =  ['origin_temp', 'origin_wind_speed', 'origin_precip',
                                'origin_humidity', 'origin_visibility', 'origin_pressure' ] ).reset_index(drop=True)

In [50]:
# Let's verify the dropped null values
final_merge.isna().sum()

unique_id                           0
truck_id                            0
route_id                            0
departure_date                      0
estimated_arrival                   0
delay                               0
route_avg_temp                      0
route_avg_wind_speed                0
route_avg_precip                    0
route_avg_humidity                  0
route_avg_visibility                0
route_avg_pressure                  0
route_description                   0
estimated_arrival_nearest_hour      0
departure_date_nearest_hour         0
origin_id                           0
destination_id                      0
distance                            0
average_hours                       0
origin_temp                         0
origin_wind_speed                   0
origin_description                  0
origin_precip                       0
origin_humidity                     0
origin_visibility                   0
origin_pressure                     0
destination_

In [51]:
final_merge

Unnamed: 0,unique_id,truck_id,route_id,departure_date,estimated_arrival,delay,route_avg_temp,route_avg_wind_speed,route_avg_precip,route_avg_humidity,...,driver_id,name,gender,age,experience,driving_style,ratings,vehicle_no,average_speed_mph,is_midnight
0,637,27585963,R-41fb82a2,2019-02-12 07:00:00,2019-02-12 13:59:24,0,54.333333,5.666667,0.000000,54.000000,...,5001910e-8,Carlos Torres,male,38,9,proactive,2,27585963,63.31,0
1,4088,18855810,R-c061582f,2019-02-06 07:00:00,2019-02-06 19:22:48,0,22.000000,10.000000,0.000000,73.500000,...,728ab9fc-f,John Vasquez,male,47,4,conservative,4,18855810,48.03,0
2,6758,31562809,R-a61d33ae,2019-01-10 07:00:00,2019-01-10 14:12:00,1,66.333333,8.666667,0.033333,84.333333,...,49fe8aed-8,Kurt Smith,male,42,13,proactive,8,31562809,63.74,0
3,9783,61984883,R-d87e53cd,2019-02-09 07:00:00,2019-02-09 20:38:24,0,80.500000,10.000000,0.000000,63.500000,...,42aa7479-5,Jerry Powers,male,41,7,conservative,2,61984883,56.94,0
4,8042,17692634,R-ab28b5f1,2019-02-06 07:00:00,2019-02-06 16:25:48,0,67.333333,10.666667,0.033333,90.000000,...,458adb7d-5,John Coleman,male,50,13,conservative,6,17692634,45.07,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12299,1846,66683794,R-a51bd34b,2019-01-06 07:00:00,2019-01-07 09:28:48,0,59.000000,9.333333,0.000000,89.333333,...,138dd3a4-5,Bradley Ramirez,male,47,9,proactive,4,66683794,63.14,1
12300,10298,20252337,R-2b5aba88,2019-02-10 07:00:00,2019-02-11 13:02:24,1,58.571429,10.428571,0.000000,57.428571,...,3bb7faad-d,Renee Cuevas,female,43,12,proactive,4,20252337,62.82,1
12301,9490,21949967,R-cf631f68,2019-01-25 07:00:00,2019-01-25 22:06:00,0,45.250000,9.750000,0.000000,71.250000,...,2363c29b-1,Zachary Hardy,male,41,4,conservative,8,21949967,41.56,0
12302,1716,96362807,R-83e7feed,2019-01-07 07:00:00,2019-01-07 21:33:00,1,71.500000,7.000000,0.000000,56.500000,...,1ed64257-3,Steven Walton,male,43,10,proactive,3,96362807,54.95,0


### **Train - Validation - Test Split**

The data points are divided into two or three datasets, train and test, in a train test split method and train validation test split in three way split. The train data is used to train the model, and the model is then used to predict on the test data to see how the model performs on unseen data and whether it is overfitting or underfitting.


The validation set is a different set of data from the training set that is used to validate the performance of our model during training. This validation approach gives data that allows us to fine-tune the model's hyperparameters and configurations.
 
Once model optimization is done with the help of the validation set, the model is then used to test unseen data.



In [53]:
#selecting necessary columns and removing id columns

cts_cols=['route_avg_temp', 'route_avg_wind_speed',
       'route_avg_precip', 'route_avg_humidity', 'route_avg_visibility',
       'route_avg_pressure', 'distance', 'average_hours',
       'origin_temp', 'origin_wind_speed', 'origin_precip', 'origin_humidity',
       'origin_visibility', 'origin_pressure',
       'destination_temp','destination_wind_speed','destination_precip',
       'destination_humidity', 'destination_visibility','destination_pressure',
        'avg_no_of_vehicles', 'truck_age','load_capacity_pounds', 'mileage_mpg',
        'age', 'experience','average_speed_mph']


cat_cols=['route_description',
       'origin_description', 'destination_description',
        'accident', 'fuel_type',
       'gender', 'driving_style', 'ratings','is_midnight']


target=['delay']



In [54]:
# Checking the date range
final_merge['estimated_arrival'].min(), final_merge['estimated_arrival'].max()

(Timestamp('2019-01-01 07:04:48'), Timestamp('2019-02-14 16:06:00'))

In [61]:
# Splitting the data into training, validation, and test sets based on date

train_df = final_merge[final_merge['estimated_arrival'] <= pd.to_datetime('2019-01-30')]

validation_df = final_merge[(final_merge['estimated_arrival'] > pd.to_datetime('2019-01-30')) &

                            (final_merge['estimated_arrival'] <= pd.to_datetime('2019-02-07'))]

test_df = final_merge[final_merge['estimated_arrival'] > pd.to_datetime('2019-02-07')]

In [62]:
X_train=train_df[cts_cols+cat_cols]

y_train=train_df['delay']



In [63]:
X_valid = validation_df[cts_cols + cat_cols]

y_valid = validation_df['delay']

X_test=test_df[cts_cols+cat_cols]

y_test=test_df['delay']

### **Data Preprocessing and Leakage**

Data leakage is a situation where information from the test or prediction data is inadvertently used during the training process of a machine learning model. This can occur when information from the test or prediction data is leaked into the training data, and the model uses this information to improve its performance during the training process.

Data leakage can occur during the preprocessing phase of machine learning when information from the test or prediction data is used to preprocess the training data, inadvertently leaking information from the test or prediction data into the training data.

For example, consider a scenario where the preprocessing step involves imputing missing values in the dataset. If the missing values are imputed using the mean or median values of the entire dataset, including the test and prediction data, then the imputed values in the training data may be influenced by the values in the test and prediction data. This can lead to data leakage, as the model may learn to recognize patterns in the test and prediction data during the training process, leading to overfitting and poor generalization performance.


To avoid data leakage, it's important to perform the data preprocessing steps on the training data only, and then apply the same preprocessing steps to the test and prediction data separately. This ensures that the test and prediction data remain unseen by the model during the training process, and helps to prevent overfitting and improve the accuracy of the model.

In the context of this problem, we performed all data preprocessing steps together for the sake of simplicity, which could potentially lead to data leakage. However, in real-world scenarios, it's important to treat the test and prediction data separately and apply the necessary preprocessing steps separately, based on the characteristics of the data.

In [67]:
load_capacity_mode = X_train['load_capacity_pounds'].mode()

load_capacity_mode

0    3000.0
Name: load_capacity_pounds, dtype: float64

In [68]:
X_train['load_capacity_pounds']=X_train['load_capacity_pounds'].fillna(load_capacity_mode.iloc[0])
X_valid['load_capacity_pounds']=X_valid['load_capacity_pounds'].fillna(load_capacity_mode.iloc[0])
X_test['load_capacity_pounds']=X_test['load_capacity_pounds'].fillna(load_capacity_mode.iloc[0])

In [69]:
X_train.isna().sum()

route_avg_temp             0
route_avg_wind_speed       0
route_avg_precip           0
route_avg_humidity         0
route_avg_visibility       0
route_avg_pressure         0
distance                   0
average_hours              0
origin_temp                0
origin_wind_speed          0
origin_precip              0
origin_humidity            0
origin_visibility          0
origin_pressure            0
destination_temp           0
destination_wind_speed     0
destination_precip         0
destination_humidity       0
destination_visibility     0
destination_pressure       0
avg_no_of_vehicles         0
truck_age                  0
load_capacity_pounds       0
mileage_mpg                0
age                        0
experience                 0
average_speed_mph          0
route_description          0
origin_description         0
destination_description    0
accident                   0
fuel_type                  0
gender                     0
driving_style              0
ratings       

In [70]:
X_valid.isna().sum()

route_avg_temp             0
route_avg_wind_speed       0
route_avg_precip           0
route_avg_humidity         0
route_avg_visibility       0
route_avg_pressure         0
distance                   0
average_hours              0
origin_temp                0
origin_wind_speed          0
origin_precip              0
origin_humidity            0
origin_visibility          0
origin_pressure            0
destination_temp           0
destination_wind_speed     0
destination_precip         0
destination_humidity       0
destination_visibility     0
destination_pressure       0
avg_no_of_vehicles         0
truck_age                  0
load_capacity_pounds       0
mileage_mpg                0
age                        0
experience                 0
average_speed_mph          0
route_description          0
origin_description         0
destination_description    0
accident                   0
fuel_type                  0
gender                     0
driving_style              0
ratings       

In [71]:
X_test.isna().sum()


route_avg_temp             0
route_avg_wind_speed       0
route_avg_precip           0
route_avg_humidity         0
route_avg_visibility       0
route_avg_pressure         0
distance                   0
average_hours              0
origin_temp                0
origin_wind_speed          0
origin_precip              0
origin_humidity            0
origin_visibility          0
origin_pressure            0
destination_temp           0
destination_wind_speed     0
destination_precip         0
destination_humidity       0
destination_visibility     0
destination_pressure       0
avg_no_of_vehicles         0
truck_age                  0
load_capacity_pounds       0
mileage_mpg                0
age                        0
experience                 0
average_speed_mph          0
route_description          0
origin_description         0
destination_description    0
accident                   0
fuel_type                  0
gender                     0
driving_style              0
ratings       

### **Encoding**


Encoding is a process in machine learning that involves converting categorical data, which consists of non-numeric labels, into a numerical format. This transformation is necessary because many machine learning algorithms operate on numerical data, and categorical variables need to be represented in a way that the algorithms can understand and process effectively.

**Why Encoding is Needed:**

Most machine learning algorithms require numerical input. Categorical data, which includes labels like 'red' or 'blue,' must be encoded into numeric values for these algorithms to work properly.

Categorical variables often have no inherent order or numerical meaning. Encoding allows us to represent them numerically without implying any ordinal relationships.

Types of Encoding:

* One-Hot Encoding: It creates binary columns for each category and indicates the presence of the category with a 1 and absence with a 0. If we have colors as categories ('Red', 'Blue', 'Green'), one-hot encoding would create three binary columns, each representing one color.


* Label Encoding: It assigns a unique numerical label to each category. The labels are usually integers. They are mostly used when the categories have inherent order or ranking.
Example: If we have sizes ('Small', 'Medium', 'Large'), label encoding might assign 1 to 'Small,' 2 to 'Medium,' and 3 to 'Large.'


* Target Encoding (Mean Encoding): It involves replacing a categorical value with the mean of the target variable for that category.
 If we have a binary target variable (0 or 1) and a categorical feature like 'Country,' target encoding would replace each country with the mean of the target variable for that country.



In [19]:
# Importing Standard Scaler and One-Hot Encoder
from sklearn.preprocessing import OneHotEncoder
from pickle import dump


In [20]:
# Creating the One-Hot Encoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [21]:
# Specifying columns to be encoded
encode_columns = ['route_description', 'origin_description', 'destination_description', 'fuel_type', 'gender', 'driving_style']

In [22]:
# Fitting the encoder on the training data
encoder.fit(X_train[encode_columns])

In [23]:
# Generating names for the new one-hot encoded features
encoded_features = list(encoder.get_feature_names_out(encode_columns))

In [24]:
encoded_features

['route_description_Blizzard',
 'route_description_Blowing snow',
 'route_description_Clear',
 'route_description_Cloudy',
 'route_description_Fog',
 'route_description_Freezing drizzle',
 'route_description_Freezing fog',
 'route_description_Heavy rain',
 'route_description_Heavy rain at times',
 'route_description_Heavy snow',
 'route_description_Light drizzle',
 'route_description_Light freezing rain',
 'route_description_Light rain',
 'route_description_Light rain shower',
 'route_description_Light sleet',
 'route_description_Light sleet showers',
 'route_description_Light snow',
 'route_description_Mist',
 'route_description_Moderate or heavy freezing rain',
 'route_description_Moderate or heavy rain shower',
 'route_description_Moderate or heavy rain with thunder',
 'route_description_Moderate or heavy sleet',
 'route_description_Moderate or heavy sleet showers',
 'route_description_Moderate or heavy snow showers',
 'route_description_Moderate or heavy snow with thunder',
 'route

In [25]:
# Transforming the training, validation, and test sets

X_train[encoded_features] = encoder.transform(X_train[encode_columns])

X_valid[encoded_features] = encoder.transform(X_valid[encode_columns])

X_test[encoded_features] = encoder.transform(X_test[encode_columns])

In [26]:
# Dumping the encoder for future use
dump(encoder, open('truck_data_encoder.pkl', 'wb'))

In [27]:
# Dropping the original categorical features

X_train = X_train.drop(encode_columns, axis=1)

X_valid = X_valid.drop(encode_columns, axis=1)

X_test = X_test.drop(encode_columns, axis=1)

### **Scaling Numerical Features**

Feature scaling is a crucial preprocessing step in machine learning that involves transforming the range of features (variables) in a dataset to a common scale. This is necessary because many machine learning algorithms are sensitive to the scale of the input features. Feature scaling ensures that all features contribute equally to the model training process and prevents some features from dominating others solely due to their scale.

Types of Feature Scaling:

* **Min-Max Scaling (Normalization):**

    Formula: $$ X_{\text{normalized}} = \frac{X - \text{min}(X)}{\text{max}(X) - \text{min}(X)} $$

​   Min-Max scaling transforms each feature to a range between 0 and 1. It subtracts the minimum value of the feature from each data point and then divides by the range (the difference between the maximum and minimum values). This ensures that the transformed data is in the desired range. Suitable for algorithms that rely on distances between data points, such as k-nearest neighbors and support vector machines.
Sensitive to outliers.


* **Standardization (Z-score):**

    Formula: $$X_{\text{standardized}} = \frac{X - \text{mean}(X)}{\text{std}(X)}$$

    Standardization transforms the data to have a mean of 0 and a standard deviation of 1. It subtracts the mean of the feature from each data point and then divides by the standard deviation. This ensures that the transformed data follows a standard normal distribution. Suitable for algorithms that assume a normal distribution of the input features, such as linear regression and neural networks.
    Less sensitive to outliers compared to Min-Max scaling.





In [28]:
# Import Scaler
from sklearn.preprocessing import StandardScaler

In [29]:
scaler = StandardScaler()

In [30]:
# Scale Separate Columns

# train

X_train[cts_cols] = scaler.fit_transform(X_train[cts_cols])

In [31]:
# valid

X_valid[cts_cols] = scaler.transform(X_valid[cts_cols])


# test

X_test[cts_cols] = scaler.transform(X_test[cts_cols])

In [32]:
# Dump the scaler to use in transforming test data

dump(scaler, open('truck_data_scaler.pkl', 'wb'))

## **Model Building and Experimentation**

Experiment tracking involves systematically recording and managing details related to each model experiment, including hyperparameters, metrics, and data versions.
Benefits: Ensures reproducibility, supports comparison of different models, facilitates collaboration, and aids in informed decision-making.




A Model Registry is a centralized system for managing, versioning, and tracking machine learning models throughout their lifecycle.
Benefits: Enables version control, streamlines collaboration, provides deployment features, maintains metadata, and enhances reproducibility.

### **Connecting to WANDB**

Weights and Biases (W&B) is a collaborative platform designed to help machine learning practitioners track, visualize, and analyze their machine learning experiments. It provides tools for experiment management, visualization of model performance, and collaboration among team members. Here's an overview of key aspects:

* Experiment Tracking:
    W&B allows users to log various parameters, metrics, and artifacts associated with their machine learning experiments. This includes hyperparameters, model architecture details, training and evaluation metrics, and even visualizations.

* Visualization:
    Users can leverage W&B to create visualizations of their experiment results. This includes charts, graphs, and plots that make it easy to understand how different parameters impact model performance over time.

* Hyperparameter Sweeps:
    W&B facilitates hyperparameter tuning through sweeps. Users can define a range of hyperparameter values, and W&B will automatically run multiple experiments with different combinations, helping to find the optimal set of hyperparameters for a given task.

* Collaboration and Reproducibility:
    Teams can use W&B to collaborate on machine learning projects. Experiment results are easily shareable and reproducible, ensuring that team members can understand and replicate each other's work.

* Model Registry:
    W&B includes a model registry that enables users to version and organize their trained models. This ensures that models can be tracked, compared, and deployed consistently.

* Integrations:
    W&B integrates seamlessly with various machine learning frameworks and libraries, including TensorFlow, PyTorch, Scikit-learn, and more. This makes it versatile and adaptable to different workflows.


For more information, refer to documentation: https://docs.wandb.ai/guides

In [34]:
# Import Libraries
import wandb
import joblib
import os

In [35]:
!wandb login <enter_your_api_key>

In [36]:
# wandb.login()

In [37]:
# constants for interacting with W&B

USER_NAME = "enter_your_username"

PROJECT_NAME = "enter_your_project_name"

### **Classification Evaluation Metrics**

Classification evaluation metrics are used to evaluate the performance of a machine learning model that is trained for classification tasks. Some of the commonly used classification evaluation metrics are F1 score, recall score, confusion matrix, and ROC AUC score. Here's an overview of each of these metrics:

**F1 score**: The F1 score is a metric that combines the precision and recall of a model into a single value. It is calculated as the harmonic mean of precision and recall, and is expressed as a value between 0 and 1, where 1 indicates perfect precision and recall.
F1 score is the harmonic mean of precision and recall. It is calculated as follows:
$$ F1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}} $$
where precision is the number of true positives divided by the sum of true positives and false positives, and recall is the number of true positives divided by the sum of true positives and false negatives.

**Recall**: Use the recall score when the cost of false negatives (i.e., missing instances of a class) is high. For example, in a medical diagnosis problem, the cost of missing a positive case may be high, so recall would be a more appropriate metric.
Recall score (also known as sensitivity) is the number of true positives divided by the sum of true positives and false negatives. It is given by the following formula:
$$ Recall = \frac{TP}{TP + FN} $$

**Precision**: Precision is another important classification evaluation metric, which is defined as the ratio of true positives to the total predicted positives. It measures the accuracy of positive predictions made by the classifier, i.e., the proportion of positive identifications that were actually correct.
The formula for precision is:
$$ precision = \frac{true\ positive}{true\ positive + false\ positive} $$
where true positive refers to the cases where the model correctly predicted the positive class, and false positive refers to the cases where the model incorrectly predicted the positive class.
Precision is useful when the cost of false positives is high, such as in medical diagnosis or fraud detection, where a false positive can have serious consequences. In such cases, a higher precision indicates that the model is better at identifying true positives and minimizing false positives.

**Confusion Matrix**:
A confusion matrix is a table that is often used to describe the performance of a classification model. It compares the predicted labels with the true labels and counts the number of true positives, false positives, true negatives, and false negatives. Here is an example of a confusion matrix:

|          | Actual Positive | Actual Negative |
|----------|----------------|----------------|
| Predicted Positive | True Positive (TP) | False Positive (FP) |
| Predicted Negative | False Negative (FN) | True Negative (TN) |

​



**ROC AUC Score**:
ROC AUC (Receiver Operating Characteristic Area Under the Curve) score is a measure of how well a classifier is able to distinguish between positive and negative classes. It is calculated as the area under the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR is the number of true positives divided by the sum of true positives and false negatives, and FPR is the number of false positives divided by the sum of false positives and true negatives.
$$ ROC\ AUC\ Score = \int_0^1 TPR(FPR^{-1}(t)) dt $$
where $FPR^{-1}$ is the inverse of the FPR function.

**When to use which**:

The choice of evaluation metric depends on the specific requirements of the business problem. Here are some general guidelines:

* F1 score: Use the F1 score when the class distribution is imbalanced, and when both precision and recall are equally important.

* Recall score: Use the recall score when the cost of false negatives (i.e., missing instances of a class) is high. For example, in a medical diagnosis problem, the cost of missing a positive case may be high, so recall would be a more appropriate metric.

* Precision: Precision is useful when the cost of false positives is high, such as in medical diagnosis or fraud detection, where a false positive can have serious consequences. In such cases, a higher precision indicates that the model is better at identifying true positives and minimizing false positives.

* Confusion matrix: The confusion matrix is a versatile tool that can be used to visualize the performance of a model across different classes. It can be useful for identifying specific areas of the model that need improvement.

* ROC AUC score: Use the ROC AUC score when the ability to distinguish between positive and negative classes is important. For example, in a credit scoring problem, the ability to distinguish between good and bad credit risks is crucial.

Importance with respect to the business problem:

The importance of each evaluation metric varies depending on the business problem. For example, in a spam detection problem, precision may be more important than recall, since false positives (i.e., classifying a non-spam email as spam) may annoy users, while false negatives (i.e., missing a spam email) may not be as harmful. On the other hand, in a disease diagnosis problem, recall may be more important than precision, since missing a positive case (i.e., a false negative) could have serious consequences. Therefore, it is important to choose the evaluation metric that is most relevant to the specific business problem at hand.



In [38]:
# Importing training libraries and evaluation metrics

from sklearn.metrics import f1_score, recall_score, confusion_matrix, roc_auc_score

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

In [39]:
# Evaluation function
# #Columns needed to compare metrics
comparison_columns = ['Model_Name', 'Train_F1score', 'Train_Recall', 'Valid_F1score', 'Valid_Recall', 'Test_F1score', 'Test_Recall']

comparison_df = pd.DataFrame()



def evaluate_models(model_name, model_defined_var, X_train, y_train, X_valid, y_valid, X_test, y_test):
  ''' This function predicts and evaluates various models for classification'''

  # train predictions
  y_train_pred = model_defined_var.predict(X_train)
  # train performance
  train_f1_score = f1_score(y_train, y_train_pred)
  train_recall = recall_score(y_train, y_train_pred)

  # validation predictions
  y_valid_pred = model_defined_var.predict(X_valid)
  # validation performance
  valid_f1_score = f1_score(y_valid, y_valid_pred)
  valid_recall = recall_score(y_valid, y_valid_pred)

  # test predictions
  y_pred = model_defined_var.predict(X_test)
  # test performance
  test_f1_score = f1_score(y_test, y_pred)
  test_recall = recall_score(y_test, y_pred)

  # Printing performance
  print("Train Results")
  print(f'F1 Score: {train_f1_score}')
  print(f'Recall Score: {train_recall}')
  print(f'Confusion Matrix: \n{confusion_matrix(y_train, y_train_pred)}')
  print(f'Area Under Curve: {roc_auc_score(y_train, y_train_pred)}')

  print(" ")

  print("Validation Results")
  print(f'F1 Score: {valid_f1_score}')
  print(f'Recall Score: {valid_recall}')
  print(f'Confusion Matrix: \n{confusion_matrix(y_valid, y_valid_pred)}')
  print(f'Area Under Curve: {roc_auc_score(y_valid, y_valid_pred)}')

  print(" ")

  print("Test Results")
  print(f'F1 Score: {test_f1_score}')
  print(f'Recall Score: {test_recall}')
  print(f'Confusion Matrix: \n{confusion_matrix(y_test, y_pred)}')
  print(f'Area Under Curve: {roc_auc_score(y_test, y_pred)}')

  # Saving our results
  global comparison_columns
  metric_scores = [model_name, train_f1_score, train_recall, valid_f1_score, valid_recall, test_f1_score, test_recall]
  final_dict = dict(zip(comparison_columns, metric_scores))
  return final_dict


final_list = []
def add_dic_to_final_df(final_dict):
  global final_list
  final_list.append(final_dict)
  global comparison_df
  comparison_df = pd.DataFrame(final_list, columns=comparison_columns)


### **Logistic Regression**



Logistic regression is a type of machine learning algorithm used for classification problems where we need to predict if something belongs to one category or another. For example, we can use it to predict if a customer will churn or not.

The algorithm works by analyzing the relationship between the input variables (such as customer demographics and usage patterns) and the binary output variable (such as churn or no churn). It then estimates the probability of the output variable using a logistic function, which outputs a value between 0 and 1.

Logistic regression is actually a type of classification algorithm, but it is called "logistic regression" because it uses a logistic function to model the probability of the binary output variable.

The term "regression" comes from the fact that the logistic regression model is based on a linear combination of the input variables and their associated weights, which is similar to linear regression. However, in linear regression, we predict a continuous output variable, while in logistic regression, we predict a probability of belonging to a particular class.

In other words, logistic regression is a regression algorithm that is used for classification problems. The logistic function transforms the output of the regression equation into a probability value between 0 and 1, which can then be used to classify the input variable into one of two categories.

Let's see how!!

The logistic regression model is based on the logistic function, which maps any real-valued input to a value between 0 and 1. The logistic function is defined as follows:

\begin{equation}
sigmoid(z) = \frac{1}{1 + e^{-z}}
\end{equation}

where $z$ is a linear combination of the input variables and their associated weights. In other words, we calculate $z$ as follows:

$$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$$

where $\beta_0$ is the intercept term and $\beta_1$, $\beta_2$, and $\beta_3$ are the weights associated with the input variables $x_1$, $x_2$, and $x_3$, respectively.

The logistic regression model then predicts the probability of the binary outcome (in our example, whether a customer will churn or not) as follows:

\begin{equation}
P(y=1|x) = sigmoid(z)
\end{equation}

where $y$ is the binary outcome, $x$ is the input variable vector, and $sigmoid(z)$ is the logistic function.

To train the logistic regression model, we use a dataset of labeled examples. Each example includes a set of input variables and the corresponding binary outcome. The model is trained by minimizing the cross-entropy loss function, which is defined as follows:

\begin{equation}
L(y,\hat{y}) = -\frac{1}{N} \sum_{i=1}^{N} y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)
\end{equation}

where $y$ is the binary outcome, $\hat{y}$ is the predicted probability, $N$ is the number of examples, and $\log$ is the natural logarithm.

To minimize the cross-entropy loss function, we use an optimization algorithm such as gradient descent. The gradient of the loss function with respect to each weight is computed using the chain rule of differentiation:

\begin{equation}
\frac{\partial L}{\partial w_j} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}i - y_i) x{ij}
\end{equation}

where $x_{ij}$ is the $j$th input variable of the $i$th example.

#### **Example**
Suppose we have the following dataset with three input variables (customer age, monthly bill amount, and number of customer service calls) and a binary output variable (1 for churn and 0 for no churn):


We can use logistic regression to build a model that predicts the probability of churn based on these input variables. The logistic function that we use is:

$$P(y=1|x) = \frac{1}{1+e^{-z}}$$

where $y$ is the output variable (churn), $x$ is the input variable (age, monthly bill amount, and number of customer service calls), and $z$ is the linear combination of the input variables and their associated weights:

$$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3$$

where $\beta_0$ is the intercept term and $\beta_1$, $\beta_2$, and $\beta_3$ are the weights associated with the input variables $x_1$, $x_2$, and $x_3$, respectively.

To train the model, we start with some initial values for the weights and use a training algorithm to adjust the weights iteratively until we minimize the error between the predicted probability and the actual output. The training algorithm typically uses a gradient descent approach to update the weights in the direction that minimizes the loss function.

Once the model is trained, we can use it to predict the probability of churn for a new customer. For example, suppose we want to predict the probability of churn for a customer who is 40 years old, has a monthly bill amount of 180, and has made 2 customer service calls. Using the logistic function and the weights learned during training, we can calculate the probability as:

$$P(y=1|x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3)}}$$

Let's say the weights learned during training are $\beta_0 = -2$, $\beta_1 = 0.05$, $\beta_2 = 0.01$, and $\beta_3 = 0.8$. Then we can plug in the values for the new customer and get:

$$P(y=1|x) = \frac{1}{1+e^{-(-2 + 0.05\times 40 + 0.01\times 180 + 0.8\times 2)}} \approx 0.69$$

So, the model predicts that there is a 69% probability that this customer will churn. We can use this probability to classify the customer as churn or no churn, depending on a threshold that we set (e.g., if the probability is above 0.5, we classify the customer as churn).

This is a simple example, but it illustrates how logistic regression uses the logistic function and linear combination of input variables to predict the probability of a binary output variable.

## **Class weights**

To handle class imbalances, logistic regression allows you to assign weights to the classes. These weights influence the model during training, and they are used to give more importance to the minority class. The goal is to ensure that the model doesn't overly favor the majority class and can still make accurate predictions for the minority class.


$$w_j = \frac{\text{n\_samples}}{\text{n\_classes} \times \text{n\_samples}_j}$$


In [None]:
y_train.value_counts().to_dict()[1]

In [41]:
weights = len(X_train)/(2*(y_train.value_counts().to_dict()[0])), len(X_train)/(2*(y_train.value_counts().to_dict()[1]))
weights

In [None]:
# Define model
log_reg = LogisticRegression(random_state=13, class_weight={0:weights[0], 1:weights[1]})
# fit it
log_reg.fit(X_train,y_train)

In [None]:
logistic_results = evaluate_models("Logistic Regression", log_reg, X_train, y_train, X_valid, y_valid, X_test, y_test)

add_dic_to_final_df(logistic_results)

#### **Logistic Regression - Experiment Tracking**

In [71]:
import joblib

w = {0: weights[0], 1: weights[1]}

def train_logistic_model(X_train=X_train, y_train=y_train, X_valid=X_valid, y_valid=y_valid, X_test=X_test, y_test=y_test):
    features = X_train.columns

    with wandb.init(project=PROJECT_NAME) as run:
        config = wandb.config
        params= {"random_state":13,
    "class_weight":w}

        model = LogisticRegression(**params)

        model.fit(X_train, y_train)
        
        # train predictions
        y_train_pred = model.predict(X_train)
        # train performance
        train_f1_score = f1_score(y_train, y_train_pred)


        # validation predictions
        y_valid_pred = model.predict(X_valid)
        # validation performance
        valid_f1_score = f1_score(y_valid, y_valid_pred)

        
        # test predictions
        y_preds = model.predict(X_test)
        y_probas = model.predict_proba(X_test)

        score = f1_score(y_test, y_preds)
        print(f"F1_score Train: {round(train_f1_score, 4)}")
        print(f"F1_score Valid: {round(valid_f1_score, 4)}")
        print(f"F1_score Test: {round(score, 4)}")


        wandb.log({"f1_score_train": train_f1_score})
        wandb.log({"f1_score_valid": valid_f1_score})
        wandb.log({"f1_score": score})

        wandb.sklearn.plot_classifier(model, X_train, X_test, y_train, y_test,
                                            y_preds, y_probas, labels= None, model_name='LogisticRegression', feature_names=features)

        model_artifact = wandb.Artifact(
                    "LogisticRegression", type="model",metadata=dict(config))

        joblib.dump(model, "log-truck-model.pkl")
        model_artifact.add_file("log-truck-model.pkl")
        wandb.save("log-truck-model.pkl")
        run.log_artifact(model_artifact)

In [None]:
train_logistic_model(X_train, y_train,X_valid, y_valid, X_test, y_test)

## **Decision Trees**

**Decision Trees in Classification**

Decision trees are a type of supervised learning algorithm that can be used for classification as well as regression problems. They are widely used in machine learning because they are easy to understand and interpret, and can handle both categorical and numerical data. The idea behind decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

The decision tree starts with a single node, called the root node, which represents the entire dataset. The root node is then split into several child nodes based on the value of a chosen feature. The process of selecting the best feature and splitting the nodes is repeated recursively for each child node until a stopping criterion is reached. This results in a tree-like structure that represents the decision rules learned from the data.

Each node in the decision tree represents a decision or a test of a feature value, and each branch represents the possible outcomes of that decision. The leaves of the tree represent the final decision or the class label assigned to the input data.

**Splitting Criteria**

To build a decision tree, we need a measure that determines how to split the data at each node. The splitting criterion is chosen based on the type of data and the nature of the problem. The most common splitting criteria are:

* Gini index: measures the impurity of a set of labels. It calculates the probability of misclassifying a randomly chosen element from the set, and is used to minimize misclassification errors.
* Information gain: measures the reduction in entropy (uncertainty) after a split. It is used to maximize the information gain in each split.
* Chi-square: measures the difference between observed and expected frequencies of the classes. It is used to minimize the deviation between the observed and expected class distribution.

**Overfitting in Decision Trees**

One of the main challenges in building decision trees is overfitting. Overfitting occurs when the tree is too complex and fits the training data too well, resulting in poor performance on new and unseen data. This can be addressed by pruning the tree or limiting its depth, or by using ensemble methods such as bagging and boosting.

**Ensemble Methods**

Ensemble methods are techniques that combine multiple models to improve performance and reduce overfitting. The two most common ensemble methods used with decision trees are:

* Bagging (Bootstrap Aggregating): involves training multiple decision trees on different subsets of the training data and then combining their predictions by averaging or voting. This reduces the variance and improves the stability of the model.
* Boosting: involves training multiple decision trees sequentially, where each subsequent tree focuses on the misclassified examples of the previous tree. This reduces the bias and improves the accuracy of the model.

Deecision trees are powerful tools for classification problems that provide a clear and interpretable representation of the decision rules learned from the data. The choice of splitting criterion, stopping criterion, and ensemble method can have a significant impact on the performance and generalization of the model.


### **Bagging**



Bagging is an ensemble learning technique that aims to decrease the variance of a single estimator by combining the predictions from multiple learners. The basic idea behind bagging is to generate multiple versions of the training dataset through random sampling with replacement, and then train a separate classifier for each sampled dataset. The predictions from these individual classifiers are then combined using averaging or voting to obtain a final prediction.

**Algorithm:**

Suppose we have a training set D of size n, and we want to train a classifier using bagging. Here are the steps involved:

* Create k different bootstrap samples from D, each of size n.
* Train a classifier on each bootstrap sample.
* When making predictions on a new data point, take the average or majority vote of the predictions from each of the k classifiers.


**Mathematical Explanation:**

Suppose we have a binary classification problem with classes -1 and 1. Let's also assume that we have a training set D of size n, and we want to train a decision tree classifier using bagging.

**Bootstrap Sample**: For each of the k classifiers, we create a bootstrap sample of size n by sampling with replacement from D. This means that each bootstrap sample may contain duplicates of some instances and may also miss some instances from the original dataset. Let's denote the i-th bootstrap sample as D_i.

**Train a Classifier**: We train a decision tree classifier T_i on each bootstrap sample D_i. This gives us k classifiers T_1, T_2, ..., T_k.

**Combine Predictions**: To make a prediction on a new data point x, we take the majority vote of the predictions from each of the k classifiers.

The idea behind bagging is that the variance of the prediction error decreases as k increases. This is because each classifier has a chance to explore a different part of the feature space due to the random sampling with replacement, and the final prediction is a combination of these diverse classifiers.






### **Random Forest**



Random Forest is an ensemble learning algorithm that builds a large number of decision trees and combines them to make a final prediction. It is a type of bagging method, where multiple decision trees are trained on random subsets of the training data and features. The algorithm then averages the predictions of these individual trees to produce a final prediction. Random Forest is particularly useful for handling high-dimensional data and for avoiding overfitting.

**Algorithm of Random Forest**

The algorithm of Random Forest can be summarized in the following steps:

* Start by randomly selecting a subset of the training data, with replacement. This subset is called the bootstrap sample.

* Next, randomly select a subset of features from the full feature set.

* Build a decision tree using the bootstrap sample and the selected subset of features. At each node of the tree, select the best feature and split the data based on the selected feature.

* Repeat steps 1-3 to build multiple trees.

* Finally, combine the predictions of all trees to make a final prediction. For classification, this is usually done by taking a majority vote of the predicted classes. For regression, this is usually done by taking the average of the predicted values.


**Mathematics Behind Random Forest**

The mathematics behind Random Forest involves the use of decision trees and the bootstrap sampling technique. Decision trees are constructed using a recursive binary partitioning algorithm that splits the data based on the values of the selected features. At each node, the algorithm chooses the feature and the split point that maximizes the information gain. Information gain measures the reduction in entropy or impurity of the target variable after the split. The goal is to minimize the impurity of the subsets after each split.

Bootstrap sampling is a statistical technique that involves randomly sampling the data with replacement to create multiple subsets. These subsets are used to train individual decision trees. By using bootstrap samples, the algorithm can generate multiple versions of the same dataset with slightly different distributions. This introduces randomness into the training process, which helps to reduce overfitting.



**Difference between Bagging and Random Forest**

Bagging and Random Forest are both ensemble learning algorithms that involve training multiple models on random subsets of the data. The main difference between the two is the way the individual models are trained.

Bagging involves training multiple models using the bootstrap sampling technique, but each model uses the same set of features. This can lead to correlated predictions, which reduces the variance but not necessarily the bias of the model.

Random Forest, on the other hand, involves training multiple models using the bootstrap sampling technique, but each model uses a randomly selected subset of features. This introduces additional randomness into the model and helps to reduce the correlation between individual predictions. Random Forest can achieve better performance than Bagging, especially when dealing with high-dimensional data or noisy features. In simpler terms it uses subsets of observations as well as features.








In [None]:
# define model
w = {0: weights[0], 1: weights[1]}
random_f = RandomForestClassifier(n_estimators=20, class_weight=w, random_state=7)
random_f.fit(X_train, y_train)

randomf_results = evaluate_models("Random Forest", random_f,X_train, y_train, X_valid, y_valid, X_test, y_test)
add_dic_to_final_df(randomf_results)

In [75]:


def train_random_forest(X_train=X_train, y_train=y_train, X_valid=X_valid, y_valid=y_valid, X_test=X_test, y_test=y_test):
  features = X_train.columns
  labels=["delay"]

  with wandb.init(project=PROJECT_NAME) as run:
      config = wandb.config

      model = RandomForestClassifier(n_estimators=20, class_weight=w, random_state=7)

      model.fit(X_train, y_train)
      # train predictions
      y_train_pred = model.predict(X_train)
      # train performance
      train_f1_score = f1_score(y_train, y_train_pred)


      # validation predictions
      y_valid_pred = model.predict(X_valid)
      # validation performance
      valid_f1_score = f1_score(y_valid, y_valid_pred)


      # test predictions
      y_preds = model.predict(X_test)
      y_probas = model.predict_proba(X_test)

      score = f1_score(y_test, y_preds)
      print(f"F1_score Train: {round(train_f1_score, 4)}")
      print(f"F1_score Valid: {round(valid_f1_score, 4)}")
      print(f"F1_score Test: {round(score, 4)}")


      wandb.log({"f1_score_train": train_f1_score})
      wandb.log({"f1_score_valid": valid_f1_score})
      wandb.log({"f1_score": score})



      wandb.sklearn.plot_classifier(model, X_train, X_test, y_train, y_test, y_preds, y_probas, labels=None,
                                                          model_name='RandomForestClassifier', feature_names=features)

      model_artifact = wandb.Artifact(
                  "RandomForestClassifier", type="model",metadata=dict(config))

      joblib.dump(model, "randomf-truck-model.pkl")
      model_artifact.add_file("randomf-truck-model.pkl")
      wandb.save("randomf-truck-model.pkl")
      run.log_artifact(model_artifact)

In [None]:
train_random_forest(X_train, y_train,X_valid, y_valid, X_test, y_test)

### **Gradient Boosting**

The primary idea behind this technique is to develop models in a sequential manner, with each model attempting to reduce the mistakes of the previous model.The additive model, loss function, and a weak learner are the three fundamental components of Gradient Boosting.

The method provides a direct interpretation of boosting in terms of numerical optimization of the loss function using Gradient Descent. We employ Gradient Boosting Regressor when the target column is continuous, and Gradient Boosting Classifier when the task is a classification problem. The "Loss function" is the only difference between the two. The goal is to use gradient descent to reduce this loss function by adding weak learners. Because it is based on loss functions, for regression problems, Mean squared error (MSE) will be used, and  for classification problems, log-likelihood.

### **XG Boost**


XGBoost is a variant of gradient boosting, which is a popular ensemble learning technique that works by iteratively adding new models to an ensemble, each model attempting to correct the errors made by the previous models. In each iteration, the algorithm calculates the negative gradient of the loss function with respect to the current prediction, and fits a new model to the residual errors. The new model is then added to the ensemble, and the algorithm repeats this process until the desired number of models is reached.

In XGBoost, the objective function is used to measure the difference between the predicted values and the true labels. The objective function is a sum of the loss function and the regularization term, where the latter prevents overfitting and encourages the model to be simple.



Suppose we have a dataset with three features, x1, x2, and x3, and we want to predict a binary outcome, y. We decide to use decision trees as our weak learners. We start by training a decision tree on the entire dataset. However, this decision tree may not be able to capture the complex relationships between the features and the outcome, and it may be overfitting the training data.

To improve upon the first decision tree, we can use XGBoost. Here's how:

* Initialize the model: We start by initializing the XGBoost model with default hyperparameters. This model will be a simple decision tree with a single split.

* Make predictions: We use this model to make predictions on the training data. We compare these predictions to the true labels and calculate the residuals, which are the differences between the predicted values and the true labels.

* Fit a new tree: We then fit a new decision tree to the residuals. This tree will be a weak learner, as it is only modeling the errors of the previous model.

* Combine the models: We add the new tree to the previous model to create a new ensemble. This new ensemble consists of the previous model plus the new tree.

* Repeat: We repeat steps 2-4 for a specified number of iterations, adding a new tree to the ensemble each time.

* Predictions: To make predictions on new data, we combine the predictions of all the trees in the ensemble.

The key idea behind XGBoost is that it improves upon the predictions of the weak learners by focusing on the misclassified data points. By fitting a new tree to the residuals, XGBoost can correct the errors of the previous model and improve its overall accuracy. Additionally, XGBoost uses regularization to prevent overfitting and to improve generalization performance.

### **XG Boost Training**

XGBoost is a popular gradient boosting library used for building supervised machine learning models. It is designed to be efficient, scalable, and flexible. XGBoost provides an efficient implementation of gradient boosting algorithms and is widely used in various machine learning competitions.

In XGBoost, the D matrix is a data structure that stores the input data and provides efficient access to it during model training. The D matrix is essentially a wrapper around the input data, which is typically stored as a two-dimensional NumPy array or a Pandas DataFrame. The D matrix adds some additional features to the input data that make it easier to use with XGBoost.

The D matrix provides a few benefits over using the raw input data directly. First, it allows for efficient access to the input data during model training, which is critical for large datasets. Second, it provides some additional functionality, such as the ability to handle missing values and to split the data into training and validation sets. Finally, it simplifies the process of passing data to the XGBoost model during training.

In [45]:
# import xgboost
import xgboost as xgb

# Convert training and test sets to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)
dtest = xgb.DMatrix(X_test, label=y_test)

# Train initial model
params = {'objective': 'multi:softmax', 'num_class': 2, 'seed': 7}
num_rounds = 30
xgbmodel = xgb.train(params, dtrain, num_rounds, evals=[(dvalid, 'validation')], early_stopping_rounds=10)

xgb_results = evaluate_models("XGB", xgbmodel, dtrain, y_train, dvalid, y_valid, dtest, y_test)
add_dic_to_final_df(xgb_results)

[0]	validation-mlogloss:0.61153
[1]	validation-mlogloss:0.56986
[2]	validation-mlogloss:0.54929
[3]	validation-mlogloss:0.53773
[4]	validation-mlogloss:0.53350
[5]	validation-mlogloss:0.52825
[6]	validation-mlogloss:0.52923
[7]	validation-mlogloss:0.52953
[8]	validation-mlogloss:0.52742
[9]	validation-mlogloss:0.52832
[10]	validation-mlogloss:0.52875
[11]	validation-mlogloss:0.52921
[12]	validation-mlogloss:0.53262
[13]	validation-mlogloss:0.53315
[14]	validation-mlogloss:0.53715
[15]	validation-mlogloss:0.53819
[16]	validation-mlogloss:0.53824
[17]	validation-mlogloss:0.53847
[18]	validation-mlogloss:0.53955
Train Results
F1 Score: 0.7202122527737578
Recall Score: 0.5912871287128713
Confusion Matrix: 
[[5145  128]
 [1032 1493]]
Area Under Curve: 0.7835062611135
 
Validation Results
F1 Score: 0.593440122044241
Recall Score: 0.4832298136645963
Confusion Matrix: 
[[1302  117]
 [ 416  389]]
Area Under Curve: 0.7003886911874778
 
Test Results
F1 Score: 0.6998468606431854
Recall Score: 0.59

In [46]:
import joblib
import xgboost as xgb

def train_xgb_model(X_train=X_train, y_train=y_train, X_valid=X_valid, y_valid=y_valid, X_test=X_test, y_test=y_test):
  features = X_train.columns
  labels=["delay"]

  with wandb.init(project=PROJECT_NAME) as run:
      config = wandb.config


      # Convert training and test sets to DMatrix
      dtrain = xgb.DMatrix(X_train, label=y_train)
      dvalid = xgb.DMatrix(X_valid, label=y_valid)
      dtest = xgb.DMatrix(X_test, label=y_test)

      # Train initial model
      params = {'objective': 'multi:softmax', 'num_class': 2}
      num_rounds = 30
      xgbmodel = xgb.train(params, dtrain, num_rounds, evals=[(dvalid, 'validation')], early_stopping_rounds=10)
        
      # train predictions
      y_train_pred = xgbmodel.predict(dtrain)
      # train performance
      train_f1_score = f1_score(y_train, y_train_pred)


      # validation predictions
      y_valid_pred = xgbmodel.predict(dvalid)
      # validation performance
      valid_f1_score = f1_score(y_valid, y_valid_pred)


      # test predictions
      y_preds = xgbmodel.predict(dtest)
      score = f1_score(y_test, y_preds)
      print(f"F1_score Train: {round(train_f1_score, 4)}")
      print(f"F1_score Valid: {round(valid_f1_score, 4)}")
      print(f"F1_score Test: {round(score, 4)}")


      wandb.log({"f1_score_train": train_f1_score})
      wandb.log({"f1_score_valid": valid_f1_score})
      wandb.log({"f1_score": score})


      model_artifact = wandb.Artifact(
                  "XGBoost", type="model",metadata=dict(config))

      joblib.dump(xgbmodel, "xgb-truck-model.pkl")
      model_artifact.add_file("xgb-truck-model.pkl")
      wandb.save("xgb-truck-model.pkl")
      run.log_artifact(model_artifact)


In [None]:
train_xgb_model(X_train, y_train,X_valid, y_valid, X_test, y_test)

In [None]:
comparison_df

### **Model Building Summary**

Logistic Regression:

* The F1 score and recall on the training set are relatively low, indicating that the model might not fit the training data well.
* The F1 score and recall on the validation and test sets are high but we'll have to look for consistency of  the model results over high scores.

Random Forest:

* The Random Forest model has very high F1 scores and recall on the training set, indicating a potential overfitting issue.
* On the validation set, the F1 score is lower than on the training set, which is expected, but it's still relatively high.
* The model performs well on the test set with a better F1 score and recall.
* To address potential overfitting, we can experiment with reducing the complexity of the Random Forest model, by limiting the depth of the trees.

XGBoost:

* XGBoost shows good F1 scores and recall on the training set, indicating a reasonable fit to the training data.
* On the validation set, the F1 score is reasonable, suggesting decent generalization.
* The model performs well on the test set with a high F1 score and recall, indicating good generalization to unseen data.
* XGBoost seems to be the most promising model, we can do hyperparameter tuning to see if we can improve its performance even more.

## **Hyperparameter Tuning**

### Hyperparameter Sweeps

In [41]:
import joblib
w = {0: weights[0], 1: weights[1]}
def train_rf_model(X_train=X_train, y_train=y_train, X_valid=X_valid,y_valid=y_valid, X_test=X_test, y_test=y_test):
    features = X_train.columns

    with wandb.init(
        project=PROJECT_NAME ) as run:
        config = wandb.config

        model = RandomForestClassifier(
            n_estimators=config["n_estimators"],
            max_depth=config["max_depth"],
            min_samples_split=config["min_samples_split"],
            random_state=7,
            class_weight=w
        )
        model.fit(X_train, y_train)
        
        # train predictions
        y_train_pred = model.predict(X_train)
        # train performance
        train_f1_score = f1_score(y_train, y_train_pred)
        

        # validation predictions
        y_valid_pred = model.predict(X_valid)
        # validation performance
        valid_f1_score = f1_score(y_valid, y_valid_pred)
      

        y_preds = model.predict(X_test)
        y_probas = model.predict_proba(X_test)

        score = f1_score(y_test, y_preds)
        print(f"F1_score Train: {round(train_f1_score, 4)}")
        print(f"F1_score Valid: {round(valid_f1_score, 4)}")
        print(f"F1_score Test: {round(score, 4)}")


        wandb.log({"f1_score_train": train_f1_score})
        wandb.log({"f1_score_valid": valid_f1_score})
        wandb.log({"f1_score": score})

        wandb.sklearn.plot_classifier(model, X_train, X_test, y_train, y_test, y_preds, y_probas, labels=None,
                                                          model_name='RandomForestClassifier', feature_names=features)

        model_artifact = wandb.Artifact(
            "RandomForestClassifier", type="model",metadata=dict(config))
        joblib.dump(model, "random_f_tuned.pkl")
        model_artifact.add_file("random_f_tuned.pkl")
        wandb.save("random_f_tuned.pkl")
        run.log_artifact(model_artifact)

In [None]:
random_f.get_params()

In [None]:
sweep_configs = {
    "method": "grid",
    "metric": {
        "name": "f1_score",
        "goal": "maximize"
    },
    "parameters": {
        "n_estimators": {
            "values": [8, 12, 16,20]
        },
        "max_depth": {
            "values": [None, 5, 10, 15, 20]
        },
        "min_samples_split": {
            "values": [2, 4, 8, 12]
        }
    }
}
# Then we initialize the sweep and run the sweep agent.

sweep_id = wandb.sweep(
    sweep=sweep_configs,
    project=PROJECT_NAME
)



wandb.agent(
    project=PROJECT_NAME,
    sweep_id=sweep_id,
    function=train_rf_model
)

### **Try it out-I: Hyperparameter tuning with XGBoost**

Perform hyperparameter tuning on the XGBoost model using a grid search approach. Use the GridSearchCV function from the sklearn.model_selection module to search over a range of hyperparameters or Weights and Biases.

Some hyperparameters you may want to consider tuning include:

* max_depth: the maximum depth of each tree in the ensemble
* learning_rate: the learning rate for gradient boosting
* n_estimators: the number of trees in the ensemble
* min_child_weight: the minimum weight required in a child node to continue splitting
* subsample: the fraction of samples used for each tree
* colsample_bytree: the fraction of features used for each tree
* You can define a dictionary of hyperparameter values to search over, and then pass it to the param_grid parameter of the GridSearchCV function.

Evaluate the tuned model
Evaluate the performance of the tuned model on the testing data using the sklearn.metrics module.

**Bonus (optional)**
Try using a different hyperparameter optimization technique, such as random search or Bayesian optimization, to see if you can improve the performance of the XGBoost model even further.


### **Try It Out-II: Explore Feature Selection for Interpretability**

Dive into feature selection techniques to improve the interpretability and explainability of your machine learning model. Start by identifying all relevant features in your dataset and categorize them based on their importance. Research and implement feature importance techniques such as permutation importance or SHAP values. Visualize the results and understand how each feature contributes to the model's predictions. Engage with business stakeholders to discuss the importance of interpretability and present your findings, highlighting the impact of each feature on decision-making. Evaluate the trade-offs between model accuracy and interpretability, and gather feedback on stakeholders' comfort levels. The goal is to ensure that the chosen features align with business priorities and empower stakeholders to make informed decisions based on the model's insights.

## **Conclusion**



In this part, we went deeper into solving the challenge faced by the logistics industry—delayed truck shipments. We focused on improving our ability to predict these delays accurately. Here's a simple breakdown of what we did:


* Used the Hopsworks feature store to efficiently retrieve data.

* Split data into training, validation, and test sets, ensuring our models learn and perform well on new data. Handled missing values to keep our predictions accurate.

* Built and experimented with logistic regression, random forest, and XGBoost models to find the best performers.

* Used hyperparameter tuning techniques to fine-tune our models for better accuracy.

* Created a Streamlit application for easy interaction with our models, making insights accessible.

As we wrap up this part, we've covered a lot! The next step, in Part 3, will be automating our processes, building CI/CD pipeline and making our models ready for real-world use. We hope you're as excited as we are for the next part.

## **Interview Questions**




* How did you handle null values in data during data preprocessing?
* Explain the concept of train-validation-test split and its importance.
* Why is feature scaling necessary in machine learning, and what are the different techniques you used?

* Walk through the process of building logistic regression, random forest, and XGBoost models.
* What metrics would you consider for evaluating model performance, and why?
* Explain the importance of experiment tracking in machine learning projects.
* How did you set up a new project and connect it to Python using Weights and Biases?

* What is the purpose of hyperparameter tuning, and how does it improve model performance?
* Describe the difference between grid search and random search for hyperparameter tuning.

* How did you develop the Streamlit application, and what role does it play in this project?
* Discuss the steps involved in deploying the Streamlit application on AWS EC2.
