# Project notebook 1

The following notebook is an excerpt and re-written example from a _real_ production model.

The overall purpose of the ML algorithm is to identify users on the website that are new possible customers. This is done by collecting behaviour data from the users as input, and the target is whether they converted/turned into customers -- essentially a classification problem. 

This notebook only focuses on the data processing part. As you know, there are multiple steps in an ML pipeline, and it's not always they are neatly separated like this. For the exam project, they will not be, and that is part of the challenge for you. For production code, it should also not be Python notebooks since, as you may well see, it is difficult to work with and collaborate on them in an automated way.

There is a lot of "fluff" in such a notebook. This ranges from comments and markdown cells to commented out code and random print statements. That is not necessary in a properly managed project where you can use git to check the version history and such. 

What is important for you is the identify the entry points into the code and segment them out into easily understandable chunks. Additionally, you might want to follow some basic code standards, such as:

- Import only libraries in the beginning of the files
- Define functions in the top of the scripts, or if used multiple places, move into a util.py script or such
- Remove unused/commented out code
- Follow the [PEP 8](https://peps.python.org/pep-0008/) style guide (and others)
  
Another thing to note is that comments can be misleading. Even if the markdown cell or inline comments says it does _X_, don't be surprised if it actually does _Y_. Sometimes additional text can be a blessing, but it can also be a curse sometimes. Remember, though, that your task is to make sure the code runs as before after refactoring the notebook into other files, not update/improve the model or flow to reflect what the comments might say.

***

# DATA PROCESSING

In this section, we will perform Exploratory Data Analysis (EDA) to better understand the dataset before proceeding with more advanced analysis. EDA helps us get a sense of the data’s structure, identify patterns, and spot any potential issues like missing values or outliers. By doing so, we can gain a clearer understanding of the data's key characteristics.

We will start with summary statistics to review basic measures like mean, median, and variance, providing an initial overview of the data distribution. Then, we’ll create visualizations such as histograms, box plots, and scatter plots to explore relationships between variables, check for any skewness, and highlight outliers.

The purpose of this EDA is to ensure that the dataset is clean and well-structured for further analysis. This step also helps us identify any necessary data transformations and informs decisions on which features might be most relevant for modeling in later stages.

# Create artifact directory
We want to create a directory for storing all the artifacts in the current directory. Users can load all the artifacts later for data cleaning pipelines and inferencing.

In [1]:
# dbutils.widgets.text("Training data max date", "2024-01-31")
# dbutils.widgets.text("Training data min date", "2024-01-01")
# max_date = dbutils.widgets.get("Training data max date")
# min_date = dbutils.widgets.get("Training data min date")

# testnng
max_date = "2024-01-31"
min_date = "2024-01-01"

In [2]:
import os
import shutil
from pprint import pprint

shutil.rmtree("./artifacts",ignore_errors=True)
os.makedirs("artifacts",exist_ok=True)
print("Created artifacts directory")

Created artifacts directory


# Pandas dataframe print options

In [3]:
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.float_format',lambda x: "%.3f" % x)

# Helper functions

* **describe_numeric_col**: Calculates various descriptive stats for a numeric column in a dataframe.
* **impute_missing_values**: Imputes the mean/median for numeric columns or the mode for other types.

In [4]:
def describe_numeric_col(x):
    """
    Parameters:
        x (pd.Series): Pandas col to describe.
    Output:
        y (pd.Series): Pandas series with descriptive stats. 
    """
    return pd.Series(
        [x.count(), x.isnull().count(), x.mean(), x.min(), x.max()],
        index=["Count", "Missing", "Mean", "Min", "Max"]
    )

def impute_missing_values(x, method="mean"):
    """
    Parameters:
        x (pd.Series): Pandas col to describe.
        method (str): Values: "mean", "median"
    """
    if (x.dtype == "float64") | (x.dtype == "int64"):
        x = x.fillna(x.mean()) if method=="mean" else x.fillna(x.median())
    else:
        x = x.fillna(x.mode()[0])
    return x

# Read data

We read the latest data from our data lake source. Here we load it locally after having pulled it from DVC.

In [5]:
print("Loading training data")

# data_path = "./data/raw"
# most_recent_date = [x.name[:-1] for x in dbutils.fs.ls(f"{data_path}")][-1]
# most_recent_file = [x.name for x in dbutils.fs.ls(f"{data_path}/{most_recent_date}")][-1]
# data = pd.read_csv(f"{data_path}/{most_recent_date}/{most_recent_file}")

# Testing
from data_generating_process import generate_data
data = generate_data(n_rows=1_000)
most_recent_date = "2024-01-31"
# Done testing 

print("Most recent date:", most_recent_date)
print("Total rows:", data.count())
display(data)

Loading training data
Most recent date: 2024-01-31
Total rows: lead_id                              1000
lead_indicator                       1000
date_part                            1000
is_active                            1000
marketing_consent                    1000
first_booking                        1000
existing_customer                    1000
last_seen                            1000
source                               1000
domain                               1000
country                              1000
visited_learn_more_before_booking    1000
visited_faq                          1000
purchases                            1000
time_spent                           1000
customer_group                       1000
onboarding                           1000
customer_code                        1000
n_visits                             1000
dtype: int64


Unnamed: 0,lead_id,lead_indicator,date_part,is_active,marketing_consent,first_booking,existing_customer,last_seen,source,domain,country,visited_learn_more_before_booking,visited_faq,purchases,time_spent,customer_group,onboarding,customer_code,n_visits
0,0,0,2024-01-15,1,false,2024-01-5,true,2024-01-19,li,.cn,DK,16,2,3,112.305,7,True,,1
1,1,,2024-01-12,0,true,2024-01-25,false,2024-01-10,signup,.com,DK,0,0,3,86.617,3,True,AKVIREJOJP,2
2,2,1,2024-01-30,0,true,2024-01-28,true,2024-01-30,li,.com,CN,6,8,2,115.328,7,False,DDHFWKYMSK,14
3,3,1,2024-01-16,1,false,2024-01-14,false,2024-01-31,organic,.cn,US,4,0,9,103.647,4,True,BCOXMYJEXK,13
4,4,0,2024-01-24,0,true,2024-01-5,true,2024-01-3,organic,.com,US,4,6,6,102.833,6,True,WNYTOMZCIH,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,0,2024-01-13,0,true,2024-01-29,true,2024-01-8,fb,.com,US,3,6,2,102.224,6,True,LPCBZGFAOL,3
996,996,1,2024-01-21,0,true,2024-01-30,false,2024-01-9,organic,.com,CN,1,7,5,101.606,3,True,MHIAEGRVHX,10
997,997,0,2024-01-23,1,true,2024-01-24,true,2024-01-13,li,.dk,US,9,5,7,99.818,5,False,ZJHLPLVVCR,2
998,998,,2024-01-11,1,true,2024-01-28,true,2024-01-9,signup,.cn,US,0,5,9,129.382,5,False,ENYTUPVFPT,1


In [6]:
import pandas as pd
import datetime
import json

if not max_date:
    max_date = pd.to_datetime(datetime.datetime.now().date()).date()
else:
    max_date = pd.to_datetime(max_date).date()

min_date = pd.to_datetime(min_date).date()

# Time limit data
data["date_part"] = pd.to_datetime(data["date_part"]).dt.date
data = data[(data["date_part"] >= min_date) & (data["date_part"] <= max_date)]

min_date = data["date_part"].min()
max_date = data["date_part"].max()
date_limits = {"min_date": str(min_date), "max_date": str(max_date)}
with open("./artifacts/date_limits.json", "w") as f:
    json.dump(date_limits, f)

In [7]:
import numpy as np
np.random.randint(1, 10, 1000) * data["lead_indicator"].replace({"": 0}).astype(int)


data["n_visits"] = (
    np.random.negative_binomial(1, 0.2, size=1000) 
    + np.random.randint(1, 10, 1000) * data["lead_indicator"].replace({"": 0}).astype(int)
)

# Feature selection

Not all columns are relevant for modelling

In [8]:
data = data.drop(
    [
        "is_active", "marketing_consent", "first_booking", "existing_customer", "last_seen"
    ],
    axis=1
)

In [9]:
#Removing columns that will be added back after the EDA
data = data.drop(
    ["domain", "country", "visited_learn_more_before_booking", "visited_faq"],
    axis=1
)

# Data cleaning
* Remove rows with empty target variable
* Remove rows with other invalid column data

In [10]:
import numpy as np

data["lead_indicator"].replace("", np.nan, inplace=True)
data["lead_id"].replace("", np.nan, inplace=True)
data["customer_code"].replace("", np.nan, inplace=True)

data = data.dropna(axis=0, subset=["lead_indicator"])
data = data.dropna(axis=0, subset=["lead_id"])

data = data[data.source == "signup"]
result=data.lead_indicator.value_counts(normalize = True)

print("Target value counter")
for val, n in zip(result.index, result):
    print(val, ": ", n)
data

Target value counter
0 :  0.5128205128205128
1 :  0.48717948717948717


Unnamed: 0,lead_id,lead_indicator,date_part,source,purchases,time_spent,customer_group,onboarding,customer_code,n_visits
5,5,0,2024-01-19,signup,6,110.066,5,True,OWABIUQIAH,0
7,7,1,2024-01-31,signup,5,109.542,6,False,SMVGZNZHCD,18
9,9,1,2024-01-17,signup,4,85.250,3,True,HVVITIQYAW,8
15,15,0,2024-01-08,signup,6,113.334,9,True,POMRFTXIRG,1
16,16,0,2024-01-25,signup,2,101.248,4,True,WXEKMCADSR,4
...,...,...,...,...,...,...,...,...,...,...
976,976,1,2024-01-25,signup,6,92.676,6,True,BGHMPVUJDY,6
980,980,0,2024-01-16,signup,4,101.000,9,True,,9
985,985,0,2024-01-15,signup,4,100.419,7,False,QFBTMOEXSF,0
989,989,1,2024-01-06,signup,3,108.524,6,True,VJJIZBNGQX,11


# Create categorical data columns

In [11]:
vars = [
    "lead_id", "lead_indicator", "customer_group", "onboarding", "source", "customer_code"
]

for col in vars:
    data[col] = data[col].astype("object")
    print(f"Changed {col} to object type")

Changed lead_id to object type
Changed lead_indicator to object type
Changed customer_group to object type
Changed onboarding to object type
Changed source to object type
Changed customer_code to object type


# Separate categorical and continuous columns

In [12]:
cont_vars = data.loc[:, ((data.dtypes=="float64")|(data.dtypes=="int64"))]
cat_vars = data.loc[:, (data.dtypes=="object")]

print("\nContinuous columns: \n")
pprint(list(cont_vars.columns), indent=4)
print("\n Categorical columns: \n")
pprint(list(cat_vars.columns), indent=4)


Continuous columns: 

['purchases', 'time_spent', 'n_visits']

 Categorical columns: 

[   'lead_id',
    'lead_indicator',
    'date_part',
    'source',
    'customer_group',
    'onboarding',
    'customer_code']


# Outliers

Outliers are data points that significantly differ from the majority of observations in a dataset and can distort statistical analysis or model performance. To identify and remove outliers, one common method is to use the Z-score, which measures how many standard deviations a data point is from the mean. Data points with a Z-score greater than 2 (or sometimes 3) standard deviations away from the mean are typically considered outliers. By applying this threshold, we can filter out values that fall outside the normal range of the data, ensuring that the remaining dataset is more representative and less influenced by extreme values.

In [13]:
cont_vars = cont_vars.apply(lambda x: x.clip(lower = (x.mean()-2*x.std()),
                                             upper = (x.mean()+2*x.std())))
outlier_summary = cont_vars.apply(describe_numeric_col).T
outlier_summary.to_csv('./artifacts/outlier_summary.csv')
outlier_summary

Unnamed: 0,Count,Missing,Mean,Min,Max
purchases,234.0,234.0,4.844,0.307,9.471
time_spent,234.0,234.0,99.206,79.767,118.665
n_visits,234.0,234.0,6.228,0.0,16.717


# Impute data

In real-world datasets, missing data is a common occurrence due to various factors such as human error, incomplete data collection processes, or system failures. These gaps in the data can hinder analysis and lead to biased results if not properly addressed. Since many analytical and machine learning algorithms require complete data, handling missing values is an essential step in the data preprocessing phase.

In the next code block, we will handle missing data by performing imputation. For numerical columns, we will replace missing values with the mean or median of the entire column, which provides a reasonable estimate based on the existing data. For categorical columns (object type), we will use the mode, or most frequent value, to fill in missing entries. This approach helps us maintain a complete dataset while ensuring that the imputed values align with the general distribution of each column.

In [14]:
cat_missing_impute = cat_vars.mode(numeric_only=False, dropna=True)
cat_missing_impute.to_csv("./artifacts/cat_missing_impute.csv")
cat_missing_impute

Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code
0,5,0,2024-01-15,signup,3,False,ADQRSDFHYO
1,7,,,,,,ADRNXKZIMZ
2,9,,,,,,AFVQPMACXF
3,15,,,,,,ALDOYPOHSO
4,16,,,,,,APRMLRJNTY
...,...,...,...,...,...,...,...
229,976,,,,,,
230,980,,,,,,
231,985,,,,,,
232,989,,,,,,


In [15]:
# Continuous variables missing values
cont_vars = cont_vars.apply(impute_missing_values)
cont_vars.apply(describe_numeric_col).T

Unnamed: 0,Count,Missing,Mean,Min,Max
purchases,234.0,234.0,4.844,0.307,9.471
time_spent,234.0,234.0,99.206,79.767,118.665
n_visits,234.0,234.0,6.228,0.0,16.717


In [16]:
cat_vars.loc[cat_vars['customer_code'].isna(),'customer_code'] = 'None'
cat_vars = cat_vars.apply(impute_missing_values)
cat_vars.apply(lambda x: pd.Series([x.count(), x.isnull().sum()], index = ['Count', 'Missing'])).T
cat_vars

Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code
5,5,0,2024-01-19,signup,5,True,OWABIUQIAH
7,7,1,2024-01-31,signup,6,False,SMVGZNZHCD
9,9,1,2024-01-17,signup,3,True,HVVITIQYAW
15,15,0,2024-01-08,signup,9,True,POMRFTXIRG
16,16,0,2024-01-25,signup,4,True,WXEKMCADSR
...,...,...,...,...,...,...,...
976,976,1,2024-01-25,signup,6,True,BGHMPVUJDY
980,980,0,2024-01-16,signup,9,True,
985,985,0,2024-01-15,signup,7,False,QFBTMOEXSF
989,989,1,2024-01-06,signup,6,True,VJJIZBNGQX


# Data standardisation

Standardization, or scaling, becomes necessary when continuous independent variables are measured on different scales, as this can lead to unequal contributions to the analysis. The objective is to rescale these variables so they have comparable ranges and/or variances, ensuring a more balanced influence in the model.

In [17]:
from sklearn.preprocessing import MinMaxScaler
import joblib

scaler_path = "./artifacts/scaler.pkl"

scaler = MinMaxScaler()
scaler.fit(cont_vars)

joblib.dump(value=scaler, filename=scaler_path)
print("Saved scaler in artifacts")

cont_vars = pd.DataFrame(scaler.transform(cont_vars), columns=cont_vars.columns)
cont_vars

Saved scaler in artifacts


Unnamed: 0,purchases,time_spent,n_visits
0,0.621,0.779,0.000
1,0.512,0.765,1.000
2,0.403,0.141,0.479
3,0.621,0.863,0.060
4,0.185,0.552,0.239
...,...,...,...
229,0.621,0.332,0.359
230,0.403,0.546,0.538
231,0.403,0.531,0.000
232,0.294,0.739,0.658


# Combine data

In [18]:
cont_vars = cont_vars.reset_index(drop=True)
cat_vars = cat_vars.reset_index(drop=True)
data = pd.concat([cat_vars, cont_vars], axis=1)
print(f"Data cleansed and combined.\nRows: {len(data)}")
data

Data cleansed and combined.
Rows: 234


Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code,purchases,time_spent,n_visits
0,5,0,2024-01-19,signup,5,True,OWABIUQIAH,0.621,0.779,0.000
1,7,1,2024-01-31,signup,6,False,SMVGZNZHCD,0.512,0.765,1.000
2,9,1,2024-01-17,signup,3,True,HVVITIQYAW,0.403,0.141,0.479
3,15,0,2024-01-08,signup,9,True,POMRFTXIRG,0.621,0.863,0.060
4,16,0,2024-01-25,signup,4,True,WXEKMCADSR,0.185,0.552,0.239
...,...,...,...,...,...,...,...,...,...,...
229,976,1,2024-01-25,signup,6,True,BGHMPVUJDY,0.621,0.332,0.359
230,980,0,2024-01-16,signup,9,True,,0.403,0.546,0.538
231,985,0,2024-01-15,signup,7,False,QFBTMOEXSF,0.403,0.531,0.000
232,989,1,2024-01-06,signup,6,True,VJJIZBNGQX,0.294,0.739,0.658


# Data drift artifact

In [19]:
import json

data_columns = list(data.columns)
with open('./artifacts/columns_drift.json','w+') as f:           
    json.dump(data_columns,f)
    
data.to_csv('./artifacts/training_data.csv', index=False)

# Binning object columns

In [20]:
data.columns

Index(['lead_id', 'lead_indicator', 'date_part', 'source', 'customer_group',
       'onboarding', 'customer_code', 'purchases', 'time_spent', 'n_visits'],
      dtype='object')

In [21]:
data['bin_source'] = data['source']
values_list = ['li', 'organic','signup','fb']
data.loc[~data['source'].isin(values_list),'bin_source'] = 'Others'
mapping = {'li' : 'socials', 
           'fb' : 'socials', 
           'organic': 'group1', 
           'signup': 'group1'
           }

data['bin_source'] = data['source'].map(mapping)

# Save gold medallion dataset

In [22]:
#spark.sql(f"drop table if exists train_gold")


In [23]:
# data_gold = spark.createDataFrame(data)
# data_gold.write.saveAsTable('train_gold')
# dbutils.notebook.exit(('training_golden_data',most_recent_date))

data.to_csv('./artifacts/train_data_gold.csv', index=False)

# MODEL TRAINING

Training the model uses a training dataset for training an ML algorithm. It has sample output data and the matching input data that affects the output.

In [24]:
import datetime

# Constants used:
current_date = datetime.datetime.now().strftime("%Y_%B_%d")
data_gold_path = "./artifacts/train_data_gold.csv"
data_version = "00000"
experiment_name = current_date

# Create paths

Maybe the artifacts path has not been created during data cleaning

In [25]:
import os
import shutil

os.makedirs("artifacts", exist_ok=True)
os.makedirs("mlruns", exist_ok=True)
os.makedirs("mlruns/.trash", exist_ok=True)

In [26]:
import mlflow

mlflow.set_experiment(experiment_name)

2024/11/04 19:01:20 INFO mlflow.tracking.fluent: Experiment with name '2024_November_04' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///Users/jeppe.kristensen/Projects/itu-sdse-project/notebooks/mlruns/332768886302457088', creation_time=1730743280162, experiment_id='332768886302457088', last_update_time=1730743280162, lifecycle_stage='active', name='2024_November_04', tags={}>

# Helper functions

* *create_dummies*: Create one-hot encoding columns in the data.

In [27]:
def create_dummy_cols(df, col):
    df_dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
    new_df = pd.concat([df, df_dummies], axis=1)
    new_df = new_df.drop(col, axis=1)
    return new_df

# Load training data
We use the training data we cleaned earlier

In [28]:
data = pd.read_csv(data_gold_path)
print(f"Training data length: {len(data)}")
data.head(5)

Training data length: 234


Unnamed: 0,lead_id,lead_indicator,date_part,source,customer_group,onboarding,customer_code,purchases,time_spent,n_visits,bin_source
0,5,0,2024-01-19,signup,5,True,OWABIUQIAH,0.621,0.779,0.0,group1
1,7,1,2024-01-31,signup,6,False,SMVGZNZHCD,0.512,0.765,1.0,group1
2,9,1,2024-01-17,signup,3,True,HVVITIQYAW,0.403,0.141,0.479,group1
3,15,0,2024-01-08,signup,9,True,POMRFTXIRG,0.621,0.863,0.06,group1
4,16,0,2024-01-25,signup,4,True,WXEKMCADSR,0.185,0.552,0.239,group1


# Data type split

In [29]:
data = data.drop(["lead_id", "customer_code", "date_part"], axis=1)

cat_cols = ["customer_group", "onboarding", "bin_source", "source"]
cat_vars = data[cat_cols]

other_vars = data.drop(cat_cols, axis=1)

# Dummy variable for categorical vars

1. Create one-hot encoded cols for cat vars
2. Change to floats

In [30]:
import pandas as pd

for col in cat_vars:
    cat_vars[col] = cat_vars[col].astype("category")
    cat_vars = create_dummy_cols(cat_vars, col)

data = pd.concat([other_vars, cat_vars], axis=1)

for col in data:
    data[col] = data[col].astype("float64")
    print(f"Changed column {col} to float")

Changed column lead_indicator to float
Changed column purchases to float
Changed column time_spent to float
Changed column n_visits to float
Changed column customer_group_2 to float
Changed column customer_group_3 to float
Changed column customer_group_4 to float
Changed column customer_group_5 to float
Changed column customer_group_6 to float
Changed column customer_group_7 to float
Changed column customer_group_8 to float
Changed column customer_group_9 to float
Changed column onboarding_True to float


# Splitting data

In [31]:
y = data["lead_indicator"]
X = data.drop(["lead_indicator"], axis=1)

In [32]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42, test_size=0.15, stratify=y
)
y_train

18    1.000
159   0.000
57    1.000
208   0.000
96    1.000
       ... 
34    1.000
186   1.000
82    0.000
88    1.000
156   1.000
Name: lead_indicator, Length: 198, dtype: float64

# Model training

This stage involves training the ML algorithm by providing it with datasets, where the learning process takes place. Consistent training can significantly enhance the model's prediction accuracy. It's essential to initialize the model's weights randomly so the algorithm can effectively learn to adjust them.

# XGBoost

In [33]:
from xgboost import XGBRFClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from scipy.stats import randint

model = XGBRFClassifier(random_state=42)
params = {
    "learning_rate": uniform(1e-2, 3e-1),
    "min_split_loss": uniform(0, 10),
    "max_depth": randint(3, 10),
    "subsample": uniform(0, 1),
    "objective": ["reg:squarederror", "binary:logistic", "reg:logistic"],
    "eval_metric": ["aucpr", "error"]
}

model_grid = RandomizedSearchCV(model, param_distributions=params, n_jobs=-1, verbose=3, n_iter=10, cv=10)

model_grid.fit(X_train, y_train)

Fitting 10 folds for each of 10 candidates, totalling 100 fits
[CV 1/10] END eval_metric=aucpr, learning_rate=0.023546743632280098, max_depth=4, min_split_loss=7.424695985083369, objective=reg:squarederror, subsample=0.1812757020731649;, score=0.500 total time=   0.0s
[CV 1/10] END eval_metric=error, learning_rate=0.21651079758012448, max_depth=7, min_split_loss=4.854042768107481, objective=reg:squarederror, subsample=0.8813627473571538;, score=0.700 total time=   0.0s
[CV 2/10] END eval_metric=error, learning_rate=0.21651079758012448, max_depth=7, min_split_loss=4.854042768107481, objective=reg:squarederror, subsample=0.8813627473571538;, score=0.650 total time=   0.0s
[CV 2/10] END eval_metric=aucpr, learning_rate=0.023546743632280098, max_depth=4, min_split_loss=7.424695985083369, objective=reg:squarederror, subsample=0.1812757020731649;, score=0.500 total time=   0.0s
[CV 3/10] END eval_metric=error, learning_rate=0.21651079758012448, max_depth=7, min_split_loss=4.854042768107481, 

# Model test accuracy

In [34]:
from sklearn.metrics import accuracy_score

best_model_xgboost_params = model_grid.best_params_
print("Best xgboost params")
pprint(best_model_xgboost_params)

y_pred_train = model_grid.predict(X_train)
y_pred_test = model_grid.predict(X_test)
print("Accuracy train", accuracy_score(y_pred_train, y_train ))
print("Accuracy test", accuracy_score(y_pred_test, y_test))


Best xgboost params
{'eval_metric': 'error',
 'learning_rate': 0.3083014985393212,
 'max_depth': 7,
 'min_split_loss': 3.379729664212708,
 'objective': 'binary:logistic',
 'subsample': 0.9432023529011201}
Accuracy train 0.7828282828282829
Accuracy test 0.7222222222222222


# XGBoost performance overview
* Confusion matrix
* Classification report

In [35]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Test actual/predicted\n")
print(pd.crosstab(y_test, y_pred_test, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_test, y_pred_test),'\n')

conf_matrix = confusion_matrix(y_train, y_pred_train)
print("Train actual/predicted\n")
print(pd.crosstab(y_train, y_pred_train, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_train, y_pred_train),'\n')

Test actual/predicted

Predicted   0   1  All
Actual                
0.0        14   4   18
1.0         6  12   18
All        20  16   36 

Classification report

              precision    recall  f1-score   support

         0.0       0.70      0.78      0.74        18
         1.0       0.75      0.67      0.71        18

    accuracy                           0.72        36
   macro avg       0.72      0.72      0.72        36
weighted avg       0.73      0.72      0.72        36
 

Train actual/predicted

Predicted   0    1  All
Actual                 
0.0        76   26  102
1.0        17   79   96
All        93  105  198 

Classification report

              precision    recall  f1-score   support

         0.0       0.82      0.75      0.78       102
         1.0       0.75      0.82      0.79        96

    accuracy                           0.78       198
   macro avg       0.78      0.78      0.78       198
weighted avg       0.79      0.78      0.78       198
 



# Save best XGBoost model

In [36]:
xgboost_model = model_grid.best_estimator_
xgboost_model_path = "./artifacts/lead_model_xgboost.json"
xgboost_model.save_model(xgboost_model_path)

model_results = {
    xgboost_model_path: classification_report(y_train, y_pred_train, output_dict=True)
}

# SKLearn logistic regression

In [37]:
import mlflow.pyfunc

from sklearn.linear_model import LogisticRegression
import os
from sklearn.metrics import cohen_kappa_score, f1_score
import matplotlib.pyplot as plt
import joblib

class lr_wrapper(mlflow.pyfunc.PythonModel):
    def __init__(self, model):
        self.model = model
    
    def predict(self, context, model_input):
        return self.model.predict_proba(model_input)[:, 1]


mlflow.sklearn.autolog(log_input_examples=True, log_models=False)
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id

with mlflow.start_run(experiment_id=experiment_id) as run:
    model = LogisticRegression()
    lr_model_path = "./artifacts/lead_model_lr.pkl"

    params = {
              'solver': ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
              'penalty':  ["none", "l1", "l2", "elasticnet"],
              'C' : [100, 10, 1.0, 0.1, 0.01]
    }
    model_grid = RandomizedSearchCV(model, param_distributions= params, verbose=3, n_iter=10, cv=3)
    model_grid.fit(X_train, y_train)

    best_model = model_grid.best_estimator_

    y_pred_train = model_grid.predict(X_train)
    y_pred_test = model_grid.predict(X_test)


    # log artifacts
    mlflow.log_metric('f1_score', f1_score(y_test, y_pred_test))
    mlflow.log_artifacts("artifacts", artifact_path="model")
    mlflow.log_param("data_version", "00000")
    
    # store model for model interpretability
    joblib.dump(value=model, filename=lr_model_path)
        
    # Custom python model for predicting probability 
    mlflow.pyfunc.log_model('model', python_model=lr_wrapper(model))


model_classification_report = classification_report(y_test, y_pred_test, output_dict=True)

best_model_lr_params = model_grid.best_params_

print("Best lr params")
pprint(best_model_lr_params)

print("Accuracy train:", accuracy_score(y_pred_train, y_train ))
print("Accuracy test:", accuracy_score(y_pred_test, y_test))

conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Test actual/predicted\n")
print(pd.crosstab(y_test, y_pred_test, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_test, y_pred_test),'\n')

conf_matrix = confusion_matrix(y_train, y_pred_train)
print("Train actual/predicted\n")
print(pd.crosstab(y_train, y_pred_train, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_train, y_pred_train),'\n')

model_results[lr_model_path] = model_classification_report
print(model_classification_report["weighted avg"]["f1-score"])

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[CV 1/3] END C=10, penalty=elasticnet, solver=newton-cg;, score=nan total time=   0.0s
[CV 2/3] END C=10, penalty=elasticnet, solver=newton-cg;, score=nan total time=   0.0s
[CV 3/3] END C=10, penalty=elasticnet, solver=newton-cg;, score=nan total time=   0.0s
[CV 1/3] END .....C=100, penalty=none, solver=sag;, score=nan total time=   0.0s
[CV 2/3] END .....C=100, penalty=none, solver=sag;, score=nan total time=   0.0s
[CV 3/3] END .....C=100, penalty=none, solver=sag;, score=nan total time=   0.0s
[CV 1/3] END .....C=1.0, penalty=none, solver=sag;, score=nan total time=   0.0s
[CV 2/3] END .....C=1.0, penalty=none, solver=sag;, score=nan total time=   0.0s
[CV 3/3] END .....C=1.0, penalty=none, solver=sag;, score=nan total time=   0.0s
[CV 1/3] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 2/3] END .....C=0.1, penalty=l1, solver=lbfgs;, score=nan total time=   0.0s
[CV 3/3] END .....C=0.1, penal

2024/11/04 19:01:27 INFO mlflow.sklearn.utils: Logging the 5 best runs, 5 runs will be omitted.


Best lr params
{'C': 100, 'penalty': 'l2', 'solver': 'sag'}
Accuracy train: 0.7121212121212122
Accuracy test: 0.75
Test actual/predicted

Predicted  0.0  1.0  All
Actual                  
0.0         14    4   18
1.0          5   13   18
All         19   17   36 

Classification report

              precision    recall  f1-score   support

         0.0       0.74      0.78      0.76        18
         1.0       0.76      0.72      0.74        18

    accuracy                           0.75        36
   macro avg       0.75      0.75      0.75        36
weighted avg       0.75      0.75      0.75        36
 

Train actual/predicted

Predicted  0.0  1.0  All
Actual                  
0.0         75   27  102
1.0         30   66   96
All        105   93  198 

Classification report

              precision    recall  f1-score   support

         0.0       0.71      0.74      0.72       102
         1.0       0.71      0.69      0.70        96

    accuracy                           0.71  

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import cohen_kappa_score

import joblib

model = LogisticRegression()
params = {
            'solver': ["newton-cg", "lbfgs", "liblinear", "sag", "saga"],
            'penalty':  ["none", "l1", "l2", "elasticnet"],
            'C': [100, 10, 1.0, 0.1, 0.01]
}

model_grid = RandomizedSearchCV(model, param_distributions=params, n_jobs=-1, verbose=3, n_iter=10, cv=10)

model_grid.fit(X_train, y_train)

model = model_grid.best_estimator_

best_model_lr_params = model_grid.best_params_
print("Best lr params")
pprint(best_model_lr_params)

y_pred_train = model_grid.predict(X_train)
y_pred_test = model_grid.predict(X_test)
print("Accuracy train:", accuracy_score(y_pred_train, y_train ))
print("Accuracy test:", accuracy_score(y_pred_test, y_test))

conf_matrix = confusion_matrix(y_test, y_pred_test)
print("Test actual/predicted\n")
print(pd.crosstab(y_test, y_pred_test, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_test, y_pred_test),'\n')

conf_matrix = confusion_matrix(y_train, y_pred_train)
print("Train actual/predicted\n")
print(pd.crosstab(y_train, y_pred_train, rownames=['Actual'], colnames=['Predicted'], margins=True),'\n')
print("Classification report\n")
print(classification_report(y_train, y_pred_train),'\n')

lr_model = model_grid.best_estimator_
lr_model_path = "./artifacts/lead_model_lr.pkl"
joblib.dump(value=lr_model, filename=lr_model_path)

model_results[lr_model_path] = classification_report(y_train, y_pred_train, output_dict=True)

2024/11/04 19:01:29 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '802301c3c238403b95b17bd13abdbc2f', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Fitting 10 folds for each of 10 candidates, totalling 100 fits
[CV 2/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 1/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 3/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 1/10] END ..C=100, penalty=l2, solver=lbfgs;, score=0.700 total time=   0.0s
[CV 5/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 6/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 2/10] END ..C=100, penalty=l2, solver=lbfgs;, score=0.800 total time=   0.0s
[CV 7/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 3/10] END ..C=100, penalty=l2, solver=lbfgs;, score=0.700 total time=   0.0s
[CV 4/10] END C=10, penalty=elasticnet, solver=lbfgs;, score=nan total time=   0.0s
[CV 4/10] END ..C=100, penalty=l2, solver=lbfgs;, score=0.450 total time=   0.0s
[CV 6/10] END ..C=100, pe

2024/11/04 19:01:30 INFO mlflow.sklearn.utils: Logging the 5 best runs, 5 runs will be omitted.


Best lr params
{'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
Accuracy train: 0.7121212121212122
Accuracy test: 0.75
Test actual/predicted

Predicted  0.0  1.0  All
Actual                  
0.0         14    4   18
1.0          5   13   18
All         19   17   36 

Classification report

              precision    recall  f1-score   support

         0.0       0.74      0.78      0.76        18
         1.0       0.76      0.72      0.74        18

    accuracy                           0.75        36
   macro avg       0.75      0.75      0.75        36
weighted avg       0.75      0.75      0.75        36
 

Train actual/predicted

Predicted  0.0  1.0  All
Actual                  
0.0         75   27  102
1.0         30   66   96
All        105   93  198 

Classification report

              precision    recall  f1-score   support

         0.0       0.71      0.74      0.72       102
         1.0       0.71      0.69      0.70        96

    accuracy                           0.71

# Save columns and model results

In [39]:
column_list_path = './artifacts/columns_list.json'
with open(column_list_path, 'w+') as columns_file:
    columns = {'column_names': list(X_train.columns)}
    pprint(columns)
    json.dump(columns, columns_file)

print('Saved column list to ', column_list_path)

model_results_path = "./artifacts/model_results.json"
with open(model_results_path, 'w+') as results_file:
    json.dump(model_results, results_file)

{'column_names': ['purchases',
                  'time_spent',
                  'n_visits',
                  'customer_group_2',
                  'customer_group_3',
                  'customer_group_4',
                  'customer_group_5',
                  'customer_group_6',
                  'customer_group_7',
                  'customer_group_8',
                  'customer_group_9',
                  'onboarding_True']}
Saved column list to  ./artifacts/columns_list.json


# MODEL SELECTION

Model selection involves choosing the most suitable statistical model from a set of candidates. In straightforward cases, this process uses an existing dataset. When candidate models offer comparable predictive or explanatory power, the simplest model is generally the preferred choice.

In [40]:
# Constants used:
current_date = datetime.datetime.now().strftime("%Y_%B_%d")
artifact_path = "model"
model_name = "lead_model"
experiment_name = current_date

# Helper functions

In [52]:
import time
from mlflow.tracking.client import MlflowClient
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus
from mlflow.tracking.client import MlflowClient

def wait_until_ready(model_name, model_version):
    client = MlflowClient()
    for _ in range(10):
        model_version_details = client.get_model_version(
          name=model_name,
          version=model_version,
        )
        status = ModelVersionStatus.from_string(model_version_details.status)
        print(f"Model status: {ModelVersionStatus.to_string(status)}")
        if status == ModelVersionStatus.READY:
            break
        time.sleep(1)


# Getting experiment model results

In [41]:
experiment_ids = [mlflow.get_experiment_by_name(experiment_name).experiment_id]
experiment_ids

['332768886302457088']

In [45]:
experiment_best = mlflow.search_runs(
    experiment_ids=experiment_ids,
    order_by=["metrics.f1_score DESC"],
    max_results=1
).iloc[0]
experiment_best

run_id                                               1d146caef00949918ca253e714468c89
experiment_id                                                      332768886302457088
status                                                                       FINISHED
artifact_uri                        file:///Users/jeppe.kristensen/Projects/itu-sd...
start_time                                           2024-11-04 18:01:27.111000+00:00
end_time                                             2024-11-04 18:01:29.790000+00:00
metrics.training_roc_auc                                                        0.775
metrics.training_f1_score                                                       0.712
metrics.training_accuracy_score                                                 0.712
metrics.best_cv_score                                                           0.662
metrics.training_score                                                          0.712
metrics.training_recall_score                         

In [None]:
import json

with open("./artifacts/model_results.json", "r") as f:
    model_results = json.load(f)
results_df = pd.DataFrame({model: val["weighted avg"] for model, val in model_results.items()}).T
results_df

In [None]:
best_model = results_df.sort_values("f1-score", ascending=False).iloc[0].name
print(f"Best model: {best_model}")

# Get production model

In [49]:
from mlflow.tracking import MlflowClient

client = MlflowClient()
prod_model = [model for model in client.search_model_versions(f"name='{model_name}'") if dict(model)['current_stage']=='Production']
prod_model_exists = len(prod_model)>0

if prod_model_exists:
    prod_model_version = dict(prod_model[0])['version']
    prod_model_run_id = dict(prod_model[0])['run_id']
    
    print('Production model name: ', model_name)
    print('Production model version:', prod_model_version)
    print('Production model run id:', prod_model_run_id)
    
else:
    print('No model in production')


No model in production


# Compare prod and best trained model

In [51]:
train_model_score = experiment_best["metrics.f1_score"]
model_details = {}
model_status = {}
run_id = None

if prod_model_exists:
    data, details = mlflow.get_run(prod_model_run_id)
    prod_model_score = data[1]["metrics.f1_score"]

    model_status["current"] = train_model_score
    model_status["prod"] = prod_model_score

    if train_model_score>prod_model_score:
        print("Registering new model")
        run_id = experiment_best["run_id"]
else:
    print("No model in production")
    run_id = experiment_best["run_id"]

print(f"Registered model: {run_id}")

No model in production
Registered model: 1d146caef00949918ca253e714468c89


# Register best model

In [54]:
if run_id is not None:
    print(f'Best model found: {run_id}')

    model_uri = "runs:/{run_id}/{artifact_path}".format(
        run_id=run_id,
        artifact_path=artifact_path
    )
    model_details = mlflow.register_model(model_uri=model_uri, name=model_name)
    wait_until_ready(model_details.name, model_details.version)
    model_details = dict(model_details)
    print(model_details)

Best model found: 1d146caef00949918ca253e714468c89
Model status: READY
{'aliases': [], 'creation_timestamp': 1730747081176, 'current_stage': 'None', 'description': None, 'last_updated_timestamp': 1730747081176, 'name': 'lead_model', 'run_id': '1d146caef00949918ca253e714468c89', 'run_link': None, 'source': 'file:///Users/jeppe.kristensen/Projects/itu-sdse-project/notebooks/mlruns/332768886302457088/1d146caef00949918ca253e714468c89/artifacts/model', 'status': 'READY', 'status_message': None, 'tags': {}, 'user_id': None, 'version': 1}


Successfully registered model 'lead_model'.
Created version '1' of model 'lead_model'.


# DEPLOY

A model version can be assigned to one or more stages. MLflow provides predefined stages for common use cases: None, Staging, Production, and Archived. With the necessary permissions, you can transition a model version between stages or request a transition to a different stage.

In [58]:
model_version = 1

# Transition to staging

In [61]:
from mlflow.tracking import MlflowClient

client = MlflowClient()


def wait_for_deployment(model_name, model_version, stage='Staging'):
    status = False
    while not status:
        model_version_details = dict(
            client.get_model_version(name=model_name,version=model_version)
            )
        if model_version_details['current_stage'] == stage:
            print(f'Transition completed to {stage}')
            status = True
            break
        else:
            time.sleep(2)
    return status

model_version_details = dict(client.get_model_version(name=model_name,version=model_version))
model_status = True
if model_version_details['current_stage'] != 'Staging':
    client.transition_model_version_stage(
        name=model_name,
        version=model_version,stage="Staging", 
        archive_existing_versions=True
    )
    model_status = wait_for_deployment(model_name, model_version, 'Staging')
else:
    print('Model already in staging')

Model version already in staging
