# Developing a Model

First, install the required Python libraries if not done already. See
[Installing Required Python Libraries](../00_Installing_Required_Python_Libraries.md).

If you're new to Python, you might be interested in [Introduction to Python Lists and Dictionaries for Data Science](../01_Introduction_to_Python_Data_Types.md).

This notebook is a continuation meant to be viewed after the Data Exploration and Data Pre-Processing notebooks. In this notebook we will develop and assess various ML models and address some of the issues we've identified in our data exploration for ML model development. We will develop and assess ML models to predict customer churn, starting with a pipeline to pre-process our data. 

### Imports

In the next section we will import the necessary packages and modules that will be used throughout this project.

In [1]:
# Imports necessary packages and modules

import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis
from sklearn import set_config
from sklearn.base import is_classifier
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score, f1_score, precision_score, recall_score, precision_recall_curve
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Imports the dataset

churn_df = pd.read_csv("../../data/output/customer_churn_abt.csv", header="infer")

### Initial-Processing with Pandas

Perform initial data pre-processing with pandas based on the exploratory analysis results.

In [3]:
# Columns that need to be dropped from the process

drop_cols = ["ID", "birthDate", "avgDiscountValue12", "intAdExposureCountAll"]

In [4]:
# Defines a function to remove unnecessary columns

def col_removal(df: pd.DataFrame, cols: list) -> pd.DataFrame:
    """
    Removes specified columns given a list.

    Parameters
    ----------
    df: 
        Initial Pandas DataFrame
    cols: 
        List of column names to be dropped. Default = None.
    """

    if not isinstance(df, pd.DataFrame):
        raise TypeError(f"Expected a pandas DataFrame. Received type: {type(df)}")
    
    if not isinstance(cols, list):
        raise TypeError(f"Expected a list of column names. Received type: {type(cols)}")

    # Avoids overwriting original dataframe 

    clean_df = df.copy()

    clean_df.drop(labels = cols, axis = 1, inplace = True)
    
    return clean_df  

In [5]:
# Defines a function to remove columns with missing values based on a threshold

def na_removal(df: pd.DataFrame, na_thresh: float = 0.5) -> pd.DataFrame:
    """
    Removes columns given a threshold for how many missing values are acceptable in a column. 
    By default only removes columns that have 50% or more of their values missing.

    Parameters
    ----------
    df: 
        Initial Pandas DataFrame
    na_thresh: 
        Value between 0.0 and 1.0 specifying threshold of missing values. If proportion of 
        missing values exceeds the threshold then the column is dropped.
    """

    # Ensures threshold value within proper range 

    if not (0 <= na_thresh <= 1):
        raise ValueError("The na_thresh parameter must be a value between 0 and 1.")

    # Creates a copy of the dataframe and modifies it based on the threshold
    
    clean_df = df.copy()

    threshold = na_thresh * clean_df.shape[0]
    clean_df.dropna(thresh = threshold, axis = 1, inplace = True)
    
    return clean_df  

In [6]:
# Performs the pre-processing from the previous notebook via a Pandas pipeline

churn_df = (churn_df.pipe(col_removal, drop_cols)
            .pipe(na_removal, 0.5)
            )

### Column Names

In the next few code cells new variables will be created to quickly reference columns belonging to a given group based on type, missingness, etc.

In [7]:
# Creates lists representing the input variables and target

target = "LostCustomer"
inputs = [_input for _input in churn_df.columns if _input != target]
numerics = [_input for _input in churn_df.select_dtypes(["int", "float"]).columns if _input != target]
categoricals = [_input for _input in churn_df.select_dtypes("object").columns]

### Data Partitioning

Partitions data into stratified training and testing partitions using an 80/20 split.

In [8]:
# Partitions data into training and validation

X_train, X_test, y_train, y_test = train_test_split(churn_df[inputs], churn_df[target], test_size = 0.2, stratify = churn_df[target], random_state = 42)

In [9]:
# Displays the shapes of the partitions

print(f"The training partition inputs have dimensions: {X_train.shape} and the targets have dimension {y_train.shape}")
print(f"The validation partition inputs have dimensions: {X_test.shape} and the targets have dimension {y_test.shape}")

The training partition inputs have dimensions: (4000, 24) and the targets have dimension (4000,)
The validation partition inputs have dimensions: (1000, 24) and the targets have dimension (1000,)


### Data-Preprocessing via Pipelines

In the previous notebook we tested various pre-processing steps and their effects on the data. In this section we will re-apply some of the tested pre-processing techniques using sklearn pipelines.

In [10]:
# Defines a function to identify columns with high skewness and kurtosis

def skewed_cols(df: pd.DataFrame, threshold: tuple = (-3, 3)) -> list:
    """
    Returns the columns that have a high skewnewess and kurtosis.
    
    Parameters
    ----------
    df: 
        Initial Pandas DataFrame.
    threshold: 
        Tuple specifying lower and upper thresholds for skewness and kurtosis.
    """ 

    # Type checking

    if not isinstance(df, pd.DataFrame):
        raise TypeError(f"df parameter must be a pandas DataFrame object, received {type(df)}")

    if not isinstance(threshold, tuple):
        raise TypeError(f"threshold parameter must be a tuple, received {type(threshold)}")
        
    low, high = threshold

    if not isinstance(low, int) or not isinstance(high, int):
        raise ValueError(f"threshold tuple values must be integers, received {type(low)} and {type(high)}")

    # Computing skewness and kurtosis

    skewness = df.select_dtypes(["int", "float"]).skew()
    kurtosis = df.select_dtypes(["int", "float"]).kurtosis()

    skew_df = skewness[(skewness > high) | (skewness < low)]
    kurt_df = skewness[(kurtosis > high) | (kurtosis < low)]

    # Selects columns with skewness and kurtosis outside the thresholds

    cols = [col for col in skew_df.index if col in kurt_df]

    return cols

In [11]:
# Defines function that applies a log transformation to the specified DataFrame

def log_transform(df: pd.DataFrame) -> pd.DataFrame:
    """
    Applies a log transformation to all columns in the input DataFrame.
    Function also ensures that np.log != 0.
    
    Parameters
    ----------
    df: 
        Initial Pandas DataFrame.
    """ 
    if not isinstance(df, pd.DataFrame):
        raise TypeError(f"Expected a pd.DataFrame type object. Received a {type(df)} type object.")
    
    # Shifts features distributions by one to prevent log(0)

    index = df.index
    columns = df.columns
    df = df + 1

    if np.any(df <= 0):
        raise ValueError(f"np.log received a value <= 0. Must receive a value >= 0.")

    # Applies log transformation using np.log. This yields an np.ndarray so we transform back into a pandas dataframe.

    df = np.log(df)    
    df = pd.DataFrame(df, columns = columns, index = index)
        
    return df

In [12]:
# Sets sklearn configuration for pipelines

set_config(transform_output = "pandas")

# Creates the list of columns that need to be log transformed

log_cols = skewed_cols(X_train)

In [13]:
# Creates a ColumnTransformer to perform the log_transform on columns with skewed distributions

log_transformer = ColumnTransformer(
    transformers = [
        ("log_transform", FunctionTransformer(log_transform), log_cols)
        ], 
    remainder = "passthrough"
    )

# Creates a pipeline to pre-process numeric and categorical variables separately

num_pipeline = Pipeline([
    ("num_imputer", SimpleImputer().set_output(transform = "pandas")),
    ("log_transformer", log_transformer),
    ("standard_scaler", StandardScaler())
])

cat_pipeline = Pipeline([
    ("cat_imputer", SimpleImputer()),
    ("ohe_encoder", OneHotEncoder()),
])

In [14]:
# Displays the numeric pipeline's parameters 

num_pipeline.get_params()

{'memory': None,
 'steps': [('num_imputer', SimpleImputer()),
  ('log_transformer',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('log_transform',
                                    FunctionTransformer(func=<function log_transform at 0x7f0cbd0f3060>),
                                    ['LastPurchaseAmount', 'AvgPurchaseAmount12',
                                     'AvgPurchaseAmountTotal', 'customersales',
                                     'AvgPurchasePerAd'])])),
  ('standard_scaler', StandardScaler())],
 'verbose': False,
 'num_imputer': SimpleImputer(),
 'log_transformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('log_transform',
                                  FunctionTransformer(func=<function log_transform at 0x7f0cbd0f3060>),
                                  ['LastPurchaseAmount', 'AvgPurchaseAmount12',
                                   'AvgPurchaseAmountTotal', 'customersales',
                 

In [15]:
# Specifies the parameters for both pipelines

num_params = {
    "num_imputer__strategy": "median"
}

cat_params = {
    "cat_imputer__strategy": "most_frequent",
    "ohe_encoder__drop": "if_binary",
    "ohe_encoder__sparse_output": False
}

# Applies parameters to the pipelines

num_pipeline.set_params(**num_params)
cat_pipeline.set_params(**cat_params)

In [16]:
# Combines the pipelines into one and adds variable selection

pipeline = Pipeline(
    [("transformer", ColumnTransformer(
        [
            ("num_pipeline", num_pipeline, numerics), 
            ("cat_pipeline", cat_pipeline, categoricals)
        ])
     ),
     ("selector", VarianceThreshold())
    ]
)

In [17]:
# Sets necessary parameters

pipeline_params = {"selector__threshold": 0.1}

pipeline.set_params(**pipeline_params)

In [18]:
# Creates a transformed sample of the training partition 

sample = pipeline.fit_transform(X_train)
sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4000 entries, 2195 to 4557
Data columns (total 26 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   num_pipeline__log_transform__LastPurchaseAmount      4000 non-null   float64
 1   num_pipeline__log_transform__AvgPurchaseAmount12     4000 non-null   float64
 2   num_pipeline__log_transform__AvgPurchaseAmountTotal  4000 non-null   float64
 3   num_pipeline__log_transform__customersales           4000 non-null   float64
 4   num_pipeline__log_transform__AvgPurchasePerAd        4000 non-null   float64
 5   num_pipeline__remainder__regionPctCustomers          4000 non-null   float64
 6   num_pipeline__remainder__numOfTotalReturns           4000 non-null   float64
 7   num_pipeline__remainder__wksSinceLastPurch           4000 non-null   float64
 8   num_pipeline__remainder__basktPurchCount12Month      4000 non-null   f

In [19]:
# Displays a sample of the dataset

sample.head()

Unnamed: 0,num_pipeline__log_transform__LastPurchaseAmount,num_pipeline__log_transform__AvgPurchaseAmount12,num_pipeline__log_transform__AvgPurchaseAmountTotal,num_pipeline__log_transform__customersales,num_pipeline__log_transform__AvgPurchasePerAd,num_pipeline__remainder__regionPctCustomers,num_pipeline__remainder__numOfTotalReturns,num_pipeline__remainder__wksSinceLastPurch,num_pipeline__remainder__basktPurchCount12Month,num_pipeline__remainder__intAdExposureCount12,...,num_pipeline__remainder__wksSinceFirstPurch,num_pipeline__remainder__EstimatedIncome,num_pipeline__remainder__regionMedHomeVal,num_pipeline__remainder__techSupportEval,num_pipeline__remainder__customerAge,cat_pipeline__customerGender_F,cat_pipeline__customerGender_M,cat_pipeline__customerSubscrStat_Gold,cat_pipeline__customerSubscrStat_Platinum,cat_pipeline__demHomeOwner_Unknown
2195,-5.972375,-4.68957,-0.616379,-1.767718,-3.169382,-2.297029,-1.008847,-1.729669,3.75124,1.67899,...,0.821938,0.45971,-0.941428,0.27711,1.005265,1.0,0.0,0.0,1.0,1.0
2822,1.07097,0.737284,1.648125,1.2541,0.370915,3.022086,1.709501,-0.215594,-0.817697,0.623464,...,-0.88576,-1.273206,-0.284103,0.27711,-0.215791,0.0,1.0,0.0,1.0,1.0
835,0.386382,0.455866,0.443336,0.232426,0.636375,0.144532,-0.32926,1.298482,-0.24658,-0.432062,...,-0.343633,-0.575226,-0.57154,1.043665,1.737899,0.0,1.0,0.0,1.0,0.0
4268,-0.00545,0.1388,-0.275544,0.005875,-0.020926,-2.733022,0.350327,-0.215594,-0.24658,-0.08022,...,-0.018358,0.074618,0.413303,-0.489446,-0.582108,1.0,0.0,0.0,1.0,0.0
2227,0.6911,0.70244,1.168836,1.249162,0.32518,1.103716,-1.008847,1.731075,0.895654,0.623464,...,0.577981,-1.056592,-0.564669,1.043665,-0.215791,1.0,0.0,0.0,1.0,1.0


## Machine Learning Model Development

The pre-processing pipelines are used to create pipelines to train ML models. The models that will be constructed are a logistic regression, decision tree, random forest, and gradient boosting model.

### Creating Model Pipelines

In [20]:
# Creates a logistic regression model pipeline

logit_pipeline = Pipeline([
    ("data_prep", pipeline),
    ("logit", LogisticRegression())
])

In [21]:
# Creates a decision tree model pipeline

dtree_pipeline = Pipeline([
    ("data_prep", pipeline),
    ("dtree", DecisionTreeClassifier())
])

In [22]:
# Creates a random forest model pipeline

rf_pipeline = Pipeline([
    ("data_prep", pipeline),
    ("rf", RandomForestClassifier())
])

In [23]:
# Creates a gradient boosting model pipeline

gb_pipeline = Pipeline([
    ("data_prep", pipeline),
    ("gb", GradientBoostingClassifier())
])

### Fitting Model Pipelines

In [24]:
# Fits the logistic regression pipeline

logit_pipeline.fit(X_train, y_train)

In [25]:
# Fits the decision tree pipeline

dtree_pipeline.fit(X_train, y_train)

In [26]:
# Fits the random forest pipeline

rf_pipeline.fit(X_train, y_train)

In [27]:
# Fits the gradient boosting pipeline

gb_pipeline.fit(X_train, y_train)

## Model Assessment

In this section we will assess the two most promising models and select a champion.

In [28]:
# Defines a function to compute various assessment metrics at once

def model_assessor(pipelines: list, X_test: pd.DataFrame, y_true: np.ndarray):
    """
    Uses the specified pipelines to generate probabilities then computes precision, recall, and F1-Score.
    
    Parameters
    ----------
    pipelines: 
        Takes a list of sklearn pipelines that will be used to generate predictions.
    X_test: 
        Pandas dataframe containing the testing partition.
    y_true: 
        Numpy array containing true labels.
    """
    
    # Stores predictions and models used to generated them 
    # These are used to create a DataFrame of the assessments metrics for each model
    
    models_info = {}
    
    for pipeline in pipelines:

        # Type Checking

        if not isinstance(pipeline, (Pipeline, GridSearchCV)):
            raise TypeError(f"Expected an sklearn estimator, received {type(pipeline)} instead.")
        
        # Accesses pipeline steps, first condition is for Pipeline classes, else condition is for GridSearchCV classes 

        if isinstance(pipeline, Pipeline):    
            model = pipeline.steps[-1][0]
        else:
            model = pipeline.best_estimator_.steps[-1][0]
        
        # Generates probabilities as a numpy array

        pred = np.array(pipeline.predict_proba(X_test))[:, 1] 
    
        # Identifies optimal threshold based on F1, Recall, and Precision
            
        precision, recall, thresholds = precision_recall_curve(y_test, pred)

        # Calculate F1 scores for all thresholds, adds small epsilon to avoid zero division

        f1_scores = 2 * (precision * recall) / (precision + recall + 1e-6)
        
         # Find the threshold that maximizes F1 Score

        optimal_idx = np.argmax(f1_scores)
        optimal_threshold = thresholds[optimal_idx]
        optimal_f1 = f1_scores[optimal_idx]
        precision = precision[optimal_idx]
        recall = recall[optimal_idx]
        
        # Makes decision based on threshold value
        
        pred = np.where(pred < optimal_threshold, 0, 1)
        models_info.update({model: {"pred": pred, "threshold": optimal_threshold, "f1_score": optimal_f1, "recall": recall, "precision": precision}})
    
    # Computes assessment metrics
    
    assessments = []
    
    for model, info in models_info.items():
        balanced_accuracy = balanced_accuracy_score(y_true, info["pred"])
        assessments.append([model, float(info["f1_score"]), float(balanced_accuracy), float(info["precision"]), float(info["recall"])])
    
    # Creates dictionary containing assessments
    
    df = pd.DataFrame(assessments, columns = ["Model", "F1-Score", "Balanced Accuracy","Precision", "Recall"])
    df = df.sort_values(by = "F1-Score", ascending = False).reset_index(drop = True)
    
    return df, models_info

In [29]:
# Defines a list of the pipelines that will be assessed and calls model_assessor

model_list = [logit_pipeline, dtree_pipeline, rf_pipeline, gb_pipeline]

model_assessment, models_info = model_assessor(model_list, X_test = X_test, y_true = y_test)
print(model_assessment)

   Model  F1-Score  Balanced Accuracy  Precision    Recall
0     gb  0.633204           0.824096   0.565517  0.719298
1     rf  0.520661           0.739634   0.492188  0.552632
2  dtree  0.429824           0.678231   0.429825  0.429825
3  logit  0.407079           0.664508   0.410714  0.403509


### Hyperparameter Tuning

In this section a grid search is performed to try to find better hyper parameters for the top two performing models based on their F1-Score. 

In [30]:
# Displays settings for the Random Forest model

rf_params = rf_pipeline.get_params()
{param: setting for param, setting in rf_params.items() if "rf__" in param}

{'rf__bootstrap': True,
 'rf__ccp_alpha': 0.0,
 'rf__class_weight': None,
 'rf__criterion': 'gini',
 'rf__max_depth': None,
 'rf__max_features': 'sqrt',
 'rf__max_leaf_nodes': None,
 'rf__max_samples': None,
 'rf__min_impurity_decrease': 0.0,
 'rf__min_samples_leaf': 1,
 'rf__min_samples_split': 2,
 'rf__min_weight_fraction_leaf': 0.0,
 'rf__monotonic_cst': None,
 'rf__n_estimators': 100,
 'rf__n_jobs': None,
 'rf__oob_score': False,
 'rf__random_state': None,
 'rf__verbose': 0,
 'rf__warm_start': False}

In [31]:
# Displays settings for the Gradient Boosting model

gb_params = gb_pipeline.get_params()
{param: setting for param, setting in gb_params.items() if "gb__" in param}

{'gb__ccp_alpha': 0.0,
 'gb__criterion': 'friedman_mse',
 'gb__init': None,
 'gb__learning_rate': 0.1,
 'gb__loss': 'log_loss',
 'gb__max_depth': 3,
 'gb__max_features': None,
 'gb__max_leaf_nodes': None,
 'gb__min_impurity_decrease': 0.0,
 'gb__min_samples_leaf': 1,
 'gb__min_samples_split': 2,
 'gb__min_weight_fraction_leaf': 0.0,
 'gb__n_estimators': 100,
 'gb__n_iter_no_change': None,
 'gb__random_state': None,
 'gb__subsample': 1.0,
 'gb__tol': 0.0001,
 'gb__validation_fraction': 0.1,
 'gb__verbose': 0,
 'gb__warm_start': False}

In [32]:
# Creates param ranges for the random forest and gradient boosting models

n_rows = X_train.shape[0]
n_cols = X_train.shape[1]

rf_param_grid = {"rf__criterion": ["gini", "entropy", "log_loss"],
                 "rf__max_features": [round(0.25 * n_cols), round(0.75 * n_cols)],
                 "rf__max_depth": [7, 17],
                 "rf__min_samples_leaf": [5, 15],
                 "rf__max_samples": [round(0.5 * n_rows), round(0.75 * n_rows)],
                 "rf__n_estimators": [100, 200]
                 }

gb_param_grid = {"gb__learning_rate": [0.01, 0.2],
                 "gb__max_depth": [3, 5],
                 "gb__min_samples_leaf": [5, 15],
                 "gb__n_estimators": [100, 200],
                 "gb__subsample": [0.75, 1.0],
                 }

In [33]:
# Defines new pipelines for both models

rf_tuned_pipe = GridSearchCV(estimator = rf_pipeline, param_grid = rf_param_grid, scoring = "f1")
gb_tuned_pipe = GridSearchCV(estimator = gb_pipeline, param_grid = gb_param_grid, scoring = "f1")

***NB: for next cell, be aware that autotuning can take some time to run, please be patient!***

In [34]:
# Tunes the random forest pipeline 

rf_tuned_pipe.fit(X_train, y_train)

In [35]:
# Displays the best parameters for the random forest

rf_best_params = rf_tuned_pipe.best_params_
rf_best_params

{'rf__criterion': 'entropy',
 'rf__max_depth': 17,
 'rf__max_features': 18,
 'rf__max_samples': 3000,
 'rf__min_samples_leaf': 5,
 'rf__n_estimators': 100}

***NB: for next cell, be aware that autotuning can take some time to run, please be patient!***

In [36]:
# Tunes the gradient boosting pipeline 

gb_tuned_pipe.fit(X_train, y_train)

In [37]:
# Displays the best parameters for the random forest

gb_best_params = gb_tuned_pipe.best_params_
gb_best_params

{'gb__learning_rate': 0.2,
 'gb__max_depth': 5,
 'gb__min_samples_leaf': 5,
 'gb__n_estimators': 200,
 'gb__subsample': 1.0}

In [38]:
# Assess the tuned models

tuned_assess, models_info = model_assessor([rf_tuned_pipe, gb_tuned_pipe], X_test, y_test)
print(tuned_assess)

  Model  F1-Score  Balanced Accuracy  Precision    Recall
0    gb  0.705882           0.845848   0.677419  0.736842
1    rf  0.611111           0.769158   0.647059  0.578947


## Viya ML

In this section the best performing pipeline will be modified to introduce the Viya ML package. The Viya ML package provides you with access to Viya's multithreaded algorithms.

In [39]:
# Imports Viya ML packages

from sasviya.ml.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression, Ridge
from sasviya.ml.svm import SVC, SVR
from sasviya.ml.tree import DecisionTreeClassifier, ForestClassifier, GradientBoostingClassifier

In [40]:
# Inspects the type of the selected algorithm

type(GradientBoostingClassifier())

sasviya.ml.tree.gradboost.GradientBoostingClassifier

In [41]:
# Creates a Viya Gradient Boosting Pipeline

viya_gb_pipeline = Pipeline([
    ("data_prep", pipeline),
    ("gb", GradientBoostingClassifier())
])

In [42]:
# Displays the parameters of the ViyaML Gradient Boosting model

viya_gb_pipeline.get_params()

{'memory': None,
 'steps': [('data_prep', Pipeline(steps=[('transformer',
                    ColumnTransformer(transformers=[('num_pipeline',
                                                     Pipeline(steps=[('num_imputer',
                                                                      SimpleImputer(strategy='median')),
                                                                     ('log_transformer',
                                                                      ColumnTransformer(remainder='passthrough',
                                                                                        transformers=[('log_transform',
                                                                                                       FunctionTransformer(func=<function log_transform at 0x7f0cbd0f3060>),
                                                                                                       ['LastPurchaseAmount',
                                              

In [43]:
# Extracts parameters from tuned gradient boosting pipeline

gb_tuned_params = gb_tuned_pipe.get_params()
gb_tuned_params = {param.split("estimator__")[1]: setting for param, setting in gb_tuned_params.items() if "estimator__gb" in param}
gb_tuned_params.update(gb_best_params)
gb_tuned_params

{'gb': GradientBoostingClassifier(),
 'gb__ccp_alpha': 0.0,
 'gb__criterion': 'friedman_mse',
 'gb__init': None,
 'gb__learning_rate': 0.2,
 'gb__loss': 'log_loss',
 'gb__max_depth': 5,
 'gb__max_features': None,
 'gb__max_leaf_nodes': None,
 'gb__min_impurity_decrease': 0.0,
 'gb__min_samples_leaf': 5,
 'gb__min_samples_split': 2,
 'gb__min_weight_fraction_leaf': 0.0,
 'gb__n_estimators': 200,
 'gb__n_iter_no_change': None,
 'gb__random_state': None,
 'gb__subsample': 1.0,
 'gb__tol': 0.0001,
 'gb__validation_fraction': 0.1,
 'gb__verbose': 0,
 'gb__warm_start': False}

In [44]:
# Applies the parameter settings that were found with GridSearchCV

viya_gb_pipeline.set_params(**gb_tuned_params)

In [45]:
# Fits the new pipeline

viya_gb_pipeline.fit(X_train, y_train)

In [46]:
# Assesses the Viya Gradient Boosting Pipeline

viya_assess, models_info = model_assessor([viya_gb_pipeline], X_test, y_test)
viya_assess

Unnamed: 0,Model,F1-Score,Balanced Accuracy,Precision,Recall
0,gb,0.706422,0.822482,0.740385,0.675439


## Deployment Preparation

In this final section we will save the champion model for deployment.

In [47]:

# Imports the pickle module to save the final tuned pipeline

import pickle

In [48]:
# Saves the pkl file in the working directory

with open("../../models/viya_gb_pipeline.pkl", "wb") as f:
    pickle.dump(viya_gb_pipeline, f)

In [49]:
# Loads the pickle file to test it

loaded_gb_pipe = pickle.load(open("../../models/viya_gb_pipeline.pkl", "rb"))

In [50]:
# Runs the assessment function on the un-pickled object

viya_assess, models_info = model_assessor([loaded_gb_pipe], X_test, y_test)
viya_assess

Unnamed: 0,Model,F1-Score,Balanced Accuracy,Precision,Recall
0,gb,0.706422,0.822482,0.740385,0.675439
