<a href="https://colab.research.google.com/github/winterForestStump/bank_marketing/blob/main/ml_sgdClassifier_bank_marketing_experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SGD Classifier

SGDClassifier implements regularised linear models with Stochastic Gradient Descent.

By default, the SGD Classifier does not perform as well as the Logistic Regression. It requires some hyper parameter tuning to be done.

## Importing Libraries

In [27]:
%%capture
!pip install skops
!pip install gradio
!pip install scikit-optimize

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import time
import skops.io as sio
import gradio as gr
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import roc_auc_score
from skopt import BayesSearchCV
import warnings

We will use a bigger dataset

In [96]:
bm_df = pd.read_csv("https://raw.githubusercontent.com/winterForestStump/bank_marketing/main/data_/bank-full.csv",header=0, delimiter=';')
bm_df = bm_df.sample(frac = 1)
bm_df.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
43722,75,retired,married,tertiary,no,6027,no,no,cellular,14,may,809,1,179,4,success,yes
23298,44,technician,single,secondary,no,-34,no,yes,cellular,27,aug,14,13,-1,0,unknown,no
28578,47,blue-collar,married,primary,no,613,yes,no,cellular,29,jan,49,3,255,2,other,no


We will not use features with previous contact information, this is due to the complexity of their interpretation and usage in practice. Also will remove target column 'y' from the dataset and create 'target' variable:

In [97]:
bm_df = bm_df.drop(columns=['day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome'])

target = bm_df.loc[:, ["y"]]
target = target["y"].replace({'yes': 1, 'no': 0})

data = bm_df.drop("y", axis=1)

Column transformer for numerical and categorical data:

In [98]:
ct = ColumnTransformer(transformers=[('num', MinMaxScaler(), ['age','balance']),
 ('cat', OneHotEncoder(handle_unknown='ignore'),['job','marital','education','default','housing','loan','contact'])])

In [99]:
data_trans = ct.fit_transform(data)

Split data into training and test sets:

In [100]:
X_train, X_test, y_train, y_test = train_test_split(data_trans, target, test_size=0.3, shuffle=True, random_state=42)

For searching best parameters will use BayesSearch cross-validation from `skopt` library:

In [101]:
opt = BayesSearchCV(
    SGDClassifier(),
    {
        'loss': ['log_loss'],
        'penalty': ['elasticnet'],
        'alpha': (0.0001, 1),
        'l1_ratio': (0.1, 1),
        'tol':[None],
        'class_weight': ['balanced'],
        'shuffle': [True]
    },
    n_iter=32,
    cv=10,
    random_state=42
)

Traing and evaluate the model:

In [102]:
# Suppress warnings
warnings.filterwarnings("ignore")

opt.fit(X_train, y_train)

print("val. score: %s" % opt.best_score_)
print("test score: %s" % opt.score(X_test, y_test))

# Restore warnings
warnings.resetwarnings()

val. score: 0.6631591582235377
test score: 0.6568121498083161


Find best parameters:

In [103]:
opt.best_estimator_

  and should_run_async(code)


Creating the production pipeline:

In [120]:
pipe = Pipeline(steps=[('preprocessor',
                        ColumnTransformer(transformers=[('num', MinMaxScaler(), [0,5]),
                                                        ('cat', OneHotEncoder(handle_unknown='ignore'),[1,2,3,4,6,7,8])])),
                      ("model", SGDClassifier(alpha=0.20768, class_weight='balanced', l1_ratio=0.1, loss='log_loss',
                                              penalty='elasticnet', tol=None)),
    ])

Train the model:

In [121]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, shuffle=True, random_state=42)

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.66      0.77     11986
           1       0.19      0.60      0.29      1578

    accuracy                           0.66     13564
   macro avg       0.56      0.63      0.53     13564
weighted avg       0.84      0.66      0.72     13564



In [123]:
pipe

In [124]:
sio.dump(pipe, "sgd_bank_marketing_pipe.skops")

## Next steps

* Building Web Application

Using Gradio build a simple classification user interface.

* Deploying the Machine Learning Model

Creating the space on the Hugging Face and add our model and the app file.

