![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FFramework+Workflows%2FCatBoost&file=CatBoost+In+Notebook.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Framework%20Workflows/CatBoost/CatBoost%20In%20Notebook.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FFramework%2520Workflows%2FCatBoost%2FCatBoost%2520In%2520Notebook.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Framework%20Workflows/CatBoost/CatBoost%20In%20Notebook.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Framework%20Workflows/CatBoost/CatBoost%20In%20Notebook.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# CatBoost In Notebook

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [63]:
# tuples of (import name, install name, min_version)
packages = [
    ('catboost', 'catboost'),
    ('bigframes', 'bigframes'),
    ('sklearn', 'scikit-learn'),
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.66.0'),   
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [64]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

inputs:

In [8]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [9]:
REGION = 'us-central1'
SERIES = 'frameworks-catboost'
EXPERIMENT = 'notebook'

packages:

In [66]:
import bigframes.pandas as bpd

import catboost #from catboost import CatBoostClassifier, Pool, metrics, cv
import sklearn.metrics #import accuracy_score

from google.cloud import aiplatform

clients:

In [68]:
# BigFrames
bpd.options.bigquery.project = PROJECT_ID

# vertex ai clients
aiplatform.init(project = PROJECT_ID, location = REGION, experiment = f"{SERIES}-{EXPERIMENT}")

---
## Data Source

**The Data**

The BigQuery source table is `bigquery-public-data.ml_datasets.ulb_fraud_detection`.  This is a table of credit card transactions that are classified as fradulant, `Class = 1`, or normal `Class = 0`.    
- The data can be researched further at this [Kaggle link](https://www.kaggle.com/mlg-ulb/creditcardfraud).
- Read mode about BigQuery public datasets [here](https://cloud.google.com/bigquery/public-data)

**Description of the Data**

This is a table of 284,807 credit card transactions classified as fradulant or normal in the column `Class`.  In order protect confidentiality, the original features have been transformed using [principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) into 28 features named `V1, V2, ... V28` (float).  Two descriptive features are provided without transformation by PCA:
- `Time` (integer) is the seconds elapsed between the transaction and the earliest transaction in the table
- `Amount` (float) is the value of the transaction
>**Quick Note on PCA**<p>PCA is an unsupervised learning technique: there is not a target variable.  PCA is commonlly used as a variable/feature reduction technique.  If you have 100 features then you could reduce it to a number p (say 10) projected features.  The choice of this number is a balance of how well it can explain the variance of the full feature space and reducing the number of features.  Each projected feature is orthogonal to each other feature, meaning there is no correlation between these new projected features.</p>

**Preparation of the Data**

Adds columns to the source data:  
- `splits` (string) this divided the tranactions into sets for `TRAIN` (80%), `VALIDATE` (10%), and `TEST` (10%)

In [70]:
fraud_ds = bpd.read_gbq('bigquery-public-data.ml_datasets.ulb_fraud_detection', use_cache=False)

In [71]:
fraud_ds.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,72890.0,-1.22258,-0.017622,2.317581,-1.547722,-0.958068,-0.370571,-0.583838,0.384328,-0.72238,...,0.430025,1.217131,-0.463494,0.456253,0.385304,-0.104713,-0.303068,-0.300302,5.9,0
1,131206.0,1.967597,-1.009301,-1.970656,-0.406056,1.614598,3.92548,-1.209586,0.952736,-0.429297,...,-0.288566,-0.420307,0.258054,0.632264,-0.148758,-0.656398,0.077885,-0.027551,59.0,0
2,122831.0,2.290614,-1.288035,-1.091499,-1.591945,-0.983697,-0.58711,-0.952236,-0.272064,-1.392405,...,-0.161871,0.04143,0.225622,0.672485,-0.105101,-0.194599,0.007488,-0.040725,30.0,0
3,68397.0,1.258859,0.440981,0.331167,0.681581,-0.267935,-1.046229,0.163925,-0.269223,-0.142249,...,-0.27486,-0.734847,0.116306,0.376938,0.25547,0.090629,-0.015355,0.033149,0.89,0
4,152137.0,2.023988,-0.351874,-0.494781,0.36047,-0.400929,-0.202362,-0.544039,-0.078031,1.364484,...,0.160192,0.774027,0.021697,-0.601828,0.029147,-0.175735,0.04743,-0.041086,9.99,0


In [72]:
fraud_ds = fraud_ds.to_pandas()

In [73]:
shuffle = fraud_ds.sample(frac = 1, random_state = 42)
train_pct, val_pct = .8, .1
train_end = int(train_pct * len(shuffle))
val_end = int((train_pct + val_pct) * len(shuffle))

fraud_ds['splits'] = 'None'
fraud_ds.loc[shuffle[:train_end].index, 'splits'] = 'TRAIN'
fraud_ds.loc[shuffle[train_end:val_end].index, 'splits'] = 'VALIDATE'
fraud_ds.loc[shuffle[val_end:].index, 'splits'] = 'TEST'

In [74]:
fraud_ds.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Class,splits
0,72890.0,-1.22258,-0.017622,2.317581,-1.547722,-0.958068,-0.370571,-0.583838,0.384328,-0.72238,...,1.217131,-0.463494,0.456253,0.385304,-0.104713,-0.303068,-0.300302,5.9,0,TRAIN
1,131206.0,1.967597,-1.009301,-1.970656,-0.406056,1.614598,3.92548,-1.209586,0.952736,-0.429297,...,-0.420307,0.258054,0.632264,-0.148758,-0.656398,0.077885,-0.027551,59.0,0,TRAIN
2,122831.0,2.290614,-1.288035,-1.091499,-1.591945,-0.983697,-0.58711,-0.952236,-0.272064,-1.392405,...,0.04143,0.225622,0.672485,-0.105101,-0.194599,0.007488,-0.040725,30.0,0,TRAIN
3,68397.0,1.258859,0.440981,0.331167,0.681581,-0.267935,-1.046229,0.163925,-0.269223,-0.142249,...,-0.734847,0.116306,0.376938,0.25547,0.090629,-0.015355,0.033149,0.89,0,TRAIN
4,152137.0,2.023988,-0.351874,-0.494781,0.36047,-0.400929,-0.202362,-0.544039,-0.078031,1.364484,...,0.774027,0.021697,-0.601828,0.029147,-0.175735,0.04743,-0.041086,9.99,0,TRAIN


In [75]:
X = fraud_ds.drop(['Class', 'splits'], axis = 1)
y = fraud_ds.Class
splits = fraud_ds.splits

In [97]:
train = catboost.Pool(
    data = X.loc[splits[splits == 'TRAIN'].index],
    label = y.loc[splits[splits == 'TRAIN'].index]
)
validate = catboost.Pool(
    data = X.loc[splits[splits == 'VALIDATE'].index],
    label = y.loc[splits[splits == 'VALIDATE'].index]
)
test = catboost.Pool(
    data = X.loc[splits[splits == 'TEST'].index],
    label = y.loc[splits[splits == 'TEST'].index]
)

In [106]:
model = catboost.CatBoostClassifier(
    custom_loss = [catboost.metrics.Accuracy()],
    random_seed = 42,
    iterations = 200,
    verbose = False
)

In [107]:
model.fit(
    train,
    eval_set = validate
)

<catboost.core.CatBoostClassifier at 0x7f4f6ae2e7d0>

In [108]:
model.get_best_score()

{'learn': {'Accuracy': 0.9999034431301982, 'Logloss': 0.00043716791558089595},
 'validation': {'Accuracy': 0.9995786664794073,
  'Logloss': 0.002149552674352494}}

In [109]:
model.get_best_iteration()

69

In [110]:
model.get_params()

{'iterations': 200,
 'random_seed': 42,
 'verbose': False,
 'custom_loss': ['Accuracy']}

In [111]:
model.get_all_params()

{'nan_mode': 'Min',
 'eval_metric': 'Logloss',
 'iterations': 200,
 'sampling_frequency': 'PerTree',
 'leaf_estimation_method': 'Newton',
 'random_score_type': 'NormalWithModelSizeDecrease',
 'grow_policy': 'SymmetricTree',
 'penalties_coefficient': 1,
 'boosting_type': 'Plain',
 'model_shrink_mode': 'Constant',
 'feature_border_type': 'GreedyLogSum',
 'bayesian_matrix_reg': 0.10000000149011612,
 'eval_fraction': 0,
 'force_unit_auto_pair_weights': False,
 'l2_leaf_reg': 3,
 'random_strength': 1,
 'rsm': 1,
 'boost_from_average': False,
 'model_size_reg': 0.5,
 'pool_metainfo_options': {'tags': {}},
 'subsample': 0.800000011920929,
 'use_best_model': True,
 'class_names': [0, 1],
 'random_seed': 42,
 'depth': 6,
 'posterior_sampling': False,
 'border_count': 254,
 'classes_count': 0,
 'auto_class_weights': 'None',
 'sparse_features_conflict_fraction': 0,
 'custom_metric': ['Accuracy'],
 'leaf_estimation_backtracking': 'AnyImprovement',
 'best_model_min_trees': 1,
 'model_shrink_rate': 

In [124]:
predictions = model.predict(test.get_features())
predictions_probs = model.predict_proba(test.get_features())

In [125]:
predictions[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [126]:
predictions_probs[0:10]

array([[9.99921674e-01, 7.83259909e-05],
       [9.89528053e-01, 1.04719474e-02],
       [9.99981821e-01, 1.81792276e-05],
       [9.99993107e-01, 6.89255259e-06],
       [9.99600287e-01, 3.99712633e-04],
       [9.99870655e-01, 1.29344732e-04],
       [9.99973999e-01, 2.60012740e-05],
       [9.99858036e-01, 1.41963930e-04],
       [9.99982863e-01, 1.71372976e-05],
       [9.99960022e-01, 3.99781716e-05]])

In [129]:
sklearn.metrics.accuracy_score(
    test.get_label(),
    model.predict(test.get_features())
)

0.9994733330992591