![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2F08+-+R&file=R+-+Notebook+Based+Workflow.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/08%20-%20R/R%20-%20Notebook%20Based%20Workflow.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2F08%2520-%2520R%2FR%2520-%2520Notebook%2520Based%2520Workflow.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/R%20-%20Notebook%20Based%20Workflow.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/08%20-%20R/R%20-%20Notebook%20Based%20Workflow.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# R - Notebook Based Workflow

---
Part of the series of [**R**](https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/readme.md) workflows:

A series of workflows focused on using **R** in Vertex AI as well as other Google Cloud services to run R code, train models with R, and serve predictionns with R.

---

**The Data**

The source data is first exported to Google Cloud Storage in CSV format below.  The BigQuery source table is `bigquery-public-data.ml_datasets.ulb_fraud_detection`.  This is a table of credit card transactions that are classified as fradulant, `Class = 1`, or normal `Class = 0`.    
- The data can be researched further at this [Kaggle link](https://www.kaggle.com/mlg-ulb/creditcardfraud).
- Read mode about BigQuery public datasets [here](https://cloud.google.com/bigquery/public-data)

**Description of the Data**

This is a table of 284,807 credit card transactions classified as fradulant or normal in the column `Class`.  In order protect confidentiality, the original features have been transformed using [principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) into 28 features named `V1, V2, ... V28` (float).  Two descriptive features are provided without transformation by PCA:
- `Time` (integer) is the seconds elapsed between the transaction and the earliest transaction in the table
- `Amount` (float) is the value of the transaction
>**Quick Note on PCA**<p>PCA is an unsupervised learning technique: there is not a target variable.  PCA is commonly used as a variable/feature reduction technique.  If you have 100 features then you could reduce it to a number p (say 10) projected features.  The choice of this number is a balance of how well it can explain the variance of the full feature space and reducing the number of features.  Each projected feature is orthogonal to each other feature, meaning there is no correlation between these new projected features.</p>

**Preparation of the Data**

This notebook adds two columns to the source data and stores it in a new table with suffix `_prepped`.  
- `transaction_id` (string) a unique id for the row/transaction
- `splits` (string) this divided the tranactions into sets for `TRAIN` (80%), `VALIDATE` (10%), and `TEST` (10%)

---

**Prerequisites:**

- This notebook running in Vertex AI Workbench Instance as described in the series [readme](./readme.md)

---
## Setup

inputs:

In [1]:
project_id <- system('gcloud config get-value project', intern = TRUE)
project_id

In [2]:
region <- 'us-central1'
experiment <- 'bigquery-data'
series <- 'r'

# BigQuery Parameters
bq_project <- project_id
bq_dataset <- series
bq_table <- experiment
bq_region <- substr(region, 1, 2)
bq_source <- 'bigquery-public-data.ml_datasets.ulb_fraud_detection'

# GCS Parameters: Give bucket name
gcs_bucket <- project_id

# key columns in the data:
var_target <- 'Class'
var_omit <- list('transaction_id', 'splits')

packages:

In [3]:
library(bigrquery)
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




---
## Prepare Data For Models

While **R** is an excellent tool for preparing data for machine learning it can make serving the resulting models challenging if the same processing needs to occur on future data.  This section shows how to use **R** to orchestrate some preliminary data preparation steps in BigQuery and then load the results into **R** using one of the method presented in the comprehensive workflow: [R - Working With BigQuery](./R%20-%20Working%20With%20BigQuery.ipynb).

### BigQuery Dataset

In BigQuery, tables are arranged in groups called datasets that are resources within Google Cloud projects.  This three level organization make it easy to refer to data table (or views).  

Create a dataset within the current project to hold a prepared version of the data.  Start by checking to see if it already exists.

- Reference: [BigQuery datasets with R/bq-datasets.R](https://bigrquery.r-dbi.org/reference/api-dataset.html)

Create BigQuery dataset object:

In [4]:
bq_ds <- bq_dataset(bq_project, bq_dataset)

Check for existance of the dataset, create if needed:

In [5]:
if (bq_dataset_exists(bq_ds)) {
    print('Dataset already exists')
} else {
    print('Creating dataset')
    bq_dataset_create(bq_ds, location = bq_region)
}

[1] "Dataset already exists"


### Create Table

Create a copy of the source table in the new dataset and add row leve id's (`transacation_id`) and assign splits for Train/Test (`splits`).

Define the query that creates the table:

In [6]:
query <- sprintf('
CREATE TABLE IF NOT EXISTS `%s.%s.%s` AS
WITH add_id AS(SELECT *, GENERATE_UUID() transaction_id FROM `%s`)
SELECT *,
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 9 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM add_id
', bq_project, bq_dataset, bq_table, bq_source)
cat(query)


CREATE TABLE IF NOT EXISTS `statmike-mlops-349915.r.bigquery-data` AS
WITH add_id AS(SELECT *, GENERATE_UUID() transaction_id FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection`)
SELECT *,
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 9 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM add_id


Run the query using `bigrquery`:

In [7]:
create <- bq_perform_query(query, billing = bq_project)

Wait on the create job to complete:

In [8]:
bq_job_wait(create)

### Retrieve Table

Using the `bigrquery` method, retrieve the full table to a dataframe.

For comprehensive review of this method and others, check out this workflow: [R - Working With BigQuery](./R%20-%20Working%20With%20BigQuery.ipynb).

Define the query that reads the table.  Take advantage of BigQuery columnar data by excluding columns that are not needed while also using a `WHERE` statment to filter to rows allocated for model training.  In the inputs above a list of these was created named `var_omit`.

In [9]:
get_data <- function(s){
    
    # query for table
    query <- sprintf('
        SELECT * EXCEPT(%s)
        FROM `%s.%s.%s`
        WHERE splits = "%s"
    ', paste(unlist(var_omit), collapse = ','),
    bq_project, bq_dataset, bq_table, s)
    
    # connect to table
    table <- bq_project_query(bq_project, query)
    
    # load table to dataframe
    return(bq_table_download(table, n_max = Inf))

}

Retrieve Training and Test dataframes:

In [10]:
train <- get_data("TRAIN")
test <- get_data("TEST")

Review the size and preview the records:

In [11]:
dim(train)

In [12]:
head(train, 2)

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,⋯,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
5043,-0.6103529,0.8762678,3.134572,2.26016851,0.001184993,0.2684391,0.127094,-0.008680134,0.9528023,⋯,-0.2022804,-0.1228932,-0.183132,0.2959793,-0.1599888,-0.1301962,-0.076139183,-0.109075941,0,0
43968,1.1032424,-0.4789847,1.136295,-0.05461861,-0.823168828,0.7920736,-0.9883738,0.492957197,0.8836801,⋯,-0.0200895,0.1700578,0.1209046,-0.2092878,-0.07528029,1.0339329,-0.005642291,-0.002844234,0,0


---
## Train Model

Using `glm` to fit logistic regression:

In [13]:
model_exp = paste0(var_target, "~ .")

model <- glm(
    as.formula(model_exp),
    data = train,
    family = binomial)

In [14]:
summary(model)


Call:
glm(formula = as.formula(model_exp), family = binomial, data = train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.543e+00  2.933e-01 -29.132  < 2e-16 ***
Time        -2.796e-06  2.614e-06  -1.070  0.28478    
V1           1.000e-01  5.001e-02   1.999  0.04557 *  
V2          -2.005e-02  7.018e-02  -0.286  0.77514    
V3           5.399e-03  6.235e-02   0.087  0.93100    
V4           7.051e-01  8.678e-02   8.126 4.45e-16 ***
V5           1.041e-01  7.837e-02   1.328  0.18421    
V6          -6.528e-02  8.379e-02  -0.779  0.43595    
V7          -1.045e-01  8.266e-02  -1.265  0.20601    
V8          -1.593e-01  3.577e-02  -4.452 8.50e-06 ***
V9          -2.970e-01  1.302e-01  -2.281  0.02256 *  
V10         -7.969e-01  1.153e-01  -6.913 4.75e-12 ***
V11         -2.946e-02  9.560e-02  -0.308  0.75795    
V12          1.385e-01  1.059e-01   1.308  0.19077    
V13         -4.065e-01  9.692e-02  -4.194 2.74e-05 ***
V14         -6.051e-01  7.53

---
## Evaluate Model

Use the test data to evaluate the model:

In [15]:
dim(test)

### Get predictions

In [16]:
preds <- predict(model, test, type = "response")

In [17]:
preds[dim(test)[1]]

In [18]:
test[dim(test)[1], var_target]

Class
<int>
0


### Compare Predictions to Actual (Confusion Matrix):

In [19]:
actual <- test[, var_target]
names(actual) <- 'actual'

In [20]:
actual[1:5,]

actual
<int>
0
0
0
0
0


In [21]:
results <- cbind(actual, tibble(round(preds)))

In [22]:
results[1:5,]

Unnamed: 0_level_0,actual,round(preds)
Unnamed: 0_level_1,<int>,<dbl>
1,0,0
2,0,0
3,0,0
4,0,0
5,0,0


In [23]:
table(results)

      round(preds)
actual     0     1
     0 28436     5
     1    19    33