# Dataset de Transferencias Bancarias

- Los datos a utilizar corresponden a una tabla de transacciones con tarjeta de crédito.   
- Las transacciones son clasificadas como fraudulentas (`Class = 1`) o normales (`Class = 0`).    
- Los datos se obtienen desde tabla pública en BigQuery `bigquery-public-data.ml_datasets.ulb_fraud_detection`, correspondiente al siguiente [dataset de Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).   
- La tabla está compuesta por 284.807 transacciones.    
- Cada transacción está descrita por 28 features `V1, V2, ... V28`, correspondientes a una proyección PCA de los datos reales (no se entregan directamente por motivos de privacidad).  Ademas de las etiquetas `Class`, se agregan dos features descriptivas sin PCA:
    - `Time`: Segundos entre la transacción y la primera transacción de la tabla.
    - `Amount`: Cantidad de la transacción.

## Preparación

Definir constantes:

In [2]:
project = !gcloud config get-value project
PROJECT_ID = project[0] # Nombre del proyecto

REGION = 'us-central1' # Ubicación servidores a usar
EXPERIMENT = '01'
SERIES = '01'

# Parámetros BigQuery
BQ_PROJECT = PROJECT_ID # Proyecto de BigQuery
BQ_DATASET = 'fraud' # Nombre del dataset dentro de BigQuery
BQ_TABLE = 'fraud' # Nombre de tabla dentro del dataset

BQ_SOURCE = 'bigquery-public-data.ml_datasets.ulb_fraud_detection' # Fuente (tabla BigQuery pública) para descargar datos de fraude bancario

BUCKET = PROJECT_ID #Bucket en Google Storage

Importaciones:

In [3]:
from google.cloud import bigquery
from google.cloud import storage

Declarar clientes de BigQuery y de Google Storage:

In [4]:
bq = bigquery.Client(project = PROJECT_ID)
gcs = storage.Client(project = PROJECT_ID)

---
## Guardar data en GCS Storage Bucket
Verificar si ya existe la data.   
Si no existe, importar como .csv al Bucket.

In [6]:
file = f"{SERIES}/{EXPERIMENT}/data/{BQ_TABLE}.csv"

In [7]:
bucketDef = gcs.bucket(BUCKET)
if storage.Blob(bucket = bucketDef, name = file).exists(gcs):
    print(f'Archivo ya existe en: gs://{bucketDef.name}/{file}')
else:
    source = bigquery.TableReference.from_string(BQ_SOURCE)
    extract = bq.extract_table(source = source, destination_uris = [f'gs://{bucketDef.name}/{file}'])
    extract.result()
    print(f'Tabla importada en: gs://{bucketDef.name}/{file}')

The file has already been created at: gs://statmike-mlops-349915/01/01/data/fraud.csv


Ver archivos en el Bucket:

In [8]:
list(bucketDef.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}'))

[<Blob: statmike-mlops-349915, 01/01/data/fraud.csv, 1664756345743350>]

---
## Crear Dataset en BigQuery 

Listar datasets dentro del proyecto en BigQuery:

In [10]:
datasets = list(bq.list_datasets())
for d in datasets:
    print(d.dataset_id)

forecasting_8_tournament
fraud
model_deployment_monitoring_1961322035766362112


Crear dataset si no existe:

In [11]:
ds = bigquery.Dataset(f"{BQ_PROJECT}.{BQ_DATASET}")
ds.location = REGION
ds.labels = {'experiment': f'{EXPERIMENT}'}
ds = bq.create_dataset(dataset = ds, exists_ok = True)

Listar nuevamante datasets dentro del proyecto en BigQuery:

In [12]:
datasets = list(bq.list_datasets())
for d in datasets:
    print(d.dataset_id)

forecasting_8_tournament
fraud
model_deployment_monitoring_1961322035766362112


---
## Crear tabla en BigQuery

Si la tabla no existe, crear a partir de archivo .csv en el Bucket.

In [14]:
from google.cloud.exceptions import NotFound
try:
    table = bq.get_table(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}')
    if table:
        print(f'La tabla ya existe: {BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}')
except NotFound as error:
    print(f'Creando tabla ...')
    destination = bigquery.TableReference.from_string(f"{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}")
    job_config = bigquery.LoadJobConfig(
        write_disposition = 'WRITE_TRUNCATE',
        source_format = bigquery.SourceFormat.CSV,
        autodetect = True,
        labels = {'experimento': f'{EXPERIMENT}'}
    )
    job = bq.load_table_from_uri(f"gs://{bucketDef.name}/{file}", destination, job_config = job_config)
    job.result()
    print(f'Tabla creada: {BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}')

Creating Table ...
Finished creating table: statmike-mlops-349915.fraud.fraud


### Revisar data en tabla creada

In [17]:
query = f"""
SELECT Class
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
"""
df = bq.query(query = query).to_dataframe()

In [18]:
df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [19]:
df['Class'].value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

---
## Crear copia de tabla en BigQuery
Para agregar dos columnas importantes:
- transaction_id: Identificador único para cada fila (transacción bancaria)
- splits: Asignar a cada instancia conjunto de entrenamiento, validación o test.

In [20]:
query = f"""
CREATE TABLE IF NOT EXISTS `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_prepped` AS
WITH add_id AS(SELECT *, GENERATE_UUID() transaction_id FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`)
SELECT *,
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 9 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM add_id
"""
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f8321ae70d0>

### Revisar data en tabla creada

Revisar la división de los datos en conjuntos:

In [23]:
query = f"""
SELECT splits, count(*) as Count, 100*count(*) / (sum(count(*)) OVER()) as Percentage
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_prepped`
GROUP BY splits
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,splits,Count,Percentage
0,VALIDATE,28244,9.916891
1,TRAIN,228061,80.07563
2,TEST,28502,10.007479


Revisar nueva tabla creada:

In [24]:
query = f"""
SELECT * 
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_prepped`
LIMIT 5
"""
data = bq.query(query = query).to_dataframe()

In [25]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,35337,1.092844,-0.01323,1.359829,2.731537,-0.707357,0.873837,-0.79613,0.437707,0.39677,...,-0.167647,0.027557,0.592115,0.219695,0.03697,0.010984,0.0,0,a1b10547-d270-48c0-b902-7a0f735dadc7,TEST
1,60481,1.238973,0.035226,0.063003,0.641406,-0.260893,-0.580097,0.049938,-0.034733,0.405932,...,-0.057718,0.104983,0.537987,0.589563,-0.046207,-0.006212,0.0,0,814c62c8-ade4-47d5-bf83-313b0aafdee5,TEST
2,139587,1.870539,0.211079,0.224457,3.889486,-0.380177,0.249799,-0.577133,0.179189,-0.120462,...,0.180776,-0.060226,-0.228979,0.080827,0.009868,-0.036997,0.0,0,d08a1bfa-85c5-4f1b-9537-1c5a93e6afd0,TEST
3,162908,-3.368339,-1.980442,0.153645,-0.159795,3.847169,-3.516873,-1.209398,-0.292122,0.760543,...,-1.171627,0.214333,-0.159652,-0.060883,1.294977,0.120503,0.0,0,802f3307-8e5a-4475-b795-5d5d8d7d0120,TEST
4,165236,2.180149,0.218732,-2.637726,0.348776,1.063546,-1.249197,0.942021,-0.547652,-0.087823,...,-0.176957,0.563779,0.730183,0.707494,-0.131066,-0.090428,0.0,0,c8a5b93a-1598-4689-80be-4f9f5df0b8ce,TEST
