# AAMR Data preparation

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10.13

## Overview

Identify Data requirements for model training and prepare required data

### Objective

Review data elements and generate synthentic data as required for training.

### Dataset

{TODO: Include a paragraph with Dataset information and where to obtain it.} 

{TODO: Make sure the dataset is accessible to the public. **Googlers**: Add your dataset to the [public samples bucket](http://goto/cloudsamples#sample-storage-bucket) within gs://cloud-samples-data/vertex-ai, if it doesn't already exist there.}

### Costs 

{TODO: Update the list of billable products that your tutorial uses.}

This tutorial uses billable components of Google Cloud:

* Vertex AI
* {TODO: BigQuery}
* Cloud Storage

{TODO: Include links to pricing documentation for each product you listed above.
 NOTE: If you use BigQuery or Dataflow, you need to add this to the pricing.
}

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing),
{ TODO: [BigQuery pricing](https://cloud.google.com/bigquery/pricing), }
and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), 
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 


In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "aamr-432116"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [1]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

The Cloud SDK, code and other libraries currently run as the service account identity of the Workbench Instance running this notebook.

**- Authenticate the Cloud SDK with your credentials :**

In [2]:
# ! gcloud auth login

**- Authenticate code and libraries with your credentials :**

In [None]:
# ! gcloud auth application-default

**- Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

- *{Note to notebook author: For any user-provided strings that need to be unique (like bucket names or model ID's), append "-unique" to the end so proper testing can occur}*

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

### Import libraries

In [None]:
from google.cloud import aiplatform

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

{TODO: Include commands to delete individual resources below}

In [1]:
import os

# Delete endpoint resource
# e.g. `endpoint.delete()`

# Delete model resource
# e.g. `model.delete()`

# Delete Cloud Storage objects that were created
delete_bucket = False
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI

## Data Preparation


In [1]:
%%capture
! pip install ydata-synthetic

Collecting ydata-synthetic
  Downloading ydata_synthetic-1.4.0-py2.py3-none-any.whl (87 kB)
[K     |████████████████████████████████| 87 kB 1.8 MB/s eta 0:00:011
[?25hCollecting tensorflow==2.15.*
  Downloading tensorflow-2.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
[K     |████████████████████████████████| 475.2 MB 16 kB/s s eta 0:00:01
[?25hCollecting typeguard==4.2.*
  Downloading typeguard-4.2.1-py3-none-any.whl (34 kB)
Collecting pytest==7.4.*
  Downloading pytest-7.4.4-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 92.7 MB/s eta 0:00:01
Collecting easydict==1.10
  Downloading easydict-1.10.tar.gz (6.4 kB)
Collecting requests<2.31,>=2.28
  Downloading requests-2.30.0-py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 2.0 MB/s  eta 0:00:01
Collecting tensorflow-probability[tf]
  Downloading tensorflow_probability-0.24.0-py2.py3-none-any.whl (6.9 MB)
[K     |████████████████████████████████| 6.9

In [None]:
! pip install ydata-synthetic[streamlit]

Collecting ydata-synthetic[streamlit]
  Using cached ydata_synthetic-1.4.0-py2.py3-none-any.whl (87 kB)
Collecting tensorflow==2.15.*
  Using cached tensorflow-2.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.2 MB)
Collecting easydict==1.10
  Using cached easydict-1.10-py3-none-any.whl
Collecting typeguard==4.2.*
  Using cached typeguard-4.2.1-py3-none-any.whl (34 kB)
Collecting pmlb==1.0.*
  Using cached pmlb-1.0.1.post3-py3-none-any.whl (19 kB)
Collecting pytest==7.4.*
  Using cached pytest-7.4.4-py3-none-any.whl (325 kB)
Collecting tensorflow-probability[tf]
  Using cached tensorflow_probability-0.24.0-py2.py3-none-any.whl (6.9 MB)
Collecting streamlit-pandas-profiling==0.1.3
  Downloading streamlit_pandas_profiling-0.1.3-py3-none-any.whl (259 kB)
[K     |████████████████████████████████| 259 kB 1.3 MB/s eta 0:00:01
Collecting streamlit==1.29.0
  Downloading streamlit-1.29.0-py2.py3-none-any.whl (8.4 MB)
[K     |████████████████████████████████| 8.4 MB 19.7 M

In [None]:
from ydata_synthetic import streamlit_app

streamlit_app.run()

In [None]:
!python -m streamlit_app