# Getting started with AutoML Tables

To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.

You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to

1.   Re-run the initialization and authentication.
2.   Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.

Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html).



## 1. Project set up





Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to
* Create a Google Cloud Platform (GCP) project.
* Enable billing.
* Apply to whitelist your project.
* Enable AutoML API.
* Enable AutoML Tables API.
* Create a service account, grant required permissions, and download the service account private key.

You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source
* Create a GCS bucket.
* Upload the training and batch prediction files.


**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console.



---



## 2. Initialize and authenticate
This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections.

### Install the client library
Run the following cell.

In [0]:
#@title Install AutoML Tables client library { vertical-output: true }
!pip install google-cloud-automl

### Authenticate using service account key
Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the <b>Command + Shift + .</b> combo.

In [0]:
#@title Authenticate and create a client. { vertical-output: true }

from google.cloud import automl_v1beta1

# Upload service account key
keyfile_upload = files.upload()
keyfile_name = list(keyfile_upload.keys())[0]
# Authenticate and create an AutoML client.
client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)
# Authenticate and create a prediction service client.
prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)

### Test

Enter your GCP project ID.

In [0]:
#@title GCP project ID and location

project_id = 'my-project-trial5' #@param {type:'string'}
location = 'us-central1'
location_path = client.location_path(project_id, location)
location_path

To test whether your project set up and authentication steps were successful, run the following cell to list your datasets.

In [0]:
#@title List datasets. { vertical-output: true }

list_datasets_response = client.list_datasets(location_path)
datasets = {
    dataset.display_name: dataset.name for dataset in list_datasets_response}
datasets

You can also print the list of your models by running the following cell.

In [0]:
#@title List models. { vertical-output: true }

list_models_response = client.list_models(location_path)
models = {model.display_name: model.name for model in list_models_response}
models



---



## 3. Import training data

### Create dataset

Select a dataset display name and pass your table source information to create a new dataset.

In [0]:
#@title Create dataset { vertical-output: true, output-height: 200 }

dataset_display_name = 'test_deployment' #@param {type: 'string'}

create_dataset_response = client.create_dataset(
    location_path,
    {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})
dataset_name = create_dataset_response.name
create_dataset_response

### Import data

You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [census_income dataset](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv) 
as your training data. You can create a GCS bucket and upload the  data intofa your bucket. The URI for your file is `gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME`. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.

Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace `dataset_name` with its actual value obtained in the preceding cells.

    dataset = client.get_dataset(dataset_name)

In [0]:
#@title ... if data source is GCS { vertical-output: true }

dataset_gcs_input_uris = ['gs://cloud-ml-data/automl-tables/notebooks/census_income.csv',] #@param
# Define input configuration.
input_config = {
    'gcs_source': {
        'input_uris': dataset_gcs_input_uris
    }
}

In [0]:
#@title ... if data source is BigQuery { vertical-output: true }

dataset_bq_input_uri = 'bq://my-project-trial5.census_income.income_census' #@param {type: 'string'}
# Define input configuration.
input_config = {
    'bigquery_source': {
        'input_uri': dataset_bq_input_uri
    }
}

In [0]:
 #@title Import data { vertical-output: true }

import_data_response = client.import_data(dataset_name, input_config)
print('Dataset import operation: {}'.format(import_data_response.operation))
# Wait until import is done.
import_data_result = import_data_response.result()
import_data_result

### Review the specs

Run the following command to see table specs such as row count.

In [0]:
#@title Table schema { vertical-output: true }

import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types
import matplotlib.pyplot as plt

# List table specs
list_table_specs_response = client.list_table_specs(dataset_name)
table_specs = [s for s in list_table_specs_response]
# List column specs
table_spec_name = table_specs[0].name
list_column_specs_response = client.list_column_specs(table_spec_name)
column_specs = {s.display_name: s for s in list_column_specs_response}
# Table schema pie chart.
type_counts = {}
for column_spec in column_specs.values():
  type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)
  type_counts[type_name] = type_counts.get(type_name, 0) + 1

plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')
plt.axis('equal')
plt.show()


Run the following command to see column specs such inferred schema.

___

## 4. Update dataset: assign a label column and enable nullable columns

AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income` to be categorical (as it is just either over or under 50k) and `age` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.

### Update a column: set to nullable

In [0]:
#@title Update dataset { vertical-output: true }

update_column_spec_dict = {
    'name': column_specs['income'].name,
    'data_type': {
        'type_code': 'CATEGORY',
        'nullable': False
    }
}
update_column_response = client.update_column_spec(update_column_spec_dict)
update_column_response

**Tip:** You can use `'type_code': 'CATEGORY'` in the preceding `update_column_spec_dict` to convert the column data type from `FLOAT64` `to  `CATEGORY`.

### Update dataset: assign a label

In [0]:
#@title Update dataset { vertical-output: true }

label_column_name = 'income' #@param {type: 'string'}
label_column_spec = column_specs[label_column_name]
label_column_id = label_column_spec.name.rsplit('/', 1)[-1]
print('Label column ID: {}'.format(label_column_id))
# Define the values of the fields to be updated.
update_dataset_dict = {
    'name': dataset_name,
    'tables_dataset_metadata': {
        'target_column_spec_id': label_column_id
    }
}
update_dataset_response = client.update_dataset(update_dataset_dict)
update_dataset_response

___

## 5. Creating a model

### Train a model
Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.

    model = client.get_model(model_name)

In [0]:
#@title Create model { vertical-output: true }

model_display_name = 'census_income_model' #@param {type:'string'}

model_dict = {
    'display_name': model_display_name,
    'dataset_id': dataset_name.rsplit('/', 1)[-1],
    'tables_model_metadata': {'train_budget_milli_node_hours': 1000}
}
create_model_response = client.create_model(location_path, model_dict)
print('Dataset import operation: {}'.format(create_model_response.operation))
# Wait until model training is done.
create_model_result = create_model_response.result()
model_name = create_model_result.name
create_model_result

___

## 6. Make a prediction

### There are two different prediction modes: online and batch. The following cells show you how to make an online prediction. 

Run the following cell, and then choose the desired test values for your online prediction.

In [0]:
#@title Make an online prediction: set the categorical variables{ vertical-output: true }
from ipywidgets import interact
import ipywidgets as widgets

workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']
education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']
marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse']
occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces']
relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried']
race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']
sex_ids = ['Female', 'Male']
native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands']
workclass = widgets.Dropdown(options=workclass_ids, value=workclass_ids[0],
                           description='workclass:')

education = widgets.Dropdown(options=education_ids, value=education_ids[0],
                           description='education:', width='500px')

marital_status = widgets.Dropdown(options=marital_status_ids, value=marital_status_ids[0],
                           description='marital status:', width='500px')

occupation = widgets.Dropdown(options=occupation_ids, value=occupation_ids[0],
                           description='occupation:', width='500px')

relationship = widgets.Dropdown(options=relationship_ids, value=relationship_ids[0],
                           description='relationship:', width='500px')

race = widgets.Dropdown(options=race_ids, value=race_ids[0],
                           description='race:', width='500px')

sex = widgets.Dropdown(options=sex_ids, value=sex_ids[0],
                           description='sex:', width='500px')

native_country = widgets.Dropdown(options=native_country_ids, value=native_country_ids[0],
                           description='native_country:', width='500px')

display(workclass)
display(education)
display(marital_status)
display(occupation)
display(relationship)
display(race)
display(sex)
display(native_country)


Adjust the slides on the right to the desired test values for your online prediction.

In [0]:
#@title Make an online prediction: set the numeric variables{ vertical-output: true }

age = 34 #@param {type:'slider', min:1, max:100, step:1}
capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}
capital_loss = 3.8 #@param {type:'slider', min:0, max:4000, step:0.1}
fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}
education_num = 9 #@param {type:'slider', min:1, max:16, step:1}
hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}


**IMPORTANT** : Deploy the model, then wait until the model FINISHES deployment.
Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.</span>

In [0]:
response = client.deploy_model(model_name)


Run the prediction, only after the model finishes deployment

In [0]:
payload = {
    'row': {       
        'values': [
            {'number_value': age},
            {'string_value': workclass.value},
            {'number_value': fnlwgt},
            {'string_value': education.value},
            {'number_value': education_num},
            {'string_value': marital_status.value},
            {'string_value': occupation.value},
            {'string_value': relationship.value},
            {'string_value': race.value},
            {'string_value': sex.value},
            {'number_value': capital_gain},
            {'number_value': capital_loss},
            {'number_value': hours_per_week},
            {'string_value': native_country.value}
          ]
    }
}
prediction_client.predict(model_name, payload)

Undeploy the model

In [0]:
response2 = client.undeploy_model(model_name)

## 7. Batch prediction

### Initialize prediction

Your data source for batch prediction can be GCS or BigQuery. For this tutorial, you can use [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. Create a GCS bucket and upload the file into your bucket. Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the `errors.csv` file.
Also, enter the UI and create the bucket into which you will load your predictions. The bucket's default name here is automl-tables-pred.

**NOTE:** The client library has a bug. If the following cell returns a `TypeError: Could not convert Any to BatchPredictResult` error, ignore it. The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells.

In [0]:
#@title Start batch prediction { vertical-output: true, output-height: 200 }

batch_predict_gcs_input_uris = ['gs://cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv',] #@param
batch_predict_gcs_output_uri_prefix = 'gs://automl-tables-pred1' #@param {type:'string'}
#gs://automl-tables-pred
# Define input source.
batch_prediction_input_source = {
  'gcs_source': {
    'input_uris': batch_predict_gcs_input_uris
  }
}
# Define output target.
batch_prediction_output_target = {
    'gcs_destination': {
      'output_uri_prefix': batch_predict_gcs_output_uri_prefix
    }
}
batch_predict_response = prediction_client.batch_predict(
    model_name, batch_prediction_input_source, batch_prediction_output_target)
print('Batch prediction operation: {}'.format(batch_predict_response.operation))
# Wait until batch prediction is done.
batch_predict_result = batch_predict_response.result()
batch_predict_response.metadata