#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [None]:
PROJECT_ID = ""

# Get your Google Cloud project ID from gcloud
if not os.getenv("IS_TESTING"):
    shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID: ", PROJECT_ID)

### Create a Cloud Storage bucket


In [None]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}
REGION = "[your-region]"  # @param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

**Finally**, validate access to your Cloud Storage bucket by examining its contents:

In [None]:
! gsutil ls -al $BUCKET_NAME

## Before you begin

The ALS model approach is compute-intensive and could take a lot of time to train on a regular notebook environment, so this tutorial uses a Dataproc cluster with PySpark environment.

### Create a Dataproc cluster with component gateway enabled and JupyterLab extension
<a name="section-5"></a>

Create the cluster using the following `gcloud` command.

In [None]:
CLUSTER_NAME = "[your-cluster-name]"
CLUSTER_REGION = "[your-cluster-region]"
CLUSTER_ZONE = "[your-cluster-zone]"
MACHINE_TYPE = "[your=machine-type]"

In [None]:
! gcloud dataproc clusters create $CLUSTER_NAME \
--enable-component-gateway \
--region $CLUSTER_REGION \
--zone $CLUSTER_ZONE \
--single-node \
--master-machine-type $MACHINE_TYPE \
--master-boot-disk-size 100 \
--image-version 2.0-debian10 \
--optional-components JUPYTER \
--project $PROJECT_ID

## Connect to the cluster from the notebook
<a name="section-6"></a>

When the new Dataproc cluster is running, the corresponding runtime appears as a kernel in the notebook. The created cluster's name will appear in the list of kernels that can be selected for this notebook. In the top right corner of this notebook file, click the current kernel name, **Python (local)**, and then select the Python 3 kernel that is running on your Dataproc cluster.

<img src="images/cluster_kernel_selection.png"></img>

Note the following:

- Your Dataproc kernel might take a few minutes to show up in the list of kernels.
- PySpark code in this tutorial can be run on either a PySpark or Python 3 kernel on the Dataproc cluster, but to run the optional code that saves recommendations to a BigQuery table, the Python 3 kernel is recommended.

#@bigquery
#TODO#Change this to our view query


In [None]:
# The following two lines are only necessary to run once.
# Comment out otherwise for speed-up.
from google.cloud.bigquery import Client

client = Client()

query = """WITH user_prod_table AS (
SELECT USER_ID, PRODUCT_ID, STATUS FROM looker-private-demo.retail.order_items AS a
join
(SELECT ID, PRODUCT_ID FROM looker-private-demo.retail.inventory_items) AS b
on a.inventory_item_id = b.ID )

SELECT USER_ID, PRODUCT_ID, STATUS from user_prod_table"""
job = client.query(query)
df = job.to_dataframe()