<a href="https://colab.research.google.com/github/trevorbehnke/Google-Cloud-AI-Platform-XGBoost-BigQuery/blob/master/Using_XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Make sure XGBoost is installed




In [0]:
!pip3 install xgboost



Import Packages

In [0]:
import pandas as pd
import xgboost as xgb
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from google.cloud import bigquery

Add Google Cloud project

In [0]:
!gcloud config set project mlxgboost

Updated property [core/project].


Authenticate Account

In [0]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=nFtCcPlLPvHAoJqQKVjYm4UqYqtVxNoTNiUgR2VoXeg&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/xgEDpnD4-sAZ61icXokBiTTkoijisO1D0IB1ZnqW_BrLJ-DWnt1gkM0

You are now logged in as [me@trevorbehnke.com].
Your current project is [mlxgboost].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


Create Service Account

In [0]:
!gcloud iam service-accounts create mlxgboost

Add IAM Policy to Service Account

In [0]:
!gcloud projects add-iam-policy-binding mlxgboost --member "serviceAccount:mlxgboost@mlxgboost.iam.gserviceaccount.com" --role "roles/owner"

Updated IAM policy for project [mlxgboost].
bindings:
- members:
  - serviceAccount:service-454613473835@compute-system.iam.gserviceaccount.com
  role: roles/compute.serviceAgent
- members:
  - serviceAccount:454613473835-compute@developer.gserviceaccount.com
  - serviceAccount:454613473835@cloudservices.gserviceaccount.com
  role: roles/editor
- members:
  - serviceAccount:service-454613473835@cloud-ml.google.com.iam.gserviceaccount.com
  role: roles/ml.serviceAgent
- members:
  - serviceAccount:mlxgboost@mlxgboost.iam.gserviceaccount.com
  - user:me@trevorbehnke.com
  role: roles/owner
etag: BwWgX6Kd9mY=
version: 1


Create Service Account Credentials

In [0]:
!gcloud iam service-accounts keys create mlxgboost.json --iam-account mlxgboost@mlxgboost.iam.gserviceaccount.com

created key [3ba674f249de5bf8859f1edd52e8b273b4fa9d33] of type [json] as [mlxgboost.json] for [mlxgboost@mlxgboost.iam.gserviceaccount.com]


Create Environment Variables for Service Account Credentials

In [0]:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/mlxgboost.json"

Save Service Account Credentials to Local Folder

In [0]:
!export GOOGLE_APPLICATION_CREDENTIALS="/content/mlxgboost.json"

Ingest Data using BigQuery

In [0]:
query="""
SELECT
  age,
  workclass,
  hours_per_week,
  race,
  sex
FROM
  bigquery-public-data.ml_datasets.census_adult_income
"""
df = bigquery.Client().query(query).to_dataframe()
df.head()

Unnamed: 0,age,workclass,hours_per_week,race,sex
0,72,Private,48,Asian-Pac-Islander,Female
1,70,Private,40,White,Female
2,77,Private,10,Black,Female
3,81,Self-emp-inc,28,White,Female
4,69,Private,40,White,Female


Brief Description of Statistical Data

In [0]:
df.describe()

Unnamed: 0,age,hours_per_week
count,32561.0,32561.0
mean,38.581647,40.437456
std,13.640433,12.347429
min,17.0,1.0
25%,28.0,40.0
50%,37.0,40.0
75%,48.0,45.0
max,90.0,99.0


Drop Rows With Null Values and Shuffle the Data

In [0]:
df = df.dropna()
df = shuffle(df, random_state=2)

Extract the Label Column Into a Separate Variable and Create a DataFrame With Only Our Features (AKA What Do I want to Predict?)

In [0]:
labels = df['hours_per_week']
data = df.drop(columns=['hours_per_week'])

Preview the New Dataset

In [0]:
data.head()

Unnamed: 0,age,workclass,race,sex
16054,35,Private,White,Female
32382,64,Private,White,Female
10749,29,Federal-gov,White,Female
15377,35,Private,White,Male
29660,56,Self-emp-not-inc,White,Male


Convert Categorical Values to Binary

In [0]:
dummy = pd.get_dummies(data=data, columns=['workclass', 'race', 'sex'])
dummy.head()

Unnamed: 0,age,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male
16054,35,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
32382,64,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0
10749,29,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0
15377,35,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1
29660,56,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1


Split Data Into Train and Test Sets Using Scikit Learn *train_test_split*

In [0]:
x,y = dummy,labels
x_train,x_test,y_train,y_test = train_test_split(x,y)

Create the Model

In [0]:
model = xgb.XGBRegressor(
    objective='reg:linear'
)

Train the Model

In [0]:
model.fit(x_train, y_train)

  if getattr(data, 'base', None) is not None and \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

Generate Predictions On the Test Data

In [0]:
y_pred = model.predict(x_test)

View Model Performance on the Test Set

In [0]:
for i in range(20):
    print('Predicted Hours Per Week: ', y_pred[i])
    print('Actual Hours Per Week: ', y_test.iloc[i])
    print()

Predicted Hours Per Week:  38.34334
Actual Hours Per Week:  40

Predicted Hours Per Week:  32.38878
Actual Hours Per Week:  25

Predicted Hours Per Week:  42.821472
Actual Hours Per Week:  40

Predicted Hours Per Week:  31.183512
Actual Hours Per Week:  30

Predicted Hours Per Week:  45.024593
Actual Hours Per Week:  40

Predicted Hours Per Week:  44.731068
Actual Hours Per Week:  55

Predicted Hours Per Week:  28.690216
Actual Hours Per Week:  30

Predicted Hours Per Week:  44.464085
Actual Hours Per Week:  40

Predicted Hours Per Week:  48.645706
Actual Hours Per Week:  60

Predicted Hours Per Week:  42.83525
Actual Hours Per Week:  40

Predicted Hours Per Week:  38.334297
Actual Hours Per Week:  40

Predicted Hours Per Week:  39.351933
Actual Hours Per Week:  40

Predicted Hours Per Week:  48.90317
Actual Hours Per Week:  40

Predicted Hours Per Week:  44.884422
Actual Hours Per Week:  35

Predicted Hours Per Week:  35.457832
Actual Hours Per Week:  25

Predicted Hours Per Week:  38

Save Model to Local Folder

In [0]:
model.save_model('model.bst')

Define Environment Variables for the Google Cloud Storage Bucket and Model

In [0]:
GCP_PROJECT = 'mlxgboost'
MODEL_BUCKET = 'gs://mlxgboost'
VERSION_NAME = 'v1'
MODEL_NAME = 'hours_per_week'

Create a Google Cloud Storage Bucket for the Model

In [0]:
!gsutil mb $MODEL_BUCKET

Creating gs://mlxgboost/...
ServiceException: 409 Bucket mlxgboost already exists.


Copy the Model to the Google Cloud Storage Bucket

In [0]:
!gsutil cp ./model.bst $MODEL_BUCKET

Copying file://./model.bst [Content-Type=application/octet-stream]...
-
Operation completed over 1 objects/67.4 KiB.                                     


Create the Model on Google Cloud AI Platform

In [0]:
!gcloud ai-platform models create $MODEL_NAME

Created ml engine model [projects/mlxgboost/models/hours_per_week].


Deploy the Model

In [0]:
!gcloud ai-platform versions create $VERSION_NAME \
--model=$MODEL_NAME \
--framework='XGBOOST' \
--runtime-version=1.15 \
--origin=$MODEL_BUCKET \
--python-version=3.7 \
--project=$GCP_PROJECT

Test the Deployed Model by Inputing Some Data 
(In this case: A 29 Year Old White Female Who Works For the Federal Government)

In [0]:
%%writefile predictions.json
[29, 0,	1, 0,	0, 0,	0, 0,	0, 0, 0, 0, 0, 0,	1, 1, 0]

Writing predictions.json


Print Out Your Model Predictions!

In [0]:
prediction = !gcloud ai-platform predict --model=$MODEL_NAME --json-instances=predictions.json --version=$VERSION_NAME
print(prediction.s)

[40.282257080078125]
