# Lab3: Peptide Prediction with BigQuery AutoML (Based on Lab2)

### Supporting research: Covid19 and beyond for vaccine candidates 

BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.
In this tutorial, you use the sample Covid19 dataset for BigQuery.
Comments & Feedback @jigmehta
## Objectives
In this tutorial, you will use BigQuery to explore immunological data, AutoML to automatically generate ML model for peptide binding. Also, you will leverage BQML to explore vareous ML models and perform feature engineering:
+ BigQuery `ML.CREATE` to create a classification model using the `CREATE MODEL` statement
+ The `ML.EVALUATE` function to evaluate the ML model
+ Use `ML.TRANSFORM`feature engineering functions to improve model performance
+ The `ML.PREDICT` function to make predictions using the ML model

# AutoML to explore classification model
Now as we explored Classification model, we might be interested to explore what other models can we used to make better prediction. 
GCP provide a service AutoML 
which allow scientist to submit their data and 
to explore vareous ML models.
+ [AutoML models](https://pantheon.corp.google.com/automl-tables/locations/us-central1/datasets/TBL8658818995480166400;modelId=TBL2508683113329065984;task=basic/schemav2?project=covid-19-271622) can be explored through GCP console
+ You can also call [AutoML from BQML](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-automl)


In [89]:
# Read GCP project id from env.
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GCP_PROJECT_ID=shell_output[0]
print("GCP project ID:" + GCP_PROJECT_ID)

GCP project ID:covid-19-271622



# Leverage AutoML to build the model

Building ML models with BigQuery AutoML is as simple as writing SQL statements; makes ML modeling accessible to even SQL developers and analysts. We will create a model to predict for a given peptide if there is strong binding affinity with certain HLA Allele.

Following statement creates a classification model using logistic regression. We are selecting feature columns of Allele and peptide of specific mers to classify if a peptide is a good candidate for vaccine testing.

    Note: we are filtering data for peptides with length of 9 or 10 mers only. Also, since we run multiple samples, we are randomizing samples by 80% of data for learning.



In [None]:
%%bigquery --project $GCP_PROJECT_ID
CREATE OR REPLACE MODEL `corona.Classification_model_automl`
OPTIONS
(
model_type='automl_classifier',
input_label_cols=['Qualitative_Measure'],
budget_hours=1.0
)
AS
SELECT
 Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
 FROM
  `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
 WHERE length(Description) IN (9,10)
 AND organism_name like '%coronavirus%'
 AND rand() < 0.8

Executing query with job ID: 24aedef0-30ae-4afe-ae1f-3e0a37f5211f
Query executing: 130.35s

### It will take an hour to run


# Evaluate your model

After creating your model, you evaluate the performance of the classifier using the ML.EVALUATE function. You can also use the ML.ROC_CURVE function for specific metrics.

A classifier is one of a set of enumerated target values for a label. For example, in this tutorial you are using a classification model that detects one of the qualification class for peptide binding.

To run the ML.EVALUATE query that evaluates the model:


In [97]:
%%bigquery --project $GCP_PROJECT_ID
SELECT
  *
FROM ML.EVALUATE(MODEL `corona.Classification_model_automl`)

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.918101,0.814754,0.990113,0.825492,0.082136,0.987235


In [98]:
%%bigquery --project $GCP_PROJECT_ID
SELECT roc_auc,
       CASE WHEN roc_auc > .8 THEN 'good'
            WHEN roc_auc > .7 THEN 'fair'
            WHEN roc_auc > .5 THEN 'not great'
            ELSE 'poor' END AS model_quality
FROM ML.EVALUATE(MODEL `corona.Classification_model_automl`)

Unnamed: 0,roc_auc,model_quality
0,0.987235,good


# You can check vareous models explored by AutoML 
+ AutoML keep [log](https://pantheon.corp.google.com/automl-tables/locations/us-central1/datasets/TBL8658818995480166400;modelId=TBL2508683113329065984;task=basic/train?project=covid-19-271622) of all model configurations it explored
+ [Models](https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22cloudml_job%22%20resource.labels.job_id%3D%22TBL2508683113329065984%22%20resource.labels.project_id%3D%22covid-19-271622%22%20labels.log_type%3D%22automl_tables%22%20jsonPayload.%22@type%22%3D%22type.googleapis.com%2Fgoogle.cloud.automl.master.TablesModelStructure%22?project=covid-19-271622)
+ [Trials](https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22cloudml_job%22%20resource.labels.job_id%3D%22TBL2508683113329065984%22%20resource.labels.project_id%3D%22covid-19-271622%22%20labels.log_type%3D%22automl_tables%22%20jsonPayload.%22@type%22%3D%22type.googleapis.com%2Fgoogle.cloud.automl.master.TuningTrial%22?project=covid-19-271622)


# Run Prediction on BQML Model

Now that you have evaluated your model, the next step is to use it to predict outcomes.

To run the query that uses the model to predict the number of transactions: Following example demonstrate leveraging BQ model for prediction. Optionally, you can export model and publish it on to Google AI Platform for serving prediction.


In [99]:
%%bigquery --project $GCP_PROJECT_ID
SELECT
  predicted_Qualitative_Measure, predicted_Qualitative_Measure_probs, Qualitative_Measure as original_result
FROM ML.PREDICT(MODEL `corona.Classification_model_automl`, (
  SELECT Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
  FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
  WHERE length(Description) IN (9,10)
  AND organism_name like '%coronavirus%'
  AND rand() < 0.0009))

Unnamed: 0,predicted_Qualitative_Measure,predicted_Qualitative_Measure_probs,original_result
0,Positive-High,"[{'label': 'Negative', 'prob': 0.0002618547005...",Positive-High
1,Positive-Intermediate,"[{'label': 'Negative', 'prob': 6.1814935179427...",Positive-Intermediate
2,Positive-Intermediate,"[{'label': 'Negative', 'prob': 0.0001175400393...",Positive-Intermediate
3,Positive-High,"[{'label': 'Negative', 'prob': 0.0009793316712...",Positive-High
4,Negative,"[{'label': 'Negative', 'prob': 0.5235264301300...",Negative
5,Positive-Intermediate,"[{'label': 'Negative', 'prob': 0.0002878189261...",Positive-Intermediate
6,Positive-High,"[{'label': 'Negative', 'prob': 0.0004914508899...",Positive-High
7,Negative,"[{'label': 'Negative', 'prob': 0.7932363152503...",Negative


The result shows predicted quality class with confidence. You can compare that with original result. Next step is to operationalize ML pipeline so that you can efficiently perform data updates and model updates. Check out AI Pipeline example for peptide prediction to learn more!

### This is end of Lab3! Congratualtions!