### Week 6:

Train a multi-class classification model on AutoML.   

https://towardsdatascience.com/the-best-of-both-worlds-calling-auto-ml-from-bigquery-9dfd433a45d6    
https://cloud.google.com/blog/products/ai-machine-learning/use-automl-tables-from-a-jupyter-notebook 


Using the NOAA dataset again:

* The target column should be the "element" column, filtered for the weather types (i.e. WT**)
* The feature columns should be id, and date, and the columns from the `bigquery-public-data.ghcn_d.ghcnd_stations` at least

For the others,they will need to be investigated to see if they have any relevant features:
`bigquery-public-data.ghcn_d.ghcnd_countries`
`bigquery-public-data.ghcn_d.ghcnd_inventory`
`bigquery-public-data.ghcn_d.ghcnd_states`


This time, we are not filtering for just a specific city, i.e. Chicago, because we want to know if there are patterns by location

In [1]:
from app_creds import set_env
set_env()

#from google.cloud import automl
from google.cloud import automl_v1beta1

from google.cloud import bigquery
# Construct a BigQuery client object.
bq_client = bigquery.Client()

## Get the weather-type categorical variables from the weather dataset

https://docs.opendata.aws/noaa-ghcn-pds/readme.html

__WT** = Weather Type where ** has one of the following values:__    

01 = Fog, ice fog, or freezing fog (may include heavy fog)    
02 = Heavy fog or heaving freezing fog (not always distinguished from fog)    
03 = Thunder    
04 = Ice pellets, sleet, snow pellets, or small hail    
05 = Hail (may include small hail)    
06 = Glaze or rime    
07 = Dust, volcanic ash, blowing dust, blowing sand, or blowing obstruction    
08 = Smoke or haze    
09 = Blowing or drifting snow    
10 = Tornado, waterspout, or funnel cloud    
11 = High or damaging winds    
12 = Blowing spray    
13 = Mist    
14 = Drizzle    
15 = Freezing drizzl    
16 = Rain (may include freezing rain, drizzle, and freezing drizzle)    
17 = Freezing rain    
18 = Snow, snow pellets, snow grains, or ice crystals    
19 = Unknown source of precipitation    
21 = Ground fog    
22 = Ice fog or freezing fog   

In [56]:
get_data_query = """
WITH get_counts AS (
SELECT
  yd.id, COUNT(*) num_rows
FROM
  `bigquery-public-data.ghcn_d.ghcnd_2022` as yd
  JOIN `bigquery-public-data.ghcn_d.ghcnd_stations` sd
  ON yd.id = sd.id
WHERE yd.qflag IS NULL
AND yd.element LIKE 'WT%%'
GROUP BY yd.id
HAVING num_rows > 2)
SELECT
  yd.id, yd.date, yd.element,
  sd.name, sd.state, sd.latitude, sd.longitude
FROM
  `bigquery-public-data.ghcn_d.ghcnd_2022` as yd
  JOIN `bigquery-public-data.ghcn_d.ghcnd_stations` sd
  ON yd.id = sd.id
  JOIN get_counts gc ON yd.id = gc.id
WHERE yd.qflag IS NULL
AND yd.element LIKE 'WT%%'
AND yd.element <> 'WT10'
"""
_2022_query_job = bq_client.query(get_data_query)

# A dry run query completes immediately, it should give me an estimate of costs
print("This query will process {} bytes.".format(_2022_query_job.total_bytes_processed))

weather_and_state_dataframe = _2022_query_job.to_dataframe()

This query will process None bytes.


In [57]:
## Categorical requirements for the label: https://cloud.google.com/automl-tables/docs/prepare
# If it is Categorical, it must have at least 2 and no more than 500 distinct values.
print(weather_and_state_dataframe['element'].unique())
print(weather_and_state_dataframe['element'].value_counts())

['WT11' 'WT01' 'WT08' 'WT02' 'WT09' 'WT06' 'WT07' 'WT03' 'WT05' 'WT04']
WT01    92018
WT03    49498
WT08    28407
WT02    12243
WT06     3445
WT11     2926
WT04     2607
WT05     1571
WT09      788
WT07      118
Name: element, dtype: int64


In [58]:
## Make sure there are at least 3 values for each - this gets split into test, train and validation models

for col in weather_and_state_dataframe.columns:
    if col != 'element':
        print(weather_and_state_dataframe[col].value_counts())

USW00024286    530
USW00027502    483
USW00023273    424
USW00025713    398
USW00023191    385
              ... 
USC00290426      3
USC00427714      3
USC00143665      3
USC00238805      3
USC00205073      3
Name: id, Length: 2803, dtype: int64
2022-01-01    1301
2022-08-21    1290
2022-02-24    1212
2022-08-22    1192
2022-05-21    1158
              ... 
2022-03-01     250
2022-03-27     246
2022-10-20     237
2022-01-11     236
2022-10-21     180
Name: date, Length: 294, dtype: int64
CRESCENT CITY MCNAMARA AP    530
BARROW POST ROGERS AP        483
SANTA MARIA PUBLIC AP        424
ST PAUL ISLAND AP            398
AVALON CATALINA AP           385
                            ... 
FOREST CITY 2 NNE              3
GROVELAND 2                    3
LOGAN 5 SW EXP FARM            3
LEE VINING                     3
HIGH PT                        3
Name: name, Length: 2743, dtype: int64
TX    12165
CA    11641
FL    10299
PA     7119
MI     6672
AK     6241
OH     5833
WI     5797
IL     56

In [59]:
target_column = 'element'

In [70]:
create_model_query = f"""
CREATE OR REPLACE MODEL `msds-434-robords-oct.weather_prediction.automl_weather_classes`
OPTIONS(MODEL_TYPE = 'automl_classifier', budget_hours=0.25, INPUT_LABEL_COLS=['{target_column}'])
AS 
{get_data_query}
"""

print(create_model_query)


CREATE OR REPLACE MODEL `msds-434-robords-oct.weather_prediction.automl_weather_classes`
OPTIONS(MODEL_TYPE = 'automl_classifier', budget_hours=2.0, INPUT_LABEL_COLS=['element'])
AS 

WITH get_counts AS (
SELECT
  yd.id, COUNT(*) num_rows
FROM
  `bigquery-public-data.ghcn_d.ghcnd_2022` as yd
  JOIN `bigquery-public-data.ghcn_d.ghcnd_stations` sd
  ON yd.id = sd.id
WHERE yd.qflag IS NULL
AND yd.element LIKE 'WT%%'
GROUP BY yd.id
HAVING num_rows > 2)
SELECT
  yd.id, yd.date, yd.element,
  sd.name, sd.state, sd.latitude, sd.longitude
FROM
  `bigquery-public-data.ghcn_d.ghcnd_2022` as yd
  JOIN `bigquery-public-data.ghcn_d.ghcnd_stations` sd
  ON yd.id = sd.id
  JOIN get_counts gc ON yd.id = gc.id
WHERE yd.qflag IS NULL
AND yd.element LIKE 'WT%%'
AND yd.element <> 'WT10'




In [71]:
create_model = bq_client.query(create_model_query)
create_model

QueryJob<project=msds-434-robords-oct, location=US, id=386c70d9-830c-48a1-9312-a0574c808797>

In [96]:
# Get the current state of the query.  We could write a "while .." job here, but that would mean
# we'd be continually querying bq and we might get charged for it. So, run as needed.

get_query_status = f"""
SELECT
job_type, state, start_time, end_time, query, total_bytes_billed/1000000000 as gigabytes_billed
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS
WHERE
  job_id = '{create_model.job_id}'
"""

query_state = bq_client.query(get_query_status)
query_state.to_dataframe()

Unnamed: 0,job_type,state,start_time,end_time,query,gigabytes_billed
0,QUERY,RUNNING,2022-10-28 21:51:31.527000+00:00,NaT,\nCREATE OR REPLACE MODEL `msds-434-robords-oc...,


In [69]:
results = create_model.result()
results

NotFound: 404 Not found: Dataset msds-434-robords-oct:weather_predictions was not found in location US

Location: US
Job ID: 33d2e8f2-0f16-4fa5-959a-e3266b8621e9


## Get Training Info

In [None]:
model_training_info = """
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `msds-434-robords-oct.weather_prediction.automl_weather_classes`)
"""

automl_model_training = bq_client.query(model_training_info)
train_info = automl_model_training.to_dataframe()

In [None]:
train_info

## Make Predictions

Pass two, manual predictions to Big Query

In [None]:
predictions_query = """
SELECT * FROM ML.PREDICT(MODEL `msds-434-robords-oct.weather_prediction.automl_weather_classes`,(
  SELECT 
  'USW00027502' AS id, 
  '71.2833' AS latitude,
  '-156.7814' AS longitude,
  'BARROW POST ROGERS AP' AS name,
  'AK' AS state,
  '2022-10-21' as date
  UNION ALL
  SELECT 
  'USW00027502' AS id, 
  '71.2833' AS latitude,
  '-156.7814' AS longitude,
  'BARROW POST ROGERS AP' AS name,
  'AK' AS state,
  '2022-11-01' as date
))
"""

automl_model_predictions = bq_client.query(predictions_query)
predictions_info = automl_model_predictions.to_dataframe()

In [None]:
predictions_info