# California Housing Model Inferencing Endpoint
This notebook walks through creating a simple machine learning model using the California Housing dataset, registering it with 
the Model Registry, and creating a publicly accessible model inferencing endpoint. It then
demonstrates how to get programmatic access to the model inferencing endpoint from outside
of Snowflake using a Programmatic Access Token (PAT).

## 1. Setup Snowflake
Before we proceed, go to the Packages" pull-down and enter `scikit-learn` in the "Find Packages" textbox and select `scikit-learn`. Do the same with `snowflake-ml-python`. Lastly, do the same with `numpy` and make sure to choose the latest version _before_ 2.0 (e.g., `1.26.4`). Then click "Save", which will also restart the Notebook.

First we create a database, schema, and role for use in this example.

In [None]:
USE ROLE accountadmin;
CREATE ROLE IF NOT EXISTS ml_role;
GRANT ROLE ml_role TO ROLE ACCOUNTADMIN;
CREATE DATABASE IF NOT EXISTS api;
CREATE SCHEMA IF NOT EXISTS api.ml;
GRANT ALL ON DATABASE api TO ROLE ml_role;
GRANT ALL ON SCHEMA api.ml TO ROLE ml_role;

Next, let's create a compute pool for our service, and grant usage permissions to our `ML_ROLE` role. 
We also grant the `ML_ROLE` role the permission to create services with public endpoints.

In [None]:
USE ROLE accountadmin;
CREATE COMPUTE POOL IF NOT EXISTS pool_api 
 MIN_NODES = 1
 MAX_NODES = 1
 INSTANCE_FAMILY = CPU_X64_XS;
GRANT ALL ON COMPUTE POOL pool_api TO ROLE ml_role;
GRANT BIND SERVICE ENDPOINT ON ACCOUNT TO ROLE ml_role;

Since we are going to be using Snowpark Container Services to host the inferencing endpoint, we
will need an `IMAGE REPOSITORY` to store the model image. We create that using the `ML_ROLE` role.

In [None]:
USE ROLE ml_role;
CREATE IMAGE REPOSITORY IF NOT EXISTS api.ml.repo_ml;

## 2. Create the Model

Now we turn our attention to the actual machine learning model. 

For illustrative purposes, we are creating a multiple linear regression model based on the California Housing
dataset included in Scikit Learn. This dataset contains information collected from the 1990 California census
and is used to predict median house values in California districts.

In [None]:
import numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from snowflake.snowpark import Session

import sklearn 
import numpy as np

# Assume you have an active Snowpark session
session = get_active_session()

# Query the table
df_snowflake = session.table("CALIFORNIA_HOUSING_RAW")

# Convert to Pandas DataFrame
pandas_df = df_snowflake.to_pandas()

# Display first few rows
print(pandas_df.head())



# Assuming pandas_df is your full DataFrame loaded from Snowflake
housing_X = pandas_df.drop(columns=['MEDHOUSEVALUE'])
housing_y = pandas_df['MEDHOUSEVALUE']

# Optional: Display shapes to verify
print(f"Features shape: {housing_X.shape}")
print(f"Target shape: {housing_y.shape}")


#train test split
housing_X_train, housing_X_test, housing_y_train, housing_y_test = train_test_split(
    housing_X, housing_y, test_size=0.2, random_state=42)

scaler = StandardScaler()
housing_X_train_scaled = scaler.fit_transform(housing_X_train)
housing_X_test_scaled = scaler.transform(housing_X_test)

# Scale the features for better model performance


# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(housing_X_train_scaled, housing_y_train)

# Print model performance
train_score = regr.score(housing_X_train_scaled, housing_y_train)
test_score = regr.score(housing_X_test_scaled, housing_y_test)
print(f"\nModel Performance:")
print(f"Training R² Score: {train_score:.4f}")
print(f"Testing R² Score: {test_score:.4f}")

Now that we have created our `regr` linear regression model, let's test it by calling the `predict()` function directly with some sample data from our test set.

In [None]:
# Test the model with first 3 samples from test set
sample_predictions = regr.predict(housing_X_test_scaled[:3])

actual_values = housing_y_test[:3]
print(sample_predictions)
print(actual_values)

## 3. Register the Model

Now we can turn our attention to registering our Scikit Learn model in the Snowflake Model Registry.

First, we create a Snowpark Session.

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()

Next, we create a Snowflake ML Registry object using the Snowpark Session. We provide the database, `API`, and the schema `ML`.

In [None]:
from snowflake.ml.registry import Registry

session.use_schema('API.ML')
session.use_role('ML_ROLE')
reg = Registry(session=session, database_name="API", schema_name="ML")

Next, we register the `regr` model with the Model Registry. We provide a name for the model (`linreg_housing`), a version name (`v1`), and an optional comment. We need to list the Anaconda dependencies for this model. We also provide some sample input data so that the schema of the data can be inferred. Lastly, we provide some options to limit warnings.

We then show the models in the Model Registry.

In [None]:
# Note: We need to create a model that includes the scaler as well
# For simplicity, we'll create a pipeline that includes both scaling and regression
from sklearn.pipeline import Pipeline

# Create a pipeline with scaling and regression
housing_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', linear_model.LinearRegression())
])

# Fit the pipeline on the original (unscaled) training data
housing_pipeline.fit(housing_X_train, housing_y_train)

# Register the pipeline model
mv = reg.log_model(housing_pipeline,
                   model_name="linreg_housing",
                   version_name="v1",
                   conda_dependencies=["scikit-learn", "numpy"],
                   comment="California Housing Linear Regression with Scaling",
                   options={"relax_version": True},
                   sample_input_data=housing_X_test[:10])

reg.show_models()

Now that we have the model registered, we will create a `SERVICE` in Snowpark Container Services (SPCS) to host the inferencing endpoint. We provide a service name (`linreg_housing_svc`), a compute pool to use (`pool_api`, which we created earlier), and an image repository to hold the image (`repo_ml`, which we created earlier). Lastly, we indicate that the service should expose the model inferencing endpoint publicly.

In [None]:
# Deploy the model to SPCS
mv.create_service(
    service_name="linreg_housing_svc",
    service_compute_pool="pool_api",
    image_repo="API.ML.REPO_ML",
    ingress_enabled=True)


## Accessing the Model Inferencing Endpoint

We want to set up a separate user and role to access the model inferencing endpoint, as opposed to using the role that created the service.

First, we create a new `ML_SCORING_ROLE` role and grant it access to the `API` database and `ML` schema.

In [None]:
USE ROLE ACCOUNTADMIN;
CREATE ROLE IF NOT EXISTS ml_scoring_role;
GRANT ROLE ml_scoring_role TO ROLE accountadmin;
GRANT USAGE ON DATABASE api TO ROLE ml_scoring_role;
GRANT USAGE ON SCHEMA api.ml TO ROLE ml_scoring_role;

Next, we create a user that we can use externally to access the endpoint. This user (`ML_SCORING_USER`) is granted the `ML_SCORING_ROLE` role.

In [None]:
USE ROLE ACCOUNTADMIN;
CREATE USER IF NOT EXISTS ml_scoring_user PASSWORD='User123' DEFAULT_ROLE = ml_scoring_role
 DEFAULT_SECONDARY_ROLES = ('ALL') MUST_CHANGE_PASSWORD = FALSE;
GRANT ROLE ml_scoring_role TO USER ml_scoring_user;

Next, we create a Programmatic Access Token (PAT) that we can use to programmatically access the model inferencing endpoint from outside of Snowflake. 

In order to create a PAT, the user must have a network policy, so we create a network policy that allows access from any source IP address. In practice, this network policy should be set as narrowly as possible. Then, we assign that network policy to our user.

Then, we create a PAT for the `ML_SCORING_USER` user. We will need this token to access from outside Snowflake.

In [None]:
USE ROLE ACCOUNTADMIN;
CREATE NETWORK POLICY IF NOT EXISTS api_np ALLOWED_IP_LIST = ('0.0.0.0/0');
ALTER USER ml_scoring_user SET NETWORK_POLICY = api_np;
ALTER USER IF EXISTS ml_scoring_user ADD PROGRAMMATIC ACCESS TOKEN ml_scoring_token_house;

We now grant access to the public endpoint to the `ML_SCORING_ROLE`.

In [None]:
GRANT SERVICE ROLE api.ml.linreg_housing_svc!all_endpoints_usage TO ROLE ml_scoring_role;

Lastly, we need the actual hostname for the endpoint.

In [None]:
SHOW ENDPOINTS IN SERVICE api.ml.linreg_housing_svc;

## Test the Endpoint

Let's test our deployed model by making a prediction on some sample data from our test set.

## Example

Now we show an example of using the Python classes programmatically so you can incorporate it into your code.

Run the following cell to see the code you can use to access the endpoint programmatically in Python. It uses output from previous cells and some SQL to get the values needed in the sample code below.

In [None]:
import streamlit as st
from snowflake.snowpark.context import get_active_session
session = get_active_session()
pat = cell12.to_pandas().iloc[0].to_dict()['token_secret']
scoring_endpoint = f"https://{cell13.to_pandas().iloc[0].to_dict()['ingress_url']}/predict"

# Get sample data for the example
sample_data = housing_X_test[0].tolist()  # First test sample
sample_json = json.dumps({"data": [[0] + sample_data]}, indent=2)

st.markdown(f"""
```python
import requests
import json
pat = "{pat}"
scoring_endpoint = "{scoring_endpoint}"

# Set up the headers to use
headers = {{'Authorization': f'Snowflake Token="{pat}"'}}

# Sample California Housing data (8 features):
# [MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude]
sample_payload = {sample_json}

# Each call to the endpoint looks like this:
resp = requests.post(url=scoring_endpoint, 
                     headers=headers, 
                     json=sample_payload)

# Do something with the scores (house price predictions)
scores = resp.json()
print(f"Predicted house price: ${{scores['data'][0][1]['output_feature_0']:.2f}}00k")
```
""")

st.markdown("### California Housing Dataset Features:")
st.markdown("""
- **MedInc**: median income in block group
- **HouseAge**: median house age in block group
- **AveRooms**: average number of rooms per household
- **AveBedrms**: average number of bedrooms per household
- **Population**: block group population
- **AveOccup**: average number of household members
- **Latitude**: block group latitude
- **Longitude**: block group longitude
""")

## Cleanup
If you are finished with this example, we can now delete the scoring service, the model, the user and scoring role.

In [None]:
USE ROLE accountadmin;
ALTER SERVICE api.ml.linreg_housing_svc SUSPEND;
DROP SERVICE api.ml.linreg_housing_svc;
DROP USER ml_scoring_user;
DROP ROLE ml_scoring_role;
DROP MODEL api.ml.linreg_housing;

You can drop the following resources, as well, but if those are being used for other purposes (e.g., you have other things using the compute pool we created), comment out (or delete) those lines.

In [None]:
USE ROLE accountadmin;
DROP IMAGE REPOSITORY api.ml.repo_ml;
DROP COMPUTE POOL pool_api;
DROP ROLE ml_role;
DROP SCHEMA api.ml;