# AI/ML for Drug Discovery : Snowflake ML Hands On Lab


In this notebook, we will go through the basics of using Notebooks Container Runtime for Data Analysis and Machine Learning. We will install OSS packages,explore the CHEMBL dataset, create streamlit visualizations, train a simple prediction model to help predict the activity of compounds against a disease of interest. 

Features Highlighted:\
:bulb: Snowflake Notebooks \
:bulb: Streamlit in Notebooks\
:bulb: SnowflakeML APIs\
:bulb: Snowflake Pandas API\
:bulb: Feature Store\
:bulb: Model Registry and Serving

### Python Packages

The Container Runtime for Snowflake Notebooks includes pre-installed common packages including SnowparkML and other OSS packages.

In [None]:
!pip freeze

Notebooks Container Runtime, along with External Access Integrations give us the flexibility to pip install packages from anywhere, including popular package repositories such as pypi. You can install whatever packages you need by running !pip install <package_name> directly in the Notebook.

We have configured this notebook to allow pypi urls with an External Access Integration.

In [None]:
!pip install rdkit
!pip install biopython
!pip install "snowflake-snowpark-python[modin]"
#you will need to restart the kernal after installing

In [None]:
import warnings
warnings.filterwarnings("ignore")

#snowpark packages 
import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F

#Scikit learn 
import sklearn as sl
#rdkit
import rdkit
#biopython
import Bio

# Data Science Libs
import numpy as np
import pandas as pd
# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()
# Add a query tag to the session. This helps with debugging and performance monitoring.
session.query_tag = {"origin":"sf_DEFAULT_DATABASE", "name":"aiml_notebooks_container_runtime", "version":{"major":1, "minor":0}, "attributes":{"is_hol":1, "source":"notebook"}}
# Set session context 
session.use_role("attendee_role") 

# Print the current role, warehouse, and database/schema
print(f"role: {session.get_current_role()} | WH: {session.get_current_warehouse()} | DB.SCHEMA: {session.get_fully_qualified_current_schema()}")
     


## Part 1: Explore Chembl29 using SQL & Python

Let's see how to seemlessly work in both sql and python in a single notebook and leverage the RDKit and Biopython libraries we installed.

In [None]:
select * from DEFAULT_DATABASE.CHEMBL29.target_dictionary limit 20;

In [None]:
---Retrieve target ChEMBL_ID, target_name, target_type, protein accessions and sequences for all protein targets:
SELECT t.chembl_id AS target_chembl_id,
t.pref_name        AS target_name,
t.organism,
t.target_type,
c.accession        AS protein_id,
c.sequence         AS protein_sequence
FROM  DEFAULT_DATABASE.CHEMBL29.target_dictionary t
  JOIN  DEFAULT_DATABASE.CHEMBL29.target_type tt ON t.target_type = tt.target_type
  JOIN  DEFAULT_DATABASE.CHEMBL29.target_components tc ON t.tid = tc.tid
  JOIN  DEFAULT_DATABASE.CHEMBL29.component_sequences c ON tc.component_id = c.component_id
AND tt.parent_type  = 'PROTEIN';

Now let's say we want a reusable function that can calculate properties given the amino acid sequence we see in the above dataframe. This is where we can leverage Snowpark Python UDFs.

Although Biopython is available via Anaconda , we want the latest biopython package available that we installed in this environment as well for our udf

We can now do this with Snowflake's default Artifact Repository 
https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-packages 

Accessing this repo typically needs to be granted from Accountadmin.

In [None]:
#leverage the Biopython library calculate some protein properties 
#based on it's amino acid sequence

import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F

@F.udf(input_types=[T.StringType()],return_type= T.VariantType(), stage_location="@DEFAULT_DATABASE.NOTEBOOKS.NOTEBOOK_1",is_permanent=True,
       name="protein_analysis_udf",replace=True,artifact_repository="snowflake.snowpark.pypi_shared_repository", artifact_repository_packages=["biopython"])
def udf(sequence):
    from Bio.Seq import Seq
    from Bio.SeqUtils.ProtParam import ProteinAnalysis
    from Bio.SeqUtils import molecular_weight
    from Bio.SeqUtils import IsoelectricPoint
    from Bio.SeqRecord import SeqRecord

    valid_protein_letters = set("ACDEFGHIKLMNPQRSTVWY")

    if sequence is None:
        return None
    elif not all(char.upper() in valid_protein_letters for char in sequence):
        return {"sequence": sequence, "error": "Invalid protein sequence"}
    else:
        seq_record = SeqRecord(Seq(sequence), id="seq", annotations={"molecule_type": "protein"})
        protein =  ProteinAnalysis(str(seq_record.seq).lower(), seq_record.annotations["molecule_type"])
        aa_count = protein.count_amino_acids()
        mw = molecular_weight(seq_record.seq, seq_record.annotations["molecule_type"])
        pI = IsoelectricPoint.IsoelectricPoint(seq_record.seq).pi()
        aa_percent = {aa: count/len(seq_record.seq)*100 for aa, count in aa_count.items()}
        return {"sequence": sequence, "length": len(sequence), "molecular_weight": mw, "PI": pI, "amino_acid_perc": aa_percent}

In [None]:
import streamlit as st
import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F

#Now lets run this Python UDF against the results of the sql dataframe using the {{cell#}} syntax
seq_df = retrieve_target.to_df()
seq_result_df=seq_df.select("target_name","target_chembl_id","protein_sequence","organism",
                            F.call_udf("DEFAULT_DATABASE.NOTEBOOKS.PROTEIN_ANALYSIS_UDF",seq_df["PROTEIN_SEQUENCE"]).alias("udf_result")).collect()

seq_result_df[:20]                           

## Streamlit in Notebooks
Let's combine this Function with some Streamlit based visualizations directly in the notebook

In [None]:
import pandas as pd
import altair as alt

# Create a dictionary to map display names (TARGET_NAME - TARGET_CHEMBL_ID) to the full Row object
protein_options = {f"{item.TARGET_NAME} - {item.TARGET_CHEMBL_ID}": item for item in seq_result_df}

# Get the list of display names for the selectbox
protein_display_names = list(protein_options.keys())

# Create a Streamlit selectbox
selected_display_name = st.selectbox("Select a Protein", protein_display_names)

# Retrieve the full Row object based on the selected display name
selected_data = protein_options.get(selected_display_name)

if selected_data:
    # Parse the UDF_RESULT JSON
    udf_result = json.loads(selected_data.UDF_RESULT)

    # Extract amino acid percentages
    amino_acid_percentages = udf_result.get("amino_acid_perc", {})

    if amino_acid_percentages:
        # Convert the dictionary to a Pandas DataFrame for easier charting
        aa_df = pd.DataFrame(list(amino_acid_percentages.items()), columns=['Amino Acid', 'Percentage'])

        # Create the Altair bar chart
        chart = alt.Chart(aa_df).mark_bar().encode(
            x=alt.X('Amino Acid', sort=None),  # sort=None to maintain alphabetical order
            y='Percentage:Q',
            tooltip=['Amino Acid', 'Percentage']
        ).properties(
            title=f"Amino Acid Percentage for {selected_data.TARGET_NAME} ({selected_data.TARGET_CHEMBL_ID})"
        )

        # Display the chart in Streamlit
        st.altair_chart(chart, use_container_width=True)
    else:
        st.warning("Amino acid percentage data not found in the UDF_RESULT.")
else:
    st.warning("Selected protein data not found.")


In [None]:
--Query of compounds that have activity data that have been tested . 
SELECT distinct 
    m.chembl_id AS compound_chembl_id,   
    s.canonical_smiles,   
    psa,full_mwt,
    t.pref_name AS target_name,
    t.chembl_id AS target_chembl_id
    FROM DEFAULT_DATABASE.CHEMBL29.compound_structures s left join 
    DEFAULT_DATABASE.CHEMBL29.molecule_dictionary m on s.molregno = m.molregno
    join DEFAULT_DATABASE.CHEMBL29.compound_records r on m.molregno  = r.molregno 
    join DEFAULT_DATABASE.CHEMBL29.docs d on  r.doc_id  = d.doc_id  
    join DEFAULT_DATABASE.CHEMBL29.activities act on r.record_id   = act.record_id 
    join DEFAULT_DATABASE.CHEMBL29.assays a on act.assay_id     = a.assay_id   
    join DEFAULT_DATABASE.CHEMBL29.target_dictionary t on a.tid            = t.tid 
    join DEFAULT_DATABASE.CHEMBL29.compound_properties c on c.molregno=act.molregno
    WHERE  standard_relation = '=' AND
    standard_type = 'IC50' AND
    standard_units = 'nM' AND
    psa IS NOT NULL AND psa> 0 and
    full_mwt IS NOT NULL and full_mwt>0 and standard_value >0 
    order by target_name
    limit 200



Now let's use RDKIT libray to calculate Lipinski values (molecular descriptors) of an compound when user provides a SMILES string. To learn more about the lipinski values check out this resource: http://dev.drugbank.com/guides/terms/lipinski-s-rule-of-five

In [None]:
import snowflake.snowpark as snowpark
from snowflake.snowpark.types import StructField, StructType, StringType, FloatType, VariantType
import snowflake.snowpark.functions as F

@F.udf(input_types=[StringType()],return_type=VariantType(),stage_location="@NOTEBOOK_1",is_permanent=True,
       name="lipinski_udf", replace=True,artifact_repository="snowflake.snowpark.pypi_shared_repository", artifact_repository_packages=["rdkit"])
def lipinski_udf(smiles:str) -> dict :
    from rdkit import Chem
    from rdkit.Chem import Descriptors, Lipinski
    # Calculates Lipinski descriptors based on the "Rule of 5" given a SMILES string input. 
    # Moleculer Weight <= 500
    # LogP <= 5
    # H-Bond Donor Count <= 5
    # H-Bond Acceptor Count <= 10
    # Parameters:
    # smiles (str): A SMILES string representing a molecule.
        
    # Returns:
    # dict: A dictionary containing the Lipinski descriptors calculated for the molecule.  

    mol = Chem.MolFromSmiles(smiles) 
    num_h_donors = Chem.rdMolDescriptors.CalcNumHBD(mol)
    num_h_acceptors = Chem.rdMolDescriptors.CalcNumHBA(mol)
    mol_wt = Chem.rdMolDescriptors.CalcExactMolWt(mol)
    
    lipinski_desc = {
        'MW': mol_wt,
        'HBD': num_h_donors,
        'HBA': num_h_acceptors,
        'logP': Chem.Crippen.MolLogP(mol)
    }
    
    # Calculate Lipinski's Rule of 5 violations
    lipinski_violations = {
        'MW': mol_wt > 500,
        'HBD': num_h_donors > 5,
        'HBA': num_h_acceptors > 10,
        'logP': Chem.Crippen.MolLogP(mol) > 5
    }
    
    # Add Lipinski's Rule of 5 violation flags to descriptor dictionary
    for desc, value in lipinski_violations.items():
        lipinski_desc[desc + '_violation'] = value
    
    return lipinski_desc



In [None]:
#Take one of the SMILES strings from the SQL output in CELL37 above to use as a variable when we call the UDF below.
smiles_input = st.text_input('Enter a SMILES string:')

In [None]:
SELECT '{{smiles_input}}' as smiles, lipinski:MW, lipinski:HBD, lipinski:HBA, lipinski:logP,
               lipinski:MW_violation, lipinski:HBD_violation, lipinski:HBA_violation, lipinski:logP_violation
        FROM (
            SELECT DEFAULT_DATABASE.NOTEBOOKS.LIPINSKI_UDF('{{smiles_input}}') AS lipinski
        ) as lipinski_desc


In [None]:

# Include rdkit
# Use the smiles input variable to calculate the lipinski values and draw the structure using rdkit rdMolDraw2D
from rdkit import Chem
from rdkit.Chem.Draw import rdMolDraw2D

from snowflake.snowpark.context import get_active_session
session = get_active_session()

try:
    mol = Chem.MolFromSmiles(smiles_input)
    if mol is None or smiles_input is '':
        st.warning('Invalid SMILES')
    else:
        st.success('Valid SMILES')
except Exception as e:
    st.warning('Error occurred: {}'.format(e))

col1,col2= st.columns(2)
with col1:
    query4 = """SELECT lipinski:MW as Molecular_Weight, lipinski:HBD as HBD, lipinski:HBA as HBA, lipinski:logP as LogP,
       lipinski:MW_violation , lipinski:HBD_violation, lipinski:HBA_violation, lipinski:logP_violation
FROM (
    SELECT DEFAULT_DATABASE.NOTEBOOKS.lipinski_udf('{}') AS lipinski
) as lipinski_desc""".format(smiles_input)
    results = session.sql(query4).collect()
    results_dict=results[0].as_dict()
    def replace_bool_with_emoji(value):
        return '✅' if value.lower() == 'false' else '⛔'
    def display_dict_as_table(dictionary):
        lipinski_keys = [key for key in dictionary if key.startswith('LIPINSKI:')]
        lipinski_violations = {key: dictionary.pop(key) for key in lipinski_keys}

        data = {
        'Lipinski Feature': list(dictionary.keys()),
        'Value': list(dictionary.values()),
        "Lipinski's Rule of Five": [replace_bool_with_emoji(str(lipinski_violations.get('LIPINSKI:MW_VIOLATION', ''))),
                               replace_bool_with_emoji(str(lipinski_violations.get('LIPINSKI:HBD_VIOLATION', ''))),
                               replace_bool_with_emoji(str(lipinski_violations.get('LIPINSKI:HBA_VIOLATION', ''))),
                               replace_bool_with_emoji(str(lipinski_violations.get('LIPINSKI:LOGP_VIOLATION', '')))
                               ]
        }

        df = pd.DataFrame(data)
        return df

    st.dataframe(display_dict_as_table(results_dict), use_container_width=True)
    st.info("To learn more about the Lipinski's Rule of Five: https://revive.gardp.org/resource/lipinskis-rule-of-5/?cf=encyclopaedia", icon="ℹ️")
# Create a drawing window with RDKit
    drawer = rdMolDraw2D.MolDraw2DCairo(400, 400)
    drawer.DrawMolecule(mol)
    drawer.FinishDrawing()
    
    # Display the drawing in Streamlit
    img = drawer.GetDrawingText()
    st.image(img, output_format='PNG')


## Part 2: Snowpark ML with CHEMBL29

This portion of the Lab will walk through key SnowflakeML features. The goal is to train a model to help researchers and chemists classify if a drug compound based on its structure (SMILES) will be effective against a specific disease. For example, we will train a model specific for Heart Disease, Alzeihmer's , prostate cancer, or arthritis, etc. 

In [None]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as mp
import seaborn as sb
import math

#import shap
from datetime import datetime
import streamlit as st
import rdkit

# Snowpark ML
from snowflake.ml.registry import Registry
from entities import search_algorithm
#Snowflake feature store
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode

# Snowpark session
from snowflake.snowpark import DataFrame
from snowflake.snowpark.types import IntegerType
import snowflake.snowpark.functions as F
from snowflake.snowpark import Window

#setup snowpark session
from snowflake.snowpark.context import get_active_session
session = get_active_session()
session

Let's start by gathering the relevant training data from the Chembl Data. We want chemical compounds with their SMILES structure and the associated targets they have been tested against. A target can be a protein/enzyme that has been identified to have a significant impact to a disease proliferation. 

In [None]:
--Create this view that brings together the compound, related properties and their associated target information  together from Chembl tables
create or replace view DEFAULT_DATABASE.CHEMBL29.compound_target_features as ( 
SELECT DISTINCT
    md.chembl_id AS compound_chembl_id,
    cs.canonical_smiles as smiles,
    td.tid as targetid,
    td.pref_name AS target_name,
    a.pchembl_value as pchembl_value,
    tt.target_type,
    td.chembl_id as target_chembl_id,
    cp.full_mwt as mwt,
    cp.hba_lipinski as hba,
    cp.hbd_lipinski as hbd,
    cp.alogp as logp,
    cp.num_lipinski_ro5_violations as num_lipinski_violations
    
FROM
    DEFAULT_DATABASE.CHEMBL29.molecule_dictionary AS md
    JOIN DEFAULT_DATABASE.CHEMBL29.compound_structures cs ON md.molregno = cs.molregno
    JOIN DEFAULT_DATABASE.CHEMBL29.activities AS a ON md.molregno = a.molregno
    JOIN DEFAULT_DATABASE.CHEMBL29.assays AS asy ON a.assay_id = asy.assay_id
    JOIN DEFAULT_DATABASE.CHEMBL29.target_dictionary AS td ON asy.tid = td.tid
    JOIN DEFAULT_DATABASE.CHEMBL29.target_type AS tt ON td.target_type = tt.target_type
    JOIN DEFAULT_DATABASE.CHEMBL29.compound_properties as cp on cp.molregno=md.molregno
WHERE
    a.standard_type IN ('IC50', 'EC50', 'Ki', 'Kd', 'Potency')
    AND a.standard_relation IN ('=', '<', '>', '<=', '>=', '~')
    AND a.standard_value IS NOT NULL );

select * from DEFAULT_DATABASE.CHEMBL29.compound_target_features limit 5;

## Feature Engineering with Snowpark APIs

In [None]:
#define the targets associated with a disease of interest
#Lets compounds for disease associated targets in chembl
disease='heart_disease'
target_id = ['CHEMBL3311','CHEMBL1942','CHEMBL1916', 'CHEMBL1867','CHEMBL233'] 

#to register a new disease model select from the options below
#refer to the streamlit app to view the current list of models
#['CHEMBL2487','CHEMBL220','CHEMBL1914','CHEMBL2015','CHEMBL1904','CHEMBL2094124', 'CHEMBL4036','CHEMBL1972'] alzeihmers_disease
# ['CHEMBL2364155','CHEMBL3553','CHEMBL2085','CHEMBL230','CHEMBL1825','CHEMBL6111'] rheumatoid_arthritis
#['CHEMBL3311','CHEMBL1942','CHEMBL1916', 'CHEMBL1867','CHEMBL233'] heart_disease

#['CHEMBL1871','CHEMBL2527','CHEMBL2034','CHEMBL2597','CHEMBL2052032','CHEMBL1855'] prostate_cancer
#['CHEMBL6152','CHEMBL2179','CHEMBL1075104','CHEMBL217','CHEMBL2039','CHEMBL234'] parkison_disease

# get the bioactivity data for the target, remove any nulls 
chembl_features=session.table('DEFAULT_DATABASE.CHEMBL29.compound_target_features').select('SMILES','PCHEMBL_VALUE', 
 'HBA', 'HBD', 'LOGP', 'NUM_LIPINSKI_VIOLATIONS', 'MWT').filter(F.col('TARGET_CHEMBL_ID').isin(target_id)).na.drop().dropDuplicates()


print(f"{chembl_features.count()} compounds associated with these targets.")

In [None]:
#create feature store client, we can pass in an existing db name or a new db will be created on initialization
fs = FeatureStore(
session=session,
database="DEFAULT_DATABASE",
name="NOTEBOOKS",
default_warehouse="DEFAULT_WH",
creation_mode=CreationMode.CREATE_IF_NOT_EXIST,
)

In [None]:
#Create entity and register to feature store
smiles_entity = Entity(name="SMILES", join_keys=["SMILES"])

fs.register_entity(smiles_entity)

fs.list_entities().show()

In [None]:
#register morgan fingerprint udf 
import snowflake.snowpark.types as T
import snowflake.snowpark.functions as F

@F.udf(input_types=[T.StringType(), T.IntegerType()],return_type= T.ArrayType(), 
       stage_location="DEFAULT_DATABASE.NOTEBOOKS.NOTEBOOK_1",is_permanent=True,name="MORGAN_FGP_BIT_UDF",
       replace=True, artifact_repository="snowflake.snowpark.pypi_shared_repository", artifact_repository_packages=["rdkit", "numpy"])

def udf(smiles, bit):
    from rdkit import Chem
    from rdkit.Chem import AllChem
    import numpy as np
    
    mol = Chem.MolFromSmiles(smiles)
    #the function GetMorganFingerprintAsBitVect () was used to create the fingerprint as a bit 
    # vector meaning the resulting vector will be composed on 0s and 1s. 
    # The 1s will reperesnt the presence of a certain molecule structure, while 0s will 
    # represent the absence of the same.
    fp =AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=bit)
    array=np.array(fp)
    return array

In [None]:
#test the function was registered and can be called
bit=32
df = session.sql("select morgan_fgp_bit_udf('CC(=O)Nc1ccc(cc1)S(=O)(=O)N', {});".format(bit))
print(df.collect())

In [None]:
#call the UDF on dataframe
#drop duplicates, cast data types, and rename columns
#result should be 9445 for heart disease
fingerprint=chembl_features.select(F.call_udf("morgan_fgp_bit_udf", F.col("smiles"), bit).alias("fingerprint"), "smiles", 
                                   chembl_features.col('HBA').cast("Float").alias("HBA"), 
                                   chembl_features.col('HBD').cast("Float").alias("HBD"), 
                                   chembl_features.col("LOGP").cast("Float").alias("LOGP"), 
                                   chembl_features.col('NUM_LIPINSKI_VIOLATIONS').cast("Float").alias("NUM_LIPINSKI_VIOLATIONS"), 
                                   chembl_features.col("MWT").cast("Float").alias("MWT"), "PCHEMBL_VALUE" ).dropDuplicates()
print(fingerprint.count())

In [None]:
#use flatten function to extract the fingerprint value at each index
flattened=fingerprint.select("smiles", 'HBA', 'HBD',"LOGP", "NUM_LIPINSKI_VIOLATIONS", "MWT" ,"PCHEMBL_VALUE", 
                             F.flatten(fingerprint["fingerprint"], outer=True))
array_vals=flattened.select( "smiles",'HBA', 'HBD',"LOGP", "NUM_LIPINSKI_VIOLATIONS", "MWT", "PCHEMBL_VALUE",
                            flattened["value"].as_("fingerprint_number"), F.concat(F.lit('INDEX_'),
                            flattened["index"]).as_("index"))
st.dataframe(array_vals.limit(20))
#array_vals.groupBy(["index", "smiles"]).agg(F.count(array_vals['fingerprint_number'])).show()

In [None]:
#pivot each index into it's own column with the index position
fdf=array_vals.pivot("index", ["INDEX_{}".format(i) for i in range(bit)]).sum("fingerprint_number").sort(array_vals["smiles"])
cols_dict={fdf["'INDEX_{}'".format(idx)]: "index_{}".format(idx) for idx in range(bit)}
finaldf=fdf.rename(cols_dict)
st.dataframe(finaldf.sort(F.col("mwt"), ascending=False).limit(10))

#number of compounds related to the disease, confirm it is still 9445 after the restructuring the dataframe (no duplicate rows created)
print(finaldf.count())


Now we can leverage OSS libraries with Snowpark with the Snowpark Pandas API: https://docs.snowflake.com/en/developer-guide/snowpark/python/pandas-on-snowflake

In [None]:
from sklearn.preprocessing import Binarizer
from sklearn.ensemble import RandomForestClassifier
import modin.pandas as mpd
import snowflake.snowpark.modin.plugin

# create a binary classification label for each compound based on activity threshold
#The pChEMBL field in the ChEMBL database is a useful way to represent the potency or affinity of compounds on a negative logarithmic scale. It allows you to compare different measurements (IC50, XC50, EC50, AC50, Ki, Kd, or Potency) in a standardized way. 
#Let's use Snowpark ML Preprocessing functions to binarize the dataframe https://docs.snowflake.com/en/developer-guide/snowpark-ml/reference/latest/api/modeling/snowflake.ml.modeling.preprocessing.Binarizer 
chembl_final_df = finaldf.to_snowpark_pandas()

# create a binary classification label for each compound based on activity threshold
# Use scikit-learn's Binarizer
binarizer = Binarizer(threshold=6.0)

# Reshape the 'PCHEMBL_VALUE' column as scikit-learn transformers expect a 2D array
pchembl_values = chembl_final_df[['PCHEMBL_VALUE']]

# Apply the binarizer and add the result as a new column 'ACTIVE'
chembl_final_df['ACTIVE'] = binarizer.fit_transform(pchembl_values)

# Display the first 10 rows of the updated Pandas DataFrame
st.markdown("#### DataFrame after binarization:")
st.write(chembl_final_df.head(10))

# print the number of active compounds based on threshold
st.markdown(f"Number of active compounds (pchembl_value > 6.0): {chembl_final_df['ACTIVE'].sum()}")


Let's try creating a Feature View with this final dataframe

In [None]:
#define feature view

#need a snowpark df
chembl_final_df.to_snowflake( "chembl_features", table_type= "transient",if_exists='replace')
chembl_df = session.table("chembl_features")


my_fv = FeatureView(
name=f"{disease}_{bit}_fv",
entities=[smiles_entity],
feature_df=chembl_df,
refresh_freq=None,
desc="heart disease feature view"
#Optional param timestamp_col="TS",
#optional param refresh_freq="1 minute",
)

#register
my_fv = fs.register_feature_view(
feature_view=my_fv,
version="V1",
overwrite=True
)

In [None]:
#discovery Feature Views

fs.list_feature_views()

In [None]:
#Create link to feature store UI to inspect newly created feature view!
org_name = session.sql('SELECT CURRENT_ORGANIZATION_NAME()').collect()[0][0]
account_name = session.sql('SELECT CURRENT_ACCOUNT_NAME()').collect()[0][0]

st.write(f'https://app.snowflake.com/{org_name}/{account_name}/#/features/database/DEFAULT_DATABASE/store/NOTEBOOKS')

In [None]:
#Split into training and testing sets
#convert to snowpark dataframe from snowpark pandas
chembl_df=mpd.to_snowpark(chembl_final_df)
features_train_df, features_test_df = chembl_df.random_split(weights=[0.80, 0.20], seed=0)


#Training and Test row count
print(features_train_df.count(),features_test_df.count())

In [None]:
#Prepare our X and Y Values for our random forrest classification
features_train_x=features_train_df.drop("ACTIVE", "PCHEMBL_VALUE", "SMILES")
features_train_y=features_train_df.select(["ACTIVE"])

features_train_pd_x=features_train_x.to_snowpark_pandas()
features_train_pd_y=np.array(features_train_y.to_snowpark_pandas()).ravel()

st.markdown('#### X - Features')
st.write(features_train_pd_x)
st.markdown('#### Y - Target')
st.write(features_train_pd_y)


In [None]:
from sklearn import datasets, ensemble

clf = ensemble.RandomForestClassifier(random_state=42)


In [None]:
clf.fit(features_train_pd_x, features_train_pd_y)

In [None]:
# Run Prediction on the test data

features_test_pd_x=features_test_df.drop("ACTIVE", "PCHEMBL_VALUE", "SMILES").to_snowpark_pandas()
features_test_pd_y=np.array(features_test_df.select(["ACTIVE"]).to_snowpark_pandas()).ravel()

prediction_results=clf.predict(features_test_pd_x)

prediction_results

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

accuracy= accuracy_score(features_test_pd_y, prediction_results)
print("Accuracy:", accuracy)

Let's Register this model in the Snowflake Model Registry:
https://docs.snowflake.com/developer-guide/snowflake-ml/model-registry/overview 

In [None]:
from snowflake.ml.registry import Registry
registry = Registry(session=session, database_name="DEFAULT_DATABASE", schema_name= "NOTEBOOKS")

In [None]:
#Deploy the base model to the model registry
model_name="predict_activity_hd"
base_version_name = "v1"

try:
    mv_base = registry.get_model(model_name).version(base_version_name)
    print("Found existing model version!")
except:
    print("Logging new model version...")
    mv_base = registry.log_model(
        model_name=model_name,
        model=clf, 
        version_name=base_version_name,
        conda_dependencies=["scikit-learn", "rdkit"],
        sample_input_data = features_train_x.limit(100), #using snowpark df to maintain lineage
        target_platforms= ["SNOWPARK_CONTAINER_SERVICES"], #serve the model in warehouse or spcs
        comment = """ML model for predicting compounds targeting heart disease
                    """,
        options={'relax_version': False, "enable_explainility": True}
    )
    

In [None]:
#metrics can be added during the time of registering or after like below
mv_base.set_metric(metric_name="Accuracy Score", value=accuracy)
mv_base.set_metric(metric_name="stage", value="testing")
mv_base.set_metric(metric_name="disease",value=disease)
mv_base.set_metric(metric_name="targets", value=target_id) 
mv_base.set_metric(metric_name="classifier_type",value="random_forrest_classifier")
mv_base.set_metric(metric_name="fingerprint_bit", value=bit)

In [None]:
#get reference to model and version
m = registry.get_model(model_name)
mv= m.version(base_version_name)
mv

In [None]:
c1,c2,c3 = st.columns(3)

with c1:
    st.metric('Accuracy Score:',mv.get_metric("Accuracy Score"))
with c2:
    st.metric('Stage:',mv.get_metric("stage"))
with c3:
    st.metric('Disease:',mv.get_metric("disease"))

with c1:
    st.write('Targets:',mv.get_metric("targets"))
with c2:
    st.metric('Classifier Type:',mv.get_metric("classifier_type"))
with c3:
    st.metric('Fingerprint Bit:',mv.get_metric("fingerprint_bit"))

In [None]:
#Get informational DataFrame 
m.show_versions()

In [None]:
mv.show_functions()

## Call the registered Model's methods using the Run (). 

https://docs.snowflake.com/developer-guide/snowflake-ml/model-registry/overview#calling-model-methods

We can envoke the functions with two options warehouse or spcs runtime.
https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/warehouse
https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/container


![How it works](https://docs.snowflake.com/en/_images/model-registry-spcs-deployment.png)

## Run Model on SPCS


In [None]:
CREATE IMAGE REPOSITORY IF NOT EXISTS my_inference_images;

In [None]:
# mv is a snowflake.ml.model.ModelVersion object

mv.create_service(service_name="myservice",
                  service_compute_pool="CPU_X64_S_1_3",
                  image_repo="DEFAULT_DATABASE.NOTEBOOKS.MY_INFERENCE_IMAGES",
                  ingress_enabled=True,
                  gpu_requests=None)


In [None]:
#call the deployed model’s predict function
#reg_preds = mv.run(features_test_df, function_name = "predict").rename(col('"output_feature_0"'), service_name= "myservice", "ACTIVITY_PREDICTION")

mv.run(
    features_test_df,
    function_name="predict",
    service_name="DEFAULT_DATABASE.NOTEBOOKS.MYSERVICE")

In [None]:
CALL SYSTEM$GET_SERVICE_LOGS('DEFAULT_DATABASE.NOTEBOOKS.MYSERVICE', '0', 'model-inference')


In [None]:
DROP SERVICE DEFAULT_DATABASE.NOTEBOOKS.MYSERVICE;