# ⚽ **HOL: Eredivisie Prediction** 🥇
### Notebook - Model Training - 3/4

---


### What We'll Do:
1. **Data Ingestion**: Fetch Eredivisie data from the GitHub repository.
2. **Data Transformation**: Utilize Snowpark DataFrames for data preparation and analysis.
3. -> **Model Training**: Train model and store it in the Snowflake Model Registry
4. **Prediction**: Predict who is going to win Eredivisie 2024/2025

![image](https://i.makeagif.com/media/2-26-2017/iTVOpv.gif)



## Step 3: Model Training and Evaluation
---

In this notebook we'll perform the following activities:

- Hyperparameter Tuning
- Model Training
- Model Validation
- Model Registry

### Setup

Before using this notebook, ensure that you have imported the following packages by click on the top right "Packages" button and restart the notebook:

- `snowflake-snowpark-python` (Latest)
- `snowflake-ml-python` (Latest)
- `fastparquet` (Latest)


In [None]:
import snowflake.snowpark
#import pandas as pd
#import numpy as np
import streamlit as st
from snowflake.snowpark.session import Session
from snowflake.snowpark import Window
from snowflake.snowpark import functions as F   
from snowflake.snowpark.functions import udf, udtf
from snowflake.snowpark.types import IntegerType, FloatType, StringType, StructField, StructType, DateType

import warnings
warnings.filterwarnings('ignore')

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()

In [None]:
# FUNCTION used to iterate the model version so we can automatically 
# create the next version number

import ast

def get_next_version(reg, model_name) -> str:
    """
    Returns the next version of a model based on the existing versions in the registry.

    Args:
        reg: The registry object that provides access to the models.
        model_name: The name of the model.

    Returns:
        str: The next version of the model in the format "V_<version_number>".

    Raises:
        ValueError: If the version list for the model is empty or if the version format is invalid.
    """
    models = reg.show_models()
    if models.empty:
        return "V_1"
    elif model_name not in models["name"].to_list():
        return "V_1"
    max_version_number = max(
        [
            int(version.split("_")[-1])
            for version in ast.literal_eval(
                models.loc[models["name"] == model_name, "versions"].values[0]
            )
        ]
    )
    return f"V_{max_version_number + 1}"

In [None]:
from snowflake.snowpark import functions as F
from snowflake.snowpark.functions import col
from snowflake.snowpark import Session
from snowflake.ml.modeling.preprocessing import LabelEncoder
import snowflake.snowpark.functions as F

# Check distribution to see how balanced out data set is
df_training = session.table(f'eredivisie_features')

# Load df_training from Snowflake table
df_training = session.table('eredivisie_features').filter(
    col("DATE") > '2010-08-01')

# Filter rows where GAME_OUTCOME = 0 (draw)
df_training_draw_matches = df_training.filter(col('GAME_OUTCOME') == 1)

# Randomly select 2000 rows from outcome_2_df
rows_to_drop = df_training_draw_matches.sample(n=500)  

# Drop the selected rows from the original DataFrame
df_training_balanced = df_training.join(
    rows_to_drop,
    on=list(rows_to_drop.columns),
    how='left_anti'  # This will keep only the rows not in rows_to_drop
)

df_training_balanced.group_by('GAME_OUTCOME').agg(F.count('*')).sort(F.col('GAME_OUTCOME'))

In [None]:
st.dataframe(df_training_balanced.limit(20))

In [None]:
--If we want to run some hyperparameter tuning, in order to speed things up lets size up -our warehouse. This scale up is just temporary, we'll scale it down for training.
--Be Aware of possible costs associated running this warehouse size.
--Wait that the WH is actually scaled up before running the next cell.

alter warehouse EREDIVISIE_PREDICTION_WH set warehouse_size = xxlarge

In [None]:
# To Uncomment this cell -> Select All -> Cmd + / (on Mac)
# Hyper Parameter tuning will allow to get the best parameters to train our model
# This step should take about 4 mins, on a 2XL.  
# Be Aware of possible costs associated executing this cell.
# If you scale up the warehouse, you'll scale it down in the next step.

from snowflake.ml.modeling.preprocessing import StandardScaler
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.model_selection.grid_search_cv import GridSearchCV

FEATURE_COLS = ["HOME_WINS_LAST68","HOME_WIN_PERCENTAGE_LAST68", "H2H_HOME_WINS_LAST10" , "H2H_HOME_LOSSES_LAST10", "HOME_GOALS_AGAINST_LAST34"]
LABEL_COLS = ["GAME_OUTCOME"]

# Select only the required feature and label columns for training
df_training_filtered = df_training_balanced.select(FEATURE_COLS + LABEL_COLS)

hyperparam_grid = {
    "n_estimators": [50, 100, 200],
    "learning_rate": [0.05, 0.1, 0.2],
    "max_depth": [3, 4, 5]
}

pipeline = Pipeline(
    steps = [
        (
            "scaler", 
            StandardScaler(
                input_cols=FEATURE_COLS, 
                output_cols=FEATURE_COLS
            )
        ),
        (
        "GridSearchCV",
            GridSearchCV(
                estimator=XGBClassifier(random_state=42),
                param_grid=hyperparam_grid,
                scoring='accuracy', 
                label_cols=LABEL_COLS,
                input_cols=FEATURE_COLS
            )   
        )
    ]
)

pipeline.fit(df_training_filtered)

sklearn_hp = pipeline.to_sklearn()
optimal_params = sklearn_hp.steps[-1][1].best_params_
score_dict = {"best_accuracy": sklearn_hp.steps[-1][1].best_score_}

st.write(score_dict)
st.write(optimal_params)

In [None]:
-- now we can scale it back down, in a matter of seconds

alter warehouse eredivisie_prediction_wh set warehouse_size = xsmall

In [None]:
# taking our optimal parameters we're going to build our model

from snowflake.ml.modeling.preprocessing import StandardScaler
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.xgboost import XGBClassifier
from snowflake.ml.modeling.metrics import *


FEATURE_COLS = ["HOME_WINS_LAST34","HOME_WIN_PERCENTAGE_LAST34", "H2H_HOME_WINS_LAST10" , "H2H_HOME_LOSSES_LAST10", "HOME_GOALS_AGAINST_LAST34"]
LABEL_COLS = ["GAME_OUTCOME"]

# Select only the required feature and label columns for training
df_training_filtered = df_training_balanced.select(FEATURE_COLS + LABEL_COLS)

# Split the filtered dataframe into training and test datasets
train_data, test_data = df_training_filtered.random_split(weights=[0.8, 0.2], seed=0)

# Optimal params: max_depth= 3, n_estimators = 100, learning_rate = 0.3
pipeline = Pipeline(
    steps = [
        (
            "scaler", 
            StandardScaler(
                input_cols=FEATURE_COLS, 
                output_cols=FEATURE_COLS
            )
        ),
        (
            "model", 
            XGBClassifier(
                input_cols=FEATURE_COLS, 
                label_cols=LABEL_COLS,
                max_depth= optimal_params['max_depth'],
                n_estimators = optimal_params['n_estimators'],
                learning_rate = optimal_params['learning_rate']
            )
        )
    ]
)

pipeline.fit(train_data)

# Get the model accuracy
predict_on_training_data = pipeline.predict(train_data)
training_accuracy = accuracy_score(df=predict_on_training_data, y_true_col_names=["GAME_OUTCOME"], y_pred_col_names=["OUTPUT_GAME_OUTCOME"])

predict_on_test_data = pipeline.predict(test_data)
eval_accuracy = accuracy_score(df=predict_on_test_data, y_true_col_names=["GAME_OUTCOME"], y_pred_col_names=["OUTPUT_GAME_OUTCOME"])

st.write("Model Training Completed")

In [None]:
# Homework - You can plot some additional statistics!
st.write(f"Training accuracy: {training_accuracy} \nEval accuracy: {eval_accuracy}")

## Model Registry
---

- Once the model is ready we'll use it to predict results of group stage.
- Save the model using MLOps Model Registry features.

In [None]:
from snowflake.ml.registry import Registry

reg = Registry(session=session)

model_name = "EREDIVISIE_PREDICT"
model_version = get_next_version(reg, model_name)

reg.log_model(
    model_name=model_name,
    version_name=model_version,
    model=pipeline,
    metrics={'training_accuracy':training_accuracy, 'eval_accuracy':eval_accuracy},
    options={'relax_version': False}
)

m = reg.get_model(model_name)
m.default = model_version

In [None]:
# lets see the models we have in our registry

reg.get_model(model_name).show_versions()

# Summary

We now have a model in our registry we can use to call from either Snowpark or SQL, which we'll use in the predictions notebook