# Train Sentiment Model

This notebook trains a sentiment prediction model.

## Notebook Summary

**It takes as inputs...**

2. A range of standard packages for modelling and evaluation
3. A custom py script for preparing the data
4. A subset of survey responses that has been labelled for training 

**It applies the following process...**

1. Data preperation
2. Model evaluation
3. Model selection (confirmed by the user)
4. Export model to be used for sentiment prediction

**It's outputs are...**

1. Multinomial Naive Bayes sentiment prediction model


## Further Notes
**During the project multiple models were evaluated. It was found that:**

1. Naive Bayes models were the quickest to train and performed well relative to other options.
1. Of the Naive Bayes models, the multinomial varient performed best despite the class imbalance
2. Extensive text preprocessing, as is done for topic modelling, degraded performance.
3. A small range of hyperparameters consistently returned the best performance


**Future Improvements:**

1. Spell check on the text inputs to improve the BOW representation
2. Large scale SVM model
3. Ensemble of methods combining lexicon approaches with this custom model

# 1 Package Imports and Constants

## 1.0 Package Imports

Snowflake offers an ml library that acts as a wrapper for the sci0kit learn library with distributed compute benefits. However, as of June 2024 that library, snowflake.ml, does not support all features critical to this project. Therefore, sklearn is still used. 

The missing features are:
- utils for resampling
- CountVectorizer
- BalancedAccuracyScore metric (although this could be derived directly)
- ClassificationReport metric (although this could be derived directly)

NB: If a wider grid search is desired, and this results in a longer than desired run time, we could use the snowflake.ml.modeling grid search for that step. Alternatively, use a snowpark optimized warehouse https://docs.snowflake.com/en/user-guide/warehouses-snowpark-optimized
TODO: review the above content once the notebook is complete. 

In [None]:
from snowflake.snowpark.context import get_active_session

# Snowpark ML
from snowflake.ml.registry import Registry
from snowflake.ml._internal.utils import identifier
#from snowflake.ml.modeling.model_selection import GridSearchCV # TODO: reinstate of the termined worker error is resolved

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils import resample

# models
from sklearn.naive_bayes import MultinomialNB

# evaluation
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, train_test_split, GridSearchCV
from sklearn.metrics import balanced_accuracy_score , confusion_matrix

## 1.1 Notebook Constants

In [None]:
# set seed for reproducible results
seed = 1234
np.random.seed(seed)
random_state = seed # for most actions
random_state2 = seed-1 # to ensure different, but reproducible, inner folds in nested cv

# nested cv config. Use stratified for imbalance
outer_cv = RepeatedStratifiedKFold(n_splits=7, n_repeats=3, random_state=random_state)
inner_cv = RepeatedStratifiedKFold(n_splits=7, n_repeats=3, random_state=random_state2) # random state2 so inner folds differ to outer folds

# get session
session = get_active_session()

# for debugging notebook
debug = False

## 1.2 Import Data

**Import pre-labeled data for training**

In [None]:
# NB: These labels could be pushed to a table and streamlit put over the top to enable more manual labelling
labelled_df = pd.read_csv('schl_aa_master_sentiment_labels.csv', encoding='utf-8', index_col=False)

In [None]:
# demonstrate imbalance
labelled_df['sentiment'].value_counts()

# 2 Process

## 2.0 Data Preperation

In [None]:
# take a prepared dataframe, seperates the features (X) from the target (y) and creates a 80/20 split stratified by target
def get_splits(dataset, seed, random=False):
    if random:
        random_state = None
    else:
        random_state = seed

    X = dataset['comment_response']
    
    y = dataset['sentiment']   
    
    # stratify as the classes are imbalanced and we want that imbalance to be consistent across splits
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state, stratify=y)

    return X_train, X_test, y_train, y_test

## 2.1 Model Preperation

**Processing Pipelines**

In [None]:
# model pipeline
mnb_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

# parameter ranges. TODO: reinstate once the issue with a vectorizer pipeline + GridSearchCV is resolved

#n_grams = [(1,3)] # (1,3) consistently the best
#alphas = [0.005, 0.01, 0.015] # narrow range. Best model is actually at 0.0075 but that will change with new data.
#min_dfs = [10,15,50] # remove words that appear in less than X reviews. 10-15 was returning the best results.
#max_dfs = np.linspace(0.5, 0.8, 3) # remove words that appear in more than X% of reviews. Decrease the last number to reduce the fit time.

vect_params = {
    'vect__min_df': min_dfs,
    'vect__max_df': max_dfs,
    'vect__ngram_range': n_grams,
    'clf__alpha': alphas,
}

**Attempt 1: snowflake ml GridSearchCV within the original sklearn GridSearchCV code.**

In [None]:
def get_model_results(X_train, X_test, y_train, y_test, pipe, params):

    # format for snowflake ml grid search which expects train data and labels in a single dataframe vs sklearn fit which expects two arrays
    train_df = pd.concat([X_train, y_train], axis=1)
    
    # hyperparameter tunning
    ## NB it appears SF notebooks do not allow jobs to run in parallel. n_jobs was changed from -2 to None 
    #grid = GridSearchCV(pipe, params, cv=inner_cv, scoring='balanced_accuracy', n_jobs=1,verbose=1) # use this once the range of hyperparameters is refined by halvinggridsearch
    #grid_fit = grid.fit(X_train, y_train)
    
    # try snowflake ml grid search as sklearn version triggering memory/thread failures. TODO: review for final version
    grid = GridSearchCV(estimator = pipe,
                        param_grid = params, 
                        cv = inner_cv, 
                        scoring = 'balanced_accuracy',
                        input_cols = 'comment_response',
                        label_cols = 'sentiment',
                        output_cols = 'predicted_sentiment')

    print('grid search created')
    

    #test_df = pd.concat([X_test, y_test], axis=1)
    
    grid_fit = grid.fit(train_df)
    print('grid fit done')
    
    # results
    best_model = grid_fit.best_estimator_
    tunning_cv_score = grid_fit.best_score_
    
    # evaluate best estimator
    eval_cv = cross_val_score(best_model, X_train, y_train, cv=outer_cv,n_jobs=-1,scoring='balanced_accuracy')
    
    # results
    eval_cv_score = eval_cv.mean()
    nested_cv_error = tunning_cv_score - eval_cv_score
    
    # evaluate on hold out
    predicted = best_model.predict(X_test)
    test_score = balanced_accuracy_score(y_test, predicted)
    approx_error = test_score - eval_cv_score

    return best_model, tunning_cv_score, eval_cv, eval_cv_score, nested_cv_error,test_score, approx_error

In [None]:
if debug:

    experiment_name = ['tuned']
    X_train, X_test, y_train, y_test = get_splits(dataset=labelled_df[['sentiment','comment_response']],seed=seed,upsample=False)
    
    best_model,tunning_cv_score,eval_cv,eval_cv_score,nested_cv_error,test_score,approx_error = get_model_results(X_train, X_test, y_train, y_test, mnb_clf, vect_params)
    
    print('Nested cross validation score: {}'.format(test_score))
    print('Test score: {}'.format(test_score))
    print('Approximation error: {}'.format(test_score)) # the difference between cross validation score and test score. Small means lower risk of over/under fitting 

**Attempt 2: snowflake ml GridSearchCV outside the custom multi-model training function.**

In [None]:
from snowflake.ml.modeling.model_selection import GridSearchCV
if debug:
    X_train, X_test, y_train, y_test = get_splits(dataset=labelled_df[['sentiment','comment_response']],seed=seed)
    
    train_df = pd.concat([X_train, y_train], axis=1)
    
    # hyperparameter tunning
    ## NB it appears SF notebooks do not allow jobs to run in parallel. n_jobs was changed from -2 to None 
    #grid = GridSearchCV(pipe, params, cv=inner_cv, scoring='balanced_accuracy', n_jobs=1,verbose=1) # use this once the range of hyperparameters is refined by halvinggridsearch
    #grid_fit = grid.fit(X_train, y_train)
    
    # try snowflake ml grid search as sklearn version triggering memory/thread failures. TODO: review for final version
    grid = GridSearchCV(estimator = mnb_clf,
                        param_grid = vect_params, 
                        #cv = inner_cv, 
                        scoring = 'balanced_accuracy',
                        input_cols = 'comment_response',
                        label_cols = 'sentiment',
                        output_cols = 'predicted_sentiment')

    grid.fit(train_df)

**Attempt 3: drop the GridSearchCV and hardcode the pipeline hyperparameters**

This successfully fits the pipeline (but relies on the user knowing the optimal hyperparameters)

In [None]:
# abandon the grid search and test if the model registry works for this pipeline. 
# start by fitting the pipeline with hard coded values

X_train, X_test, y_train, y_test = get_splits(dataset=labelled_df[['sentiment','comment_response']],seed=seed)

# model pipeline
mnb_clf = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1,3),min_df=10,max_df=0.75)),
    ('clf', MultinomialNB(alpha=0.0075)),
])

mnb_clf.fit(X_train, y_train)

In [None]:
accuracy_score = mnb_clf.score(X_test, y_test)
print('Accuracy score: {}'.format(accuracy_score))

In [None]:
y_pred = mnb_clf.predict(X_test)
bal_accuracy_score = balanced_accuracy_score(y_test, y_pred)
print('Balanced accuracy score: {}'.format(bal_accuracy_score))

## Save Model to Snowflake Model Registry

In [None]:
# TODO: update with the final location of the registry

db = identifier._get_unescaped_name(session.get_current_database())
schema = 'ML_MODEL_REGISTRY'

reg = Registry(session=session, database_name=db, schema_name=schema)

In [None]:
# function to find the current model version and return the next version to use
def get_next_model_version(reg, model_name):
    models = reg.show_models()
    if models.empty:
        return "V_1"
    elif model_name not in models["name"].to_list():
        return "V_1"
    max_version = max(
        ast.literal_eval(models.loc[models["name"] == model_name, "versions"].values[0])
    )
    return 'V_{}'.format({int(max_version.split('_')[-1]) + 1})

In [None]:
# Log the model
# the name + version must be unique and both are strings. The name could be a UUID but a short description of the use case is more practical.
# the model can be logged with metrics, to track history, and the features

model_name = "CEMPLICITY_SENTIMENT_CLASSIFIER"

logged_model = reg.log_model(
        model_name = model_name,
        version_name = get_next_model_version(reg, model_name), # assumes versioning is incremental
        model = mnb_clf,
        metrics = {'accuracy': accuracy_score,
                   'balanced_accuracy': bal_accuracy_score},
        sample_input_data = pd.DataFrame(X_train[:100]))

# get around bug with log_model method not saving comments
logged_model.comment = 'Test model registry for hosting. This is the sentiment classifier trained on the Cemplicity survey data.'

In [None]:
# check the model has been saved
reg.get_model(model_name).show_versions()