# Train a domain classifier on the [semantic scholar dataset](https://api.semanticscholar.org/corpus)

> Part 2: train a model

![position of this step in the lifecycle](diagrams/scope-train.svg)
> The blue boxes show the steps implemented in this notebook.

In [Part 1](data.ipynb), we have cleaned and transformed our training data. We can now access this data using `great_ai.LargeFile`. Locally, it will gives us the cached version, otherwise, the latest version is downloaded from S3. 

In this part, we hyperparameter-optimise and train a simple, Naive Bayes classifier which we then export for deployment using `great_ai.save_model`.

In [1]:
MODEL_KEY = "small-domain-prediction"

## Load data that has been extracted in [part 1](data.ipynb)

In [3]:
from great_ai import query_ground_truth

data = query_ground_truth("train")
X = [d.input for d in data for domain in d.feedback]
y = [domain for d in data for domain in d.feedback]

[38;5;39m2022-06-19 15:08:22,339 |     INFO | Options: configured ✅[0m


In [4]:
import pandas as pd
from collections import Counter
import plotly.express as px

df = pd.DataFrame(Counter(y).most_common(), columns=["domain", "count"])
px.bar(x=df["domain"], y=df["count"], width=1200, height=400).show()

## Optimise and train Multinomial Naive Bayes classifier

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer


def create_pipeline() -> Pipeline:
    return Pipeline(
        steps=[
            ("vectorizer", TfidfVectorizer(sublinear_tf=True)),
            ("classifier", MultinomialNB()),
        ]
    )

In [6]:
from sklearn.model_selection import GridSearchCV

optimisation_pipeline = GridSearchCV(
    create_pipeline(),
    {
        "vectorizer__min_df": [5, 20, 100],
        "vectorizer__max_df": [0.05, 0.1],
        "classifier__alpha": [0.5, 1],
        "classifier__fit_prior": [True, False],
    },
    scoring="f1_macro",
    cv=3,
    n_jobs=-1,
    verbose=1,
)
optimisation_pipeline.fit(X, y)

results = pd.DataFrame(optimisation_pipeline.cv_results_)
results.sort_values("rank_test_score")

Fitting 3 folds for each of 24 candidates, totalling 72 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__alpha,param_classifier__fit_prior,param_vectorizer__max_df,param_vectorizer__min_df,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
7,7.796924,0.321314,3.756043,0.02786,0.5,False,0.05,20,"{'classifier__alpha': 0.5, 'classifier__fit_pr...",0.508013,0.509086,0.514455,0.510518,0.002818,1
10,8.055664,0.206984,3.748517,0.088012,0.5,False,0.1,20,"{'classifier__alpha': 0.5, 'classifier__fit_pr...",0.503729,0.506417,0.511895,0.507347,0.003398,2
11,7.74836,0.484361,3.863216,0.072048,0.5,False,0.1,100,"{'classifier__alpha': 0.5, 'classifier__fit_pr...",0.502211,0.498949,0.503744,0.501635,0.002,3
8,7.400649,0.08732,3.658442,0.011735,0.5,False,0.05,100,"{'classifier__alpha': 0.5, 'classifier__fit_pr...",0.501432,0.49397,0.501386,0.498929,0.003507,4
19,8.147969,0.40198,3.977119,0.284028,1.0,False,0.05,20,"{'classifier__alpha': 1, 'classifier__fit_prio...",0.48641,0.491891,0.492515,0.490272,0.002743,5
20,7.472414,0.13032,3.771136,0.146406,1.0,False,0.05,100,"{'classifier__alpha': 1, 'classifier__fit_prio...",0.486868,0.489142,0.492665,0.489558,0.002385,6
23,7.395585,0.326162,2.332031,0.254146,1.0,False,0.1,100,"{'classifier__alpha': 1, 'classifier__fit_prio...",0.489489,0.489987,0.488543,0.48934,0.000599,7
22,7.45206,0.162072,2.937473,0.116443,1.0,False,0.1,20,"{'classifier__alpha': 1, 'classifier__fit_prio...",0.478748,0.485174,0.484685,0.482869,0.002921,8
6,7.83638,0.374669,4.007429,0.251199,0.5,False,0.05,5,"{'classifier__alpha': 0.5, 'classifier__fit_pr...",0.472793,0.47646,0.479583,0.476279,0.002775,9
2,7.839444,0.174964,3.914105,0.379735,0.5,True,0.05,100,"{'classifier__alpha': 0.5, 'classifier__fit_pr...",0.469224,0.472179,0.476758,0.47272,0.0031,10


In [7]:
from sklearn import set_config

set_config(display="diagram")

classifier = create_pipeline()
classifier.set_params(**optimisation_pipeline.best_params_)
classifier.fit(X, y)

## Export the model using GreatAI

In [8]:
from great_ai import save_model


save_model(classifier, key=MODEL_KEY, keep_last_n=5)

[38;5;39m2022-06-19 15:12:58,312 |     INFO | Fetching cached versions of small-domain-prediction[0m
[38;5;39m2022-06-19 15:12:59,027 |     INFO | Copying file for small-domain-prediction-12[0m
[38;5;39m2022-06-19 15:12:59,039 |     INFO | Compressing small-domain-prediction-12[0m
[38;5;39m2022-06-19 15:12:59,842 |     INFO | Model small-domain-prediction uploaded with version 12[0m


'small-domain-prediction:12'