### Importing Packages

All imported packages are here, so we don't need to search for them everywhere.

In [1]:
import os
from multiprocessing import cpu_count

import cloudpickle
import pandas as pd
import scipy.sparse as sp
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

### Data Extraction

We'll load the dataset from the environment variable `DATASET_PATH` but only the columns we are interested in, as a good practice. It doesn't make that much difference in this case, but if our dataset was huge it could potentially save us a lot of memory.

For this exercise, we'll use only the text columns `title` and `concatenated_tags`, since during an early experimentation phase, these two columns were enough for a very good result, and although working with additional numerical features like `price` and `weight` could help improve the model even further, that would require a lot more tricks with sparse matrix convertions and would be more complicated that what the exercise requires.

In [2]:
df = pd.read_csv(
    os.environ["DATASET_PATH"],
    usecols=["title", "concatenated_tags", "category"],
)

### Data Formatting

Here we start by dropping any rows with NA values, because there are only two of them (about 0.05%). Then, we proceed to split the dataset into training and test sets, setting a random seed for reproducibility.

In [3]:
df = df.dropna()

In [4]:
X, y = df[["title", "concatenated_tags"]], df["category"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Modeling

We are going to vectorize both our text columns (independently) and use them to fit a `MultinomialNB` classifier which is usually good at handling discrete features such as word counts for text classification.

It's very important that when we transform the validation sets inside cross-validation or the test set after training, that we do so with the vectorizers fitted to the training data. To handle that and also to streamline the process and help with reproducibility, we will use a `Pipeline` with a custom transformer `TitleTagsVectorizer`.


Note: The multinomial naive bayes model was chosen because, during an early experimentation phase, it showed good results with fast training times, which allows us to use more parameters in grid search without it taking too much time for this exercise.

In [5]:
class TitleTagsVectorizer(BaseEstimator, TransformerMixin):
    """Custom transformer to vectorize `title` and `concatenated_tags` columns"""
    
    def __init__(self, title_ngram_range=(1, 1), tags_ngram_range=(1, 1)):
        self.title_ngram_range = title_ngram_range
        self.tags_ngram_range = tags_ngram_range
    
    def fit(self, X, y=None):
        self.title_cv = CountVectorizer(ngram_range=self.title_ngram_range)
        self.title_cv.fit(X["title"])
        
        self.tags_cv = CountVectorizer(ngram_range=self.tags_ngram_range)
        self.tags_cv.fit(X["concatenated_tags"])
        
        return self
    
    def transform(self, X):
        title_vect = self.title_cv.transform(X["title"])
        tags_vect = self.tags_cv.transform(X["concatenated_tags"])
        return sp.hstack([title_vect, tags_vect])

In [6]:
pipe = make_pipeline(TitleTagsVectorizer(), MultinomialNB())

### Model Validation

For hyperparameter optimization and validation we will use scikit-learn's `GridSearchCV`, that performs a cross-validated grid-search over the parameter space. By default it uses a 5-fold CV.

Note: The parameter grid we are using here might not actually contain the best values, but it serves a good demo purpose and achieves a nice result.

In [7]:
param_grid = {
    "titletagsvectorizer__title_ngram_range": [(1, 2), (1, 3)],
    "titletagsvectorizer__tags_ngram_range": [(1, 2), (1, 3)],
    "multinomialnb__alpha": [1.0e-5, 1.0e-2, 1]
}

classifier = GridSearchCV(pipe, param_grid=param_grid, n_jobs=cpu_count() - 1)
classifier.fit(X_train, y_train)


report = classification_report(y_test, classifier.predict(X_test))
print(report)

with open(os.environ["METRICS_PATH"], "w") as metrics_file:
    metrics_file.write(report)

                    precision    recall  f1-score   support

              Bebê       0.93      0.91      0.92      1780
Bijuterias e Jóias       0.97      0.95      0.96       229
         Decoração       0.91      0.92      0.91      2145
     Lembrancinhas       0.93      0.94      0.94      4381
            Outros       0.83      0.72      0.77       280
       Papel e Cia       0.83      0.81      0.82       685

          accuracy                           0.91      9500
         macro avg       0.90      0.88      0.89      9500
      weighted avg       0.91      0.91      0.91      9500



### Model exportation

Here I use cloudpickle instead of pickle or joblib because they didn't handle well the serialization of the custom transformer as well as cloudpickle does, saving everything into a single .pkl file.

In [8]:
with open(os.environ["MODEL_PATH"], "wb") as model_file:
    cloudpickle.dump(classifier, model_file)