# Training

The training pipeline is composed by the following steps:

- [0. Setup](#0.-Setup)
- [1. Data extraction](#1.-Data-extraction)
- [2. Data formatting](#2.-Data-formatting)
- [3. Modeling](#3.-Modeling)
 - [3.1. Bag-of-words feature extraction](#3.1-Bag-of-words-feature-extraction)
- [4. Model validation](#4.-Model-validation)
- [5. Model exportation](#5.-Model-exportation)

## 0. Setup

In [1]:
import os
import pandas as pd
import pickle

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.linear_model import SGDClassifier


print('Setup complete!')

Setup complete!


## 1. Data extraction

Loads a dataset with product data from a specified path available in the environment variable DATASET_PATH.
Select only feature subset to reduce needed memory.

In [2]:
# Read training set
products_train = pd.read_csv(os.environ['DATASET_PATH'])

# Read test set
products_test = pd.read_csv(os.environ['TEST_PATH'])

print("Training data:")
print("%d products" % len(products_train))
print("%d categories" % len(products_train['category'].value_counts()))
print()
print("Test data:")
print("%d products" % len(products_test))
print("%d categories" % len(products_test['category'].value_counts()))
print()
print('Data extraction complete!')

Training data:
38000 products
6 categories

Test data:
500 products
6 categories

Data extraction complete!


## 2. Data formatting
Processes the dataset to use it for training and validation.

During experiments we used GridSearchCV class, which provided a cross-validation (5 k-fold) method to search the best set of features. Here we present only the best feature composition produced. The exploration path is documented in experiments.ipynb file.

In order to reduce memory requirements and facilitate processing, we selected only the columns used during training. After that, to avoid pipeline execution problems, we discarded 60 rows that had missing values.

During the experiments, we noticed that combining the text columns improved the model's performance. So we ended up combining the 'title', 'concataned_tag' and 'query' columns.

In addition, in order to integrate information from three numeric columns ('price', 'weight', 'minimum_quantity') we used the k-means clustering algorithm to generate a new text feature ('kmeansPriceWeightMinimumQuantity') to be combined with the others. In this case, the best result was found when we defined the algorithm to run to find 23 clusters.

It is worth mentioning that the dataset is clearly unbalanced. But balancing it did not improve the performance of the classifier. We think the reason is that the test set itself is unbalanced in the same distribution of training set. Perhaps the teacher did this to make the work easier.

In [3]:
#2.1 Select columns subset used for training
features = ['title', 'concatenated_tags', 'query', 'price', 'weight', 'minimum_quantity', 'category']
products_train = products_train[features]

#2.2 Clean products null and NaN occurrences. Remove only 60 lines from 38000 in total. 
products_train = products_train.dropna() 
products_train = products_train.reset_index()

#2.3 Creates a k-means model for group three float columns
def create_kmeans(products_train, ncluster=23):
    float_columns = ['price', 'weight', 'minimum_quantity']
    kMeansPipeline = Pipeline(
        [
            ("scaler", StandardScaler()),
            ("kmeans", KMeans(n_clusters=ncluster, random_state=0)),

        ]
    )
    
    kmeansArray_train = kMeansPipeline.fit(products_train[float_columns].dropna())

    file_name = os.environ['KMEANS_PATH']
    with open(file_name, "wb") as open_file:
        pickle.dump(kMeansPipeline, open_file)
        
    return kMeansPipeline

#2.4 Auxiliary function that tries first loading a saved kmeans predictor. If not found, creates and save it for future use.
def load_kmeans():
    file_name = os.environ['KMEANS_PATH']
    try:
        with open(file_name, 'rb') as open_file:
            kMeansPipeline = pickle.load(open_file)
            return kMeansPipeline
    except:
        return create_kmeans(products_train)

#2.5 All data format steps together
def data_format(X):
#     textColumnsFeatures = ['title', 'concatenated_tags', 'query', 'kmeansPriceWeightMinimumQuantity']
    kmeans = load_kmeans()
    kmeansArray  = kmeans.predict(X[['price', 'weight', 'minimum_quantity']])
    kmeansSeries = pd.Series(kmeansArray, name="kmeans")
    X = pd.concat([X, kmeansSeries], axis=1)
    X['kmeansPriceWeightMinimumQuantity'] = 'grupo' + X['kmeans'].astype(str)
#     return concatCols(X, textColumnsFeatures)
    return X['title'] + ' ' +  X['concatenated_tags'] + ' ' +  X['query'] + ' ' +  X['kmeansPriceWeightMinimumQuantity']


X_train = data_format(products_train)
y_train = products_train['category']

X_test  = data_format(products_test)
y_test  = products_test['category']    

print()
print('Data formatting complete!')


Data formatting complete!


## 3. Modeling
Specifies a model to handle the categorization problem.

During experiments we used GridSearchCV class, which provided a cross-validation (5 k-fold) method to search the best set of features, pipelines and hyperparameters. Only the exploration best result is presented here.  The exploration path is documented in experiments.ipynb file.


### 3.1. Bag-of-words feature extraction
Most of classifiers do not work directly with **text** data. For this reason we used the CountVectorizer class that implements feature extraction of text columns. Basically it converts the text data into a matrix of token counts (bag-of-words). In addition we ended up using some extra parameters to improve the performance of the model. Below are the best combination of parameters found during parameters exploration followed by description and a possible explanation of **why** they worked.

| Parameter                                 	| Description                                                     	| Reason why (we think) it worked                                                                                                                                                                                                               	|
|-------------------------------------------	|-----------------------------------------------------------------	|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| binary=True                               	| If True, all non zero counts are set to 1.                      	| Our resulting text feature is combination of multiple text columns. This combination probably results in some duplicate words. The parameter worked because avoids CountVectorizer misinterpreting duplicated words as more valuable words.   	|
| max_features=None                         	| Define no max number of features in bag-of-words.               	| Using all words created more data. More data, better classifier.                                                                                                                                                                              	|
| max_df=0.5                                	| Ignore terms that have a document frequency strictly then 0.5.  	| It worked as avoid corpus-specific stop words. It worked well when using together the stop words list found in word clouds.                                                                                                                   	|
| stop_words=stop_portuguese_fromWordcoluds 	| Define a list of stop words.                                    	| This list came from visualy inspecting the word clouds of dataset text columns. This list proved to be enough to grant the model the best scores. It is a small list that worked well together with "max-df" automatic stop word detection param.                                                                                                                                               	|
| ngram_range=(1, 2)                        	| Means the unigrams and bigrams should be extracted.             	| Extracting unigrams and bigrams created more data. More data, better classifier.                                                                                                                                                              	|


### 3.2. Choosing the classifier
Due to our beginner level, we started exploring possible classifiers based on an example found in the scikit-learn documentation: [Sample pipeline for text feature extraction and evaluation](https://scikit-learn.org/0.15/auto_examples/grid_search_text_feature_extraction.html). This example uses an SGDClassifier.

**SGDClassifier** is generic linear classifier with stochastic gradient descent (SGD) training. Seting the **loss function** parameter defines the type of classifier. The default loss function is  **hinge loss function** which defines that the classifier fits a linear **support vector machine (SVM)**. We also tried exploring other loss function and hyperparameters, but at the end, the defaults provided the best results.

Only to mention, beyond this classifier, we tried some ensemble methods as RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier and XGBClassifier. Unfortunately, our lack of knowledge combined with a substantial training time increase of these models meant that we were not able to make much progress.


In [4]:
stop_portuquese_fromWordclouds = ['de', 'do', 'dos', 'com', 'em', 'o', 'e', 'para', 'em']

pipeline = Pipeline(
    [
        ("vect", CountVectorizer(binary=True, max_df=0.5, max_features=None, ngram_range=(1, 2), strip_accents=None, stop_words=stop_portuquese_fromWordclouds )),
        ("clf", SGDClassifier(random_state=0)),
    ]
)

classifier = pipeline.fit(X_train, y_train)

print()
print('Modeling complete!')


Modeling complete!


## 4. Model validation
Generates metrics about the model accuracy (precision, recall, F1, etc.) for each category and exports them to a specified path available in the environment variable METRICS_PATH.

We used three metrics:
- **accuracy**: the measure the overall model output closeness to target data, where an accuracy score reaches its best value at 1 and worst score at 0. 
- **f1_score by category**: F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. **Precision** is intuitively the ability of the classifier not to label as positive a sample that is negative. **Recall** is intuitively the ability of the classifier to find all the positive samples.
The formula for the F1 score is:

```
F1_score = 2 * (precision * recall) / (precision + recall)

precision = true positives / (true positives + false positives)

recall =  true positives / (true positives + false negatives)
```
- **f1_score micro averaged**: Calculate f1_score metrics globally by counting the total true positives, false negatives and false positives.


In [5]:
def evaluate_test(y_true, y_predicted):
    categories = ['Lembrancinhas', 'Decoração', 'Bebê', 'Papel e Cia', 'Outros', 'Bijuterias e Jóias']
    accuracy = accuracy_score(y_true, y_predicted)
    f1_score_microAveraged = f1_score(products_test['category'], y_predicted, average='micro', labels=categories)
    f1_score_byCategories = f1_score(products_test['category'], y_predicted, average=None, labels=categories)
    print('accuracy: %0.3f' % accuracy)
    print('\nf1_score micro averaged: %0.3f\n' % f1_score_microAveraged)
    for c, s in zip(categories , f1_score_byCategories):
        print('f1_score[%s]: %0.3f' % (c, s) )
        
    file_name = os.environ['METRICS_PATH']
    with open(file_name, 'w') as f:
        f.write('accuracy: %0.3f\n' % accuracy)
        f.write('\nf1_score micro averaged: %0.3f\n\n' % f1_score_microAveraged)
        for c, s in zip(categories , f1_score_byCategories):
            f.write('f1_score[%s]: %0.3f\n' % (c, s) )

        
y_predicted = classifier.predict(X_test)
evaluate_test(y_test, y_predicted)
    
print()
print('Model validation complete!')

accuracy: 0.962

f1_score micro averaged: 0.962

f1_score[Lembrancinhas]: 0.974
f1_score[Decoração]: 0.968
f1_score[Bebê]: 0.951
f1_score[Papel e Cia]: 0.880
f1_score[Outros]: 0.929
f1_score[Bijuterias e Jóias]: 0.952

Model validation complete!


## 5. Model exportation
Exports a candidate model to a specified path available in the environment variable MODEL_PATH.

In [6]:
file_name = os.environ['MODEL_PATH']
with open(file_name, "wb") as open_file:
    pickle.dump(classifier, open_file)
    
print('Model exportation complete!')    

Model exportation complete!
