# 1. Introduction 

This tutorial presents a walkthrough of the Consensus Categorization (C²) method.

The C² method is a supervised and transductive learning method based on the Consensus Clustering method that I investigated during my [doctorate at ICMC-USP](http://www.teses.usp.br/teses/disponiveis/55/55134/tde-05082015-094733/en.php).

Here, the C² method has been adapted to handle the dataset provided by [Meli Data Challenge 2019](https://ml-challenge.mercadolibre.com/).

In short, the C² method has the following steps:

* Preprocess product titles by removing stopwords (English, Portuguese, and Spanish), numbers, and special characters. Source: [meli/preprocess.py](meli/preprocess.py).
* Learn a textual representation for product titles by using [fasttext word embeddings](https://fasttext.cc/). Such a representation is useful for initializing classification models.
* Get different dataset samples, both by sampling instances and features. Source: [meli/sampling.py](meli/sampling.py)
* Get different classification models for each sampling. It is important that there is diversity in classification model solutions. Source: [meli/models.py](meli/models.py)
* Build a [heterogeneous network](https://ieeexplore.ieee.org/abstract/document/7536145/) with the following node types: product, terms, and classification models. Some network nodes are labeled considering the training set and the categories predicted by the classification models. The heterogeneous network is regularized through a [consensus function](https://dl.acm.org/citation.cfm?id=2983730) that will return the final categorization. Source: [meli/consensus.py](meli/consensus.py)

The C² method ranked fourth (private leaderboard) in the Meli Data Challenge 2019. It can be improved by either adding more classification models or tuning the consensus function.


In [None]:
import pandas as pd
from tqdm import tqdm
import pickle
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
import pandas as pd
import keras
from sklearn.model_selection import train_test_split
import numpy as np
import random
from keras.callbacks import EarlyStopping

# code for the Meli Data Challenge
from meli import preprocess as pretext
from meli import sampling
from meli import models
from meli import consensus

# 2. Dataset Preprocessing

* Let's create a directory to store the dataset and then download it from the Meli 2019 repository.

In [None]:
!mkdir dataset
!wget -c -P ./dataset/ 'https://meli-data-challenge.s3.amazonaws.com/train.csv.gz'
!wget -c -P ./dataset/ 'https://meli-data-challenge.s3.amazonaws.com/test.csv'
!cd ./dataset/; gunzip train.csv.gz #To unzip the dataset

* Viewing the training set and the test set using pandas.

In [33]:
df_train = pd.read_csv('dataset/train.csv')
df_train

Unnamed: 0,title,label_quality,language,category
0,Hidrolavadora Lavor One 120 Bar 1700w Bomba A...,unreliable,spanish,ELECTRIC_PRESSURE_WASHERS
1,Placa De Sonido - Behringer Umc22,unreliable,spanish,SOUND_CARDS
2,Maquina De Lavar Electrolux 12 Kilos,unreliable,portuguese,WASHING_MACHINES
3,Par Disco De Freio Diant Vent Gol 8v 08/ Frema...,unreliable,portuguese,VEHICLE_BRAKE_DISCS
4,Flashes Led Pestañas Luminoso Falso Pestañas P...,unreliable,spanish,FALSE_EYELASHES
...,...,...,...,...
19999995,Brochas De Maquillaje Kylie Set De 12 Unidades,unreliable,spanish,MAKEUP_BRUSHES
19999996,Trimmer Detailer Wahl + Kit Tijeras Stylecut,reliable,spanish,HAIR_CLIPPERS
19999997,Bateria Portátil 3300 Mah Power Bank Usb Max...,unreliable,portuguese,PORTABLE_CELLPHONE_CHARGERS
19999998,"Palo De Hockey Grays Nano 7 37,5''",unreliable,spanish,FIELD_HOCKEY_STICKS


In [34]:
df_test = pd.read_csv('dataset/test.csv')
df_test

Unnamed: 0,id,title,language
0,0,Kit Maternidade Bolsa-mala Baby/bebe Vinho Men...,portuguese
1,1,Trocador De Fraldas Fisher Price Feminino Rosa...,portuguese
2,2,Motor Ventoinha - Fiat Idea / Palio 1.8 - A 04...,portuguese
3,3,Amortecedor Mola Batente D Dir New Civic 14 - ...,portuguese
4,4,Cadeirinha De Carro Bebê Princesa Princess 9 A...,portuguese
...,...,...,...
246950,246950,Disco Freno Delantero Ford Escort 88/94 Nuevo,spanish
246951,246951,Radio Comunicador Walk Talk Baofeng 777s Profi...,portuguese
246952,246952,Calculadora De Escritorio Grande 150$,spanish
246953,246953,Conj Mesa P/ Sala De Jantar C/ 06 Cadeiras Ams...,portuguese


* Let's do a simple preprocessing of text in the product titles: stopwords removal (English, Portuguese and Spanish) and special characters removal. Use the clean_text() function available in [meli/preprocess.py](meli/preprocess.py). This operation may take a while.

In [None]:
tqdm.pandas() 

df_train['title_clean'] = df_train.progress_apply(lambda x: pretext.clean_text(x['title'], x['language']), axis=1)
df_test['title_clean'] = df_test.progress_apply(lambda x: pretext.clean_text(x['title'], x['language']), axis=1)


* I used the pickle library to save a binary version of the preprocessed dataset.

In [None]:
pickle.dump(df_train, open('./dataset/df_train.pd', 'wb'))
pickle.dump(df_test, open('./dataset/df_test.pd', 'wb'))

In [38]:
# Reload here if the dataset has already been preprocessed.
#df_train = pickle.load( open( "./dataset/df_train.pd", "rb" ) )
#df_test = pickle.load( open( "./dataset/df_test.pd", "rb" ) )

In [39]:
df_train[['title','title_clean']]

Unnamed: 0,title,title_clean
0,Hidrolavadora Lavor One 120 Bar 1700w Bomba A...,hidrolavadora lavor one 120bar 1700w bomba alu...
1,Placa De Sonido - Behringer Umc22,placa sonido behringer umc22
2,Maquina De Lavar Electrolux 12 Kilos,maquina lavar electrolux kilos
3,Par Disco De Freio Diant Vent Gol 8v 08/ Frema...,par disco freio diant vent gol 8v fremax bd5298
4,Flashes Led Pestañas Luminoso Falso Pestañas P...,flashes led pestanas luminoso falso pestanas p...
...,...,...
19999995,Brochas De Maquillaje Kylie Set De 12 Unidades,brochas maquillaje kylie set unidades
19999996,Trimmer Detailer Wahl + Kit Tijeras Stylecut,trimmer detailer wahl kit tijeras stylecut
19999997,Bateria Portátil 3300 Mah Power Bank Usb Max...,bateria portatil 3300mah power bank usb maxprint
19999998,"Palo De Hockey Grays Nano 7 37,5''",palo hockey grays nano


# 3. Representation Learning

Some classification models may benefit from using word embeddings. I used fasttext to create word vectors from product titles.

* Download and compile fasttext (compilation environment required).

In [None]:
!wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
!unzip v0.9.1.zip
!cd fastText-0.9.1; make # You must have a compatible compiler (eg debian or ubuntu build-essentials package)

* Train word representation with skip-gram from product titles.
* The word embeddings file (product_titles.vec) will be saved in ./datasets directory

In [None]:
print('Saving product titles...')
df_product_titles = pd.concat([ df_train[['title_clean']] , df_test[['title_clean']] ])
df_product_titles.to_csv('./dataset/product_titles.txt',index=False,header=None)

In [None]:
print('Learning word embeddings')
!cd fastText-0.9.1; ./fasttext skipgram -dim 300 -minCount 2 -wordNgrams 2 -minn 0 -maxn 0 -input ../dataset/product_titles.txt -output ../dataset/product_titles

* Let's check our word embeddings:

In [31]:
print('Loading word embeddings model...')
title_embedding = KeyedVectors.load_word2vec_format('./dataset/product_titles.vec', binary=False)
print('Loading word embeddings model... OK')

word = 'samsung'
print('Words that are similar to samsung.')
title_embedding.most_similar('samsung')

Loading word embeddings model...
Loading word embeddings model... OK
Words that are similar to samsung.


[('sansung', 0.8280077576637268),
 ('samsumg', 0.8134722709655762),
 ('sansumg', 0.7443801164627075),
 ('samung', 0.7332087755203247),
 ('samgung', 0.7327459454536438),
 ('samsug', 0.723293662071228),
 ('samsun', 0.7043582797050476),
 ('samnsung', 0.6975691914558411),
 ('smasung', 0.6821693181991577),
 ('s5367', 0.6568795442581177)]

In [32]:
print('Words that are similar to smartphone.')
title_embedding.most_similar('smartphone')

Words that are similar to smartphone.


[('celular', 0.8157174587249756),
 ('smartphones', 0.6840193271636963),
 ('smartpho', 0.6817572712898254),
 ('smarthphone', 0.6655287742614746),
 ('smartfones', 0.6596484184265137),
 ('celulares', 0.6565076112747192),
 ('smartphon', 0.6550799012184143),
 ('ceular', 0.6521972417831421),
 ('leeremi', 0.6448410749435425),
 ('jmxl7', 0.644024133682251)]

* Saving word embedding index in a binary format.

In [None]:
embeddings_index = {}
with open('./dataset/product_titles.vec') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs
        


pickle.dump(embeddings_index, open('./dataset/product_titles.index.vec', 'wb'))

# embeddings_index = pickle.load( open( "./dataset/product_titles.index.vec", "rb" ) )  # Use this if embedding has already been generated.

# 4. Learning the classification models from sampling

Sampling was performed in two ways: balanced and unbalanced. In balanced sampling, the same number of instances per class (when possible) was sampled. In addition, instances marked as "reliables" have higher priority in balanced sampling. In unbalanced sampling, instances are obtained randomly, but with at least one instance per class.

For each sampling three different neural network models are trained: LSTM, GRU and CNN. The parameters of each model are set randomly to increase solution diversity.

Each model is initially trained in balanced sampling. Another model is obtained by continuing the previous training, but by inserting unbalanced sampling. This process is repeated over a number of k different samplings. I used k = 20 in Meli Data Challenge 2019, thereby generating a total of 120 models.

This step can be adapted to run in parallel if you have multiple GPUs.

In [None]:
k=20  # Tip: Use a low value of k for testing purposes.

!mkdir models # dir to save classification models and prediction data.

epochs = [5,7,10,15]
batch_size = [64,128,256,512] # Configure according to your hardware capacity.

for num_iter in range(0,k): # number of balanced samplings 
    
    print('Iteration '+str((num_iter+1))+'/'+str(k))
    
    # get balanced sampling
    print('Getting balanced sampling...')
    df_train_balanced_sampling = sampling.balanced_sampling(df_train)
    #df_train_balanced_sampling = pickle.load( open( "./dataset/df_train_balanced_sampling.pd", "rb" ) ) 
    
    # data tokenization for neural network
    print('Data tokenization...')
    tokenizer, embedding_matrix, MAX_NB_WORDS, MAX_SEQUENCE_LENGTH, nb_words = models.data_input(df_train_balanced_sampling, embeddings_index)
    
    # shuffle data
    df_train_balanced_sampling = df_train_balanced_sampling.sample(frac=1)

    # get labels
    labels = pd.get_dummies(df_train_balanced_sampling['category'])
    Y = labels.values
    number_of_classes = Y.shape[1]

    # tokenize X training data
    X = tokenizer.texts_to_sequences(df_train_balanced_sampling['title_clean'].apply(str))
    X = keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10)
    print('X_train and Y_train shapes:')
    print(X_train.shape,Y_train.shape)
    print('X_test and Y_test shapes:')
    print(X_test.shape,Y_test.shape)
    
    # codes for categoies
    print('Generating category codes...')
    category_index = {}
    total_categories = len(df_train_balanced_sampling['category'].unique())
    for i in range(0,len(Y)):
      category_index[np.argmax(Y[i])] = df_train_balanced_sampling.iloc[i,3]
      if len(category_index)==total_categories: break

    # tokenize Z test data
    Z = tokenizer.texts_to_sequences(df_test['title_clean'].apply(str))
    Z = keras.preprocessing.sequence.pad_sequences(Z, maxlen=MAX_SEQUENCE_LENGTH)
    
    # generate models (neural networks) using random parameters
    model_dic = {}
    print('GRU Model..')
    model_dic['GRU'] = models.TextGRU(nb_words, MAX_SEQUENCE_LENGTH, 300, embedding_matrix, number_of_classes)
    print('LSTM Model..')
    model_dic['LSTM'] = models.TextLSTM(nb_words, MAX_SEQUENCE_LENGTH, 300, embedding_matrix, number_of_classes)
    print('CNN Model..')
    model_dic['CNN'] = models.TextCNN(nb_words, MAX_SEQUENCE_LENGTH, 300, embedding_matrix, number_of_classes)
    
    
    # training neural network from balanced dataset
    for model_name in model_dic:
        r = random.randrange(100000, 999999)
        model_file = 'models/'+model_name+'_'+str(r)+".v1.model"
        print('Learning classification model '+model_file)
        model = model_dic[model_name]
        model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=random.choice(epochs), batch_size=random.choice(batch_size), callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
        model.save(model_file)
        pickle.dump(tokenizer, open(model_file+'.tokenizer', 'wb'))

        preds = model.predict(Z)

        probs = []
        labels_preds = []
        confidences = []
        counter = 0
        for index,row in df_test.iterrows():
          pred = preds[counter]
          confidence = np.max(pred)
          label_pred = category_index[np.argmax(pred)]
          probs.append(pred)
          labels_preds.append(label_pred)
          confidences.append(confidence)
          counter+=1

        df_test['probs']=probs
        df_test['confidence']=confidences
        df_test['label']=labels_preds

        df_test.to_csv(model_file+'.test.csv') # save model predictions
        
    
    # get unbalanced sampling
    df_train_unbalanced_sampling = sampling.unbalanced_sampling(df_train)

    labels = pd.get_dummies(df_train_unbalanced_sampling['category'])
    Y = labels.values
    number_of_classes = Y.shape[1]

    X = tokenizer.texts_to_sequences(df_train_unbalanced_sampling['title_clean'].apply(str))
    X = keras.preprocessing.sequence.pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.10)
    print(X_train.shape,Y_train.shape)
    print(X_test.shape,Y_test.shape)
    
    # training neural network from unbalanced dataset 
    for model_name in model_dic:
        r = random.randrange(100000, 999999)
        model_file = 'models/'+model_name+'_'+str(r)+".v2.model"
        print('Learning classification model '+model_file)
        model = model_dic[model_name]
        model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=random.choice(epochs), batch_size=random.choice(batch_size), callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
        model.save(model_file)
        pickle.dump(tokenizer, open(model_file+'.tokenizer', 'wb'))


        preds = model.predict(Z)

        probs = []
        labels_preds = []
        confidences = []
        counter = 0
        for index,row in df_test.iterrows():
          pred = preds[counter]
          confidence = np.max(pred)
          label_pred = category_index[np.argmax(pred)]
          probs.append(pred)
          labels_preds.append(label_pred)
          confidences.append(confidence)
          counter+=1

        df_test['probs']=probs
        df_test['confidence']=confidences
        df_test['label']=labels_preds

        df_test.to_csv(model_file+'.test.csv')# save model predictions

# 5. Consensus Function using Transductive Learning

Here, training data and model predictions are combined into a heterogeneous network. The consensus function is a network regularization through label propagation, where the target nodes are the test instances.

* If many models agree on the category of a test instance, then there is a good chance that this category will be maintained during regularization.

* If most models are in disagreement, the consensus function tends to identify and disable the importance of weak models. Moreover, consensus function use more training data to identify the final product category.

* Here the predictions of each model are summarized. In Meli Data Challenge, predictions with confidence greater than or equal to 0.9 had higher priority. It was also ensured that each test instance had at least 5 predictions from different models.

In [None]:
prediction_data = consensus.get_prediction_data()

* Generation of heterogeneous network.

In [None]:
df_train_balanced_sampling = sampling.balanced_sampling(df_train)
G, label_to_code, code_to_label = consensus.generate_network(df_train_balanced_sampling, df_test, prediction_data)

* The regularization process to identify a consensus is carried out below. This operation may take a while.

In [None]:
G = consensus.regularization(G, code_to_label)

* After regularization, save the final consensus prediction (categories) for submission in Meli 2019 (file: meli_submission.csv).

In [41]:
import numpy as np
labels_consensus = []
for index,row in df_test.iterrows():
    f = G.nodes[str(index)+':doc_test']['f']
    label = code_to_label[np.argmax(f)]
    labels_consensus.append(label)

df_test['category'] = labels_consensus

df_test[['id','category']].to_csv('meli_submission.csv',index=False)