# Text classification example

In this notebook we are going to train a recurrent neural network that classifies texts from the [20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/).

The main purpose of this example is to illustrate how to use the pylat library to solve a complete text classification problem.

In [None]:
%run init.py

## Loading the data
We are going to use the scikit-learn.datasets module to load the texts and store them in the 'data' directory.

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['rec.autos', 'rec.sport.baseball', 'rec.sport.hockey']
newsgroups_train = fetch_20newsgroups(data_home='./data', subset='train', categories=categories)

texts = newsgroups_train.data
labels = newsgroups_train.target

In the following cell we can see an example text from the dataset. We can see that there is some metadata that could be separated from the main text ('from', 'subject', 'nntp-posting-host', 'organization'...) and processed to potentially improve the performance of our final classifier. However, for the purpose of this example we will just work with the complete piece of text.

In [None]:
texts[0]

## Preparing the data
In this section we are going to preprocess the texs before feeding them to our recurrent neural network. The following steps will be explained:
* Preprocessing of the text: This includes tokenization, removal of stop words and lemmatization.
* Training a Word2Vec model that maps tokens to a vector representation.
* Using the trained Word2Vec model to convert each token in the texts to vectors that can be fed to the neural network.

### Text preprocessing
Pylat provides a TextPreprocessor class that takes care of tokenization, stop word removal and lemmatization. The constructor receives the following parameters:
* remove_stop_words: boolean indicating if step words should be removed or not.
* lemmatize: boolean indicating if the words should be lemmatized.
* spacy_model_id: language to be used internally for tokenization of the text. Supported languages right now are 'en' for English and 'es' for Spanish. An English tokenizer is used by default.
* additional_pipes: Iterable of callables that can be provided by the user to perform additional preprocessing steps.

In the following cell we are going to create a TextPreprocessor object to tokenize and remove the stop words of our texts:

In [None]:
from pylat.wrapper.transformer.text_preprocessor import TextPreprocessor

preprocessor = TextPreprocessor(remove_stop_words=True, lemmatize=False)
preprocessed_texts = preprocessor.fit_transform(texts)

### Word2Vec

Now that we have our texts preprocessed, we can feed them to Word2Vec to train the word embedding model:

In [None]:
from gensim.models.word2vec import Word2Vec

w2v_model = Word2Vec(preprocessed_texts, size=50, alpha=0.025, window=5, min_count=3,
                     max_vocab_size=None, sample=0.001, seed=42, workers=3, iter=100, min_alpha=0.0001)

With our model ready, pylat provides a class to transform the tokens to their vector representation. It also provides a SentencePadder transformer to make sure that all of our texts have the same size after being preprocessed and converted to a list of vectors.

In [None]:
from pylat.neuralnet.embeddings import Word2VecEmbedding
from pylat.wrapper.transformer import SentencePadder, WordEmbeddingsTransformer

w2v_embedding = Word2VecEmbedding(model=w2v_model)
w2v_transformer = WordEmbeddingsTransformer(embeddings=w2v_embedding, to_id=True)

In the following cell we put everything together to create our final data pipeline. This pylat can transform any text from our dataset to a vector representation that can be directly fed to our recurrent neural network:

In [None]:
from sklearn.pipeline import Pipeline

w2v_data_pipeline = Pipeline(steps=[('preprocessing', preprocessor), 
                                    ('w2v', w2v_transformer), 
                                    ('padder', SentencePadder())])
X_train_w2v = w2v_data_pipeline.fit_transform(texts)

In [None]:
y_train = labels

## Neural network creation

After preprocessing the texts, we can move on to the creation of the neural network.

### Creating a neural network with a specific architecture
Pylat provides several classes to build and personalize the architecture of our neural network. In the package pylat.neuralnet.rnn we have available different implementations of network layers and cells to use in our classifier. In this example we are going to build a neural network with a recurrent layer and a dense layer.

If we want to add more layers to the network, we just have to add additional RecurrentLayer or BidirectionalRecurrentLayer objects to the 'rnn_layers' list:

In [None]:
from pylat.neuralnet import DenseLayer
from pylat.neuralnet.rnn import BidirectionalRecurrentLayer, RecurrentLayer, \
    LSTMCellFactory, GRUCellFactory
from pylat.wrapper.predictor import RNNWrapper

rnn_layers = [RecurrentLayer(50, dropout_rate=0.35,
                             cell_factory=GRUCellFactory(),
                             cell_dropout=0.45)]
fc_layers = [DenseLayer(20, activation='relu', dropout_rate=0.3)]

rnn_w2v = RNNWrapper(embeddings=w2v_embedding, 
                     rnn_layers=rnn_layers, fc_layers=fc_layers,
                     batch_size=50, early_stopping=False,
                     learning_rate=1e-3, num_epochs=12,
                     save_dir='results/rnn')
rnn_w2v.fit(X_train_w2v, y_train)

### Fine tuning parameters

If we want to optimize the parameters, we can make use of the RandomizedSearchCV class provided by scikit-learn. We will first define a dict with the combination of parameters that we want to try, and later on we will pass our recurrent neural network to the RandomizedSearchCV constructor to find the best parameters:

In [None]:
rnn_params = {
    "batch_size": [25, 50, 75, 100],
    "num_epochs": [10, 15, 20, 25, 30],
    "rnn_layers": [(BidirectionalRecurrentLayer(50, dropout_rate=0.45, cell_factory=LSTMCellFactory(),
                                          cell_dropout=0.55),)],
    "fc_layers": [(DenseLayer(20, activation='relu', dropout_rate=0.3),)],
    "early_stopping": [True, False],
    "learning_rate": [0.001, 0.003, 0.01, 0.03]
}

In [None]:
rnn_w2v = RNNWrapper(embeddings=w2v_embedding)
rnn_w2v_grid = RandomizedSearchCV(rnn_w2v, rnn_params, n_iter=5, cv=4, scoring='f1_macro',
                                  return_train_score=True, random_state=RANDOM_SEED)
rnn_w2v_grid.fit(X_train_w2v, y_train)
rnn_w2v_grid.best_score_

## Evaluation

After training our models, we can evaluate their performance on the test dataset. First of all, we will load this set using the sklearn library:

In [None]:
newsgroups_test = fetch_20newsgroups(data_home='./data', subset='test', categories=categories)

X_test = w2v_data_pipeline.fit_transform(newsgroups_test.data)
y_test = newsgroups_test.target

We can make use of the predict function provided by the neural network to obtain the predictions for our test set. After obtaining the predictions we can make use of the sklearn.metrics module to compute common evaluation metrics such as the accuracy or f1_score.

Pylat also provides additional functions that we can use to evaluate our models. In this case, we are going to compute the PPV (positive predicted value), the NPV (negative predicted value) and the Wilson Score interval:

In [None]:
from pylat.evaluation import positive_predicted_value, \
                             negative_predicted_value, wilson_score_interval
from sklearn.metrics import accuracy_score, f1_score

def measure_performance(model, X, y):
    """This method shows a summary of the performance of a model."""
    y_pred = model.predict(X)
    print(y_pred.shape)
    print(y.shape)
    f1 = f1_score(y, y_pred, average="macro")
    acc = accuracy_score(y, y_pred)
    acc_interval = wilson_score_interval(1 - acc, len(y), 90)
    ppv = positive_predicted_value(y, y_pred)
    npv = negative_predicted_value(y, y_pred)
    print('F1: {:.3f}, Accuracy: {:.3f} ± {:.3f}, PPV: {:.3f}, NPV: {:.3f}'.format(
          f1, acc, acc_interval, ppv, npv))
    
measure_performance(rnn_w2v, X_test, y_test)

## Saving the model

Finally, we can save our model to a file. This file could be loaded by other programs to obtain predictions from the network with new data:Ç

In [None]:
import os
import pickle
import shutil

def remove_dir(directory):
    if os.path.exists(directory):
        shutil.rmtree(directory)


def overwrite_dir(directory):
    remove_dir(directory)
    os.mkdir(directory)

def save_neural_net(pipeline, neural_net, path, save_name):
    """Saves a recurrent neural network into a file.
    Parameters
    ----------
    pipeline : sklearn.Pipeline
        Data processing pipeline to be saved.
    neural_net : :obj:`BaseNeuralNetwork`
        Neural network that will be saved.
    path : str
        Directory where the model will be saved.
    save_name : str
        Name of the saved file.
    """
    save_path = os.path.join(path, save_name)
    overwrite_dir(save_path)
    with open(os.path.join(save_path, 'pipeline.pk1'), 'wb') as f:
        pickle.dump(pipeline, f)
    model_dir = os.path.join(save_path, 'model')
    remove_dir(model_dir)
    neural_net.model.save(model_dir)

save_neural_net(w2v_data_pipeline, rnn_w2v, 'classifiers', 'rnn_w2v')

If we wanted to load the model in another project, we can use the following code:

In [None]:
def load_neural_net(save_path):
    """Loads a neural network model.
    This method should be used to load neural networks that use TensorFlow as
    the backend.
    Parameters
    ----------
    save_path : str
        Directory where the neural network has been saved.
    Returns
    -------
    sklearn.Pipeline
        Loaded scikit-learn pipeline. The last step of the pipeline corresponds
        to the neural network, which has its weights restored from the save
        file.
    """
    pipeline_path = os.path.join(save_path, 'pipeline.pk1')
    with open(pipeline_path, 'rb') as f:
        pipe = pickle.load(f)
    model_path = os.path.join(save_path, 'model')
    embeddings = pipe.steps[1][1].embeddings
    rnn = RNNWrapper(rnn_layers=[DenseLayer(10)],
                     fc_layers=[RecurrentLayer(10)], embeddings=embeddings)
    rnn.model.restore(model_path)
    pipe.steps.append(('rnn', rnn))
    return pipe