## Idéias
* Criar novas features -> serão usadas como input após as camadas LSTM

      - quantas vezes a palavra aparece na sentença -> int (? cada palavra uma nova feature? matriz (m, n_total_palavras))
      - quantas vezes a palavra aparece em todo o dataset (? cada palavra uma nova feature? matriz (m, n_total_palavras))
      - checar se é uma palavra que estava no embedding pretreinado -> [0,1]

            TODAS ESSAS FEATURES ACIMA TEM O MESMO PROBLEMA.
            POSSÍVEL SOLUÇÃO seria uma matriz (m, n_total_palavras, n_features), sendo  n_features = número de features com o formato (m, n_total_palavras)
            
      -1 coluna - quantas palavras existem na sentenca?
      -1 coluna - quantas palavras diferentes?
      -1 coluna - quantas vezes a palavra que mais repete aparece?
      -1 coluna - tem palavra inteira maiúscla?  
      -1 coluna - quantas palavras inteiras maiúscula?
      -1 coluna - taxa de repetição
      (m, 6)

Para predições é necessário uma função predict que recebe os dados e faz um pré-processamento, criando as features descritas acima.


* Os inputs de cacacteres de uma sentença tem de ter sua extensão limitada à extenção correspondente de palavras de uma sentença. Por exemplo, na frase:  "`Paglia is somewhat of intellectual enigma, a conservative academic feminist, who revels in`" a sequencia de caracteres para essa sentença terminaria no último `"n"`, enquanto a sequencia de palavras terminaria em `"in"`.

* Testar tokenização por meio da bilbioteca NLTK que parece criar divisões de palavras melhores.
Ex: " **queen's** " seria dividido em " **queen** " e " **'s** ", algo que não acontece no Tokenizer da biblioteca Keras que junta essas partículas em uma nova palavra "**queen's**"

* **English Stemmer NLTK**

* Testar **Stanford Tokenizer**

* Adicionar embedding para caracteres e criar uma LSTM que irá processar um input de sequências de caracteres, isso será concatenado com as outras formas de inputs numa camada mais interna do modelo.

* Treinar **Character Embeddings** separadamente, e carrega-los no notebook ou usar **Word Embeddings** para [gerar os novos.](http://minimaxir.com/2017/04/char-embeddings)

# Improved LSTM baseline

This kernel is a somewhat improved version of [Keras - Bidirectional LSTM baseline](https://www.kaggle.com/CVxTz/keras-bidirectional-lstm-baseline-lb-0-051) along with some additional documentation of the steps. (NB: this notebook has been re-run on the new test set.)

In [None]:
import sys, os, re, csv, codecs, operator, numpy as np, pandas as pd
from collections import defaultdict, OrderedDict

from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer

from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, GRU, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D, Concatenate, BatchNormalization
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.callbacks import Callback

import logging
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight

import warnings
warnings.filterwarnings('ignore')

np.set_printoptions(suppress=True)

We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [None]:
path = '../input/'
comp = 'jigsaw-toxic-comment-classification-challenge/'
EMBEDDING_FILE=f'{path}glove6b50d/glove.6B.50d.txt'
#EMBEDDING_FILE="../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt"
TRAIN_DATA_FILE=f'{path}{comp}train.csv'
TEST_DATA_FILE=f'{path}{comp}test.csv'


Set some basic config parameters:

In [None]:
embed_size = 50 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a comment to use

In [None]:
window_sizes = [40,80,maxlen]
step_size = 20

def augment_toxic(list_sequences, y, window_sizes=[100],step_size=20, leakage=0, 
                  maxlen=None, truncated_only=False, non_repeat=False):
    """
    This function is intended to augment the amount of toxic comments on the training data
    using the provided dataset.
    Given a list ´window_sizes´ of integer values, the window will slide through the sentence with 
    a ´step_size´ until the end. Each window will be a new sentence to be appended to the training 
    data list.
    This function solves the problem of losing data when the truncate from Tokenizer is applied for
    values in ´window_sizes´ lower than `maxlen` (if local ´´maxlen=None´´, consider the global `maxlen`.
    
    arguments: 
    ´list_sequences´ -> 2D-array like: sequence of sentences
    ´window_sizes´ -> integer
    ´step_size´ -> integer
    ´leakage´ -> float: percentage of words in the new sentence to be randomly removed
    
    ||Maybe better to be another function
    ||'''Augment only truncated'''
    ||´truncated_only´ -> boolean: if True, ´step_size´ will be ignored and only the words 
    ||    after ´maxlen´ index will be augmented.
    ||´maxlen´ -> max lenght of the Tokenized comment. Should be equal the global `maxlen`.
    
    
    return: 
    ´new_comments´ -> a list with new comments of sizes ´window_sizes´
    ´new_ys´ -> a list with indices for the new comments
    """
    new_comments = []
    new_ys = []
    if isinstance(list_sequences,(list,tuple)):
        list_sequences = np.array(list_sequences)
    # If ´non-repeat´ is True, create a list to store the already processed comments
    if non_repeat:
        done = []
    
    # True if any of 6 clases == 1, False otherwise.
    toxic_filter = y.sum(axis=1) > 0
    # Indices of toxic comments.
    toxic_indices = [i for i in range(len(toxic_filter)) if toxic_filter[i] == True] 
    toxic_comments = list(list_sequences[toxic_indices])

    window_sizes = sorted(window_sizes)
    
    for window in window_sizes:
        if truncated_only:
            step_size = window
        for index, comment in zip(toxic_indices,toxic_comments):
            # Check if this comment has already been augmented
            if non_repeat and index in done:
                continue
            # mode
            if truncated_only:
                splitted = comment[maxlen:]                
            else:
                splitted = comment
            
            # Check if it is possible to take spall pieces of size ´window´ from the sentence.
            # Go to the next comment if it is not.
            if len(splitted) <= window:
                continue
                
            # Get how many times it is possible to apply the window with the given ´step_size´.
            n_steps = 0
            window_end = window        
            while window_end <= len(splitted):
                window_end += step_size
                n_steps += 1

            # Perform the possible steps
            for step in range(n_steps):
                # Get the window.                
                new_sequence = np.array(splitted[step*step_size:window+step*step_size])
                
                # Leakage control
                if leakage < 0 or leakage > 1:
                    raise ValueError(f'leakage must be between 0 and 1. It was passed leakage={leakage}')
                elif leakage > 0:
                    ## total number of words that can be removed
                    possible_leaks = int(len(new_sequence)*leakage) 
                    ## random number between 0 and ´possible_leaks´
                    n_leaks = np.random.randint(possible_leaks) 
                    ## indices of words to be removed
                    leaky_indices = set(np.random.randint(0,len(new_sequence),n_leaks)) 
                    leak_filter = [False if i in leaky_indices else True for i in range(len(new_sequence))] 
                    new_sequence = new_sequence[leak_filter]
                
                # Append new sentence to a list        
                new_comments.append(new_sequence)
                # Append labels from the original sentece to the new sentece to a list
                new_ys.append(y[index,:])
                # Append index to ´done´ list if ´non_repeat´ is True
                if non_repeat:
                    done.append(index)
    # pad or truncate sequences to have length of `maxlen`
    list_sequences = pad_sequences(list_sequences, maxlen=maxlen, truncating='post')
    new_comments = pad_sequences(new_comments, maxlen=maxlen, truncating='post')
    
    # concatenate new comments with the old ones
    list_sequences = np.concatenate((list_sequences, new_comments), axis=0)
    y = np.concatenate((y, new_ys),axis=0)
    # shuffle data
    shuffle_index = np.random.permutation(len(y))
    list_sequences, y = list_sequences[shuffle_index], y[shuffle_index]
                
    return list_sequences, y

def get_helper(list_sentences):
    """ 
    Generates new features:
    
    Lists of length `len(list_sentences)`
    `sentence_length` -> number of words in each sentence after sequencing.
    `n_distinct_words` -> number of distinct words for each sentence.
    `n_uppercase` -> number of uppercase words in each sentence.
    `any_uppercase` -> True if there is any uppercase word in the sentence, False otherwise.
    `n_most_occurring_word` -> number of times the most frequent word of each sentence appears.
    `repetitive` -> Percentage of repeating words ´´1 - (n_distinct_words / sentence_lenght)´´
    
    Return:
        `helper` -> An array of shape (len(list_sentences), 6), where 6 is the number of
            features and each column is a feature.
    """
    sentence_length = []        #(int) how mane words in each sentence?
    n_distinct_words = []       #(int) how many distinct words in each sentence?
    n_uppercase = []            #(int) how many words are uppercase in each sentence?
    any_uppercase = []          #(bool) is the any uppercase word in each sentence?
    n_most_occurring_word = []  #(int) how many times the most frequent word appears in each sentence?
    
    repetitive = []             #|(float) Percentage of repeating words
                                #| ´´1 - (n_distinct_words / sentence_lenght)´´
                                #| small values means a higly repetitive sentence
    
    for sentence in list_sentences:
        word_sequence = text_to_word_sequence(sentence)
        
        sentence_length.append(len(word_sequence))
        n_distinct_words.append(len(set(word_sequence)))
        try:
            repetitive.append(1 - (n_distinct_words[-1]/sentence_length[-1]))
        except ZeroDivisionError:
            repetitive.append(0)
        try:
            n_most_occurring_word.append(word_sequence.count(max(word_sequence)))
        except ValueError:
            n_most_occurring_word.append(0)
        n_upper = 0
        for word in sentence:
            n_upper += word.isupper()
                        
        n_uppercase.append(n_upper)
        any_uppercase.append(n_upper > 0)
            
    features = [sentence_length, n_distinct_words, n_uppercase, 
                any_uppercase, n_most_occurring_word, repetitive]
    helper = np.zeros((len(list_sentences), len(features)))    
    for i, l in enumerate(features):
        helper[:,i] = l        
    return helper

def get_sequence_maxlen(list_sentences):
    """ Return the maxlen value. This function looks for the sentence with more characters """    
    maxlen = 0
    for sentence in list_sentences:
        sequence = text_to_word_sequence(sentence)
        length = len(sequence)
        maxlen = length if length > maxlen else maxlen        
    return maxlen

def get_num_words(list_sentences):
    """ Returns the number of distinct words in the sentences """
    words = set()
    for sentence in list_sentences:
        sequence = text_to_word_sequence(sentence)
        for word in sequence:
            words.add(word)            
    return len(words)

def get_maxlen_char(list_sentences):
    """ Return the maxlen value. This function looks for the sentence with more words """    
    maxlen = 0
    for sentence in list_sentences:
        length = len(sentence)
        maxlen = length if length > maxlen else maxlen        
    return maxlen

def get_num_char(list_sentences):
    """ Returns the number of distinct characters in the sentences """
    chars = set()
    for sentence in list_sentences:
        for char in sentence:
            chars.add(char)            
    return len(chars)

def get_word_length(list_sentences, mode='max'):
    """ 
    Returns the biggest word length for `mode='max'`or the average length 
    between all words for `mode='avg'.
    """
    length = 0
    if mode is 'avg':
        n_words = 0
        for sentence in list_sentences:
            sequence = text_to_word_sequence(sentence)
            for word in sequence:
                length += len(word)
                n_words += 1
        length = length//n_words
            
    elif mode is 'max':
        for sentence in list_sentences:
            sequence = text_to_word_sequence(sentence)
            for word in sequence:
                t_len = len(word)
                if length < t_len and t_len < 50: 
                    length = t_len
    
    return length

### Read Glove
Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [None]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

In [None]:
def text_to_sequence(list_sentences, lower=True, char_level=False):
    if lower:
        if char_level:
            sequences = [list(sentence.lower()) for sentence in list_sentences]
        else:
            sequences = [word_tokenize(sentence.lower()) for sentence in list_sentences]
    else:
        if char_level:
            sequences = [list(sentence) for sentence in list_sentences]
        else:
            sequences = [word_tokenize(sentence) for sentence in list_sentences]   
    return sequences

def get_word_index(sequences, return_counts=False):
    
    word_counts = defaultdict(int)
    word_index = dict()
    # Count words frequency
    for sequence in sequences:
        for word in sequence:
            word_counts[word] += 1
            
    # Sort from most to less frequent
    word_counts = OrderedDict(sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True))
    
    # Associate numeric indices to each word starting at 1
    for i, word in enumerate(word_counts.keys(),1):
        word_index[word] = i
        
    if return_counts:
        return word_index, word_counts
    return word_index

def apply_indices(sequences, num_words, word_index):
    if not isinstance(word_index, dict):
        raise TypeError('word_index must be a dict')
    id_sequences = []
    # Apply indices
    for sequence in sequences:
        id_sequence = []
        for word in sequence:
            # ignore words with index higher than the max number or words (`num_words`)
            if word not in word_index.keys() or word_index[word] >= num_words:
                continue
            else:
                id_sequence.append(word_index[word])
        id_sequences.append(id_sequence)
    return id_sequences

### RocAucEvaluation with scores for the Training and Validation sets

In [None]:
class RocAucEvaluation(Callback):
    
    def __init__(self, validation_data=(), train_data=(), interval=1):
        
        super(Callback, self).__init__()        
        assert isinstance(train_data,(tuple,list)) or len(train_data) == 2, \
            "train_data must be a list or tuple with 2 elements: ´´(X_train, y_train)´´"        
        
        self.interval = interval        
        self.X_val, self.y_val = validation_data
        
        if train_data:
            self.X_train, self.y_train = train_data
            self.train_score = True
            
    def on_epoch_end(self, epoch, logs={}):
        
        if epoch % self.interval == 0:            
            y_val_pred = self.model.predict(self.X_val, verbose=0)
            score_val = roc_auc_score(self.y_val, y_val_pred)

            if self.train_score:
                y_train_pred = self.model.predict(self.X_train, verbose=0)
                score_train = roc_auc_score(self.y_train, y_train_pred)
                print(f"\n ROC-AUC - epoch: {epoch} - score_train: "
                      f"{score_train:.6f} | score_val: {score_val:.6f}")
            else:
                print(f"\n ROC-AUC - epoch: {epoch} - score: {score_val:.6f}")

### Read in our data and replace missing values:

In [None]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

y.sum(axis=0)

In [None]:
sequences_train = text_to_sequence(list_sentences_train)
word_index = get_word_index(sequences_train)
sequences_train = apply_indices(sequences_train, max_features, word_index)
X_w_t, y = augment_toxic(sequences_train, y, window_sizes=window_sizes, maxlen=maxlen, truncated_only=True, non_repeat=True)

sequences_test = text_to_sequence(list_sentences_test)
sequences_test = apply_indices(sequences_test, max_features, word_index)
X_w_te = pad_sequences(sequences_test, maxlen=maxlen, truncating='post')

# In this version, `sequences_train` is already padded in `augment_toxic()`
#X_w_t = pad_sequences(sequences_train, maxlen=maxlen, truncating='post')



### Augment or get part the Tokenizer would truncate

In [None]:
## augment data
#new_comments, new_ys = augment_toxic(list_sentences_train, window_sizes, step_size, 
#                                 maxlen=maxlen, truncated_only=True, non_repeat=True)
#list_sentences_train = np.concatenate((list_sentences_train, new_comments),axis=0)
#y = np.concatenate((y, new_ys),axis=0)
## shuffle data
#shuffle_index = np.random.permutation(len(y))
#list_sentences_train, y = list_sentences_train[shuffle_index], y[shuffle_index]

#y.sum(axis=0)

### Relation between classes

In [None]:
[y[y[:,i]==1].sum(axis=0) for i in range(6)]
pd.DataFrame(data=[y[y[:,i]==1].sum(axis=0) for i in range(6)], columns=list_classes,index=list_classes)

## Word embedding matrix
### Clean and get word sequences from sentences 
Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

### Create words embedding matrix
Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [None]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
#emb_mean,emb_std

In [None]:
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

The reason for us to use `max_features = 20000` and `maxlen = 100` or something close is because the training set has a lot of features and some large comments

In [None]:
#original_maxlen = get_sequence_maxlen(list_sentences_train)
#original_max_features = get_num_words(list_sentences_train)
#print(f'The original number of features and maximum length in the sequences '
#      f'are: {original_max_features} features and {original_maxlen} words, respectively')

## Char embedding matrix

In [None]:
#embed_char_size = 50
#original_num_char = get_num_char(list_sentences_train) # how many unique chars to use
#original_maxlen_char = get_maxlen_char(list_sentences_train) # maximum 
#num_char = 500
#maxlen_char = 800
#print(f'original_num_char = {original_num_char}, original_maxlen_char = {original_maxlen_char}')

In [None]:
#tokenizer_char = Tokenizer(num_words=num_char, char_level=True, lower=False) # Initialize the tokenizer
#tokenizer_char.fit_on_texts(list_sentences_train) # associate each "word" or "character" to an indice
#list_char_tokenized_train = tokenizer_char.texts_to_sequences(list_sentences_train) # get sequence of indices for the training data
#list_char_tokenized_test = tokenizer_char.texts_to_sequences(list_sentences_test) # get sequences of indices for the test data
#X_c_t = pad_sequences(list_char_tokenized_train, maxlen=maxlen_char, truncating='post') # pad with zeros
#X_c_te = pad_sequences(list_char_tokenized_test, maxlen=maxlen_char, truncating='post') # pad with zeros

##########
#sequences_char_train = text_to_sequence(list_sentences_train, lower=False, char_level=True)
#char_index = get_word_index(sequences_char_train)
#sequences_char_train = apply_indices(sequences_char_train, max_features, char_index)
#X_c_t, _ = augment_toxic(sequences_char_train, y, window_sizes=window_sizes, maxlen=maxlen, truncated_only=True, non_repeat=True)

#sequences_char_test = text_to_sequence(list_sentences_test, lower=False, char_level=True)
#sequences_char_test = apply_indices(sequences_char_test, max_features, word_index)
#X_c_te = pad_sequences(sequences_test, maxlen=maxlen, truncating='post')

In [None]:
#indx = 9999
#for i, j in tokenizer.word_index.items():
#    if j == indx:
#        print(i, tokenizer.word_index[i])

In [None]:
#embedding_char_matrix = np.random.normal(size=(num_char, embed_char_size))

### Train/Val split

In [None]:
#y = y_

In [None]:
[X_word_train, X_word_val, y_train, y_val] = train_test_split(X_w_t, y, train_size=0.95)
#[X_char_train, X_char_val, _, _] = train_test_split(X_c_t, y, train_size=0.95)

### Creating new features

In [None]:
m_train = X_word_train.shape[0]

X_train_helper = get_helper(list_sentences_train[:m_train])
X_val_helper = get_helper(list_sentences_train[m_train:])

X_h_te = get_helper(list_sentences_test)

n_helper = X_train_helper.shape[1]

### Joining input data in lists

In [None]:
#input_train = [X_word_train, X_char_train, X_train_helper]
#input_val = [X_word_val, X_char_val, X_val_helper]
#input_test = [X_w_te, X_c_te, X_h_te]

input_train = [X_word_train, X_train_helper]
input_val = [X_word_val, X_val_helper]
input_test = [X_w_te, X_h_te]

In [None]:
#input_test[0].shape, input_test[1].shape, input_test[2].shape, input_val[0].shape, input_val[1].shape, input_val[2].shape, input_train[0].shape, input_train[1].shape, input_train[2].shape

### Prepare RocAuc callback

In [None]:
ra_val = RocAucEvaluation(
    validation_data=(input_val, y_val),
    train_data=(input_train, y_train), 
    interval=1
)

### Sample Weights
Creating weights of the samples that are in some of the toxic classes to higher values and the clean comment samples to a lower value to be applied on the `model.fit` method.
This will penalize the loss function in order for it to pay more attention in the minority classes (samples in any of the toxic classes)

In [None]:
#cw = compute_class_weight('balanced', , y)
sw = compute_sample_weight('balanced', y_train)
sw = 1/(1+np.exp(-sw))
#sw[sw == sw.min()] *=.2
#sw2 = np.array([sw[i]*sw[i] if (sw[i] < 0.8) else sw[i] for i in range(sw.shape[0])])
sw = sw**3
#np.unique(sw),np.unique(sw2)

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [None]:
# LSTM for the Word Sequences
input_word = Input(shape=(maxlen,), name="input_word_sequences")
X_word = Embedding(max_features, embed_size, weights=[embedding_matrix])(input_word)

X_word = Bidirectional(LSTM(50, return_sequences=True, recurrent_dropout=0.1))(X_word)
X_word = Dropout(0.3)(X_word)
#X_word = BatchNormalization()(X_word)

###X_word = LSTM(10, return_sequences=True, recurrent_dropout=0.1, unroll=False)(X_word)
X_word = Bidirectional(LSTM(50, return_sequences=True, recurrent_dropout=0.1))(X_word)
X_word = Dropout(0.3)(X_word)
X_word = GlobalMaxPool1D()(X_word)
X_word = BatchNormalization()(X_word)

####{
## LSTM for the Character Sequences
#input_char = Input(shape=(maxlen_char,), name="input_char_sequences")
#X_char = Embedding(num_char, embed_char_size, weights=[embedding_char_matrix])(input_char)

##X_char = Bidirectional(LSTM(20, return_sequences=True, recurrent_dropout=0.1, unroll=True))(X_char)
##X_char = Dropout(0.3)(X_char)
##X_char = BatchNormalization()(X_char)

####X_char = LSTM(10, return_sequences=True, recurrent_dropout=0.1, unroll=True)(X_char)
#X_char = Bidirectional(LSTM(20, return_sequences=True, recurrent_dropout=0.1, unroll=True))(X_char)
#X_char = Dropout(0.3)(X_char)
#X_char = GlobalMaxPool1D()(X_char)
#X_char = BatchNormalization()(X_char)

# Helper Hand Engineered Features
input_helper = Input(shape=(n_helper,), name="input_helper")
X_helper = BatchNormalization()(input_helper)

X_helper = Dense(40, activation="relu")(X_helper)
X_helper = Dropout(0.3)(X_helper)
#X_helper = BatchNormalization()(X_helper)

X_helper = Dense(40, activation="relu")(X_helper)
X_helper = Dropout(0.3)(X_helper)
X_helper = BatchNormalization()(X_helper)

## Merge all networks
concat = Concatenate()([X_word, X_helper])
#concat = Concatenate()([X_word, X_char, X_helper])
####}
#X = Dense(40, activation="relu")(X_word)
X = Dense(40, activation="relu")(concat)
X = Dropout(0.2)(X)
X = BatchNormalization()(X)
X = Dense(30, activation="relu")(X)
X = Dropout(0.2)(X)
X = BatchNormalization()(X)

X = Dense(6, activation="sigmoid")(X)

#model = Model(inputs=[input_word, input_char, input_helper], outputs=X)
model = Model(inputs=[input_word, input_helper], outputs=X)
opt = optimizers.Adam(
    lr=0.01,
    decay=0.0000009
)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])

In [None]:
model.summary()

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [None]:
# Find the decay you want for the batch size you're using
mtrain = X_word_train.shape[0]
batches = [16,32,64,128,256,512]
batch_steps_decay = []
decay = 0.0000002
for i,b_size in enumerate(batches):
    steps = mtrain//b_size
    total_decay = steps * decay
    batch_steps_decay.append([b_size, steps, total_decay])
print(batch_steps_decay)

In [None]:
model.fit(
    input_train,
    y_train,
    batch_size=64,
    epochs=2,
    validation_data=(input_val, y_val),
    sample_weight=sw,
    callbacks=[ra_val])

And finally, get predictions for the test set and prepare a submission CSV:

In [None]:
y_test = model.predict(input_test, batch_size=1024, verbose=1)
sample_submission = pd.read_csv(f'{path}{comp}sample_submission.csv')
sample_submission[list_classes] = y_test
sample_submission.to_csv('submission.csv', index=False)

In [None]:
sample_submission.head(10)

In [None]:
list_sentences_test[0]