Сверточные сети относительно хорошо масшабируются на длинные последовательности. Поэтому их часто применяют к отдельным символам, а не токенам. В домашке вам нужно будет обучить большую сверточную модель на символах (отличаться по сути будет только токенизация).

При обучении используйте колбек для отслеживания лучшей модели. Ориентируйтесь на ф1 меру.

Конкретнее задание такое: Обучите модель с минимум 15 слоями, где у каждого слоя разные параметры (Dropout, Conv1d и Pooling считаются слоями, остальное нет). Как минимум 4 слоя должны быть наложены друг на друга. Должен быть хотя бы один слой каждого типа.

Советы: Начните с небольших сетей и постепенно добавляйте, не пытайтесь сразу собрать все слои. Сделайте размер эмбединга сильно меньше. Попробуйте паддинг поменьше. Символьная модель может обучаться намного дольше. Иногда кернел может крашиться просто так или из-за слишком больших матриц.

Бонусный балл можно получить за изучение влияния предобработки (нужно ли приводить к нижнему регистру, нужно ли выкидывать не алфавитные символы, помогает ли замена цифр на определенный токен).

In [0]:
%tensorflow_version 2.x

In [2]:
cd 'drive/My Drive/colab_data'

/content/drive/My Drive/colab_data


In [0]:
import tensorflow as tf
from tensorflow.keras import backend as K
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from collections import Counter

import re

from string import ascii_lowercase, digits, ascii_letters, punctuation
short_punctuation = ',.!?;'

In [0]:
quora = pd.read_csv('quora.csv')

In [0]:
def f1(y_true, y_pred):
  
    def recall(y_true, y_pred):
        """Recall metric.

        Only computes a batch-wise average of recall.

        Computes the recall, a metric for multi-label classification of
        how many relevant items are selected.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        """Precision metric.

        Only computes a batch-wise average of precision.

        Computes the precision, a metric for multi-label classification of
        how many selected items are relevant.
        """
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

In [0]:
chars_to_remove = r"[\ufeff|\u202c|\x7f|\u200b|\xad|\u2060|\u200c|\uf02d|\x10|\u200e|\u2061]"
whitespace = r"[\t|\n|\r|\x0b|\x0c]"

In [0]:
# def tokenize(text: str) -> list:
#   tokens = text.lower().split()
#   return [token.strip(punctuation) for token in tokens]

def clean_text(text):
  text = re.sub(chars_to_remove, ' ', text) # funky punctuation symbols
  text = re.sub(whitespace, ' ', text) # different symbols for whitespace
  text = re.sub(r" {2,}", ' ', text) # multiple whitespaces
  return text

def tokenize(text: str) -> list:
  result = []
  for ch in text.lower():
    if ch in ascii_lowercase or ch == ' ':
      result.append(ch)
    elif ch in digits:
      result.append('DIG')
    elif ch in short_punctuation:
      result.append('PNC')
    else:
      result.append('UNK')
  return result

def filter_dict(d: dict, func: callable) -> dict:
    new_d = dict()
    for key, value in d.items():
        if func((key, value)):
            new_d[key] = value
    return new_d

In [0]:
def build_vocab(texts: list, min_count: int) -> set:
  
  vocab = Counter()

  for text in texts:
    vocab.update(text)

  return set(filter_dict(vocab, lambda x: x[1] > min_count))

In [0]:
def index_chars(chars: set) -> dict:
  d = {'PNC': 3, 'DIG': 2, 'UNK': 1, 'PAD': 0}
  for ch in chars :
    if ch not in d.keys():
      d[ch] = len(d)
  return d

In [0]:
def index_text(text: list) -> list:
  return [char2id.get(token, 1) for token in text]

In [0]:
quora['cleaned'] = quora.question_text.apply(clean_text)

In [0]:
quora['tokenized'] = quora.cleaned.apply(tokenize)

In [0]:
vocab = build_vocab(quora.tokenized.values, 1)

In [0]:
char2id = index_chars(vocab)

In [0]:
id2char = {i: ch for ch, i in char2id.items()}

In [0]:
quora['indexed'] = quora.tokenized.apply(index_text)

In [0]:
def limit_max_len(len_list, n_sigma):
  mean = np.mean(len_list)
  std = np.std(len_list)
  return int(np.ceil(mean + (n_sigma * std)))

In [0]:
len_limit = limit_max_len(quora.indexed.str.len(), 1)

In [19]:
len_limit

110

In [0]:
X = tf.keras.preprocessing.sequence.pad_sequences(quora.indexed.values, maxlen=len_limit)

In [0]:
y = quora.target.values

In [0]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      test_size=0.05,
                                                      random_state=1,
                                                      stratify=y)

In [23]:
X.shape

(1306122, 110)

# Попытка 1

In [0]:
inputs = tf.keras.layers.Input(shape=(len_limit,))

embeddings = tf.keras.layers.Embedding(input_dim=len(char2id), output_dim=50)(inputs)

conv_1 = tf.keras.layers.Conv1D(kernel_size=3, filters=16, strides=1)(embeddings)
conv_2 = tf.keras.layers.Conv1D(kernel_size=3, filters=16, strides=1)(conv_1)
conv_3 = tf.keras.layers.Conv1D(kernel_size=3, filters=16, strides=1)(conv_2)
pool_1 = tf.keras.layers.AveragePooling1D()(conv_3)
drop_1 = tf.keras.layers.Dropout(0.1)(pool_1)

conv_4 = tf.keras.layers.Conv1D(kernel_size=3, filters=16, strides=1)(drop_1)
conv_5 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1)(conv_4)
conv_6 = tf.keras.layers.Conv1D(kernel_size=3, filters=64, strides=1, activation='relu')(conv_5)
pool_2 = tf.keras.layers.AveragePooling1D()(conv_6)
drop_2 = tf.keras.layers.Dropout(0.1)(pool_2)

conv_7 = tf.keras.layers.Conv1D(kernel_size=4, filters=16, strides=1, activation='relu')(drop_2)
conv_8 = tf.keras.layers.Conv1D(kernel_size=4, filters=32, strides=1, activation='relu')(conv_7)
conv_9 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='relu')(conv_8)
pool_3 = tf.keras.layers.MaxPooling1D()(conv_9)
drop_3 = tf.keras.layers.Dropout(0.1)(pool_3)

concat = tf.keras.layers.Flatten()(drop_3)
dense_1 = tf.keras.layers.Dense(64, activation='relu')(concat)
dense_2 = tf.keras.layers.Dense(64, activation='relu')(dense_1)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense_2)

model = tf.keras.Model(inputs=inputs, outputs=outputs)



In [0]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

In [0]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('model.weights', 
                                                monitor='val_f1',
                                                verbose=1,
                                                save_weights_only=True,
                                                save_best_only=True,
                                                mode='max',
                                                save_freq='epoch')

In [0]:
model.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=[f1])

In [162]:
model.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid),
          batch_size=4000,
          epochs=15,
          callbacks=[checkpoint])

Epoch 1/15
Epoch 00001: val_f1 improved from -inf to 0.00000, saving model to model.weights
Epoch 2/15
Epoch 00002: val_f1 improved from 0.00000 to 0.00317, saving model to model.weights
Epoch 3/15
Epoch 00003: val_f1 improved from 0.00317 to 0.16109, saving model to model.weights
Epoch 4/15
Epoch 00004: val_f1 improved from 0.16109 to 0.23139, saving model to model.weights
Epoch 5/15
Epoch 00005: val_f1 did not improve from 0.23139
Epoch 6/15
Epoch 00006: val_f1 did not improve from 0.23139
Epoch 7/15
Epoch 00007: val_f1 improved from 0.23139 to 0.35665, saving model to model.weights
Epoch 8/15
Epoch 00008: val_f1 did not improve from 0.35665
Epoch 9/15
Epoch 00009: val_f1 did not improve from 0.35665
Epoch 10/15
Epoch 00010: val_f1 did not improve from 0.35665
Epoch 11/15
Epoch 00011: val_f1 did not improve from 0.35665
Epoch 12/15
Epoch 00012: val_f1 improved from 0.35665 to 0.40001, saving model to model.weights
Epoch 13/15
Epoch 00013: val_f1 did not improve from 0.40001
Epoch 14/

<tensorflow.python.keras.callbacks.History at 0x7f3b3a00c8d0>

In [165]:
model.load_weights('model.weights')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3b33baa2b0>

In [0]:
preds = model.predict(X_valid).reshape(-1)

In [167]:
print(classification_report(y_valid, (preds > 0.5).astype(int)))

              precision    recall  f1-score   support

           0       0.96      0.99      0.97     61266
           1       0.61      0.29      0.40      4041

    accuracy                           0.94     65307
   macro avg       0.78      0.64      0.68     65307
weighted avg       0.93      0.94      0.94     65307



# Попытка 2 (основная)

In [0]:
inputs = tf.keras.layers.Input(shape=(len_limit,))

embeddings = tf.keras.layers.Embedding(input_dim=len(char2id), output_dim=50)(inputs)

conv = []

conv_1 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_2 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_1)
conv_3 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_2)
pool_1 = tf.keras.layers.GlobalMaxPooling1D()(conv_3)
drop_1 = tf.keras.layers.AlphaDropout(0.1)(pool_1)
conv.append(drop_1)

conv_4 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(embeddings)
conv_5 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_4)
conv_6 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_5)
pool_2 = tf.keras.layers.GlobalMaxPooling1D()(conv_6)
drop_2 = tf.keras.layers.AlphaDropout(0.1)(pool_2)
conv.append(drop_2)

conv_7 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(embeddings)
conv_8 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_7)
conv_9 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_8)
pool_3 = tf.keras.layers.GlobalMaxPooling1D()(conv_9)
drop_3 = tf.keras.layers.AlphaDropout(0.1)(pool_3)
conv.append(drop_3)

conv_10 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_11 = tf.keras.layers.Conv1D(kernel_size=2, filters=32, strides=1, activation='tanh')(conv_10)
conv_12 = tf.keras.layers.Conv1D(kernel_size=2, filters=64, strides=1, activation='tanh')(conv_11)
conv_13 = tf.keras.layers.Conv1D(kernel_size=2, filters=128, strides=1, activation='tanh')(conv_12)
pool_4 = tf.keras.layers.GlobalMaxPooling1D()(conv_13)
drop_4 = tf.keras.layers.AlphaDropout(0.1)(pool_4)
conv.append(drop_4)

concat = tf.keras.layers.Concatenate()(conv)
dense_1 = tf.keras.layers.Dense(128, activation='selu')(concat)
dense_2 = tf.keras.layers.Dense(64, activation='selu')(dense_1)
dense_3 = tf.keras.layers.Dense(32, activation='selu')(dense_2)
drop_5 = tf.keras.layers.AlphaDropout(0.1)(dense_3)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(drop_5)

model_2 = tf.keras.Model(inputs=inputs, outputs=outputs)


In [0]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

In [0]:
checkpoint = tf.keras.callbacks.ModelCheckpoint('model.weights', 
                                                monitor='val_f1',
                                                verbose=1,
                                                save_weights_only=True,
                                                save_best_only=True,
                                                mode='max',
                                                save_freq='epoch')

In [0]:
model_2.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=[f1])

In [53]:
model_2.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid),
          batch_size=4000,
          epochs=30,
          callbacks=[checkpoint])

Epoch 1/30
Epoch 00001: val_f1 improved from -inf to 0.00000, saving model to model.weights
Epoch 2/30
Epoch 00002: val_f1 improved from 0.00000 to 0.28512, saving model to model.weights
Epoch 3/30
Epoch 00003: val_f1 improved from 0.28512 to 0.29826, saving model to model.weights
Epoch 4/30
Epoch 00004: val_f1 improved from 0.29826 to 0.44185, saving model to model.weights
Epoch 5/30
Epoch 00005: val_f1 improved from 0.44185 to 0.45842, saving model to model.weights
Epoch 6/30
Epoch 00006: val_f1 did not improve from 0.45842
Epoch 7/30
Epoch 00007: val_f1 did not improve from 0.45842
Epoch 8/30
Epoch 00008: val_f1 improved from 0.45842 to 0.46119, saving model to model.weights
Epoch 9/30
Epoch 00009: val_f1 did not improve from 0.46119
Epoch 10/30
Epoch 00010: val_f1 did not improve from 0.46119
Epoch 11/30
Epoch 00011: val_f1 did not improve from 0.46119
Epoch 12/30
Epoch 00012: val_f1 improved from 0.46119 to 0.47053, saving model to model.weights
Epoch 13/30
Epoch 00013: val_f1 imp

<tensorflow.python.keras.callbacks.History at 0x7fad14c6a7b8>

In [55]:
model_2.load_weights('model.weights')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fad99294ef0>

In [0]:
preds = model_2.predict(X_valid).reshape(-1)

In [57]:
print(classification_report(y_valid, (preds > 0.5).astype(int)))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97     61266
           1       0.65      0.47      0.55      4041

    accuracy                           0.95     65307
   macro avg       0.81      0.73      0.76     65307
weighted avg       0.95      0.95      0.95     65307



# Попытка 3

Попробуем учитывать регистр

In [0]:
def tokenize_cased(text: str) -> list:
  result = []
  for ch in text:
    if ch in ascii_letters or ch == ' ':
      result.append(ch)
    elif ch in digits:
      result.append('DIG')
    elif ch in short_punctuation:
      result.append('PNC')
    else:
      result.append('UNK')
  return result

In [0]:
quora['tokenized_cased'] = quora.cleaned.apply(tokenize_cased)

vocab = build_vocab(quora.tokenized_cased.values, 1)

char2id = index_chars(vocab)
id2char = {i: ch for ch, i in char2id.items()}

quora['indexed_cased'] = quora.tokenized_cased.apply(index_text)

len_limit = limit_max_len(quora.indexed_cased.str.len(), 1)

X = tf.keras.preprocessing.sequence.pad_sequences(quora.indexed_cased.values, maxlen=len_limit)

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      test_size=0.05,
                                                      random_state=1,
                                                      stratify=y)

In [63]:
inputs = tf.keras.layers.Input(shape=(len_limit,))

embeddings = tf.keras.layers.Embedding(input_dim=len(char2id), output_dim=50)(inputs)

conv = []

conv_1 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_2 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_1)
conv_3 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_2)
pool_1 = tf.keras.layers.GlobalMaxPooling1D()(conv_3)
drop_1 = tf.keras.layers.AlphaDropout(0.1)(pool_1)
conv.append(drop_1)

conv_4 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(embeddings)
conv_5 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_4)
conv_6 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_5)
pool_2 = tf.keras.layers.GlobalMaxPooling1D()(conv_6)
drop_2 = tf.keras.layers.AlphaDropout(0.1)(pool_2)
conv.append(drop_2)

conv_7 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(embeddings)
conv_8 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_7)
conv_9 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_8)
pool_3 = tf.keras.layers.GlobalMaxPooling1D()(conv_9)
drop_3 = tf.keras.layers.AlphaDropout(0.1)(pool_3)
conv.append(drop_3)

conv_10 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_11 = tf.keras.layers.Conv1D(kernel_size=2, filters=32, strides=1, activation='tanh')(conv_10)
conv_12 = tf.keras.layers.Conv1D(kernel_size=2, filters=64, strides=1, activation='tanh')(conv_11)
conv_13 = tf.keras.layers.Conv1D(kernel_size=2, filters=128, strides=1, activation='tanh')(conv_12)
pool_4 = tf.keras.layers.GlobalMaxPooling1D()(conv_13)
drop_4 = tf.keras.layers.AlphaDropout(0.1)(pool_4)
conv.append(drop_4)

concat = tf.keras.layers.Concatenate()(conv)
dense_1 = tf.keras.layers.Dense(128, activation='selu')(concat)
dense_2 = tf.keras.layers.Dense(64, activation='selu')(dense_1)
dense_3 = tf.keras.layers.Dense(32, activation='selu')(dense_2)
drop_5 = tf.keras.layers.AlphaDropout(0.1)(dense_3)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(drop_5)

model_3 = tf.keras.Model(inputs=inputs, outputs=outputs)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

checkpoint = tf.keras.callbacks.ModelCheckpoint('model.weights', 
                                                monitor='val_f1',
                                                verbose=1,
                                                save_weights_only=True,
                                                save_best_only=True,
                                                mode='max',
                                                save_freq='epoch')

model_3.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=[f1])

model_3.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid),
          batch_size=4000,
          epochs=30,
          callbacks=[checkpoint])

Epoch 1/30
Epoch 00001: val_f1 improved from -inf to 0.00000, saving model to model.weights
Epoch 2/30
Epoch 00002: val_f1 improved from 0.00000 to 0.31356, saving model to model.weights
Epoch 3/30
Epoch 00003: val_f1 improved from 0.31356 to 0.33218, saving model to model.weights
Epoch 4/30
Epoch 00004: val_f1 improved from 0.33218 to 0.36174, saving model to model.weights
Epoch 5/30
Epoch 00005: val_f1 improved from 0.36174 to 0.45097, saving model to model.weights
Epoch 6/30
Epoch 00006: val_f1 did not improve from 0.45097
Epoch 7/30
Epoch 00007: val_f1 did not improve from 0.45097
Epoch 8/30
Epoch 00008: val_f1 did not improve from 0.45097
Epoch 9/30
Epoch 00009: val_f1 did not improve from 0.45097
Epoch 10/30
Epoch 00010: val_f1 improved from 0.45097 to 0.45151, saving model to model.weights
Epoch 11/30
Epoch 00011: val_f1 improved from 0.45151 to 0.50761, saving model to model.weights
Epoch 12/30
Epoch 00012: val_f1 did not improve from 0.50761
Epoch 13/30
Epoch 00013: val_f1 did

<tensorflow.python.keras.callbacks.History at 0x7fae523f3400>

In [64]:
model_3.load_weights('model.weights')
preds = model_3.predict(X_valid).reshape(-1)
print(classification_report(y_valid, (preds > 0.5).astype(int)))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97     61266
           1       0.64      0.48      0.55      4041

    accuracy                           0.95     65307
   macro avg       0.80      0.73      0.76     65307
weighted avg       0.95      0.95      0.95     65307



Здесь разница только в том, что пик ф-меры был достигнут на на несколько эпох раньше.

# Попытка 4

Оставляем регистр, добавляем в словарь "значимую" пунктуацию. Остальную храним как один символ.

In [0]:
long_punctuation = set(punctuation) - set(short_punctuation)

In [0]:
def tokenize_cased_punct(text: str) -> list:
  result = []
  for ch in text:
    if ch in ascii_letters or ch in short_punctuation or ch == ' ':
      result.append(ch)
    elif ch in digits:
      result.append('DIG')
    elif ch in long_punctuation:
      result.append('PNC')
    else:
      result.append('UNK')
  return result

In [0]:
quora['tokenized_cased_punct'] = quora.cleaned.apply(tokenize_cased)

vocab = build_vocab(quora.tokenized_cased_punct.values, 1)

char2id = index_chars(vocab)
id2char = {i: ch for ch, i in char2id.items()}

quora['indexed_cased_punct'] = quora.tokenized_cased_punct.apply(index_text)

len_limit = limit_max_len(quora.indexed_cased_punct.str.len(), 1)

X = tf.keras.preprocessing.sequence.pad_sequences(quora.indexed_cased_punct.values, maxlen=len_limit)

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      test_size=0.05,
                                                      random_state=1,
                                                      stratify=y)

In [71]:
inputs = tf.keras.layers.Input(shape=(len_limit,))

embeddings = tf.keras.layers.Embedding(input_dim=len(char2id), output_dim=50)(inputs)

conv = []

conv_1 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_2 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_1)
conv_3 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_2)
pool_1 = tf.keras.layers.GlobalMaxPooling1D()(conv_3)
drop_1 = tf.keras.layers.AlphaDropout(0.1)(pool_1)
conv.append(drop_1)

conv_4 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(embeddings)
conv_5 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_4)
conv_6 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_5)
pool_2 = tf.keras.layers.GlobalMaxPooling1D()(conv_6)
drop_2 = tf.keras.layers.AlphaDropout(0.1)(pool_2)
conv.append(drop_2)

conv_7 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(embeddings)
conv_8 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_7)
conv_9 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_8)
pool_3 = tf.keras.layers.GlobalMaxPooling1D()(conv_9)
drop_3 = tf.keras.layers.AlphaDropout(0.1)(pool_3)
conv.append(drop_3)

conv_10 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_11 = tf.keras.layers.Conv1D(kernel_size=2, filters=32, strides=1, activation='tanh')(conv_10)
conv_12 = tf.keras.layers.Conv1D(kernel_size=2, filters=64, strides=1, activation='tanh')(conv_11)
conv_13 = tf.keras.layers.Conv1D(kernel_size=2, filters=128, strides=1, activation='tanh')(conv_12)
pool_4 = tf.keras.layers.GlobalMaxPooling1D()(conv_13)
drop_4 = tf.keras.layers.AlphaDropout(0.1)(pool_4)
conv.append(drop_4)

concat = tf.keras.layers.Concatenate()(conv)
dense_1 = tf.keras.layers.Dense(128, activation='selu')(concat)
dense_2 = tf.keras.layers.Dense(64, activation='selu')(dense_1)
dense_3 = tf.keras.layers.Dense(32, activation='selu')(dense_2)
drop_5 = tf.keras.layers.AlphaDropout(0.1)(dense_3)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(drop_5)

model_4 = tf.keras.Model(inputs=inputs, outputs=outputs)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

checkpoint = tf.keras.callbacks.ModelCheckpoint('model.weights', 
                                                monitor='val_f1',
                                                verbose=1,
                                                save_weights_only=True,
                                                save_best_only=True,
                                                mode='max',
                                                save_freq='epoch')

model_4.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=[f1])

model_4.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid),
          batch_size=4000,
          epochs=30,
          callbacks=[checkpoint])

Epoch 1/30
Epoch 00001: val_f1 improved from -inf to 0.00000, saving model to model.weights
Epoch 2/30
Epoch 00002: val_f1 improved from 0.00000 to 0.28070, saving model to model.weights
Epoch 3/30
Epoch 00003: val_f1 improved from 0.28070 to 0.31928, saving model to model.weights
Epoch 4/30
Epoch 00004: val_f1 improved from 0.31928 to 0.37249, saving model to model.weights
Epoch 5/30
Epoch 00005: val_f1 improved from 0.37249 to 0.40427, saving model to model.weights
Epoch 6/30
Epoch 00006: val_f1 improved from 0.40427 to 0.40608, saving model to model.weights
Epoch 7/30
Epoch 00007: val_f1 improved from 0.40608 to 0.40639, saving model to model.weights
Epoch 8/30
Epoch 00008: val_f1 improved from 0.40639 to 0.48436, saving model to model.weights
Epoch 9/30
Epoch 00009: val_f1 improved from 0.48436 to 0.51356, saving model to model.weights
Epoch 10/30
Epoch 00010: val_f1 did not improve from 0.51356
Epoch 11/30
Epoch 00011: val_f1 did not improve from 0.51356
Epoch 12/30
Epoch 00012: v

<tensorflow.python.keras.callbacks.History at 0x7fabc94c8b00>

In [72]:
model_4.load_weights('model.weights')
preds = model_4.predict(X_valid).reshape(-1)
print(classification_report(y_valid, (preds > 0.5).astype(int)))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97     61266
           1       0.64      0.49      0.56      4041

    accuracy                           0.95     65307
   macro avg       0.81      0.74      0.77     65307
weighted avg       0.95      0.95      0.95     65307



Как видим, качество несколько улучшилось. Вероятно в пунктуации действительно содержится полезная информация для нашего классификатора.

# Попытка 5

Оставляем только латинские буквы нижнего регистра

In [0]:
def tokenize_ascii(text: str) -> list:
  result = []
  for ch in text.lower():
    if ch in ascii_lower:
      result.append(ch)
    else:
      result.append('UNK')
  return result

def index_chars(chars: set) -> dict:
  d = {'UNK': 1, 'PAD': 0}
  for ch in chars :
    if ch not in d.keys():
      d[ch] = len(d)
  return d


In [0]:
quora['tokenized_ascii'] = quora.cleaned.apply(tokenize_cased)

vocab = build_vocab(quora.tokenized_ascii.values, 1)

char2id = index_chars(vocab)
id2char = {i: ch for ch, i in char2id.items()}

quora['indexed_ascii'] = quora.tokenized_ascii.apply(index_text)

len_limit = limit_max_len(quora.indexed_ascii.str.len(), 1)

X = tf.keras.preprocessing.sequence.pad_sequences(quora.indexed_ascii.values, maxlen=len_limit)

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      test_size=0.05,
                                                      random_state=1,
                                                      stratify=y)

In [75]:
inputs = tf.keras.layers.Input(shape=(len_limit,))

embeddings = tf.keras.layers.Embedding(input_dim=len(char2id), output_dim=50)(inputs)

conv = []

conv_1 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_2 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_1)
conv_3 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(conv_2)
pool_1 = tf.keras.layers.GlobalMaxPooling1D()(conv_3)
drop_1 = tf.keras.layers.AlphaDropout(0.1)(pool_1)
conv.append(drop_1)

conv_4 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(embeddings)
conv_5 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_4)
conv_6 = tf.keras.layers.Conv1D(kernel_size=3, filters=32, strides=1, activation='tanh')(conv_5)
pool_2 = tf.keras.layers.GlobalMaxPooling1D()(conv_6)
drop_2 = tf.keras.layers.AlphaDropout(0.1)(pool_2)
conv.append(drop_2)

conv_7 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(embeddings)
conv_8 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_7)
conv_9 = tf.keras.layers.Conv1D(kernel_size=4, filters=64, strides=1, activation='tanh')(conv_8)
pool_3 = tf.keras.layers.GlobalMaxPooling1D()(conv_9)
drop_3 = tf.keras.layers.AlphaDropout(0.1)(pool_3)
conv.append(drop_3)

conv_10 = tf.keras.layers.Conv1D(kernel_size=2, filters=16, strides=1, activation='tanh')(embeddings)
conv_11 = tf.keras.layers.Conv1D(kernel_size=2, filters=32, strides=1, activation='tanh')(conv_10)
conv_12 = tf.keras.layers.Conv1D(kernel_size=2, filters=64, strides=1, activation='tanh')(conv_11)
conv_13 = tf.keras.layers.Conv1D(kernel_size=2, filters=128, strides=1, activation='tanh')(conv_12)
pool_4 = tf.keras.layers.GlobalMaxPooling1D()(conv_13)
drop_4 = tf.keras.layers.AlphaDropout(0.1)(pool_4)
conv.append(drop_4)

concat = tf.keras.layers.Concatenate()(conv)
dense_1 = tf.keras.layers.Dense(128, activation='selu')(concat)
dense_2 = tf.keras.layers.Dense(64, activation='selu')(dense_1)
dense_3 = tf.keras.layers.Dense(32, activation='selu')(dense_2)
drop_5 = tf.keras.layers.AlphaDropout(0.1)(dense_3)
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(drop_5)

model_5 = tf.keras.Model(inputs=inputs, outputs=outputs)

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

checkpoint = tf.keras.callbacks.ModelCheckpoint('model.weights', 
                                                monitor='val_f1',
                                                verbose=1,
                                                save_weights_only=True,
                                                save_best_only=True,
                                                mode='max',
                                                save_freq='epoch')

model_5.compile(optimizer=optimizer,
              loss='binary_crossentropy',
              metrics=[f1])

model_5.fit(X_train, y_train, 
          validation_data=(X_valid, y_valid),
          batch_size=4000,
          epochs=30,
          callbacks=[checkpoint])

Epoch 1/30
Epoch 00001: val_f1 improved from -inf to 0.00000, saving model to model.weights
Epoch 2/30
Epoch 00002: val_f1 improved from 0.00000 to 0.34488, saving model to model.weights
Epoch 3/30
Epoch 00003: val_f1 did not improve from 0.34488
Epoch 4/30
Epoch 00004: val_f1 improved from 0.34488 to 0.44221, saving model to model.weights
Epoch 5/30
Epoch 00005: val_f1 improved from 0.44221 to 0.44640, saving model to model.weights
Epoch 6/30
Epoch 00006: val_f1 improved from 0.44640 to 0.45315, saving model to model.weights
Epoch 7/30
Epoch 00007: val_f1 improved from 0.45315 to 0.46922, saving model to model.weights
Epoch 8/30
Epoch 00008: val_f1 improved from 0.46922 to 0.47121, saving model to model.weights
Epoch 9/30
Epoch 00009: val_f1 improved from 0.47121 to 0.50605, saving model to model.weights
Epoch 10/30
Epoch 00010: val_f1 did not improve from 0.50605
Epoch 11/30
Epoch 00011: val_f1 did not improve from 0.50605
Epoch 12/30
Epoch 00012: val_f1 did not improve from 0.50605


<tensorflow.python.keras.callbacks.History at 0x7fab7cc813c8>

In [76]:
model_5.load_weights('model.weights')
preds = model_5.predict(X_valid).reshape(-1)
print(classification_report(y_valid, (preds > 0.5).astype(int)))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97     61266
           1       0.62      0.47      0.54      4041

    accuracy                           0.95     65307
   macro avg       0.80      0.73      0.76     65307
weighted avg       0.94      0.95      0.95     65307



Качество ухудшилось, но не так уж сильно.