### Description

Yandex.Algorithm ML

Каждый из файлов субтитров в датасете OpenSubtitles [2], который мы использовали в качестве источника реплик и разговоров, содержит упорядоченный набор реплик. В большинстве случаев, каждая реплика – это ответ на предыдущую, в разговоре между двумя персонажами фильма. Мы случайно выбрали эпизоды этих разговоров в качестве наших тренировочных и тестовых примеров.

Каждый эпизод состоит из двух частей – контекста (Context) и финальной реплики (Reply). Например,

- context_2: Персонаж A говорит реплику 

- context_1: Персонаж B отвечает на нее 

- context_0: Персонаж А произносит вторую реплику 

reply: Персонаж B отвечает на вторую реплику 
Контекстная часть может состоять из трех реплик (как в примере) – в 50% случаев, двух – в 25%, и одного – в оставшихся 25% случаев. Финальная реплика (Reply) всегда завершает любой эпизод, то есть следует за контекстом (Context). Задача участников – найти наиболее подходящую и интересную реплику для данного контекста среди предложенных кандидатов (числом до 6), случайно выбранных из топа кандидатов, возвращенных бейзлайном высокого качества, натренированным командой Алисы (который, в свою очередь, отобрал кандидатов среди всех возможных реплик OpenSubtitles).

Все реплики-кандидаты размечены асессорами на сервисе Яндекс.Толока с помощью следующей инструкции для разметки:

- Good (2): реплика уместна (имеет смысл для данного контекста) и интересна (нетривиальна, специфична именно для данного контекста, мотивирует продолжать разговор)

- Neutral (1): реплика уместна (имеет смысл для данного контекста), но не интересна (тривиальна, не специфична для данного контекста и скорее подталкивает пользователя закончить разговор)

- Bad (0): реплика не имеет никакого смысла в данном контексте

Каждая метка в тренировочной части датасета (и только в ней), сопровождается также уверенностью (confidence) – числом в интервале от 0 до 1 – которое показывает насколько уверенными в своей разметке были асессоры с Толоки, совместно предложившие данную метку. Мы хотим обратить особое внимание участников на эту информацию, она может быть очень полезна при обучении их моделей.

Мы хотим особо отметить, что все участники имеют право скачать датасет OpenSubtitles [2], который использовался для подготовки датасета и применять его для тренировки своих моделей по своему усмотрению.

In [3]:
import pandas as pd, numpy as np

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib

import warnings
warnings.filterwarnings('ignore')


In [None]:
!ls data

- context_id – идентификатор эпизода
- context_2,context_1,context_0 – текст реплик, предшествующих финальной (может состоять из трех частей)
- reply_id – идентификатор реплики-кандидата
- reply – текст реплики-кандидата
- label – метка реплики-кандидата (good, neutral или bad)
- confidence - уверенность в метке реплики-кандидата (число от 0 до 1)

### Load

#### Load dataset

In [4]:
df = pd.read_csv('data/train.tsv', sep='\t', quotechar=' ', header = None)
df.columns = ['context_id', 'context_2', 'context_1', 'context_0', 'reply_id', 'reply', 'label', 'confidence']
test = pd.read_csv('data/public.tsv', sep='\t', quotechar = ' ', header = None)
test.columns = ['context_id', 'context_2', 'context_1', 'context_0', 'reply_id', 'reply']

In [None]:
df.head(6)

In [None]:
test.head(8)

y - label, and prob

### Prep

#### Label

In [5]:
def label_enc(x ,reverse = False):
    if reverse == False:
        if x == 'bad':
            return 0
        elif x == 'neutral':
            return 1
        else:
            return 2
    else:
        if x == 0:
            return 'bad'
        elif x == 1:
            return 'neutral'
        else:
            return 'good'

In [6]:
df['label'] = df['label'].apply(lambda x: label_enc(x))

#### Scorer

In [7]:
from sklearn.metrics import make_scorer

In [8]:
def DCG(label): return sum([float(label[i]/np.log2(i+2)) for i in range(len(label))])

def nDCG(label, best_label):
    label, best_label = DCG(label), DCG(best_label)
    if label != 0 and best_label != 0:
        return label/best_label
    else:
        return 1

scorer = make_scorer(nDCG)

#### Nan clearing
Wanted more clever way to dell and fill nan, but dropna or fillna, will work good.

In [9]:
df.fillna('-', inplace=True)
test.fillna('-', inplace=True)

#### Vectorizer

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
import re

import scipy.sparse as sps

In [11]:
def pre(s):
    return re.sub(r'[^\w]', ' ', s)

In [12]:
def Vect(df, test, use_idf=True, min_df=1, max_df=1.0, ngram_range = (1,8)):
    v = []
    if use_idf == True:
        tfidf = TfidfVectorizer(stop_words=None, preprocessor=pre,
                               ngram_range=ngram_range, strip_accents='unicode', analyzer='word',
                               min_df = min_df, max_df=max_df)
    
        context_0 = tfidf.fit_transform(df['context_0'])
        context_0_t = tfidf.transform(test['context_0'])
        v.append(tfidf)
    
        context_1 = tfidf.fit_transform(df['context_1'])
        context_1_t = tfidf.transform(test['context_1'])
        v.append(tfidf)
        
        context_2 = tfidf.fit_transform(df['context_2'])
        context_2_t = tfidf.transform(test['context_1'])
        v.append(tfidf)
        
        reply = tfidf.fit_transform(df['reply'])
        reply_t = tfidf.transform(test['reply'])
        v.append(tfidf)
        
        return sps.hstack((context_0, context_1, context_2, reply)), \
               sps.hstack((context_0_t, context_1_t, context_2_t, reply_t)), v
    else:
        tf = CountVectorizer(stop_words=None, preprocessor=pre,
                             ngram_range=ngram_range, strip_accents='unicode', analyzer='word',
                             min_df = min_df, max_df=max_df)

        context_0 = tf.fit_transform(df['context_0'])
        context_0_t = tf.transform(test['context_0'])
        v.append(tf)
    
        context_1 = tf.fit_transform(df['context_1'])
        context_1_t = tf.transform(test['context_1'])
        v.append(tf)
        
        context_2 = tf.fit_transform(df['context_2'])
        context_2_t = tf.transform(test['context_1'])
        v.append(tf)
        
        reply = tf.fit_transform(df['reply'])
        reply_t = tf.transform(test['reply'])
        v.append(tf)
    
        return sps.hstack((context_0, context_1, context_2, reply)), \
               sps.hstack((context_0_t, context_1_t, context_2_t, reply_t)), v

X_train_tf, X_test_tf, tf_m = Vect(df, test, use_idf=False, max_df=0.80, min_df=8)
X_train_tfidf, X_test_tfidf, tf_m = Vect(df, test, max_df=0.80, min_df=8)

In [None]:
tf_m

#### Kmeans

In [None]:
from sklearn.decomposition import LatentDirichletAllocation,TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

In [None]:
n_topics = 10

def get_lda(data, test, topics):
    lda = LatentDirichletAllocation(n_topics=topics, n_jobs=-1, learning_method='batch',
                                   verbose = True).fit(data)
    train = lda.transform(data)
    test = lda.transform(test)
    
    return train, test, lda

def get_kmeans(data, test, k, scale=True):
    if scale == True:
        scaler = MinMaxScaler().fit(data)
        train = scaler.transform(data)
        test = scale.transform(test)        
    
    kmean = KMeans(n_clusters=k).fit(data)
    
    train = kmean.predict(data)
    test = kmean.predict(test)      
    
    return train, test, kmean    

X_train_lda, X_train_lda, lda_m = get_lda(X_train_tf, X_test_tf, n_topics)
X_train_m, X_test_m, kmean_m = get_kmeans(X_train_tfidf, X_test_tfidf, n_topics, scale=False)

#### decomposion

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
def SVD(X_train, X_test):
    svd = TruncatedSVD(n_components=4, n_iter = 50)
    svd = TruncatedSVD().fit(X_train)
    
    X_train = sps.hstack((svd.transform(X_train), X_train))
    X_test = sps.hstack((svd.transform(X_test), X_test))
    
    return X_train, X_test

X_train, X_test = SVD(X_train, X_test)

### CV

In [13]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier, AdaBoostClassifier, ExtraTreesClassifier, RandomForestClassifier, BaggingClassifier

from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from xgboost import XGBClassifier

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer

import datetime
from tqdm import tqdm_notebook

In [None]:
def calculate_cv(X, y):
    cv = StratifiedKFold(n_splits=6)
    results = {
        'lr': [],
        'dtc': [],
        'nb': [],
        'xgb': [],
        'adb': [],
        'etr': [],
        'kn': [],
        'rf': [],
        'bag': [],
        'sgd': [],
        'combined': []
    }
    
    lm = LogisticRegression()
    dtc = DecisionTreeClassifier()
    nb = MultinomialNB()
    xgb = XGBClassifier()
    adb = AdaBoostClassifier()
    etr = ExtraTreesClassifier()
    kn = KNeighborsClassifier()
    rf = RandomForestClassifier()
    bag = BaggingClassifier()
    sgd = SGDClassifier()
    vc = VotingClassifier([('lm', lm), ('dtc', dtc), ('nb', nb), 
                           ('xgb', xgb), ('adb', adb), ('etr', etr),
                           ('kn', kn), ('rf', rf), ('bag', bag),
                           ('sgd', sgd)])
    
    for c in tqdm_notebook([0,1,2]):
        y_adj = np.array(y==c)
        results['lr'].append((cross_val_score(lm, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['dtc'].append((cross_val_score(dtc, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['nb'].append((cross_val_score(nb, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['xgb'].append((cross_val_score(xgb, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['adb'].append((cross_val_score(adb, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['etr'].append((cross_val_score(etr, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['kn'].append((cross_val_score(kn, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['rf'].append((cross_val_score(rf, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['bag'].append((cross_val_score(bag, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['sgd'].append((cross_val_score(sgd, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
        results['combined'].append((cross_val_score(vc, X, y_adj, cv=cv, scoring='accuracy', n_jobs=-1).mean(), c))
    
    print("Model accuracy predictions\n")
    for m,s in list(results.items()):
        for ss in s:
            print(("{M} model ({R} rating): {S:.1%}".format(M=m.upper(), R=ss[1], S=ss[0])))
            print()
    return results

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train_tfidf, df['label'], test_size=0.3)
r1 = calculate_cv(X_test, y_test)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_train_tf, df['label'], test_size=0.3)
r2 = calculate_cv(X_test, y_test)

In [14]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
import tensorflow as tf

In [18]:
maxlen = 100
batch_size = 32

X_train, X_test, y_train, y_test = train_test_split(X_train_tf, df['label'], test_size=0.3)

X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(x_test, maxlen=maxlen)


with tf.device('/gpu:0'):
    model = Sequential()
    model.add(Embedding(10000, 128, input_length=maxlen))
    model.add(Bidirectional(LSTM(64)))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    
    # try using different optimizers and different optimizer configs
    model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

    print('Train...')
    model.fit(X_train, y_train,
              batch_size=batch_size,
              epochs=4,
              validation_data=[X_test, y_test])

Train...
Train on 68273 samples, validate on 29260 samples
Epoch 1/4


ResourceExhaustedError: OOM when allocating tensor with shape[32,33926,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: embedding_4/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@training_2/Adam/gradients/embedding_4/GatherV2_grad/Reshape"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_4/embeddings/read, embedding_4/Cast, bidirectional_4/TensorArrayUnstack_1/range/start)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: loss_3/mul/_415 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3201_loss_3/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'embedding_4/GatherV2', defined at:
  File "/home/denis/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/denis/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/denis/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 112, in start
    self.asyncio_loop.run_forever()
  File "/home/denis/anaconda3/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/denis/anaconda3/lib/python3.6/asyncio/base_events.py", line 1432, in _run_once
    handle._run()
  File "/home/denis/anaconda3/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 102, in _handle_events
    handler_func(fileobj, events)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/home/denis/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2903, in run_ast_nodes
    if self.run_code(code, result):
  File "/home/denis/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-18-55fdb6ad56ec>", line 8, in <module>
    model.add(Embedding(10000, 128, input_length=maxlen))
  File "/home/denis/anaconda3/lib/python3.6/site-packages/keras/models.py", line 467, in add
    layer(x)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py", line 619, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/keras/layers/embeddings.py", line 138, in call
    out = K.gather(self.embeddings, inputs)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1211, in gather
    return tf.gather(reference, indices)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2736, in gather
    return gen_array_ops.gather_v2(params, indices, axis, name=name)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3065, in gather_v2
    "GatherV2", params=params, indices=indices, axis=axis, name=name)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/denis/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[32,33926,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: embedding_4/GatherV2 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@training_2/Adam/gradients/embedding_4/GatherV2_grad/Reshape"], _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_4/embeddings/read, embedding_4/Cast, bidirectional_4/TensorArrayUnstack_1/range/start)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: loss_3/mul/_415 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3201_loss_3/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.



### Predict

#### model

In [12]:
def decis(prob0, prob1, prob2):
    if prob0 > prob1 and prob0 > prob2:
        return [prob0, 0]
    elif prob0 < prob1 and prob1 > prob2:
        return [prob1, 1]
    else:
        return [prob2, 2]

In [13]:
def predict(X_train, y ,X_test):
    lm = LogisticRegression(class_weight='balanced')
    dtc = DecisionTreeClassifier(class_weight='balanced')
    nb = MultinomialNB()
    xgb = XGBClassifier()
    adb = AdaBoostClassifier()
    etr = ExtraTreesClassifier(class_weight='balanced')
    kn = KNeighborsClassifier(n_neighbors=10)
    rf = RandomForestClassifier(class_weight='balanced')
    bag = BaggingClassifier()
    sgd = SGDClassifier(class_weight='balanced', loss='log', n_jobs=-1)
    vc = VotingClassifier([('lm', lm), ('dtc', dtc), ('nb', nb), 
                           ('xgb', xgb), ('adb', adb), ('etr', etr),
                           ('kn', kn), ('rf', rf), ('bag', bag),
                           ('sgd', sgd)], n_jobs=-1, voting = 'soft')
    
    result = pd.DataFrame()
    for c in tqdm_notebook([0,1,2]):
        y_adj = np.array(y==c)
        vc.fit(X_train, y_adj)
        result['pred'+str(c)] = vc.predict(X_test)
        result['proba'+str(c)] = vc.predict_proba(X_test)[:,1]
    
    result['label'] = result.apply(lambda row: decis(row['proba0'], row['proba1'], row['proba2'])[1] ,axis=1)
    result['confidence'] = result.apply(lambda row: decis(row['proba0'], row['proba1'], row['proba2'])[0] ,axis=1)
    
    return result[['label', 'confidence']]

In [14]:
result = predict(X_train_tfidf, df['label'], X_test_tfidf)

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

  if diff:
  if diff:
  if diff:





#### Save subm

In [15]:
test['confidence'] = result['confidence']
test['label']  = result['label']

In [None]:
test.sort_values(by=['context_id', 'confidence'])[['context_id', 'reply_id']].to_csv('subm.csv', encoding='utf-8', sep=' ', index=False)