### Переходим к практике

In [11]:
import numpy as np
import pandas as pd

Наши новости

In [12]:
news = pd.read_csv("articles.csv")
print(news.shape)
news.head(3)

(27000, 2)


Unnamed: 0,doc_id,title
0,6,Заместитель председателяnправительства РФnСерг...
1,4896,Матч 1/16 финала Кубка России по футболу был п...
2,4897,Форвард «Авангарда» Томаш Заборский прокоммент...


Загрузим пользователей и списки последних прочитанных новостей

In [13]:
users = pd.read_csv("users_articles.csv")
users.head(3)

Unnamed: 0,uid,articles
0,u105138,"[293672, 293328, 293001, 293622, 293126, 1852]"
1,u108690,"[3405, 1739, 2972, 1158, 1599, 322665]"
2,u108339,"[1845, 2009, 2356, 1424, 2939, 323389]"


Итак, нам нужно получить векторные представления пользователей на основе прочитанным ими новостей и самих новостей

### 1. Получаем векторные представления новостей

In [14]:
#from gensim.test.utils import common_texts
# import sys
# !{sys.executable} -m pip install gensim
from gensim.corpora.dictionary import Dictionary

You should consider upgrading via the '/Users/sonzza/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [15]:
#предобработка текстов
import re
import numpy as np
from nltk.corpus import stopwords

#from nltk.tokenize import word_tokenize
# import sys
# !{sys.executable} -m pip install pymorphy2

#!python -m pip install install razdel
from razdel import tokenize # https://github.com/natasha/razdel


import pymorphy2  # pip install pymorphy2

In [16]:
stopword_ru = stopwords.words('russian')
len(stopword_ru)

morph = pymorphy2.MorphAnalyzer()

In [17]:
with open('stopwords.txt') as f:
    additional_stopwords = [w.strip() for w in f.readlines() if w]
stopword_ru += additional_stopwords
len(stopword_ru)

776

In [18]:
def clean_text(text):
    '''
    очистка текста
    
    на выходе очищеный текст
    
    '''
    if not isinstance(text, str):
        text = str(text)
    
    text = text.lower()
    text = text.strip('\n').strip('\r').strip('\t')
    text = re.sub("-\s\r\n\|-\s\r\n|\r\n", '', str(text))

    text = re.sub("[0-9]|[-—.,:;_%©«»?*!@#№$^•·&()]|[+=]|[[]|[]]|[/]|", '', text)
    text = re.sub(r"\r\n\t|\n|\\s|\r\t|\\n", ' ', text)
    text = re.sub(r'[\xad]|[\s+]', ' ', text.strip())
    
    #tokens = list(tokenize(text))
    #words = [_.text for _ in tokens]
    #words = [w for w in words if w not in stopword_ru]
    
    #return " ".join(words)
    return text

cache = {}

def lemmatization(text):
    '''
    лемматизация
        [0] если зашел тип не `str` делаем его `str`
        [1] токенизация предложения через razdel
        [2] проверка есть ли в начале слова '-'
        [3] проверка токена с одного символа
        [4] проверка есть ли данное слово в кэше
        [5] лемматизация слова
        [6] проверка на стоп-слова

    на выходе лист отлемматизированых токенов
    '''

    # [0]
    if not isinstance(text, str):
        text = str(text)
    
    # [1]
    tokens = list(tokenize(text))
    words = [_.text for _ in tokens]

    words_lem = []
    for w in words:
        if w[0] == '-': # [2]
            w = w[1:]
        if len(w)>1: # [3]
            if w in cache: # [4]
                words_lem.append(cache[w])
            else: # [5]
                temp_cach = cache[w] = morph.parse(w)[0].normal_form
                words_lem.append(temp_cach)
    
    words_lem_without_stopwords=[i for i in words_lem if not i in stopword_ru] # [6]
    
    return words_lem_without_stopwords

In [19]:
%%time
from ipykernel import kernelapp as app
#Запускаем очистку текста. Будет долго...
news['title'] = news['title'].apply(lambda x: clean_text(x), 1)

  from ipykernel import kernelapp as app


CPU times: user 38.2 s, sys: 695 ms, total: 38.9 s
Wall time: 47.2 s


In [20]:
%%time
#Запускаем лемматизацию текста. Будет очень долго...
news['title'] = news['title'].apply(lambda x: lemmatization(x), 1)

CPU times: user 4min 19s, sys: 2.29 s, total: 4min 21s
Wall time: 4min 53s


А теперь в 3 строчки обучим нашу модель

In [21]:
#сформируем список наших текстов, разбив еще и на пробелы
texts = [t for t in news['title'].values]

# Create a corpus from a list of texts
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]

Что такое common_dictionary и как он выглядит

In [22]:
common_dictionary[10]

'ватутин'

Все просто - это словарь наших слов

Запускаем обучение

In [23]:
from gensim.models import LdaModel
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=25, id2word=common_dictionary)#, passes=10)

In [24]:
from gensim.test.utils import datapath
# Save model to disk.
temp_file = datapath("model.lda")
lda.save(temp_file)

# Load a potentially pretrained model from disk.
lda = LdaModel.load(temp_file)

Обучили модель. Теперь 2 вопроса:

1. как выглядят наши темы
2. как получить для документа вектор значений (вероятности принадлежности каждой теме)

In [25]:
# Create a new corpus, made of previously unseen documents.
other_texts = [t for t in news['title'].iloc[:4]]
other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]

unseen_doc = other_corpus[2]
print(other_texts[2])
lda[unseen_doc] 

['форвард', 'авангард', 'томаш', 'заборский', 'прокомментировать', 'игра', 'команда', 'матч', 'чемпионат', 'кхл', 'против', 'атланта', 'nnnn', 'плохой', 'матч', 'нижний', 'новгород', 'против', 'торпедо', 'настраиваться', 'первый', 'минута', 'включиться', 'заборский', 'получиться', 'забросить', 'быстрый', 'гол', 'задать', 'хороший', 'темп', 'поединок', 'играть', 'хороший', 'сторона', 'пять', 'очко', 'выезд', 'девять', 'хороший']


[(4, 0.031471472),
 (5, 0.13983467),
 (8, 0.07237738),
 (18, 0.6991602),
 (21, 0.035977226)]

In [26]:
x=lda.show_topics(num_topics=25, num_words=7,formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

#Below Code Prints Only Words 
for topic,words in topics_words:
    print("topic_{}: ".format(topic)+" ".join(words))

topic_0: фонд научный граница сша система долг американский
topic_1: дыра бизнесмен греческий си корзина козак катастрофический
topic_2: автор земля конкурс японский первый миллиард небо
topic_3: новый эксперт всё журнал украина снижение обнаружить
topic_4: путин россия фонд nn владимир российский пресссекретарить
topic_5: газ ракета район новый nn часть восток
topic_6: рубль тыс статья мозг размер правительство доллар
topic_7: проект развитие технология государственный доход строительство федеральный
topic_8: исследование всё власть экономика ребёнок жизнь министерство
topic_9: россия сша российский военный млрд научный данные
topic_10: банк завод дональд вскоре последствие фсб азия
topic_11: поверхность знаменитый девочка этаж гражданство надёжный вклад
topic_12: турция форум отряд ск транспорт грузия предупредить
topic_13: иран украина опрос спасти орден век золото
topic_14: млн рост цена составить рынок показатель вырасти
topic_15: погибнуть станция космос супруг производить оплата

Очень неплохо - большинство тем вполне можно описать о чем они

Давайте напишем функцию, которая будет нам возвращать векторное представление новости

In [27]:
#text = news['title'].iloc[0]

def get_lda_vector(text):
    unseen_doc = common_dictionary.doc2bow(text)
    lda_tuple = lda[unseen_doc]
    not_null_topics = dict(zip([i[0] for i in lda_tuple], [i[1] for i in lda_tuple]))

    output_vector = []
    for i in range(25):
        if i not in not_null_topics:
            output_vector.append(0)
        else:
            output_vector.append(not_null_topics[i])
    return np.array(output_vector)

In [28]:
topic_matrix = pd.DataFrame([get_lda_vector(text) for text in news['title'].values])
topic_matrix.columns = ['topic_{}'.format(i) for i in range(25)]
topic_matrix['doc_id'] = news['doc_id'].values
topic_matrix = topic_matrix[['doc_id']+['topic_{}'.format(i) for i in range(25)]]
topic_matrix.head(5)

Unnamed: 0,doc_id,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_15,topic_16,topic_17,topic_18,topic_19,topic_20,topic_21,topic_22,topic_23,topic_24
0,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.44082,0.0,...,0.0,0.0,0.0,0.0,0.0,0.05059,0.0,0.0,0.0,0.0
1,4896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.232115,0.103889,0.053673,0.259095,0.0,0.0,0.330162,0.0,0.0,0.0
2,4897,0.0,0.0,0.0,0.0,0.031467,0.139947,0.0,0.0,0.072374,...,0.0,0.0,0.0,0.699058,0.0,0.0,0.035974,0.0,0.0,0.0
3,4898,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.412581,0.0,0.521553,0.0,0.0,0.0,0.0,0.0,0.0
4,4899,0.0,0.0,0.0,0.063458,0.0,0.0,0.0,0.517509,0.0,...,0.0,0.0,0.0,0.0,0.393114,0.0,0.0,0.0,0.0,0.0


Прекрасно, мы получили вектора наших новостей! И даже умеем интерпретировать получившиеся темы.

Можно двигаться далее

### Следующий шаг - векторные представления пользователей

In [29]:
users.head(3)

Unnamed: 0,uid,articles
0,u105138,"[293672, 293328, 293001, 293622, 293126, 1852]"
1,u108690,"[3405, 1739, 2972, 1158, 1599, 322665]"
2,u108339,"[1845, 2009, 2356, 1424, 2939, 323389]"


In [30]:
doc_dict = dict(zip(topic_matrix['doc_id'].values, topic_matrix[['topic_{}'.format(i) for i in range(25)]].values))

In [31]:
doc_dict[293622]

array([0.0429141 , 0.        , 0.        , 0.18659358, 0.18337567,
       0.22912204, 0.        , 0.        , 0.        , 0.03675398,
       0.02910567, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.27988622, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [32]:
user_articles_list = users['articles'].iloc[33]

def get_user_embedding_mean(user_articles_list):
    user_articles_list = eval(user_articles_list)
    user_vector = np.array([doc_dict[doc_id] for doc_id in user_articles_list])
    user_vector = np.mean(user_vector, 0)
    return user_vector

def get_user_embedding_median(user_articles_list):
    user_articles_list = eval(user_articles_list)
    user_vector = np.array([doc_dict[doc_id] for doc_id in user_articles_list])
    user_vector = np.median(user_vector, 0)
    return user_vector

def get_user_embedding_max(user_articles_list):
    user_articles_list = eval(user_articles_list)
    user_vector = np.array([doc_dict[doc_id] for doc_id in user_articles_list])
    user_vector = np.max(user_vector, 0)
    return user_vector

In [33]:
get_user_embedding_mean(user_articles_list)

array([0.01769576, 0.        , 0.00442432, 0.08253028, 0.10612228,
       0.        , 0.02866339, 0.06515276, 0.17764636, 0.17429365,
       0.0089097 , 0.00468868, 0.        , 0.01597304, 0.        ,
       0.        , 0.08442609, 0.10570284, 0.00435285, 0.03932005,
       0.0382451 , 0.00683009, 0.        , 0.        , 0.0202493 ])

Интересовался новостями с топиками topic_3, topic_14 (что-то про политику и государство)

In [34]:
users['articles'].iloc[33]

'[323329, 321961, 324743, 323186, 324632, 474690]'

In [35]:
" ".join(news[news['doc_id']==323186]['title'].iloc[0])

'глава российский мид сергей лавров опровергнуть появиться сми информация якобы готовиться обмен декларация россия сша сотрудничество сфера риа новость nn читать сообщение разговаривать автор сообщение откуда автор источник какихлибо основание подобный род репортаж откуда информация появиться журналист итог встреча госсекретарь сша джон керри nn позиция изложить декларация напринимать достаточно рамка обсе рамка совет россия нато высокий уровень продекларировать всё обеспечивать неделимость безопасность никто обеспечивать безопасность счёт безопасность продолжить министр лавров москва система нато создавать проблема безопасность поэтому декларация недостаточно договариваться совместный система россия предлагать начинать путин посещать сша нужно вести речь очередной декларация гарантия проверять объективный военнотехнический критерий гарантия ненаправленность система против российский ядерный потенциал подчеркнуть глава мид газета коммерсантъ ссылаться дипломатический источник написать 

Теперь получим эмбединги для всех пользователей и проверим их качество на конкретной downstream-задаче

In [36]:
user_embeddings_mean = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding_mean(x), 1)])
user_embeddings_mean.columns = ['topic_{}'.format(i) for i in range(25)]
user_embeddings_mean['uid'] = users['uid'].values
user_embeddings_mean = user_embeddings_mean[['uid']+['topic_{}'.format(i) for i in range(25)]]
user_embeddings_mean.head(3)

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_15,topic_16,topic_17,topic_18,topic_19,topic_20,topic_21,topic_22,topic_23,topic_24
0,u105138,0.036966,0.0,0.037285,0.134376,0.052677,0.094946,0.035764,0.064206,0.01627,...,0.021337,0.209977,0.0,0.006903,0.022963,0.055371,0.005091,0.005239,0.0,0.060865
1,u108690,0.022578,0.0,0.022921,0.119199,0.037422,0.042954,0.056197,0.058062,0.178506,...,0.002257,0.075997,0.008143,0.016363,0.040566,0.022152,0.001787,0.0,0.0,0.019234
2,u108339,0.023764,0.0,0.009639,0.054207,0.002717,0.07722,0.046653,0.035548,0.110014,...,0.018454,0.146564,0.036651,0.045467,0.045178,0.03858,0.006568,0.011466,0.0,0.0392


In [37]:
user_embeddings_median = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding_median(x), 1)])
user_embeddings_median.columns = ['topic_{}'.format(i) for i in range(25)]
user_embeddings_median['uid'] = users['uid'].values
user_embeddings_median = user_embeddings_median[['uid']+['topic_{}'.format(i) for i in range(25)]]

In [38]:
user_embeddings_max = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding_max(x), 1)])
user_embeddings_max.columns = ['topic_{}'.format(i) for i in range(25)]
user_embeddings_max['uid'] = users['uid'].values
user_embeddings_max = user_embeddings_max[['uid']+['topic_{}'.format(i) for i in range(25)]]

Датасет готов - можно попробовать обучить модель. Загрузим нашу разметку

In [39]:
target = pd.read_csv("users_churn.csv")
target.head(3)

Unnamed: 0,uid,churn
0,u107120,0
1,u102277,0
2,u102444,0


In [40]:
X_median = pd.merge(user_embeddings_median, target, 'left')
X_median.head(3)

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_16,topic_17,topic_18,topic_19,topic_20,topic_21,topic_22,topic_23,topic_24,churn
0,u105138,0.021457,0.0,0.0,0.119199,0.015252,0.035178,0.0,0.067462,0.0,...,0.182492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033757,0
1,u108690,0.01869,0.0,0.005612,0.128507,0.031967,0.026708,0.048484,0.05655,0.188167,...,0.051091,0.0,0.0,0.023188,0.0,0.0,0.0,0.0,0.0,1
2,u108339,0.005744,0.0,0.0,0.029749,0.0,0.083529,0.028469,0.029961,0.086625,...,0.144626,0.029903,0.023289,0.055243,0.022665,0.0,0.0,0.0,0.0,1


In [41]:
X_mean = pd.merge(user_embeddings_mean, target, 'left')
X_mean.head(3)

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_16,topic_17,topic_18,topic_19,topic_20,topic_21,topic_22,topic_23,topic_24,churn
0,u105138,0.036966,0.0,0.037285,0.134376,0.052677,0.094946,0.035764,0.064206,0.01627,...,0.209977,0.0,0.006903,0.022963,0.055371,0.005091,0.005239,0.0,0.060865,0
1,u108690,0.022578,0.0,0.022921,0.119199,0.037422,0.042954,0.056197,0.058062,0.178506,...,0.075997,0.008143,0.016363,0.040566,0.022152,0.001787,0.0,0.0,0.019234,1
2,u108339,0.023764,0.0,0.009639,0.054207,0.002717,0.07722,0.046653,0.035548,0.110014,...,0.146564,0.036651,0.045467,0.045178,0.03858,0.006568,0.011466,0.0,0.0392,1


In [42]:
X_max = pd.merge(user_embeddings_max, target, 'left')
X_max.head(3)

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_16,topic_17,topic_18,topic_19,topic_20,topic_21,topic_22,topic_23,topic_24,churn
0,u105138,0.134018,0.0,0.160092,0.309205,0.183376,0.270196,0.214583,0.128669,0.097619,...,0.496187,0.0,0.041417,0.096075,0.297697,0.030548,0.031431,0.0,0.209472,0
1,u108690,0.076726,0.0,0.113003,0.210235,0.080632,0.108216,0.17011,0.112524,0.319377,...,0.185105,0.032736,0.098179,0.155262,0.104055,0.01072,0.0,0.0,0.079525,1
2,u108339,0.104833,0.0,0.034968,0.129668,0.016303,0.149815,0.133756,0.085778,0.309233,...,0.263548,0.079354,0.122227,0.096692,0.102265,0.025842,0.043059,0.0,0.152535,1


In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import itertools

import matplotlib.pyplot as plt

%matplotlib inline

In [44]:
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(X_mean[['topic_{}'.format(i) for i in range(25)]], 
                                                    X_mean['churn'], random_state=0)

In [45]:
logreg = LogisticRegression(solver='lbfgs')
#обучим наш пайплайн
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [46]:
#наши прогнозы для тестовой выборки
preds = logreg.predict_proba(X_test)[:, 1]
preds[:10]

array([0.12603403, 0.04867823, 0.2582371 , 0.09830045, 0.10335489,
       0.07797125, 0.21517718, 0.09828997, 0.09262448, 0.1935983 ])

In [47]:
from sklearn.metrics import f1_score, roc_auc_score, precision_score, classification_report, precision_recall_curve, confusion_matrix

### Рассчитаем Precision, Recall, F_score

In [64]:
precision, recall, thresholds = precision_recall_curve(y_test, preds)
fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
mean_result = [roc_auc_score(y_test, preds), thresholds[ix], fscore[ix], precision[ix], recall[ix]]

SyntaxError: invalid syntax (<ipython-input-64-c55784c8ad30>, line 5)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X_median[['topic_{}'.format(i) for i in range(25)]], 
                                                    X_median['churn'], random_state=0)
logreg.fit(X_train, y_train)
preds = logreg.predict_proba(X_test)[:, 1]
preds[:10]

array([0.24735099, 0.02012299, 0.36782298, 0.19294469, 0.04905664,
       0.08541669, 0.16962504, 0.09425571, 0.07263817, 0.12107159])

In [59]:
precision, recall, thresholds = precision_recall_curve(y_test, preds)
fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
median_result = [roc_auc_score(y_test, preds), thresholds[ix], fscore[ix], precision[ix], recall[ix]]

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X_max[['topic_{}'.format(i) for i in range(25)]], 
                                                    X_max['churn'], random_state=0)
logreg.fit(X_train, y_train)
preds = logreg.predict_proba(X_test)[:, 1]
preds[:10]

array([0.13039935, 0.00220364, 0.57479928, 0.08352207, 0.00127034,
       0.01458581, 0.09840019, 0.12390224, 0.01949426, 0.52980848])

In [81]:
precision, recall, thresholds = precision_recall_curve(y_test, preds)
fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
rows = ['rocauc', 'thresholds', 'f_score', 'prec', 'rec']
cols = ['mean', 'median', 'max']
max_result = [roc_auc_score(y_test, preds), thresholds[ix], fscore[ix], precision[ix], recall[ix]]
res = pd.DataFrame([mean_result, median_result,max_result], cols, rows)

In [83]:
res

Unnamed: 0,rocauc,thresholds,f_score,prec,rec
mean,0.97615,0.346224,0.801619,0.795181,0.808163
median,0.965244,0.265719,0.741697,0.676768,0.820408
max,0.97615,0.346224,0.801619,0.795181,0.808163


Среднее и максимальное одинаковые, так как являются просто результататами математического исчисления, медиана же смещается за счет того что считается исходя из разделения выборки пополам (не среднее). 

In [86]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [89]:
#мы уже нашли ранее "оптимальный" порог, когда максимизировали f_score
# font = {'size' : 15}

# plt.rc('font', **font)

# cnf_matrix = confusion_matrix(y_test, preds>thresholds[ix])
# plt.figure(figsize=(10, 8))
# plot_confusion_matrix(cnf_matrix, classes=['Non-Churn', 'churn'],
#                       title='Confusion matrix')
# plt.savefig("conf_matrix.png")
# plt.show()

In [90]:
# roc_auc_score(y_test, preds)

В целом мы видим, что получившиеся векторные представления содержат какой-то сигнал и позволяют решать нашу прикладную задачу. 

### Домашнее задание

1. Самостоятельно разобраться с тем, что такое tfidf (документация https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html и еще - https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
2. Модифицировать код функции get_user_embedding таким образом, чтобы считалось не среднее (как в примере np.mean), а медиана. Применить такое преобразование к данным, обучить модель прогнозирования оттока и посчитать метрики качества и сохранить их: roc auc, precision/recall/f_score (для 3 последних - подобрать оптимальный порог с помощью precision_recall_curve, как это делалось на уроке)
3. Повторить п.2, но используя уже не медиану, а max
4. (опциональное, если очень хочется) Воспользовавшись полученными знаниями из п.1, повторить пункт 2, но уже взвешивая новости по tfidf (подсказка: нужно получить веса-коэффициенты для каждого документа. Не все документы одинаково информативны и несут какой-то положительный сигнал). Подсказка 2 - нужен именно idf, как вес.
5. Сформировать на выходе единую таблицу, сравнивающую качество 3 разных метода получения эмбедингов пользователей: mean, median, max, idf_mean по метрикам roc_auc, precision, recall, f_score
6. Сделать самостоятельные выводы и предположения о том, почему тот или ной способ оказался эффективнее остальных

#### Ссылки

1. http://www.machinelearning.ru/wiki/images/d/d5/Voron17survey-artm.pdf
2. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation