# Подготовка датасета
Изначально я решил, что по бинарной классификации не круто стрелять из пушки, надо попытаться решить простыми алгоритмами. На это, кстати, тонко намекала всего одна библиотека в _requirements.txt_: _sklearn_.

Основная часть работы будет с логистической регрессией, потому что я хорошо знаком с ней, а ещё заюзаю наивный байес, потому что задача бинарной классификации «спам-не спам» очень похожа на нашу.

In [3]:
from pandas import read_csv, concat
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import hstack
from sklearn.metrics import roc_auc_score
from datetime import datetime
from stop_words import get_stop_words
from tqdm.notebook import tqdm
from multiprocessing import cpu_count
from json import load as json_load
from numpy import int64, array, logspace, zeros
from joblib import Parallel, delayed
from tabulate import tabulate
from pickle import load as pickle_load, dump as pickle_dump
from timeit import default_timer as timer
from transliterate import translit

Тут я пытался выключить уведомления о несходимости у _sklearn_, но с какой-то версии, включая нашу, это не работает.

In [4]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

## Импорт и очистка от знаков препинания + приведение к нижнему регистру

In [5]:
%%time
df_train = read_csv("../data/train.csv")
df_val = read_csv("../data/val.csv")

df_train['description'] = df_train['description'].replace(r'[\W_]+', ' ', regex = True).str.lower()
df_val['description'] = df_val['description'].replace(r'[\W_]+', ' ', regex = True).str.lower()

CPU times: user 28.2 s, sys: 1.3 s, total: 29.5 s
Wall time: 29.5 s


## Преобразование текста в TF-IDF...
...в надежде что количество оборотов, указывающих на наличие контактов, сильно ограничено, и его можно как-то удобно выделить с относительно небольшим словарём. Если выгорит, можно будет потом проводить кластерный анализ и остальные приколы векторных моделей.

In [6]:
stop_words_ru = get_stop_words("ru")
text_transformer = TfidfVectorizer(
    stop_words=stop_words_ru, ngram_range=(1, 2), min_df=100, use_idf=1, smooth_idf=1,
)

In [7]:
%%time
X_train_text = text_transformer.fit_transform(df_train['description'])
X_val_text = text_transformer.transform(df_val['description'])

CPU times: user 1min 10s, sys: 1.79 s, total: 1min 12s
Wall time: 1min 12s


In [8]:
df_train.shape

(984487, 9)

In [9]:
len(df_train[df_train["category"] == "Транспорт"])

209221

In [10]:
X_train_text[df_train["category"] == "Транспорт"].shape

(209221, 72856)

In [11]:
%%time
# сохраним модельку
pickle_dump(text_transformer, open("../lib/text_transformer.pickle", "wb"))

CPU times: user 4.92 s, sys: 366 ms, total: 5.28 s
Wall time: 5.39 s


## Векторизуем категории
В общем-то тут можно использовать два энкодера, но зачем усложнять, если всё за нас придумали?

In [12]:
%%time
categories = ['subcategory', 'category' , 'region', 'city']

vectorizer = DictVectorizer()
df_transformed = concat([df_train[categories], df_val[categories]])
df_transformed = vectorizer.fit_transform(df_transformed.to_dict('records'))
X_train_categ, X_val_categ = df_transformed[:X_train_text.shape[0]], df_transformed[X_train_text.shape[0]:]

CPU times: user 3.97 s, sys: 59.5 ms, total: 4.03 s
Wall time: 4.03 s


In [13]:
X_train_categ[df_train["category"] == "Транспорт"].shape

(209221, 3456)

In [14]:
pickle_dump(vectorizer, open("../lib/cat_transformer.pickle", "wb"))

In [15]:
del df_transformed

In [16]:
!free -mh

              total        used        free      shared  buff/cache   available
Mem:           31Gi        10Gi        15Gi       708Mi       5,2Gi        19Gi
Swap:         2,0Gi       2,0Gi        10Mi


## Регулярочки. Довольно круто бустят модель!
Вообще-то список можно посмотреть в файлике, там есть интересные регулярки на номер телефона формата 8п9а0к5о5р7в0д9г6а4т7ь и ещё куча всякой всячины.\
Не зря же я их придумывал...

In [17]:
%%time
with open('../lib/logreg_models/regexps/regexp.json') as json_file:
    regexps = json_load(json_file)

for regexp_name in regexps.keys():
    df_train[regexp_name] = df_train['description'].str.contains(regexps[regexp_name]).astype(int)
    df_val[regexp_name] = df_val['description'].str.contains(regexps[regexp_name]).astype(int)

  return func(self, *args, **kwargs)


CPU times: user 5min 24s, sys: 69.2 ms, total: 5min 24s
Wall time: 5min 24s


In [18]:
df_train.sample(5)

Unnamed: 0,title,description,subcategory,category,price,region,city,datetime_submitted,is_bad,phone_normal,...,viber,instagram,youtube,home_phone,site,digits,numbers,phone_operators,impulse,punctuation
360667,Дверь входная уличная,дверь входная м 46 2 дверь для загородного дом...,Ремонт и строительство,Для дома и дачи,40300.0,Санкт-Петербург,Пушкин,2019-07-19 15:00:52.161937,0,1,...,0,0,0,1,0,1,0,0,1,0
360502,Ответственный дежурный,описание работодателя частная охранная организ...,Вакансии,Работа,17000.0,Пермский край,Пермь,2019-07-19 14:38:22.885579,1,0,...,0,0,0,1,0,0,1,0,0,0
891722,Ресивер С крышкой (завод) Приора Гранта. 35кл,пробег 9 в идеале торг небольшой,Запчасти и аксессуары,Транспорт,2200.0,Дагестан,Хасавюрт,2019-09-28 13:10:54.887917,0,0,...,0,0,0,0,0,0,0,0,0,0
472973,Угловой стол,угловой стол с тумбочкой три выдвижных ящика д...,Мебель и интерьер,Для дома и дачи,4500.0,Кировская область,Киров,2019-08-05 15:29:58.521798,0,0,...,0,0,0,0,0,1,1,0,0,0
647928,008R12134,008r56541 145s00948 008р56541 145с00948 фьюзе...,Оргтехника и расходники,Бытовая электроника,27000.0,Россия,Москва,2019-08-27 14:11:09.083943,0,1,...,0,0,0,1,0,0,0,0,0,0


In [19]:
X_train_regexp = df_train[regexps.keys()]
X_val_regexp = df_val[regexps.keys()]

In [20]:
X_train_regexp[df_train["category"] == "Транспорт"].shape

(209221, 17)

## Соединяем
Используем _csr_-sparse формат матрицы, чтобы можно было по ней итерироваться (по _coo_, например, нельзя).

In [21]:
X_train = hstack([X_train_text, X_train_categ, X_train_regexp], format="csr")
y_train = df_train['is_bad']
X_val = hstack([X_val_text, X_val_categ, X_val_regexp], format="csr")
y_val = df_val['is_bad']

In [22]:
type(X_train)

scipy.sparse.csr.csr_matrix

In [23]:
!free -mh

              total        used        free      shared  buff/cache   available
Mem:           31Gi        11Gi        15Gi       708Mi       3,8Gi        18Gi
Swap:         2,0Gi       2,0Gi        10Mi


In [23]:
X_train[df_train["category"] == "Транспорт"].shape

(209221, 76329)

In [24]:
del X_train_categ, X_val_categ, X_train_text, X_val_text, X_train_regexp, X_val_regexp

In [25]:
!free -mh

              total        used        free      shared  buff/cache   available
Mem:           31Gi        10Gi        15Gi       708Mi       5,2Gi        19Gi
Swap:         2,0Gi       2,0Gi        10Mi


# Обучаем модели
Я сделал логистическую регрессию для каждой категории. Вот что получилось:

In [50]:
def grid_search_model(df_train, Model, params_grid, X_train, y_train):
    models = {}
    for category in df_train["category"].unique():
        start = timer()
        print("Now working on '{0}' category.".format(category))
        lsearch = Model
        grid_cv_logreg = GridSearchCV(lsearch, params_grid, cv=3, scoring='roc_auc', verbose=2, n_jobs=cpu_count())
        grid_cv_logreg.fit(X_train[df_train["category"] == category], y_train[df_train["category"] == category])
        models[category] = grid_cv_logreg.best_params_
        print("Category {0} finished, elapsed time: {1:.1f}s.\n".format(category, timer() - start))
    return models

In [27]:
df_train["category"].unique()

array(['Для дома и дачи', 'Транспорт', 'Услуги', 'Бытовая электроника',
       'Хобби и отдых', 'Личные вещи', 'Недвижимость', 'Животные',
       'Для бизнеса', 'Работа'], dtype=object)

In [None]:
%%time
logreg_grid = {"C": logspace(0, 0.8, 40)}
LogReg = LogisticRegression(solver='lbfgs', multi_class='multinomial', random_state=17, max_iter=300)
grid_search_model(df_train, LogReg, logreg_grid, X_train, y_train)

Now working on 'Для дома и дачи' category.
Fitting 3 folds for each of 40 candidates, totalling 120 fits
Category Для дома и дачи finished, elapsed time: 616.4s.

Now working on 'Транспорт' category.
Fitting 3 folds for each of 40 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Category Транспорт finished, elapsed time: 910.1s.

Now working on 'Услуги' category.
Fitting 3 folds for each of 40 candidates, totalling 120 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Category Услуги finished, elapsed time: 449.3s.

Now working on 'Бытовая электроника' category.
Fitting 3 folds for each of 40 candidates, totalling 120 fits


In [None]:
logreg_params = logregs

for key, params in logreg_params.items():
    params = {
        "C": params["C"],
        "multi_class": 'multinomial',
        "random_state": 17,
        "max_iter": 300,
        "n_jobs": cpu_count(),
    }
    logreg_params[key] = params

In [53]:
def category_fitter(df_train, Model, params, X_train, y_train):
    models = {}
    for category in df_train["category"].unique():
        start = timer()
        print("Now working on '{0}' category.".format(category))
        if params:
            model = Model(**params[category])
        else:
            model = Model()
        model.fit(X_train[df_train["category"] == category], y_train[df_train["category"] == category])
        models[category] = model
        print("Category {0} finished, elapsed time: {1:.1f}s.\n".format(category, timer() - start))
    return models

In [None]:
logregs = category_fitter(df_train, LogisticRegression, logreg_params, X_train, y_train)

Тут я в очередной раз нечаянно потерял все выводы, поэтому загружу модельки, чтобы показать метрички

Извините меня за этот _comprehension_ :(

In [57]:
cat_names = list((category, translit(category.lower().replace(" ", "_"), 'ru', reversed=True)) for category in df_train["category"].unique())
logregs = {}
for category, name in cat_names:
    with open("../lib/logreg_models/{0}.pickle".format(name), 'rb') as file:
        if file:
            logregs[category] = pickle_load(file)
            file.close()

In [42]:
logregs

{'Для дома и дачи': LogisticRegression(C=1.5297321160913588, max_iter=300,
                    multi_class='multinomial', n_jobs=16, random_state=17),
 'Транспорт': LogisticRegression(C=2.963431355763227, max_iter=300, multi_class='multinomial',
                    n_jobs=16, random_state=17),
 'Услуги': LogisticRegression(C=1.6812837894983077, max_iter=300,
                    multi_class='multinomial', n_jobs=16, random_state=17),
 'Бытовая электроника': LogisticRegression(C=1.6037187437513303, max_iter=300,
                    multi_class='multinomial', n_jobs=16, random_state=17),
 'Хобби и отдых': LogisticRegression(C=1.6812837894983077, max_iter=300,
                    multi_class='multinomial', n_jobs=16, random_state=17),
 'Личные вещи': LogisticRegression(C=1.7626003261754577, max_iter=300,
                    multi_class='multinomial', n_jobs=16, random_state=17),
 'Недвижимость': LogisticRegression(C=2.82671518082979, max_iter=300, multi_class='multinomial',
               

In [46]:
def metrics_printer(X_train, X_val, y_train, y_val, models):
    overall_table = []
    cats_table = []
    for dataset, ds_type, labels, df in zip([X_train, X_val], ["train", "val"], [y_train, y_val], [df_train, df_val]):
        labels_pred = zeros(labels.shape)
        for category in df["category"].unique():
            cat_name = translit(category.lower().replace(" ", "_"), 'ru', reversed=True)
            model = models[category]
            labels_pred[df["category"] == category] = model.predict_proba(dataset[df["category"] == category])[:, 1]
            rocauc_category = roc_auc_score(labels[df["category"] == category], labels_pred[df["category"] == category])
            cats_table.append([cat_name, ds_type, rocauc_category])
        rocauc = roc_auc_score(labels, labels_pred)
        overall_table.append([ds_type, rocauc])
    print("Categories table:")
    print(tabulate(cats_table, headers=["Category", "Type", "AUC"], tablefmt='orgtbl'))
    print(" _____________________________________________")
    print(".____________________________________________.>")
    print("Overall table:")
    print(tabulate(overall_table, headers=['Dataset', 'AUC'], tablefmt='orgtbl'))

In [47]:
metrics_printer(X_train, X_val, y_train, y_val, logregs)

Categories table:
| Category             | Type   |      AUC |
|----------------------+--------+----------|
| dlja_doma_i_dachi    | train  | 0.981892 |
| transport            | train  | 0.992807 |
| uslugi               | train  | 0.986237 |
| bytovaja_elektronika | train  | 0.97988  |
| hobbi_i_otdyh        | train  | 0.985773 |
| lichnye_veschi       | train  | 0.989317 |
| nedvizhimost'        | train  | 0.994326 |
| zhivotnye            | train  | 0.997275 |
| dlja_biznesa         | train  | 0.998275 |
| rabota               | train  | 0.984784 |
| transport            | val    | 0.981164 |
| dlja_biznesa         | val    | 0.922197 |
| dlja_doma_i_dachi    | val    | 0.915825 |
| lichnye_veschi       | val    | 0.813546 |
| uslugi               | val    | 0.848407 |
| bytovaja_elektronika | val    | 0.929878 |
| nedvizhimost'        | val    | 0.954078 |
| hobbi_i_otdyh        | val    | 0.87889  |
| rabota               | val    | 0.879812 |
| zhivotnye            | val    | 0.8

In [48]:
def multiple_models_dumper(models):
    for category, model in models.items():
        cat_name = translit(category.lower().replace(" ", "_"), 'ru', reversed=True)
        print(cat_name)
        with open("../lib/logreg_models/{0}_logreg.pickle".format(cat_name), "wb") as f:
            pickle_dump(model, f)

In [None]:
multiple_models_dumper(logregs)

In [51]:
%%time
parameters_nb = {'alpha': logspace(0, 0.01, 30) - .9999}
NaiveBayes = MultinomialNB()
grid_cv_nb = grid_search_model(df_train, NaiveBayes, parameters_nb, X_train, y_train)

Now working on 'Для дома и дачи' category.
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Category Для дома и дачи finished, elapsed time: 4.1s.

Now working on 'Транспорт' category.
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Category Транспорт finished, elapsed time: 4.6s.

Now working on 'Услуги' category.
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Category Услуги finished, elapsed time: 1.2s.

Now working on 'Бытовая электроника' category.
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Category Бытовая электроника finished, elapsed time: 3.6s.

Now working on 'Хобби и отдых' category.
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Category Хобби и отдых finished, elapsed time: 1.2s.

Now working on 'Личные вещи' category.
Fitting 3 folds for each of 30 candidates, totalling 90 fits
Category Личные вещи finished, elapsed time: 2.4s.

Now working on 'Недвижимость' category.
Fitting 3 folds for each of 30 can

In [54]:
naive_bayeses = category_fitter(df_train, MultinomialNB, grid_cv_nb, X_train, y_train)

Now working on 'Для дома и дачи' category.
Category Для дома и дачи finished, elapsed time: 0.1s.

Now working on 'Транспорт' category.
Category Транспорт finished, elapsed time: 0.2s.

Now working on 'Услуги' category.
Category Услуги finished, elapsed time: 0.1s.

Now working on 'Бытовая электроника' category.
Category Бытовая электроника finished, elapsed time: 0.1s.

Now working on 'Хобби и отдых' category.
Category Хобби и отдых finished, elapsed time: 0.1s.

Now working on 'Личные вещи' category.
Category Личные вещи finished, elapsed time: 0.1s.

Now working on 'Недвижимость' category.
Category Недвижимость finished, elapsed time: 0.1s.

Now working on 'Животные' category.
Category Животные finished, elapsed time: 0.1s.

Now working on 'Для бизнеса' category.
Category Для бизнеса finished, elapsed time: 0.1s.

Now working on 'Работа' category.
Category Работа finished, elapsed time: 0.1s.



In [55]:
print("NAIVE BAYES BY CATEGORY")
metrics_printer(X_train, X_val, y_train, y_val, naive_bayeses)

NAIVE BAYES BY CATEGORY
Categories table:
| Category             | Type   |      AUC |
|----------------------+--------+----------|
| dlja_doma_i_dachi    | train  | 0.910569 |
| transport            | train  | 0.938853 |
| uslugi               | train  | 0.931692 |
| bytovaja_elektronika | train  | 0.911815 |
| hobbi_i_otdyh        | train  | 0.944674 |
| lichnye_veschi       | train  | 0.947389 |
| nedvizhimost'        | train  | 0.937181 |
| zhivotnye            | train  | 0.977605 |
| dlja_biznesa         | train  | 0.963456 |
| rabota               | train  | 0.922833 |
| transport            | val    | 0.955261 |
| dlja_biznesa         | val    | 0.876103 |
| dlja_doma_i_dachi    | val    | 0.90007  |
| lichnye_veschi       | val    | 0.784975 |
| uslugi               | val    | 0.755391 |
| bytovaja_elektronika | val    | 0.890499 |
| nedvizhimost'        | val    | 0.814048 |
| hobbi_i_otdyh        | val    | 0.906934 |
| rabota               | val    | 0.840247 |
| zhivotnye  

In [56]:
print("LOGISTIC REGRESSION BY CATEGORY")
metrics_printer(X_train, X_val, y_train, y_val, logregs)

LOGISTIC REGRESSION BY CATEGORY
Categories table:
| Category             | Type   |      AUC |
|----------------------+--------+----------|
| dlja_doma_i_dachi    | train  | 0.981892 |
| transport            | train  | 0.992807 |
| uslugi               | train  | 0.986237 |
| bytovaja_elektronika | train  | 0.97988  |
| hobbi_i_otdyh        | train  | 0.985773 |
| lichnye_veschi       | train  | 0.989317 |
| nedvizhimost'        | train  | 0.994326 |
| zhivotnye            | train  | 0.997275 |
| dlja_biznesa         | train  | 0.998275 |
| rabota               | train  | 0.984784 |
| transport            | val    | 0.981164 |
| dlja_biznesa         | val    | 0.922197 |
| dlja_doma_i_dachi    | val    | 0.915825 |
| lichnye_veschi       | val    | 0.813546 |
| uslugi               | val    | 0.848407 |
| bytovaja_elektronika | val    | 0.929878 |
| nedvizhimost'        | val    | 0.954078 |
| hobbi_i_otdyh        | val    | 0.87889  |
| rabota               | val    | 0.879812 |
| zhi