# Пример использования API

* [1. Импорт модулей](#1)
* [2. Настройки для Pipeline](#2)
* [3. Настройки для валидации](#3)
* [4. Обучение Pipeline и получение результатов](#4)
* [5. Ансамбль моделей](#5)
* [6. Fine-tuning](#6)
* [7. Интерпретация результатов](#7)

In [1]:
%load_ext autoreload
%autoreload 2

import os

In [2]:
SEED = 1

In [3]:
if 'notebooks' in os.listdir():
    pass
else:
    os.chdir('..')
    print(os.getcwd())

F:\study\recs


<a id='1'></a>
## 1. Импорт модулей

In [4]:
from recs_searcher import (
    dataset,  # учебные датасеты
    preprocessing,  # предобработка текста
    embeddings,  # преобразование текста в эмбеддинги
    similarity_search,  # быстрые поисковики в пространстве эмбеддингов
    augmentation,  # аугментация текста для валидации пайплайнов
    explain,  # интерпретация сходства двух текстов
)
from recs_searcher import api  # Пайплайн

<a id='2'></a>
## 2. Настройки для Pipeline

In [5]:
dataset_phones = dataset.load_mobile_phones()
dataset_games = dataset.load_video_games()

SPACY_MODEL_NAME = 'en_core_web_md'
preprocessing_list = [
    preprocessing.TextLower(),
    preprocessing.RemovePunct(),
    # preprocessing.RemoveNumber(),
    preprocessing.RemoveWhitespace(),
    # preprocessing.RemoveHTML(),
    # preprocessing.RemoveURL(),
    # preprocessing.RemoveEmoji(),

    # preprocessing.RemoveStopwordsSpacy(spacy_model_name=SPACY_MODEL_NAME),
    # preprocessing.LemmatizeSpacy(spacy_model_name=SPACY_MODEL_NAME),
]

model_count_vectorizer_word = embeddings.CountVectorizerWrapperEmbedding(
    analyzer='word',
    ngram_range=(1, 1),
)
model_count_vectorizer_char = embeddings.CountVectorizerWrapperEmbedding(
    analyzer='char',
    ngram_range=(1, 2),
)

explainer = explain.DistanceExplain

searcher_faiss = similarity_search.FaissSearch
searcher_chroma = similarity_search.ChromaDBSearch
searcher_knn = similarity_search.NearestNeighborsSearch
searcher_fuzz = similarity_search.TheFuzzSearch

<a id='3'></a>
## 3. Настройки для валидации

In [6]:
validate_augmentation_transforms = [
    augmentation.CharAugmentation(
        action='insert',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=SEED,
    ),
    augmentation.CharAugmentation(
        action='delete',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=SEED,
    ),
]
accuracy_top = [1, 5, 10]

In [7]:
validate_augmentation_transforms[0].transform(['Motorola Moto G9 Power'])

['MotorPola Moto G9 Polwer']

<a id='4'></a>
## 4. Обучение Pipeline и получение результатов

### CountVectorizer-Faiss

**Анализируем символы:**

In [161]:
pipeline1 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_char,
    explainer=explainer,
    searcher=searcher_faiss,
    verbose=True,
)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [156]:
%%time
pipeline1.search('phone', 5, ascending=True)

CPU times: total: 15.6 ms
Wall time: 55.5 ms


Unnamed: 0,text,similarity
0,Lenovo Legion Phone Pro,0.343264
1,Fairphone 3,0.345346
2,Fairphone 4,0.345346
3,Apple iPhone 6,0.359487
4,Apple iPhone X,0.359487


In [157]:
pipeline1.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 320.17it/s]

Top 1Acc = 0.9103139013452914
Top 5Acc = 1.0
Top 10Acc = 1.0





{1: 0.9103139013452914, 5: 1.0, 10: 1.0}

In [1]:
# pipeline1.save(path_folder_save='pipelines', filename='tmp.pkl')

In [2]:
# pipeline1 = pipeline1.load(path_to_filename='./pipelines/tmp.pkl')
# pipeline1

**Меняем параметры поиска Faiss:**

In [13]:
pipeline1 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_char,
    searcher=searcher_faiss,
    verbose=True,
    count_voronoi_cells=2,
    type_optimization_searcher='IVF',
)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [14]:
pipeline1.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

100%|███████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 661.07it/s]

Top 1Acc = 0.9237668161434978
Top 5Acc = 0.9955156950672646
Top 10Acc = 1.0





{1: 0.9237668161434978, 5: 0.9955156950672646, 10: 1.0}

**Анализируем слова:**

In [15]:
pipeline2 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_word,
    searcher=searcher_faiss,
    verbose=True,
)

pipeline2.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


100%|███████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 668.13it/s]

Top 1Acc = 0.18834080717488788
Top 5Acc = 0.4080717488789238
Top 10Acc = 0.5336322869955157





{1: 0.18834080717488788, 5: 0.4080717488789238, 10: 0.5336322869955157}

**Меняем поисковик в уже обученном `Pipeline`:**

In [16]:
pipeline2.change_searcher(searcher_knn, metric='l2')
pipeline2.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

100%|███████████████████████████████████████████████████████████████| 223/223 [00:01<00:00, 158.87it/s]

Top 1Acc = 0.19282511210762332
Top 5Acc = 0.3811659192825112
Top 10Acc = 0.5246636771300448





{1: 0.19282511210762332, 5: 0.3811659192825112, 10: 0.5246636771300448}

### TheFuzzSearch

In [17]:
pipeline3 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    searcher=searcher_fuzz,
    verbose=True,
)

Data preparation for training has begun...
Pipeline ready!


In [18]:
pipeline3.search('apple', 5, ascending=True)

Unnamed: 0,text,similarity
0,Apple iPhone 13 Pro Max,90
1,Apple iPhone 13 Pro,90
2,Apple iPhone 13,90
3,Apple iPhone 13 mini,90
4,Apple iPhone 12 Pro Max,90


In [19]:
pipeline3.validate(validate_augmentation_transforms, accuracy_top, ascending=False)

100%|███████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 362.46it/s]

Top 1Acc = 0.8968609865470852
Top 5Acc = 0.968609865470852
Top 10Acc = 0.968609865470852





{1: 0.8968609865470852, 5: 0.968609865470852, 10: 0.968609865470852}

### SentenceTransformer-Faiss

In [21]:
augmentation_transforms_seed_none = [
    augmentation.CharAugmentation(
        action='insert',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=None,
    ),
    augmentation.CharAugmentation(
        action='delete',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=None,
    ),
]

model_sentence_transformer = embeddings.SentenceTransformerWrapperEmbedding(
    augmentation_transform=augmentation_transforms_seed_none,
    batch_size=32,
    epochs=6,
    optimizer_params={'lr': 2e-2},
)

In [22]:
pipeline4 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_sentence_transformer,
    searcher=searcher_faiss,
    verbose=True,
)

Data preparation for training has begun...
The training of the model has begun...


Epoch:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Pipeline ready!


In [23]:
pipeline4.search('apple', 5)

Unnamed: 0,text,similarity
0,Apple iPhone X,18.708843
1,Apple iPhone XR,23.733101
2,Sony Xperia 1,26.959209
3,Apple iPhone 8 Plus,27.090759
4,Apple iPhone 11,27.754007


In [24]:
pipeline4.validate(validate_augmentation_transforms, accuracy_top)

100%|████████████████████████████████████████████████████████████████| 223/223 [00:08<00:00, 25.18it/s]

Top 1Acc = 0.8161434977578476
Top 5Acc = 0.9551569506726457
Top 10Acc = 0.9775784753363229





{1: 0.8161434977578476, 5: 0.9551569506726457, 10: 0.9775784753363229}

<a id='5'></a>
## 5. Ансамбль моделей

In [25]:
model_count_vectorizer_ensemble = embeddings.EnsembleWrapperEmbedding(
    models=[model_count_vectorizer_word, model_count_vectorizer_char],
    weights=[1, 1],
)

In [26]:
pipeline5 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_ensemble,
    searcher=searcher_faiss,
    verbose=True,
)

pipeline5.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


100%|███████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 562.92it/s]

Top 1Acc = 0.9147982062780269
Top 5Acc = 0.9955156950672646
Top 10Acc = 1.0





{1: 0.9147982062780269, 5: 0.9955156950672646, 10: 1.0}

<a id='6'></a>
## 6. Fine-tuning

In [27]:
dataset_games.head(3)

Unnamed: 0,target
0,Wii Sports
1,Super Mario Bros.
2,Mario Kart Wii


In [28]:
pipeline5 = pipeline5.fine_tuning(dataset_games.target.values)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [29]:
pipeline5.search('mario cart', 5)

Unnamed: 0,text,similarity
0,Mario Party,10.0
1,Mario Kart 7,12.0
2,Mario Kart 8,12.0
3,Mario Party 8,14.0
4,Dr. Mario,14.0


<a id='7'></a>
## 7. Интерпретация результатов

In [83]:
%%time
df, indeces_n_grams = pipeline1.explain(
    compared_text='Donald Trump bought an Apple iPhone 13 Pro Max and called his colleague Vladimir Putin',
    original_text='Apple iPhone 13 Pro Max',
    n_grams=(1, 5),
    analyzer='word',
    sep=' ',
    k=10,
    ascending=True,
)
print(indeces_n_grams)

df

[(23, 46), (23, 42), (20, 42), (29, 46), (23, 38), (20, 38), (29, 50), (29, 42), (13, 38), (20, 35)]
CPU times: total: 31.2 ms
Wall time: 14 ms


Unnamed: 0,text,similarity
0,apple iphone 13 pro max,2.220446e-16
1,apple iphone 13 pro,0.05157482
2,an apple iphone 13 pro,0.05880343
3,iphone 13 pro max,0.07892228
4,apple iphone 13,0.1101174
5,an apple iphone 13,0.11604
6,iphone 13 pro max and,0.1172652
7,iphone 13 pro,0.1608151
8,bought an apple iphone 13,0.1633808
9,an apple iphone,0.1735286


In [82]:
%%time
df, indeces_n_grams = pipeline1.explain(
    compared_text='mario cart',
    original_text='Mario Party',
    n_grams=(1, 5),
    analyzer='char',
    sep=' ',
    k=10,
    ascending=True,
)
print(indeces_n_grams)

df

[(0, 5), (1, 5), (0, 4), (0, 3), (7, 10), (1, 4), (1, 3), (7, 9), (6, 10), (2, 5)]
CPU times: total: 15.6 ms
Wall time: 8 ms


Unnamed: 0,text,similarity
0,mario,0.2
1,ario,0.244071
2,mari,0.244071
3,mar,0.284458
4,art,0.284458
5,ari,0.284458
6,ar,0.30718
7,ar,0.30718
8,cart,0.395257
9,rio,0.463344
