# Пример использования API

* [1. Импорт модулей](#1)
* [2. Настройки для Pipeline](#2)
* [3. Настройки для валидации](#3)
* [4. Обучение Pipeline и получение результатов](#4)
* [5. Ансамбль моделей](#5)
* [6. Fine-tuning](#6)
* [7. Интерпретация результатов](#7)

In [1]:
%load_ext autoreload
%autoreload 2

import os

In [2]:
SEED = 1

In [3]:
if 'notebooks' in os.listdir():
    pass
else:
    os.chdir('..')
    print(os.getcwd())

F:\study\recs


<a id='1'></a>
## 1. Импорт модулей

In [4]:
from recs_searcher import (
    dataset,  # учебные датасеты
    preprocessing,  # предобработка текста
    embeddings,  # преобразование текста в эмбеддинги
    similarity_search,  # быстрые поисковики в пространстве эмбеддингов
    augmentation,  # аугментация текста для валидации пайплайнов
    explain,  # интерпретация сходства двух текстов
)
from recs_searcher import api  # Пайплайн

<a id='2'></a>
## 2. Настройки для Pipeline

In [5]:
dataset_phones = dataset.load_mobile_phones()
dataset_games = dataset.load_video_games()

SPACY_MODEL_NAME = 'en_core_web_md'
preprocessing_list = [
    preprocessing.TextLower(),
    preprocessing.RemovePunct(),
    # preprocessing.RemoveNumber(),
    preprocessing.RemoveWhitespace(),
    # preprocessing.RemoveHTML(),
    # preprocessing.RemoveURL(),
    # preprocessing.RemoveEmoji(),

    # preprocessing.RemoveStopwordsSpacy(spacy_model_name=SPACY_MODEL_NAME),
    # preprocessing.LemmatizeSpacy(spacy_model_name=SPACY_MODEL_NAME),
]

model_count_vectorizer_word = embeddings.CountVectorizerWrapperEmbedding(
    analyzer='word',
    ngram_range=(1, 1),
)
model_count_vectorizer_char = embeddings.CountVectorizerWrapperEmbedding(
    analyzer='char',
    ngram_range=(1, 2),
)

searcher_faiss = similarity_search.FaissSearch
searcher_chroma = similarity_search.ChromaDBSearch
searcher_knn = similarity_search.NearestNeighborsSearch
searcher_fuzz = similarity_search.TheFuzzSearch

<a id='3'></a>
## 3. Настройки для валидации

In [6]:
validate_augmentation_transforms = [
    augmentation.AugmentexCharWrapperAugmentation(
        action='insert',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=SEED,
        lang="eng",
        platform="pc",
    ),
    augmentation.AugmentexCharWrapperAugmentation(
        action='delete',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=SEED,
        lang="eng",
        platform="pc",
    ),
]
accuracy_top = [1, 5, 10]

In [7]:
validate_augmentation_transforms[0].transform(['Motorola Moto G9 Power'])

['Motorfola Moto G9 Polwer']

<a id='4'></a>
## 4. Обучение Pipeline и получение результатов

### CountVectorizer-Faiss

**Анализируем символы:**

In [8]:
pipeline1 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_char,
    searcher=searcher_faiss,
    verbose=True,
)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [9]:
%%time
pipeline1.search('phone', 5, ascending=True)

CPU times: total: 0 ns
Wall time: 995 µs


Unnamed: 0,text,similarity
0,Fairphone 3,12.0
1,Fairphone 4,12.0
2,OnePlus 9,14.0
3,OnePlus 6,14.0
4,OnePlus 8,14.0


In [10]:
pipeline1.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

100%|█████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 1225.34it/s]

Top 1Acc = 0.905829596412556
Top 5Acc = 0.9955156950672646
Top 10Acc = 1.0





{1: 0.905829596412556, 5: 0.9955156950672646, 10: 1.0}

In [11]:
# pipeline1.save(path_folder_save='pipelines', filename='tmp.pkl')

In [12]:
# pipeline1 = api.load_pipeline(path_to_filename='./pipelines/tmp.pkl')
# pipeline1

**Меняем параметры поиска Faiss:**

In [13]:
pipeline1 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_char,
    searcher=searcher_faiss,
    verbose=True,
    count_voronoi_cells=2,
    type_optimization_searcher='IVF',
)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [14]:
pipeline1.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

100%|█████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 1037.22it/s]

Top 1Acc = 0.905829596412556
Top 5Acc = 0.9955156950672646
Top 10Acc = 1.0





{1: 0.905829596412556, 5: 0.9955156950672646, 10: 1.0}

**Анализируем слова:**

In [15]:
pipeline2 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_word,
    searcher=searcher_faiss,
    verbose=True,
)

pipeline2.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


100%|█████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 1186.27it/s]

Top 1Acc = 0.17040358744394618
Top 5Acc = 0.3901345291479821
Top 10Acc = 0.5112107623318386





{1: 0.17040358744394618, 5: 0.3901345291479821, 10: 0.5112107623318386}

**Меняем поисковик в уже обученном `Pipeline`:**

In [16]:
pipeline2.change_searcher(searcher_knn, metric='l2')
pipeline2.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:01<00:00, 143.16it/s]

Top 1Acc = 0.17488789237668162
Top 5Acc = 0.36771300448430494
Top 10Acc = 0.5067264573991032





{1: 0.17488789237668162, 5: 0.36771300448430494, 10: 0.5067264573991032}

### TheFuzzSearch

In [17]:
pipeline3 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    searcher=searcher_fuzz,
    verbose=True,
)

Data preparation for training has begun...
Pipeline ready!


In [18]:
pipeline3.search('apple', 5, ascending=True)

Unnamed: 0,text,similarity
0,Apple iPhone 13 Pro Max,90
1,Apple iPhone 13 Pro,90
2,Apple iPhone 13,90
3,Apple iPhone 13 mini,90
4,Apple iPhone 12 Pro Max,90


In [19]:
pipeline3.validate(validate_augmentation_transforms, accuracy_top, ascending=False)

100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 379.25it/s]

Top 1Acc = 0.9013452914798207
Top 5Acc = 0.968609865470852
Top 10Acc = 0.968609865470852





{1: 0.9013452914798207, 5: 0.968609865470852, 10: 0.968609865470852}

### SentenceTransformer-Faiss

In [20]:
augmentation_transforms_seed_none = [
    augmentation.AugmentexCharWrapperAugmentation(
        action='insert',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=None,
        lang="eng",
        platform="pc",
    ),
    augmentation.AugmentexCharWrapperAugmentation(
        action='delete',
        unit_prob=1.0,
        min_aug=1,
        max_aug=2,
        mult_num=2,
        seed=None,
        lang="eng",
        platform="pc",
    ),
]

model_sentence_transformer = embeddings.SentenceTransformerWrapperEmbedding(
    augmentation_transform=augmentation_transforms_seed_none,
    batch_size=32,
    epochs=6,
    optimizer_params={'lr': 2e-2},
)

In [21]:
pipeline4 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_sentence_transformer,
    searcher=searcher_faiss,
    verbose=True,
)

Data preparation for training has begun...
The training of the model has begun...


Epoch:   0%|          | 0/6 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Pipeline ready!


In [22]:
pipeline4.search('apple', 5)

Unnamed: 0,text,similarity
0,Apple iPhone X,13.169312
1,Apple iPhone 8 Plus,21.750168
2,Apple iPhone XR,21.924805
3,Apple iPhone 11,22.218494
4,Apple iPhone 6,22.738293


In [23]:
pipeline4.validate(validate_augmentation_transforms, accuracy_top)

100%|███████████████████████████████████████████████████████████████████████| 223/223 [00:08<00:00, 24.90it/s]

Top 1Acc = 0.7219730941704036
Top 5Acc = 0.9372197309417041
Top 10Acc = 0.9775784753363229





{1: 0.7219730941704036, 5: 0.9372197309417041, 10: 0.9775784753363229}

<a id='5'></a>
## 5. Ансамбль моделей

In [24]:
model_count_vectorizer_ensemble = embeddings.EnsembleWrapperEmbedding(
    models=[model_count_vectorizer_word, model_count_vectorizer_char],
    weights=[1, 1],
)

In [25]:
pipeline5 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_ensemble,
    searcher=searcher_faiss,
    verbose=True,
)

pipeline5.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 861.02it/s]

Top 1Acc = 0.8834080717488789
Top 5Acc = 0.9955156950672646
Top 10Acc = 1.0





{1: 0.8834080717488789, 5: 0.9955156950672646, 10: 1.0}

<a id='6'></a>
## 6. Fine-tuning

In [26]:
dataset_games.head(3)

Unnamed: 0,target
0,Wii Sports
1,Super Mario Bros.
2,Mario Kart Wii


In [27]:
pipeline5 = pipeline5.fine_tuning(dataset_games.target.values)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [28]:
pipeline5.search('mario cart', 5)

Unnamed: 0,text,similarity
0,Mario Party,10.0
1,Mario Kart 7,12.0
2,Mario Kart 8,12.0
3,Mario Party 8,14.0
4,Dr. Mario,14.0


<a id='7'></a>
## 7. Интерпретация результатов

In [29]:
dist_explain = explain.DistanceExplain(
    pipeline5.get_model(),
    preprocessing=pipeline5.get_preprocessing(),
    distance='euclidean'
)

In [30]:
dist_explain.explain(
    compared_text='Donald Trump bought an Apple iPhone 13 Pro Max and called his colleague Vladimir Putin',
    original_text='Apple iPhone 13 Pro Max',
    n_grams=(1, 3),
)

Unnamed: 0,text,similarity
0,apple iphone 13,4.472136
1,iphone 13 pro,5.291503
2,an apple iphone,5.477226
3,apple iphone,5.567764
4,13 pro max,6.324555
5,iphone 13,6.708204
6,pro max and,7.071068
7,pro max,7.141428
8,an apple,7.28011
9,13 pro,7.416198


In [31]:
dist_explain.explain(
    compared_text='mario cart',
    original_text='Mario Party',
    n_grams=1,
)

Unnamed: 0,text,similarity
0,mario,3.605551
1,cart,4.582576
