# Пример использования API

* [1. Импорт модулей](#1)
* [2. Настройки для Pipeline](#2)
* [3. Настройки для валидации](#3)
* [4. Обучение Pipeline и получение результатов](#4)
* [5. Fine-tuning](#5)
* [6. Интерпретация результатов](#6)

In [1]:
%load_ext autoreload
%autoreload 2

import os

In [2]:
SEED = 1

In [3]:
if 'notebooks' in os.listdir():
    pass
else:
    os.chdir('..')
    print(os.getcwd())

F:\study\recs


<a id='1'></a>
## 1. Импорт модулей

In [4]:
from recs_searcher import (
    dataset,  # учебные датасеты
    preprocessing,  # предобработка текста
    embeddings,  # преобразование текста в эмбеддинги
    similarity_search,  # быстрые поисковики в пространстве эмбеддингов
    augmentation,  # аугментация текста для валидации пайплайнов
    explain,  # интерпретация сходства двух текстов
)
from recs_searcher import api  # Пайплайн

<a id='2'></a>
## 2. Настройки для Pipeline

In [5]:
dataset_phones = dataset.load_mobile_phones()
dataset_games = dataset.load_video_games()

SPACY_MODEL_NAME = 'en_core_web_md'
preprocessing_list = [
    preprocessing.TextLower(),
    preprocessing.RemovePunct(),
    # preprocessing.RemoveNumber(),
    preprocessing.RemoveWhitespace(),
    # preprocessing.RemoveHTML(),
    # preprocessing.RemoveURL(),
    # preprocessing.RemoveEmoji(),

    # preprocessing.RemoveStopwordsSpacy(spacy_model_name=SPACY_MODEL_NAME),
    # preprocessing.LemmatizeSpacy(spacy_model_name=SPACY_MODEL_NAME),
]

model_count_vectorizer_word = embeddings.CountVectorizerWrapperEmbedding(
    analyzer='word',
    ngram_range=(1, 1),
)
model_count_vectorizer_char = embeddings.CountVectorizerWrapperEmbedding(
    analyzer='char',
    ngram_range=(1, 1),
)

model_count_vectorizer_ensemble = embeddings.EnsembleWrapperEmbedding(
    models=[model_count_vectorizer_word, model_count_vectorizer_char],
    weights=[1, 1],
)

searcher_faiss = similarity_search.FaissSearch
searcher_chroma = similarity_search.ChromaDBSearch
searcher_knn = similarity_search.NearestNeighborsSearch
searcher_fuzz = similarity_search.TheFuzzSearch

<a id='3'></a>
## 3. Настройки для валидации

In [6]:
LANGUAGE = 'english'
validate_augmentation_transforms = [
    augmentation.ChangeSyms(p=0.013, language=LANGUAGE, change_only_alpha=True, seed=SEED),
    augmentation.DeleteSyms(p=0.013, delete_only_alpha=True, seed=SEED),
    augmentation.AddSyms(p=0.013, language=LANGUAGE, seed=SEED),
    augmentation.MultiplySyms(p=0.013, count_multiply=2, multiply_only_alpha=True, seed=SEED),
    augmentation.SwapSyms(p=0.013, seed=SEED),
    augmentation.ChangeSyms(p=0.013, language=LANGUAGE, change_only_alpha=True, seed=SEED),
    augmentation.ChangeSyms(p=0.013, language=LANGUAGE, change_only_alpha=True, seed=SEED),
]
accuracy_top = [1, 5, 10]

<a id='4'></a>
## 4. Обучение Pipeline и получение результатов

### CountVectorizer-Faiss

**Анализируем символы:**

In [7]:
pipeline1 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_char,
    searcher=searcher_faiss,
    verbose=True,
)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


In [8]:
%%time
pipeline1.search('phone', 5, ascending=True)

CPU times: total: 0 ns
Wall time: 2 ms


Unnamed: 0,text,similarity
0,OnePlus 9,6.0
1,OnePlus 6,6.0
2,Fairphone 3,6.0
3,OnePlus 8,6.0
4,Fairphone 4,6.0


In [9]:
pipeline1.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

100%|█████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 1115.01it/s]

Top 1Acc = 0.9551569506726457
Top 5Acc = 1.0
Top 10Acc = 1.0





{1: 0.9551569506726457, 5: 1.0, 10: 1.0}

In [10]:
# pipeline1.save(path_folder_save='pipelines', filename='tmp.pkl')

In [11]:
# pipeline1 = api.load_pipeline(path_to_filename='./pipelines/tmp.pkl')
# pipeline1

**Анализируем слова:**

In [12]:
pipeline2 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_word,
    searcher=searcher_faiss,
    verbose=True,
)

pipeline2.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


100%|█████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 1262.89it/s]

Top 1Acc = 0.6143497757847534
Top 5Acc = 0.8026905829596412
Top 10Acc = 0.8699551569506726





{1: 0.6143497757847534, 5: 0.8026905829596412, 10: 0.8699551569506726}

**Анализируем и слова, и символы:**

In [13]:
pipeline3 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_count_vectorizer_ensemble,
    searcher=searcher_faiss,
    verbose=True,
)

pipeline3.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Data preparation for training has begun...
The training of the model has begun...
Pipeline ready!


100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 957.08it/s]

Top 1Acc = 0.9641255605381166
Top 5Acc = 1.0
Top 10Acc = 1.0





{1: 0.9641255605381166, 5: 1.0, 10: 1.0}

**Меняем поисковик в уже обученном `Pipeline`:**

In [14]:
pipeline3.change_searcher(searcher_knn, metric='l2')
pipeline3.validate(validate_augmentation_transforms, accuracy_top, ascending=True)

Found Intel OpenMP ('libiomp') and LLVM OpenMP ('libomp') loaded at
the same time. Both libraries are known to be incompatible and this
can cause random crashes or deadlocks on Linux when loaded in the
same Python program.
Using threadpoolctl may cause crashes or deadlocks. For more
information and possible workarounds, please see
    https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md

100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:01<00:00, 145.89it/s]

Top 1Acc = 0.9730941704035875
Top 5Acc = 1.0
Top 10Acc = 1.0





{1: 0.9730941704035875, 5: 1.0, 10: 1.0}

### TheFuzzSearch

In [15]:
pipeline4 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    searcher=searcher_fuzz,
    verbose=True,
)

Data preparation for training has begun...
Pipeline ready!


In [16]:
pipeline4.search('apple', 5, ascending=True)

Unnamed: 0,text,similarity
0,Apple iPhone 13 Pro Max,90
1,Apple iPhone 13 Pro,90
2,Apple iPhone 13,90
3,Apple iPhone 13 mini,90
4,Apple iPhone 12 Pro Max,90


In [17]:
pipeline4.validate(validate_augmentation_transforms, accuracy_top, ascending=False)

100%|██████████████████████████████████████████████████████████████████████| 223/223 [00:00<00:00, 452.34it/s]

Top 1Acc = 0.9237668161434978
Top 5Acc = 0.9551569506726457
Top 10Acc = 0.968609865470852





{1: 0.9237668161434978, 5: 0.9551569506726457, 10: 0.968609865470852}

### SentenceTransformer-Faiss

In [18]:
augmentation_transforms_seed_none = [
    augmentation.ChangeSyms(p=0.013, language=LANGUAGE, change_only_alpha=True, seed=None),
    augmentation.DeleteSyms(p=0.013, delete_only_alpha=True, seed=None),
    augmentation.AddSyms(p=0.013, language=LANGUAGE, seed=None),
    augmentation.MultiplySyms(p=0.013, count_multiply=2, multiply_only_alpha=True, seed=None),
    augmentation.SwapSyms(p=0.013, seed=None),
    augmentation.ChangeSyms(p=0.013, language=LANGUAGE, change_only_alpha=True, seed=None),
    augmentation.ChangeSyms(p=0.013, language=LANGUAGE, change_only_alpha=True, seed=None),
]

model_sentence_transformer = embeddings.SentenceTransformerWrapperEmbedding(
    augmentation_transform=augmentation_transforms_seed_none,
    batch_size=32,
    epochs=1,
    optimizer_params={'lr': 2e-2},
)

In [19]:
pipeline5 = api.Pipeline(
    dataset=dataset_phones.target.values,
    preprocessing=preprocessing_list,
    model=model_sentence_transformer,
    searcher=searcher_faiss,
    verbose=True,
)

Data preparation for training has begun...
The training of the model has begun...


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/7 [00:00<?, ?it/s]

Pipeline ready!


In [20]:
pipeline5.search('apple', 5)

Unnamed: 0,text,similarity
0,Apple iPhone X,14.970808
1,Apple iPhone 11,19.092113
2,Apple iPhone XR,19.112835
3,Apple iPhone 12,19.560452
4,Apple iPhone 13,20.332455


In [21]:
pipeline5.validate(validate_augmentation_transforms, accuracy_top)

100%|███████████████████████████████████████████████████████████████████████| 223/223 [00:09<00:00, 24.62it/s]

Top 1Acc = 0.6816143497757847
Top 5Acc = 0.8116591928251121
Top 10Acc = 0.852017937219731





{1: 0.6816143497757847, 5: 0.8116591928251121, 10: 0.852017937219731}

<a id='5'></a>
## 5. Fine-tuning

In [22]:
dataset_games.head(3)

Unnamed: 0,target
0,Wii Sports
1,Super Mario Bros.
2,Mario Kart Wii


In [23]:
pipeline4 = pipeline4.fine_tuning(dataset_games.target.values)

Data preparation for training has begun...
Pipeline ready!


In [24]:
pipeline4.search('mario', 5)

Unnamed: 0,text,similarity
0,Super Mario Bros.,90
1,Mario Kart Wii,90
2,New Super Mario Bros.,90
3,New Super Mario Bros. Wii,90
4,Mario Kart DS,90


<a id='6'></a>
## 6. Интерпретация результатов

In [25]:
dist_explain = explain.DistanceExplain(pipeline5._model, distance='euclidean')

In [26]:
dist_explain.explain(
    compared_text='Donald Trump bought an Apple iPhone 13 Pro Max and called his colleague Vladimir Putin',
    original_text='Apple iPhone 13 Pro Max',
    n_grams=2,
)

Unnamed: 0,text,similarity
0,iPhone 13,3.747535
1,Pro Max,4.74533
2,13 Pro,4.813129
3,Apple iPhone,4.874532
4,Max and,5.26999
5,an Apple,6.075725
6,bought an,6.697048
7,and called,6.907519
8,called his,7.042148
9,his colleague,7.194981
