-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
39e336a
commit 1691006
Showing
20 changed files
with
561 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
name: docs_pages_workflow | ||
|
||
# execute this workflow automatically when a we push to master | ||
on: | ||
push: | ||
branches: [ main, dev ] | ||
|
||
jobs: | ||
|
||
build_docs_job: | ||
runs-on: ubuntu-latest | ||
env: | ||
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} | ||
|
||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v3 | ||
|
||
- name: Set up Python | ||
uses: actions/setup-python@v3 | ||
with: | ||
python-version: 3.11.5 | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install -U sphinx | ||
python -m pip install sphinx-rtd-theme | ||
# python -m pip install sphinxcontrib-apidoc | ||
python -m pip install sphinx-autoapi | ||
- name: make the sphinx docs | ||
run: | | ||
make -C docs clean | ||
# sphinx-apidoc -f -o docs/source . -H Test -e -t docs/source/_templates | ||
make -C docs html | ||
- name: Init new repo in dist folder and commit generated files | ||
run: | | ||
cd docs/build/html/ | ||
git init | ||
touch .nojekyll | ||
git add -A | ||
git config --local user.email "action@github.com" | ||
git config --local user.name "GitHub Action" | ||
git commit -m 'deploy' | ||
- name: Force push to destination branch | ||
uses: ad-m/github-push-action@v0.5.0 | ||
with: | ||
github_token: ${{ secrets.GITHUB_TOKEN }} | ||
branch: gh-pages | ||
force: true | ||
directory: ./docs/build/html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,7 @@ venv/ | |
.idea | ||
__pycache__ | ||
dist/ | ||
_build | ||
build | ||
recs_searcher.egg-info/ | ||
weights/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,83 @@ | ||
# recs-searcher (Registry error correction system - Searcher) | ||
<div align="center"> | ||
|
||
pip install recs-searcher | ||
[![GitHub Workflow Status (branch)](https://img.shields.io/github/actions/workflow/status/sheriff1max/recs-searcher/.github/workflows/python-app.yml)](https://github.com/sheriff1max/recs-searcher/actions/workflows/python-app.yaml) | ||
|
||
Система способна решить следующие задачи: | ||
1. Исправление реестровых и орфографических ошибок пользовательского ввода при сравнении с базой данных; | ||
2. Поиск схожих текстовых записей на пользовательский ввод по базе данных. | ||
[![PyPI](https://img.shields.io/pypi/v/recs-searcher?color=blue&style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/recs-searcher/) | ||
[![PyPI - Downloads](https://img.shields.io/pypi/dm/recs-searcher?style=for-the-badge&color=blue)](https://pepy.tech/project/recs-searcher) | ||
<br> | ||
</div> | ||
|
||
**В разработке...** | ||
# recs-searcher — библиотека для поиска похожих текстов | ||
Библиотека позволяет находить похожие на пользовательский ввод тексты из датасета. | ||
|
||
## Примеры применения | ||
Пример для быстрого использования: [пример API](https://github.com/sheriff1max/recs-searcher/blob/master/notebooks/tutorial_rus.ipynb) | ||
### Содержание | ||
1. [Проблематика](#problems) | ||
2. [Особенности библиотеки](#features) | ||
3. [Установка](#install) | ||
4. [Примеры применения](#examples) | ||
|
||
### Проблематика <a name="problems"></a> | ||
Пользовательский ввод может содержать как орфографические, так и реестровые ошибки. | ||
|
||
Рассмотрим самые частые ошибки: | ||
- используются сокращения или полные формы слова: `«Литературный институт имени А.М. Горького»` || `«Литературный институт им. А.М. Горького»`; | ||
- пропущены либо добавлены слова: `«Литературный институт имени А.М. Горького»` || `«Институт имени А.М.Горького»`; | ||
- пропущены либо добавлены дополнительные символы: `«Сибирский федеральный университет»` || `«Сибрский федерааальный универ»`; | ||
- слова могут быть не в правильном порядке: `Большой и красивый мотоцикл` || `Мотоцикл большой и красивый`. | ||
|
||
Данные проблемы помогает решить разработанный модуль `recs-searcher (registry error correction system - searcher)`, основанный на известных NLP-алгоритмах. | ||
|
||
### Особенности библиотеки: <a name="features"></a> | ||
- модуль универсален для любого датасета; | ||
- содержит API для использования библиотеки; | ||
- содержит множество подмодулей алгоритмов для оптимизации задачи, из которых строится pipeline (предобработка текста, модели для создания эмбеддингов, алгоритмы для эффективного сравнения эмбеддингов, аугментация текста для оценки обученного pipeline); | ||
- возможность интерпретировать результаты обученных pipeline; | ||
- масштабирование библиотеки благодаря имеющимся абстрактным классам. | ||
|
||
### Установка <a name="install"></a> | ||
|
||
```commandline | ||
pip install recs-searcher | ||
``` | ||
|
||
### Примеры применения <a name="examples"></a> | ||
|
||
1. Соберём pipeline: | ||
```python | ||
from recs_searcher import ( | ||
dataset, # учебные датасеты | ||
preprocessing, # предобработка текста | ||
embeddings, # преобразование текста в эмбеддинги | ||
similarity_search, # быстрые поисковики в пространстве эмбеддингов | ||
augmentation, # аугментация текста для валидации пайплайнов | ||
explain, # интерпретация сходства двух текстов | ||
api, # Пайплайн | ||
) | ||
|
||
model_embedding = embeddings.CountVectorizerWrapperEmbedding( | ||
analyzer='char', | ||
ngram_range=(1, 2), | ||
) | ||
|
||
pipeline = api.Pipeline( | ||
dataset=['Красноярск', 'Москва', 'Владивосток'], | ||
preprocessing=[preprocessing.TextLower()], | ||
model=model_embedding, | ||
searcher=similarity_search.FaissSearch, | ||
verbose=True, | ||
) | ||
# Pipeline ready! | ||
``` | ||
|
||
2. Найдём 3 схожих текстов в базе данных на пользовательский ввод "Красный ярск": | ||
```python | ||
pipeline.search('Красный ярск', 3, ascending=True) | ||
# return: pandas.DataFrame | ||
``` | ||
|
||
Более подробный пример [API](https://github.com/sheriff1max/recs-searcher/blob/master/notebooks/tutorial_rus.ipynb). | ||
|
||
Пример [WEB-интерфейса](https://github.com/sheriff1max/web-recs-searcher), в который внедрена данная бибилотека. | ||
|
||
### Автор | ||
- [Кобелев Максим](https://github.com/sheriff1max) — автор и единственный разработчик. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Генерация документации | ||
|
||
sphinx-apidoc -P -o docs ./recs_searcher | ||
|
||
Добавить в index.rst `modules` | ||
|
||
cd docs | ||
|
||
.\make.bat html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Configuration file for the Sphinx documentation builder. | ||
# | ||
# For the full list of built-in configuration values, see the documentation: | ||
# https://www.sphinx-doc.org/en/master/usage/configuration.html | ||
|
||
# -- Project information ----------------------------------------------------- | ||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information | ||
|
||
import os | ||
import sys | ||
|
||
sys.path.insert(0, os.path.abspath('..')) | ||
|
||
project = 'recs-searcher' | ||
copyright = '2024, sheriff1max' | ||
author = 'sheriff1max' | ||
release = '0.1.0' | ||
|
||
# -- General configuration --------------------------------------------------- | ||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration | ||
|
||
extensions = ['sphinx.ext.todo', 'sphinx.ext.viewcode', 'sphinx.ext.autodoc'] | ||
|
||
templates_path = ['_templates'] | ||
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] | ||
|
||
language = 'ru' | ||
|
||
# -- Options for HTML output ------------------------------------------------- | ||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output | ||
|
||
html_theme = 'sphinx_rtd_theme' | ||
html_static_path = ['_static'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
.. recs-searcher documentation master file, created by | ||
sphinx-quickstart on Tue May 7 23:14:04 2024. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to recs-searcher's documentation! | ||
========================================= | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
:caption: Contents: | ||
|
||
modules | ||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
@ECHO OFF | ||
|
||
pushd %~dp0 | ||
|
||
REM Command file for Sphinx documentation | ||
|
||
if "%SPHINXBUILD%" == "" ( | ||
set SPHINXBUILD=sphinx-build | ||
) | ||
set SOURCEDIR=. | ||
set BUILDDIR=_build | ||
|
||
%SPHINXBUILD% >NUL 2>NUL | ||
if errorlevel 9009 ( | ||
echo. | ||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx | ||
echo.installed, then set the SPHINXBUILD environment variable to point | ||
echo.to the full path of the 'sphinx-build' executable. Alternatively you | ||
echo.may add the Sphinx directory to PATH. | ||
echo. | ||
echo.If you don't have Sphinx installed, grab it from | ||
echo.https://www.sphinx-doc.org/ | ||
exit /b 1 | ||
) | ||
|
||
if "%1" == "" goto help | ||
|
||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% | ||
goto end | ||
|
||
:help | ||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% | ||
|
||
:end | ||
popd |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
recs_searcher | ||
============= | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
recs_searcher |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
recs\_searcher.api package | ||
========================== | ||
|
||
Submodules | ||
---------- | ||
|
||
recs\_searcher.api.api module | ||
----------------------------- | ||
|
||
.. automodule:: recs_searcher.api.api | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
Module contents | ||
--------------- | ||
|
||
.. automodule:: recs_searcher.api | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
recs\_searcher.augmentation package | ||
=================================== | ||
|
||
Submodules | ||
---------- | ||
|
||
recs\_searcher.augmentation.\_actions module | ||
-------------------------------------------- | ||
|
||
.. automodule:: recs_searcher.augmentation._actions | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
recs\_searcher.augmentation.\_base module | ||
----------------------------------------- | ||
|
||
.. automodule:: recs_searcher.augmentation._base | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
recs\_searcher.augmentation.\_char\_aug module | ||
---------------------------------------------- | ||
|
||
.. automodule:: recs_searcher.augmentation._char_aug | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
recs\_searcher.augmentation.\_word\_aug module | ||
---------------------------------------------- | ||
|
||
.. automodule:: recs_searcher.augmentation._word_aug | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
Module contents | ||
--------------- | ||
|
||
.. automodule:: recs_searcher.augmentation | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
recs\_searcher.dataset.data package | ||
=================================== | ||
|
||
Module contents | ||
--------------- | ||
|
||
.. automodule:: recs_searcher.dataset.data | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
recs\_searcher.dataset package | ||
============================== | ||
|
||
Subpackages | ||
----------- | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
recs_searcher.dataset.data | ||
|
||
Submodules | ||
---------- | ||
|
||
recs\_searcher.dataset.\_base module | ||
------------------------------------ | ||
|
||
.. automodule:: recs_searcher.dataset._base | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
recs\_searcher.dataset.\_dataframes module | ||
------------------------------------------ | ||
|
||
.. automodule:: recs_searcher.dataset._dataframes | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
Module contents | ||
--------------- | ||
|
||
.. automodule:: recs_searcher.dataset | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: |
Oops, something went wrong.