Skip to content

Commit

Permalink
update readme, add docs
Browse files Browse the repository at this point in the history
  • Loading branch information
sheriff1max committed May 8, 2024
1 parent 39e336a commit 1691006
Show file tree
Hide file tree
Showing 20 changed files with 561 additions and 18 deletions.
53 changes: 53 additions & 0 deletions .github/workflows/docs_pages.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: docs_pages_workflow

# execute this workflow automatically when a we push to master
on:
push:
branches: [ main, dev ]

jobs:

build_docs_job:
runs-on: ubuntu-latest
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: 3.11.5

- name: Install dependencies
run: |
python -m pip install -U sphinx
python -m pip install sphinx-rtd-theme
# python -m pip install sphinxcontrib-apidoc
python -m pip install sphinx-autoapi
- name: make the sphinx docs
run: |
make -C docs clean
# sphinx-apidoc -f -o docs/source . -H Test -e -t docs/source/_templates
make -C docs html
- name: Init new repo in dist folder and commit generated files
run: |
cd docs/build/html/
git init
touch .nojekyll
git add -A
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git commit -m 'deploy'
- name: Force push to destination branch
uses: ad-m/github-push-action@v0.5.0
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: gh-pages
force: true
directory: ./docs/build/html
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ venv/
.idea
__pycache__
dist/
_build
build
recs_searcher.egg-info/
weights/
87 changes: 79 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,83 @@
# recs-searcher (Registry error correction system - Searcher)
<div align="center">

pip install recs-searcher
[![GitHub Workflow Status (branch)](https://img.shields.io/github/actions/workflow/status/sheriff1max/recs-searcher/.github/workflows/python-app.yml)](https://github.com/sheriff1max/recs-searcher/actions/workflows/python-app.yaml)

Система способна решить следующие задачи:
1. Исправление реестровых и орфографических ошибок пользовательского ввода при сравнении с базой данных;
2. Поиск схожих текстовых записей на пользовательский ввод по базе данных.
[![PyPI](https://img.shields.io/pypi/v/recs-searcher?color=blue&style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/recs-searcher/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/recs-searcher?style=for-the-badge&color=blue)](https://pepy.tech/project/recs-searcher)
<br>
</div>

**В разработке...**
# recs-searcher — библиотека для поиска похожих текстов
Библиотека позволяет находить похожие на пользовательский ввод тексты из датасета.

## Примеры применения
Пример для быстрого использования: [пример API](https://github.com/sheriff1max/recs-searcher/blob/master/notebooks/tutorial_rus.ipynb)
### Содержание
1. [Проблематика](#problems)
2. [Особенности библиотеки](#features)
3. [Установка](#install)
4. [Примеры применения](#examples)

### Проблематика <a name="problems"></a>
Пользовательский ввод может содержать как орфографические, так и реестровые ошибки.

Рассмотрим самые частые ошибки:
- используются сокращения или полные формы слова: `«Литературный институт имени А.М. Горького»` || `«Литературный институт им. А.М. Горького»`;
- пропущены либо добавлены слова: `«Литературный институт имени А.М. Горького»` || `«Институт имени А.М.Горького»`;
- пропущены либо добавлены дополнительные символы: `«Сибирский федеральный университет»` || `«Сибрский федерааальный универ»`;
- слова могут быть не в правильном порядке: `Большой и красивый мотоцикл` || `Мотоцикл большой и красивый`.

Данные проблемы помогает решить разработанный модуль `recs-searcher (registry error correction system - searcher)`, основанный на известных NLP-алгоритмах.

### Особенности библиотеки: <a name="features"></a>
- модуль универсален для любого датасета;
- содержит API для использования библиотеки;
- содержит множество подмодулей алгоритмов для оптимизации задачи, из которых строится pipeline (предобработка текста, модели для создания эмбеддингов, алгоритмы для эффективного сравнения эмбеддингов, аугментация текста для оценки обученного pipeline);
- возможность интерпретировать результаты обученных pipeline;
- масштабирование библиотеки благодаря имеющимся абстрактным классам.

### Установка <a name="install"></a>

```commandline
pip install recs-searcher
```

### Примеры применения <a name="examples"></a>

1. Соберём pipeline:
```python
from recs_searcher import (
dataset, # учебные датасеты
preprocessing, # предобработка текста
embeddings, # преобразование текста в эмбеддинги
similarity_search, # быстрые поисковики в пространстве эмбеддингов
augmentation, # аугментация текста для валидации пайплайнов
explain, # интерпретация сходства двух текстов
api, # Пайплайн
)

model_embedding = embeddings.CountVectorizerWrapperEmbedding(
analyzer='char',
ngram_range=(1, 2),
)

pipeline = api.Pipeline(
dataset=['Красноярск', 'Москва', 'Владивосток'],
preprocessing=[preprocessing.TextLower()],
model=model_embedding,
searcher=similarity_search.FaissSearch,
verbose=True,
)
# Pipeline ready!
```

2. Найдём 3 схожих текстов в базе данных на пользовательский ввод "Красный ярск":
```python
pipeline.search('Красный ярск', 3, ascending=True)
# return: pandas.DataFrame
```

Более подробный пример [API](https://github.com/sheriff1max/recs-searcher/blob/master/notebooks/tutorial_rus.ipynb).

Пример [WEB-интерфейса](https://github.com/sheriff1max/web-recs-searcher), в который внедрена данная бибилотека.

### Автор
- [Кобелев Максим](https://github.com/sheriff1max) — автор и единственный разработчик.
9 changes: 9 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Генерация документации

sphinx-apidoc -P -o docs ./recs_searcher

Добавить в index.rst `modules`

cd docs

.\make.bat html
33 changes: 33 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

import os
import sys

sys.path.insert(0, os.path.abspath('..'))

project = 'recs-searcher'
copyright = '2024, sheriff1max'
author = 'sheriff1max'
release = '0.1.0'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = ['sphinx.ext.todo', 'sphinx.ext.viewcode', 'sphinx.ext.autodoc']

templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

language = 'ru'

# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = 'sphinx_rtd_theme'
html_static_path = ['_static']
20 changes: 20 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
.. recs-searcher documentation master file, created by
sphinx-quickstart on Tue May 7 23:14:04 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to recs-searcher's documentation!
=========================================

.. toctree::
:maxdepth: 4
:caption: Contents:

modules

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
35 changes: 35 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
7 changes: 7 additions & 0 deletions docs/modules.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
recs_searcher
=============

.. toctree::
:maxdepth: 4

recs_searcher
23 changes: 23 additions & 0 deletions docs/recs_searcher.api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
recs\_searcher.api package
==========================

Submodules
----------

recs\_searcher.api.api module
-----------------------------

.. automodule:: recs_searcher.api.api
:members:
:undoc-members:
:show-inheritance:
:private-members:

Module contents
---------------

.. automodule:: recs_searcher.api
:members:
:undoc-members:
:show-inheritance:
:private-members:
50 changes: 50 additions & 0 deletions docs/recs_searcher.augmentation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
recs\_searcher.augmentation package
===================================

Submodules
----------

recs\_searcher.augmentation.\_actions module
--------------------------------------------

.. automodule:: recs_searcher.augmentation._actions
:members:
:undoc-members:
:show-inheritance:
:private-members:

recs\_searcher.augmentation.\_base module
-----------------------------------------

.. automodule:: recs_searcher.augmentation._base
:members:
:undoc-members:
:show-inheritance:
:private-members:

recs\_searcher.augmentation.\_char\_aug module
----------------------------------------------

.. automodule:: recs_searcher.augmentation._char_aug
:members:
:undoc-members:
:show-inheritance:
:private-members:

recs\_searcher.augmentation.\_word\_aug module
----------------------------------------------

.. automodule:: recs_searcher.augmentation._word_aug
:members:
:undoc-members:
:show-inheritance:
:private-members:

Module contents
---------------

.. automodule:: recs_searcher.augmentation
:members:
:undoc-members:
:show-inheritance:
:private-members:
11 changes: 11 additions & 0 deletions docs/recs_searcher.dataset.data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
recs\_searcher.dataset.data package
===================================

Module contents
---------------

.. automodule:: recs_searcher.dataset.data
:members:
:undoc-members:
:show-inheritance:
:private-members:
40 changes: 40 additions & 0 deletions docs/recs_searcher.dataset.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
recs\_searcher.dataset package
==============================

Subpackages
-----------

.. toctree::
:maxdepth: 4

recs_searcher.dataset.data

Submodules
----------

recs\_searcher.dataset.\_base module
------------------------------------

.. automodule:: recs_searcher.dataset._base
:members:
:undoc-members:
:show-inheritance:
:private-members:

recs\_searcher.dataset.\_dataframes module
------------------------------------------

.. automodule:: recs_searcher.dataset._dataframes
:members:
:undoc-members:
:show-inheritance:
:private-members:

Module contents
---------------

.. automodule:: recs_searcher.dataset
:members:
:undoc-members:
:show-inheritance:
:private-members:
Loading

0 comments on commit 1691006

Please sign in to comment.