SimDocSin

SimDocSin is a cross-lingual document similarity checking tool for Sinhala and English.

This system can be used to find similar documents or parts of documents of Sinhala (English) language to a given document of English (Sinhala) language. System consists of two parts.

Full Matching

Here an user can submit a source document to get any matching complete document that exists in the system database in target language. Here the user has to set 3 input fields.

Input language - Language of the source document. It can be Sinhala or English.
Similarity level - This indicates how much similarity you expect from a similar pair output by system. The user can give a value within the range from 1 to 5. Low value means there is a high chance of getting a similar document but the similarity can be relatively low while high value means that system can output a document with a relatively high similarity but the chance of getting a result is low.
Source file - Source document given as the input. It can be submitted as either a file or a text.

Partial Matching

Here an user can submit a source document to get any matching partials of documents that exist in the system database in target language. Here also the user has to set the 3 inputs fields mentioned in the previous section. Apart from those there is another input field called Min Length.

Min Length - minimum number of sentences the user expected to have in a document partial. It can be 1, greater than 1, greater than 2, greater than 5 or greater than 10.

How to Deploy SimDocSin

1. Install Dependencies

You will need python 3.x. You can build the enviornment as follows.
pip install -r requirements.txt
Run below command to install the LASER models needed for embeddings.
python -m laserembeddings download-models

2. Create Data Container Folders

Run build.sh or build.bat. It will create the following folders within SimDocSin directory.
db - To contain embedded files preprocessed for indexing
index - To contain index files
inputs - To contain documents inputted by users to the system
outputs_full_match - To contain documents outputted by the system
outputs_partial_match - To contain documents outputted by the system
embeddings- To contain embedding json files. This folder contains 3 sub folders.
|--sinhala- To contains embedding of Sinhala documents
|--english- To contain embeddings on English documents
|--parallel- To contain embeddings of parallel documents

3. Embed Documents

Main option

You can find already embeded documents from here (link has been sent with the email)
You have to download and put those files to corresponding sub folder within the embeddings folder

OR

Alternative option

If you want you can embed documents by yourself and use (but above main option is more prefered)
Run below command to embed json list of documents using embedding_creator.py inside embedder folder.

python embedding_creator.py path/to/input_file.json file_type output_file_name

Here file_type is si for sinhala documents, en for english documents and pa for parallel document parirs.

The format of the input_file.json should be in following format.
For english documents

[
  {"content_en": "english document content"},
  ...
]

For sinhala documents

[
  {"content_si": "sinhala document content"},
  ...
]

For parallel documents

[
  {
   "content_en": "english document content",
   "content_si": "sinhala document content"
  },
  ...
]

4. Preprocess Embedding Database

Run both of following commands to preprocess embedding database for indexing.

python db_split.py en
python db_split.py si

5. Build Index Files

Run both of following commands to build index files.
(You should use same filename.py for both previous step and this step )

python indexing.py en
python indexing.py si

After this step, embeddings folder is no more needed.

6. Run SimDocSin

Execute flask run within the webapp folder

Contributors

Udhan Isuranga (udhanisuranga.16@cse.mrt.ac.lk)
Janaka Sandaruwan (janakasadaruwan.16@cse.mrt.ac.lk)
Udesh Athukorala (udeshathukorala.16@cse.mrt.ac.lk)

Publications

https://www.researchgate.net/publication/348328250_Improved_Cross-Lingual_Document_Similarity_Measurement

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
embedder		embedder
preprocess		preprocess
splitter		splitter
utils		utils
webapp		webapp
webapp_controller		webapp_controller
.gitignore		.gitignore
README.md		README.md
build.bat		build.bat
build.sh		build.sh
doc_matcher.py		doc_matcher.py
extract_digits.py		extract_digits.py
extract_ne.py		extract_ne.py
greedy_mover_distance.py		greedy_mover_distance.py
margin_base_distance_calculator.py		margin_base_distance_calculator.py
requirements.txt		requirements.txt
weight_schema.py		weight_schema.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimDocSin

Full Matching

Partial Matching

How to Deploy SimDocSin

1. Install Dependencies

2. Create Data Container Folders

3. Embed Documents

Main option

Alternative option

4. Preprocess Embedding Database

5. Build Index Files

6. Run SimDocSin

Contributors

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimDocSin

Full Matching

Partial Matching

How to Deploy SimDocSin

1. Install Dependencies

2. Create Data Container Folders

3. Embed Documents

Main option

Alternative option

4. Preprocess Embedding Database

5. Build Index Files

6. Run SimDocSin

Contributors

Publications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages