Skip to content

udhanMti/SimDocSin

Repository files navigation

SimDocSin

SimDocSin is a cross-lingual document similarity checking tool for Sinhala and English.

This system can be used to find similar documents or parts of documents of Sinhala (English) language to a given document of English (Sinhala) language. System consists of two parts.

Full Matching

Here an user can submit a source document to get any matching complete document that exists in the system database in target language. Here the user has to set 3 input fields.

  • Input language - Language of the source document. It can be Sinhala or English.
  • Similarity level - This indicates how much similarity you expect from a similar pair output by system. The user can give a value within the range from 1 to 5. Low value means there is a high chance of getting a similar document but the similarity can be relatively low while high value means that system can output a document with a relatively high similarity but the chance of getting a result is low.
  • Source file - Source document given as the input. It can be submitted as either a file or a text.
Partial Matching

Here an user can submit a source document to get any matching partials of documents that exist in the system database in target language. Here also the user has to set the 3 inputs fields mentioned in the previous section. Apart from those there is another input field called Min Length.

  • Min Length - minimum number of sentences the user expected to have in a document partial. It can be 1, greater than 1, greater than 2, greater than 5 or greater than 10.

How to Deploy SimDocSin

1. Install Dependencies

You will need python 3.x. You can build the enviornment as follows.
pip install -r requirements.txt
Run below command to install the LASER models needed for embeddings.
python -m laserembeddings download-models

2. Create Data Container Folders

Run build.sh or build.bat. It will create the following folders within SimDocSin directory.
db - To contain embedded files preprocessed for indexing
index - To contain index files
inputs - To contain documents inputted by users to the system
outputs_full_match - To contain documents outputted by the system
outputs_partial_match - To contain documents outputted by the system
embeddings- To contain embedding json files. This folder contains 3 sub folders.
 |--sinhala- To contains embedding of Sinhala documents
 |--english- To contain embeddings on English documents
 |--parallel- To contain embeddings of parallel documents

3. Embed Documents

Main option

You can find already embeded documents from here (link has been sent with the email)
You have to download and put those files to corresponding sub folder within the embeddings folder


OR

Alternative option

If you want you can embed documents by yourself and use (but above main option is more prefered)
Run below command to embed json list of documents using embedding_creator.py inside embedder folder.

python embedding_creator.py path/to/input_file.json file_type output_file_name

Here file_type is si for sinhala documents, en for english documents and pa for parallel document parirs.

The format of the input_file.json should be in following format.
For english documents

[
  {"content_en": "english document content"},
  ...
]

For sinhala documents

[
  {"content_si": "sinhala document content"},
  ...
]

For parallel documents

[
  {
   "content_en": "english document content",
   "content_si": "sinhala document content"
  },
  ...
]

After this step, the folder structure of the Embeddings folder as follows.
.
|--embeddings
         |--parallel
                  |-- parallel_json_file_name1.json
                  |-- parallel_json_file_name2.json
                  |-- ...
         |--sinhala
                  |-- sinhala_json_file_name1.json
                  |-- sinhala_json_file_name2.json
                  |-- ...
         |--english
                  |-- english_json_file_name1.json
                  |-- english_json_file_name2.json
                  |-- ...

4. Preprocess Embedding Database

Run both of following commands to preprocess embedding database for indexing.

python db_split.py en
python db_split.py si

5. Build Index Files

Run both of following commands to build index files.
(You should use same filename.py for both previous step and this step )

python indexing.py en
python indexing.py si

After this step, embeddings folder is no more needed.

6. Run SimDocSin

Execute flask run within the webapp folder

Contributors

Udhan Isuranga (udhanisuranga.16@cse.mrt.ac.lk)
Janaka Sandaruwan (janakasadaruwan.16@cse.mrt.ac.lk)
Udesh Athukorala (udeshathukorala.16@cse.mrt.ac.lk)

Publications

https://www.researchgate.net/publication/348328250_Improved_Cross-Lingual_Document_Similarity_Measurement

About

FYP 16th Batch - Team MOJO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors