LexicalChain_Builder

Works with Python 3.X, WordNet (native in nltk) and Synset-Embeddings (word embeddings model trained using synsets instead of words - synset2vec.vector)
Another type of synset2vec can be used, but changes have to be made to retrieve key-vectors in the model

Transforms consecutive related synsets (semantic related synsets) into a LexicalChains by incorporating in the same chain synsets that share one or more of the following attributes/relationships defined in WordNet as:

  NYMS = ['hypernyms','instance_hypernyms' , 'hyponyms', 'instance_hyponyms', 
  'member_holonyms', 'substance_holonyms', 'part_holonyms', 
  'member_meronyms', 'substance_meronyms', 'part_meronyms',
  'attributes','entailments','causes', 'also_sees', 'verb_groups', 'similar_tos'] + [evaluated synset]

[evaluated synset] is the synset-token being evaluated itself
These chains grow as long there are semantic related synsets in common
In read_write.py under def fname_splitter(docslist) If running in UNIX use split('/') if running in WINDOWS with 'hardcoded' path for input/output use split('\')
Takes directory with synsets in .txt files with the following format:

word \t synset \t offset \t token4 \n

Example: gray Synset('gray.n.09') 11012474 n

Produces as many files as the input, also in the same format. However, each entry represents now a LexicalChain.
POS_W = {'n':1.0, 'v':1.0, 'r':1.0, 'a':1.0, 's':1.0} in lc_management.py represent the weight for each POS when deciding which is closer to the average in that chain. This can be adjusted according to the distribution os POS in the trained corpus. Current is 1.0 for all.
'a' and 's' should have the same value since both represent ADJECTIVES in WordNet

COMMAND LINE

python3 lc_builder.py  --input <input_folder> --chain <chain_type> [--size <size>] --output <output_folder> --model <model_file>

<input_folder> : Input folder with .txt files or folders with .txt
<chain_type> : 'flex' for Flexible Lexical Chains (FLLC); 'fixed' - for Fixed Lexical Chains (FXLC)
: [OPTIONAL] - size of the chunk for fixed chains (Default = CHUNK_SIZE in lc_management.py)
<output_folder>: Ouput folder where LexicalChain representatives should be saved
<model_file>: Synset-Embbedding model used. This should be in .vector format, but it can be changed to binary. The important is that its embeddings should be trained using synsets in the following canonical format: word#offset#pos . These are the keys to look up the embeddings.
input/output/model folder must be in the same level as ../lc_builder.py (a level above the executed script)

Models and Corpora:

All datasets, training corpora and generated models for the paper "Enhanced word embeddings using multi-semantic representation through lexical chains" can be found at DeepBlue repository

UPDATES

[2019-05-15]

Public domain for datasets/vectors/models generated.

[2019-03-07]

Moving project from personal repository

[2019-01-12]

Bug correction - reading non ASCII chars handled

[2018-11-29]:

General refactorin on printing status (reduce I/O)
Differences between python 3.4<= and 3.5>= with respect to merge dictionaries
Discard documents that cannot be parsed into chains and/or are empty

[2018-11-15]:

Flex and Fixed LC implemented, IDE and command line - milestone
Small refactoring to validate input/parameteres
General refactoring in the code

[2018-11-14]

Flexible Lexical Chains (FLLC) - Prototype working
Fixed Lexical Chains (FXLC) - Prototype working - milestone

[2018-11-08]

Refactoring - work with document structure better
Refactoring - generating key and model handling
Refactoring - normal distribution between LOW-HIGH in case key does not exist in vector model
Initialize FixedLexical Chains
Making code for representing Fixed and Flex chains more common so they can share unit-simple functions.

[2018-10-11]

If key-token does not exist on token-embeddings models, we generate a random uniform distribution [-5.0,5.0]. A random part-of-speech weight is also selected from the weight list 1a. This shouldn't happen since the model used here is based on the synset-corpus we use to build the chains
General refactor for optimization
on doc_multifolder : file_uri = file_uri.replace("\","/") #if running on windows
included new related synset methods ('topic_domains', 'region_domains', 'usage_domains')

[2018-06-12]

Deleted package for read-write. Everything will be under lexicon package

[2018-03-28]

Flexible Lexical Chain Algorithm (FLC) implemented and validated.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
lexicon		lexicon
.project		.project
.pydevproject		.pydevproject
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

lexicon

lexicon

.project

.project

.pydevproject

.pydevproject

README.md

README.md

Repository files navigation

LexicalChain_Builder

COMMAND LINE

Models and Corpora:

UPDATES

About

Releases

Packages

Languages

truas/LexicalChain_Builder

Folders and files

Latest commit

History

Repository files navigation

LexicalChain_Builder

COMMAND LINE

Models and Corpora:

UPDATES

About

Resources

Stars

Watchers

Forks

Languages