UnsupClean

Implementation of the proposed algorithm in the paper An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection by Anurag Roy, Shalmoli Ghosh, Kripabandhu Ghosh, Saptarshi Ghosh. The proposed unsupervised algorithm finds noisy variants of words in a corpus by finding clusters of its morphological variants. The flow-chart below gives a schematic overview of the UnsupClean algorithm:

About

UnsupClean is a novel Unsupervised and language independent text normalization algorithm. The algorithm relies on lexical and contextual similarities of words/tokens in a corpus to form clusters of their morphological variants. For the contextual similarities the algorithm relies on word vectors and co-occurrence counts. We have used word2vec embeddings for our experiments. However other word vectors can also be used. The implementation expects word vectors and co-occurrence counts in the form of files(more details in Hyperparameters and Options). UnsupClean also depends upon alpha -- a threshold based on which it determines the lexical similarity. The implementation also has the provision of stopword removal incase anyone wants to exclude stopwords from the clusters. The output will be stored in a file name passed as output_fname. The output format is <word><||><space separated list of words in cluster>. An example file entry looks like blessed<||>bless blessed

Dependencies

python version: python 2.7

packages:

networkx
gensim
python_louvain
community
scikit_learn

To install the dependencies run pip install -r requirements.txt

Hyperparameters and Options

Hyperparameters and options in unsupclean.py.

wordvec_file Word vectors file containing vector representation for each words in the lexicon in google word2vec text format
cooccur_dict Pickle file of a python dictionary which has word tuples as its key and their co-occurance counts as its corresponding value
alpha The alpha value used in the algorithm [0, 1]
output_fname Name of the output file where the word clusters will be stored
nprocs Number of parallel processes(the algorithm can be run parallely)
stopword_list File containing stopwords with one stopword in each line(optional)

Data for Demo

For demonstration purposes we have provided the wordvectors and co-occurrence counts of the Sem-Eval Atheism dataset. Stop-word removal is not required for this dataset. Following are the data-files:

Run Demo

To generate the clusters, run the following command:

python2 unsupclean.py -wordvec_file w2v_SemEvalAT_new_100dim_iter100.txt -alpha 0.56 -output_fname word_clusters.txt -nprocs 4 -cooccur_dict cooccurData.pkl

The word clusters will be generated in the word_clusters.txt file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UnsupClean

Table of Contents

Summary

About

Dependencies

Hyperparameters and Options

Data for Demo

Run Demo

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md
UnsupClean.png		UnsupClean.png
cooccurData.pkl		cooccurData.pkl
requirements.txt		requirements.txt
unsupclean.py		unsupclean.py
w2v_SemEvalAT_new_100dim_iter100.txt		w2v_SemEvalAT_new_100dim_iter100.txt

License

ranarag/UnsupClean

Folders and files

Latest commit

History

Repository files navigation

UnsupClean

Table of Contents

Summary

About

Dependencies

Hyperparameters and Options

Data for Demo

Run Demo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages