KCharEmb

Tutorial for character-level embeddings in Korean sentence classification

Requirements

fasttext==0.8.3 (else gensim==3.6.0), hgtk==0.1.3, Keras==2.1.2,
numpy==1.14.3, scikit-learn==0.19.1, tensorflow-gpu==1.4.1
Currently available for python 3.5 and upper version is in implementation

HowTo

Git clone and proceed with val_tutorial.py, line by line.

Datasets

NMSC

Naver Sentiment Movie Corpus
Train:Test ratio is 3:1.
Train set is again split into train:validation set in ratio 9:1.

3i4K

Intonation-aided Intention Identification for Korean
Train:Test ratio is 9:1.
Train set is again split into train:validation set in ratio 9:1.

Character-level representations

Word vector for Cho2018a-Dense

Pretrained 100dim fastText vector

Download this and unzip THE .BIN FILE in the NEW FOLDER named 'vectors'
This can be replaced with whatever model the user employs, but it requires an additional training.

Result

The analysis can be found in the paper!

(The arXiv version will be updated in early July)

DISCLAIMER

We added NSMC files to our repo since it is easier for cloning and replication, and most of all the data is open to the public domain. The files will be removed if any problem comes up.

ACKNOWLEDGEMENT

The authors appreciate Yong Gyu Park for informing us the points that require improvement in the previous experiment.

Citation

For the utilization of the dataset 3i4K, cite the following:

@article{cho2018speech,
	title={Speech Intention Understanding in a Head-final Language: 
	A Disambiguation Utilizing Intonation-dependency},
	author={Cho, Won Ik and Lee, Hyeon Seung and Yoon, Ji Won and Kim, Seok Min and Kim, Nam Soo},
	journal={arXiv preprint arXiv:1811.04231},
	year={2018}
}

For the utilization of the word vector dictionary, cite the following:

@article{cho2018real,
	title={Real-time Automatic Word Segmentation for User-generated Text},
	author={Cho, Won Ik and Cheon, Sung Jun and Kang, Woo Hyun and Kim, Ji Won and Kim, Nam Soo},
	journal={arXiv preprint arXiv:1810.13113},
	year={2018}
}

For the utilization of the result and the code, cite the following:

@article{cho2019investigating,
	title={Investigating an Effective Character-level Embedding in Korean Sentence Classification},
	author={Cho, Won Ik and Kim, Seok Min and Kim, Nam Soo},
	journal={arXiv preprint arXiv:1905.13656},
	year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
images		images
model		model
LICENSE		LICENSE
README.md		README.md
Requirements.txt		Requirements.txt
han2one.py		han2one.py
kor_char.npy		kor_char.npy
val_tutorial.py		val_tutorial.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KCharEmb

Requirements

HowTo

Datasets

NMSC

3i4K

Character-level representations

Word vector for Cho2018a-Dense

Result

The analysis can be found in the paper!

DISCLAIMER

ACKNOWLEDGEMENT

Citation

For the utilization of the dataset 3i4K, cite the following:

For the utilization of the word vector dictionary, cite the following:

For the utilization of the result and the code, cite the following:

About

Releases

Packages

Languages

License

warnikchow/kcharemb

Folders and files

Latest commit

History

Repository files navigation

KCharEmb

Requirements

HowTo

Datasets

NMSC

3i4K

Character-level representations

Word vector for Cho2018a-Dense

Result

The analysis can be found in the paper!

DISCLAIMER

ACKNOWLEDGEMENT

Citation

For the utilization of the dataset 3i4K, cite the following:

For the utilization of the word vector dictionary, cite the following:

For the utilization of the result and the code, cite the following:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages