CNN for Sentence Classification

Implementation of Convolutional Neural Networks for Sentence Classification using PyTorch. And using Bidirectional Encoder Representations from Transformers instead of word2vec for embedding words.

Yoon Kim, Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746 - 1751, 2014.

Requirements

Python: 3.8.0 or higher
PyTorch: 1.6.0 or higher
Optuna: 2.0.0 or higher
Transformers from huggingface 3.0.2 or higher

If you installed Anaconda, you can create a virtual enviroment from env.yml.

$ conda env create -f env.yml

Datasets

If you use sample dataset, download RCV1-ids and put the folder that unzipped the id.zip into data/.

This is RCV1 dataset that raw texts converted to ids by BERT tokenizer. Raw texts means it didn't normalize. Embedding's Max length is 512 by BERT-base. So, used only 512 words from the beggining of each texts. This dataset splitted 23,149 training sample and 781,265 testing sample according to RCV1: A New Benchmark Collection for Text Categorization Research.

If you want to use original or another dataset, you should convert dataset to ids by BERT tokenizer and you should change format.

- one document per line
- line must has 'tokenized ids' and 'label'
- label must be represented in one hot vector

example

[id1, id2, id3, ...]<TAB>[000100...]

And you should fix some parameters in run.py such as classes and embedding_dim.

BERT

huggingface publish some pre-trained models including BERT. This program uses base-uncased model that one of pre-trained models. So, embeds a word into a 768-dimensional vector by BERT. And this program turn off the fine-tuning of BERT.

Evalution Metrics

This program uses Precision@k, MicroF1 and MacroF1.

How to run

normal training and testing

$ python run.py

parameter search

$ python run.py --tuning

Force to use cpu

$ python run.py --no_cuda

Acknowledgment

This program is based on the following repositories. Thank you very much for their accomplishments.

siddsax/XML-CNN (MIT license)
yu54ku/xml-cnn (MIT license)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
CNN.py		CNN.py
DataHelper.py		DataHelper.py
LICENSE.txt		LICENSE.txt
README.md		README.md
env.yml		env.yml
run.py		run.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

CNN.py

CNN.py

DataHelper.py

DataHelper.py

LICENSE.txt

LICENSE.txt

README.md

README.md

env.yml

env.yml

run.py

run.py

train.py

train.py

Repository files navigation

CNN for Sentence Classification

Requirements

Datasets

BERT

Evalution Metrics

How to run

normal training and testing

parameter search

Force to use cpu

Acknowledgment

About

Releases

Packages

Languages

License

viiviilxx/TextClassification

Folders and files

Latest commit

History

Repository files navigation

CNN for Sentence Classification

Requirements

Datasets

BERT

Evalution Metrics

How to run

normal training and testing

parameter search

Force to use cpu

Acknowledgment

About

Topics

Resources

License

Stars

Watchers

Forks

Languages