Feature Embedding

Author: Yi Yang

contact: yangyiycc@gmail.com

Basic Description

Python code for

NAACL 2015 paper: Unsupervised Domain Adaptation with Feature Embeddings
ICLR 2015 paper: Unsupervised Domain Adaptation with Feature Embeddings.

Requirements

Install gensim by
- pip install --upgrade gensim
If you want a faster version of this tool, you may also want to
- install Cython by
  - pip install cython
- compile the code by running
  - python setup.py build_ext --inplace

Demo

A demo for saving feature embeddings to a txt/bin file is available (python save_embeddings.py -h).

Given a feature file (data/twitter_feat.txt) in which each line corresponds to features of one instance, save feature embeddings to a txt file (data/twitter_embeddings.txt):

If features employ bag-of-word (BoW) representation (no feature templates involved)

python save_embeddings.py --bow 1 --dim 25 data/twitter_feat.txt data/twitter_embeddings.txt

If features employ structured representation (extract features by feature templates), and given the feature-template mapping file (data/twitter_feat_template.txt)

python save_embeddings.py --feature_template_file data/twitter_feat_template.txt --dim 25 data/twitter_feat.txt data/twitter_embeddings.txt

If features employ structured representation (extract features by feature templates), and given the template prefix file (data/twitter_template_prefix.txt)

python save_embeddings.py --template_prefix_file data/twitter_template_prefix.txt --dim 25 data/twitter_feat.txt data/twitter_embeddings.txt

See save_features method of twproc.py for how to generate data/twitter_feat.txt and data/twitter_feat_template.txt files given files in CONLL POS format.

Domain Adaptation for Twitter POS tagging

A light demo for part-of-speech tagging of tweets is also provided, using data from CMU Twitter NLP project.

oct27 dataset is regarded as source data, and daily547 dataset is regarded as target data. We also sample some unlabeled tweets randomly (see data/twitter folder).

Run the demo:

Prepare the data (extract features, select pivots, etc.) by running

python twproc.py

Obtain the baseline (no adaptation) SVM tagging results by running

python twpos.py none

Obtain the marginalized Denoising Autoencoders adaptation results by running

python twpos.py mldae

Obtain the feature embedding adaptation results by running

python twpos.py feat2vec

The first step will create a file data/dataset_twitter.pkl. I got results of 0.8839, 0.8889 and 0.8924 for step 2, 3 and 4. The feat2vec results may vary a litter due to the negative sampling technique. You should obtain even better results with feat2vec by using more unlabeled data.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
README.md		README.md
feat2vec.py		feat2vec.py
feat2vec_inner.pyx		feat2vec_inner.pyx
mldae.py		mldae.py
save_embeddings.py		save_embeddings.py
setup.py		setup.py
twpos.py		twpos.py
twproc.py		twproc.py
voidptr.h		voidptr.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Embedding

Basic Description

Requirements

Demo

Domain Adaptation for Twitter POS tagging

About

Releases

Packages

Languages

yiyang-gt/feat2vec

Folders and files

Latest commit

History

Repository files navigation

Feature Embedding

Basic Description

Requirements

Demo

Domain Adaptation for Twitter POS tagging

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages