Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Larger unlabeled dataset #3

Closed
don-tpanic opened this issue Jul 4, 2018 · 2 comments
Closed

Larger unlabeled dataset #3

don-tpanic opened this issue Jul 4, 2018 · 2 comments

Comments

@don-tpanic
Copy link

Hi it's me again...

I notice a tiny inconsistency in your implementation against the paper. It seems the unlabeled data you used (from 6001 book to 34742 dvd) was way more than Blitzer used (from 3685 to 5945). I am not if you have used all of them in your actual experiment so I have to confirm this from you.

Also I wonder if you can recall how exactly did you collect the data. It seems there are two amazon datasets (a big one and a small one) according to this paper:
http://www.icml-2011.org/papers/342_icmlpaper.pdf
And clearly Blitzer used the small one in this SCL paper. But in your implementation, the labeled data you used has the same size as Blitzer's (2000 positive, 2000 negative for each domain), I just wanted to know where did you get the unlabeled data from as it seems the small amazon dataset doesn't seem to have that much data.

Thanks as always

@yftah89
Copy link
Owner

yftah89 commented Jul 4, 2018

We wrote about it in the appendix (B Experimental Choices) as follows : "Variants of the Product Review Data There are two releases of the datasets of the Blitzer et al. (2007) cross-domain product review task. We use the one from http://www.cs.jhu.edu/˜mdredze/datasets/sentiment/index2.html where the data is imbalanced, consisting of more positive than negative reviews.
We believe that our setup is more realistic as when collecting unlabeled data, it is hard to get a
balanced set. Note that Blitzer et al. (2007) used the other release where the unlabeled data consists
of the same number of positive and negative reviews.

@yftah89
Copy link
Owner

yftah89 commented Jul 4, 2018

I know that "Danushka Bollegala, Takanori Maehara, and Ken-ichi Kawarabayashi. 2015. Unsupervised cross-domain word representation learning. In Proc. of ACL." also used this variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants