Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running with Custom Dataset #3

Open
aflah02 opened this issue Aug 16, 2022 · 4 comments
Open

Running with Custom Dataset #3

aflah02 opened this issue Aug 16, 2022 · 4 comments

Comments

@aflah02
Copy link

aflah02 commented Aug 16, 2022

Hey!
Great Paper!!
Can you share some instructions for formatting a custom dataset as well?

@hzhwcmhf
Copy link
Member

You can download the preprocessed yelp dataset and format your custom dataset following the instruction in Readmes.

Please feel free to ask if you have any further questions.

@aflah02
Copy link
Author

aflah02 commented Aug 18, 2022

@hzhwcmhf Thanks for the instructions!

@aflah02
Copy link
Author

aflah02 commented Aug 18, 2022

@hzhwcmhf Can the test files be run without multiple human references as well? I see the paper mentions Luo et al. (2019) for the Yelp dataset as they provided multiple references but for GYAFC there is no such mention. I don't have multiple human references hence would like to know if the code already auto handles single references or would I need to make the changes manually?

@hzhwcmhf
Copy link
Member

Hi, @aflah02

First, we use multiple human references as well for GYAFC. You can find the references here. Multiple references are recommended in evaluating style transfer models since they can cover more possible transferred phrases, leading to reliable results.

Second, it should be ok if you test files only contain one reference per sample. For example, the test file can be

ever since joes has changed hands it 's just gotten worse and worse .
ever since joes has changed hands it 's gotten better and better .

there is definitely not enough room in that part of the venue .
there is so much room in that part of the venue

......
(NOTE: THE BLANK LINE IS REQUIRED)

(If it does not work, please tell me. I will figure out the problem.)

Moreover, you can change the format of input file here

data_arg.fields = {
"train_0": OrderedDict([("sent", "SentenceDefault")]),
"train_1": OrderedDict([("sent", "SentenceDefault")]),
"dev_0": OrderedDict([("sent", "SentenceDefault")]),
"dev_1": OrderedDict([("sent", "SentenceDefault")]),
"test_0": OrderedDict([("sent", "SentenceDefault"), ("ref", "SessionDefault")]),
"test_1": OrderedDict([("sent", "SentenceDefault"), ("ref", "SessionDefault")]),
}

where SentenceDefault indicates a line, and SessionDefault indicates mutliple lines with an empty line as ending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants