Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to change in secondary_code argument for mode --sentiment if I want to generate negative sentiment text? #1

Closed
yugaljain1999 opened this issue Sep 25, 2020 · 6 comments

Comments

@yugaljain1999
Copy link

@akhileshgotmare I was trying to generate negative sentiment text instead of default one..How can I do that?

Thanks!

@yugaljain1999
Copy link
Author

One more thing I wanna ask what is the difference between gen_type --gpt-2 CCLM and gen_type --gedi? Both looks similar as both are conditioned on secondary_code and mode like sentiment,detoxify or topic..

Thanks!

@yugaljain1999
Copy link
Author

My last question is - If I wanna train GeDi on my own data , I have to train whole network or just last layer is enough to train to learn embeddings of additional tokens?
Thanks!

@benkrause
Copy link

Hi! To answer your questions:

  1. Sentiment doesn't use a secondary code, it only uses --code_desired and --code_undesired, but these are set automatically if you run the shell script.

To get negative sentiment, run_generation.sh and set --mode sentiment. You'll be prompted with the opportunity to change to negative sentiment. When this happens, type n and press enter, and then you can give the model your prompt.

  1. --gpt2 and cclm were baselines for --gedi. --gpt2 just generates from OpenAI's gpt-2 language model using greedy decoding and a repetition penalty (it will be the same regardless of attribute codes). --cclm generates directly from a language model directly conditioned on an attribute variable. --gedi is the method described in our paper where we guide generation from gpt-2 using a language model that conditions on an attribute variable. Both --gedi and --cclm can control generation to an extent, but --gedi tends to give much more interesting and diverse responses for different prompts.

  2. If you want to train your own GeDi, it's advisable to train the whole network. Last layer only training would not work as well, and would require some modification to the codebase.

@yugaljain1999
Copy link
Author

@benkrause Thanks for your valuable responses. I wanna ask one more thing from my last question.. How should I make my labelled dataset because as per your default dataset of AG News, there is four topics and for each sentence one topic is assigned, I also have to make dataset of four topics or I can have more or less than four topics? If I can change number of topics, then which python file or script I should update to change number of topics of a dataset?

Another thing I wanna ask, what is the need of second column in train and test files of AG News as second column sentences are of length 4 to 5 words which I couldn't understand why is it necessary to have?

image

One last question I wanna ask how can I label each sentence to specific topic as I have just preprocessed text file of sentences? Till now , I have applied LDA to classify sentences but for each sentence I didn't get broad topics like politics,crime or sports instead I am getting set to topics for each sentence..

Thanks!

@benkrause
Copy link

The second column of AG news is just the article titles, we don't actually use these. Our scripts only process the first and third columns. It assumes the topic labels are in the first column (and start at 1), and the text is in the third column.

If you want to train on your own topic dataset with minimal changes, first set up new csv files in the same format as the AG news train and test csv files. So topic label IDs in the first column, second column can be blank since we ignore it anyway, third column has text.

  • If you want to avoid having to specify paths in the processing and training scripts, you could save your csv files with the same names in the same directory that we download to (data/AG-news/train.csv and data/AG-news/test.csv for the train and test splits). This would be the simplest, but would overwrite AG-news.

  • Alternatively, you could save it in a new directory if you replace the paths in proc_data.py and specify the directory in the --data_dir argument in scripts/run_training.sh.

Once you have replaced the AG-news train and test csv files with your own, you can process them into a dataset suitable for GeDi with python proc_data.py. To change the topics, you will have to change the list on line 16 in proc_data.py, which currently specifies the topic names used for AG-news. Make sure the list corresponds to the topic labels you saved in the csv files, so the first topic in the list should correspond to the label "1", second topic should correspond to the label "2", etc. You can potentially have as many topics as you want, as long as you have data and numbered labels for these topics in your csv file.

As for your last question on how to use unlabeled data, that is something we haven't explored yet, all our experiments so far have used labeled datasets. I will mention that GeDi can often generate to topics it hasn't seen during training. For instance, if you run our topic GeDi trained on AG-news (which was trained on "world", "sports", "business" and "science"), and give it a secondary code of "crime", depending on the prompt, it should sometimes be able to generate text relating to crime.

Hope this helps!

@yugaljain1999
Copy link
Author

@benkrause Thanks for your valuable response, it really helped me a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants