# Controllable generation via RL to let Elon Musk speak ill of DOGE
> How to control text generation through a sentiment classifier.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]


In [2]:
from textrl import TextRLEnv,TextRLActor
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelWithLMHead
import logging
import sys
import pfrl
import torch
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')

**Using a pre-trained model, it can generate elonmusk's style tweets.**

In [3]:
tokenizer = AutoTokenizer.from_pretrained("huggingtweets/elonmusk")  
model = AutoModelWithLMHead.from_pretrained("huggingtweets/elonmusk")
model.eval()
model.cuda()



GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )


**a sentiment classifier for rl reward**

In [4]:
sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)

In [5]:
transformers_logger = logging.getLogger('transformers')
transformers_logger.setLevel(logging.CRITICAL)

In [6]:
sentiment("dogecoin is bad")

[[{'label': 'LABEL_0', 'score': 0.9338533878326416},
  {'label': 'LABEL_1', 'score': 0.060118917375802994},
  {'label': 'LABEL_2', 'score': 0.0060277231968939304}]]

In [7]:
sentiment("dogecoin is bad")[0][0]['score']

0.9338533878326416

set our text generation reward, inverse perplexity + sentiment classifier.
- inverse perplexity make sure the generated sentence probability will be high.
- sentiment classifier can make the generate more negative.

In [8]:
class MyRLEnv(TextRLEnv):
    def get_reward(self, input_text, predicted_list, finish): # predicted will be the list of predicted token
      reward = 0
      if finish:
        if 1 < len(predicted_list) < 50:
          predicted_text = tokenizer.convert_tokens_to_string(predicted_list)
          # inverse perplexity
          inputs = tokenizer(input_text+predicted_text,return_tensors='pt').to('cuda')
          reward += (1/(torch.exp(model(**inputs, labels=inputs["input_ids"]).loss).mean().item()))
          # sentiment classifier
          reward += sentiment(predicted_text)[0][0]['score']
      return reward

**fit one example**

In [9]:
observaton_list = ['i think dogecoin is']

In [10]:
env = MyRLEnv(model, tokenizer, observation_input=observaton_list)
actor = TextRLActor(env,model,tokenizer)
agent = actor.agent_ppo(update_interval=10, minibatch_size=2000, epochs=20)

In [11]:
actor.predict('i think dogecoin is')

' a great idea.'

In [16]:
pfrl.experiments.train_agent_with_evaluation(
    agent,
    env,
    steps=500,
    eval_n_steps=None,
    eval_n_episodes=1,       
    train_max_episode_len=50,  
    eval_interval=10,
    outdir='elon_musk_dogecoin', 
)

outdir:elon_musk_dogecoin step:28 episode:0 R:0.11256813772978423
statistics:[('average_value', -1.1820644), ('average_entropy', 71312.86), ('average_value_loss', 1.5397733084180139), ('average_policy_loss', -0.0005778993137496537), ('n_updates', 33), ('explained_variance', -36.946850889108426)]
evaluation episode 0 length:15 R:0.6118701571341162
The best score is updated -3.4028235e+38 -> 0.6118701571341162
Saved the agent to elon_musk_dogecoin/best
outdir:elon_musk_dogecoin step:33 episode:1 R:0.04733136688125236
statistics:[('average_value', -1.1896302), ('average_entropy', 71312.86), ('average_value_loss', 1.5780798437840797), ('average_policy_loss', -0.0006748329104833401), ('n_updates', 34), ('explained_variance', -0.7300913744865904)]
evaluation episode 0 length:5 R:0.051060883339943464
outdir:elon_musk_dogecoin step:38 episode:2 R:0.051060883339943464
statistics:[('average_value', -1.1958332), ('average_entropy', 71312.86), ('average_value_loss', 1.5780798437840797), ('average_

(<pfrl.agents.ppo.PPO at 0x7f1438408f10>,
 [{'average_entropy': 71312.86,
   'average_policy_loss': -0.0005778993137496537,
   'average_value': -1.1820644,
   'average_value_loss': 1.5397733084180139,
   'eval_score': 0.6118701571341162,
   'explained_variance': -36.946850889108426,
   'n_updates': 33},
  {'average_entropy': 71312.86,
   'average_policy_loss': -0.0006748329104833401,
   'average_value': -1.1896302,
   'average_value_loss': 1.5780798437840797,
   'eval_score': 0.051060883339943464,
   'explained_variance': -0.7300913744865904,
   'n_updates': 34},
  {'average_entropy': 71312.94,
   'average_policy_loss': -0.0002805701943333798,
   'average_value': -1.2554632,
   'average_value_loss': 1.5440955738990734,
   'eval_score': 0.051060883339943464,
   'explained_variance': -390.4063933607259,
   'n_updates': 39},
  {'average_entropy': 71312.98,
   'average_policy_loss': -0.0002032094211821592,
   'average_value': -1.2457539,
   'average_value_loss': 1.5156753742840232,
   'eva

loading the best result and predict.

In [17]:
agent.load("./elon_musk_dogecoin/best")

In [18]:
actor.predict('i think dogecoin is')

' a great idea, but I think it is a little overused.'