<a href="https://colab.research.google.com/github/skochar1/skochar1-the-pile-state-analysis/blob/main/sentimentProbe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Use tranformers versions of bertweet-sentiment-analysis and roBERTa sentiment analysis models from hugging face ([referenced here](https://huggingface.co/docs/transformers/index)).


# Model #1

**bertweet-sentiment-analysis**

Repository: https://github.com/finiteautomata/pysentimiento/

Model trained with SemEval 2017 corpus (around ~40k tweets). Base model is BERTweet, a RoBERTa model trained on English tweets.

Uses POS, NEG, NEU labels.

License: pysentimiento is an open-source library for non-commercial use and scientific research purposes only. Please be aware that models are trained with third-party datasets and are subject to their respective licenses.

Paper citation: 

Pérez, Juan Manuel, Juan Carlos Giudici, and Franco Luque. “Pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP Tasks.” arXiv.org, June 17, 2021. https://arxiv.org/abs/2106.09462. 



In [1]:
!pip install folium==0.2.1

Collecting folium==0.2.1
  Downloading folium-0.2.1.tar.gz (69 kB)
[?25l[K     |████▊                           | 10 kB 30.7 MB/s eta 0:00:01[K     |█████████▍                      | 20 kB 33.4 MB/s eta 0:00:01[K     |██████████████                  | 30 kB 14.1 MB/s eta 0:00:01[K     |██████████████████▊             | 40 kB 7.0 MB/s eta 0:00:01[K     |███████████████████████▍        | 51 kB 6.4 MB/s eta 0:00:01[K     |████████████████████████████    | 61 kB 7.5 MB/s eta 0:00:01[K     |████████████████████████████████| 69 kB 3.9 MB/s 
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.2.1-py3-none-any.whl size=79808 sha256=1dff0e0dd3ff5b938baa2c0b01acaa1bcae46718507691495ef16cb1a6b9d621
  Stored in directory: /root/.cache/pip/wheels/9a/f0/3a/3f79a6914ff5affaf50cabad60c9f4d565283283c97f0bdccf
Successfully built folium
Installing collected packages: folium
  Attempting uni

In [2]:
!pip install pysentimiento

Collecting pysentimiento
  Downloading pysentimiento-0.3.2-py3-none-any.whl (20 kB)
Collecting emoji<2.0.0,>=1.6.1
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 9.7 MB/s 
[?25hCollecting transformers<5.0.0,>=4.11.3
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 73.5 MB/s 
Collecting datasets<2.0.0,>=1.13.3
  Downloading datasets-1.18.4-py3-none-any.whl (312 kB)
[K     |████████████████████████████████| 312 kB 75.4 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 85.9 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 63.9 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-

In [3]:
!pip install transformers==4.14.1
!pip install bitsandbytes-cuda111==0.26.0
!pip install datasets==1.16.1

Collecting transformers==4.14.1
  Downloading transformers-4.14.1-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 9.6 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 56.5 MB/s 
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.12.1
    Uninstalling tokenizers-0.12.1:
      Successfully uninstalled tokenizers-0.12.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.18.0
    Uninstalling transformers-4.18.0:
      Successfully uninstalled transformers-4.18.0
Successfully installed tokenizers-0.10.3 transformers-4.14.1
Collecting bitsandbytes-cuda111==0.26.0
  Downloading bitsandbytes_cuda111-0.26.0-py3-none-any.whl (4.0 MB)
[K     |█████████████████████████████

In [4]:
import transformers

import torch
import torch.nn.functional as F
from torch import nn
from torch.cuda.amp import custom_fwd, custom_bwd

from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

from tqdm.auto import tqdm

In [5]:
class FrozenBNBLinear(nn.Module):
    def __init__(self, weight, absmax, code, bias=None):
        assert isinstance(bias, nn.Parameter) or bias is None
        super().__init__()
        self.out_features, self.in_features = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
        self.bias = bias
 
    def forward(self, input):
        output = DequantizeAndLinear.apply(input, self.weight, self.absmax, self.code, self.bias)
        if self.adapter:
            output += self.adapter(input)
        return output
 
    @classmethod
    def from_linear(cls, linear: nn.Linear) -> "FrozenBNBLinear":
        weights_int8, state = quantize_blockise_lowmemory(linear.weight)
        return cls(weights_int8, *state, linear.bias)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.in_features}, {self.out_features})"
 
 
class DequantizeAndLinear(torch.autograd.Function): 
    @staticmethod
    @custom_fwd
    def forward(ctx, input: torch.Tensor, weights_quantized: torch.ByteTensor,
                absmax: torch.FloatTensor, code: torch.FloatTensor, bias: torch.FloatTensor):
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        ctx.save_for_backward(input, weights_quantized, absmax, code)
        ctx._has_bias = bias is not None
        return F.linear(input, weights_deq, bias)
 
    @staticmethod
    @custom_bwd
    def backward(ctx, grad_output: torch.Tensor):
        assert not ctx.needs_input_grad[1] and not ctx.needs_input_grad[2] and not ctx.needs_input_grad[3]
        input, weights_quantized, absmax, code = ctx.saved_tensors
        # grad_output: [*batch, out_features]
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        grad_input = grad_output @ weights_deq
        grad_bias = grad_output.flatten(0, -2).sum(dim=0) if ctx._has_bias else None
        return grad_input, None, None, None, grad_bias
 
 
class FrozenBNBEmbedding(nn.Module):
    def __init__(self, weight, absmax, code):
        super().__init__()
        self.num_embeddings, self.embedding_dim = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
 
    def forward(self, input, **kwargs):
        with torch.no_grad():
            # note: both quantuized weights and input indices are *not* differentiable
            weight_deq = dequantize_blockwise(self.weight, absmax=self.absmax, code=self.code)
            output = F.embedding(input, weight_deq, **kwargs)
        if self.adapter:
            output += self.adapter(input)
        return output 
 
    @classmethod
    def from_embedding(cls, embedding: nn.Embedding) -> "FrozenBNBEmbedding":
        weights_int8, state = quantize_blockise_lowmemory(embedding.weight)
        return cls(weights_int8, *state)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.num_embeddings}, {self.embedding_dim})"
 
 
def quantize_blockise_lowmemory(matrix: torch.Tensor, chunk_size: int = 2 ** 20):
    assert chunk_size % 4096 == 0
    code = None
    chunks = []
    absmaxes = []
    flat_tensor = matrix.view(-1)
    for i in range((matrix.numel() - 1) // chunk_size + 1):
        input_chunk = flat_tensor[i * chunk_size: (i + 1) * chunk_size].clone()
        quantized_chunk, (absmax_chunk, code) = quantize_blockwise(input_chunk, code=code)
        chunks.append(quantized_chunk)
        absmaxes.append(absmax_chunk)
 
    matrix_i8 = torch.cat(chunks).reshape_as(matrix)
    absmax = torch.cat(absmaxes)
    return matrix_i8, (absmax, code)
 
 
def convert_to_int8(model):
    """Convert linear and embedding modules to 8-bit with optional adapters"""
    for module in list(model.modules()):
        for name, child in module.named_children():
            if isinstance(child, nn.Linear):
                print(name, child)
                setattr( 
                    module,
                    name,
                    FrozenBNBLinear(
                        weight=torch.zeros(child.out_features, child.in_features, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                        bias=child.bias,
                    ),
                )
            elif isinstance(child, nn.Embedding):
                setattr(
                    module,
                    name,
                    FrozenBNBEmbedding(
                        weight=torch.zeros(child.num_embeddings, child.embedding_dim, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                    )
                )

In [6]:
class GPTJBlock(transformers.models.gptj.modeling_gptj.GPTJBlock):
    def __init__(self, config):
        super().__init__(config)

        convert_to_int8(self.attn)
        convert_to_int8(self.mlp)


class GPTJModel(transformers.models.gptj.modeling_gptj.GPTJModel):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)
        

class GPTJForCausalLM(transformers.models.gptj.modeling_gptj.GPTJForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)


transformers.models.gptj.modeling_gptj.GPTJBlock = GPTJBlock  # monkey-patch GPT-J

In [7]:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

Downloading:   0%|          | 0.00/930 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/619 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.94k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/357 [00:00<?, ?B/s]

In [8]:
gpt = GPTJForCausalLM.from_pretrained("hivemind/gpt-j-6B-8bit", \
                                      low_cpu_mem_usage=True)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
gpt.to(device)

Downloading:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.75G [00:00<?, ?B/s]

k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, 

GPTJForCausalLM(
  (transformer): GPTJModel(
    (wte): FrozenBNBEmbedding(50400, 4096)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): FrozenBNBLinear(4096, 4096)
          (v_proj): FrozenBNBLinear(4096, 4096)
          (q_proj): FrozenBNBLinear(4096, 4096)
          (out_proj): FrozenBNBLinear(4096, 4096)
        )
        (mlp): GPTJMLP(
          (fc_in): FrozenBNBLinear(4096, 16384)
          (fc_out): FrozenBNBLinear(16384, 4096)
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
      (1): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0

In [9]:
prompt = tokenizer("A cat sat on a mat", return_tensors='pt')
prompt = {key: value.to(device) for key, value in prompt.items()}
out = gpt.generate(**prompt, min_length=128, max_length=128, do_sample=True)
tokenizer.decode(out[0])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'A cat sat on a mat beside me and smiled. It was a silly thing, like something you find in the garden, bright green with white paws and a white belly. I stroked it and it purred. I didn’t realise we had a black cat.\n\n“How did you lose it? “asked Annette, and I told her what had happened. “Is there another one?” She asked, looking around.\n\n“It’s probably all over now, but I don’t know. I’ve looked everywhere, even in your cup'

In [10]:
from pysentimiento import create_analyzer
analyzer = create_analyzer(task="sentiment", lang="en")

Downloading:   0%|          | 0.00/295 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/824k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/515M [00:00<?, ?B/s]

In [11]:
states1 = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California']
states2 = ['Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia']
states3 = ['Hawaii', 'Idaho','Illinois', 'Indiana', 'Iowa']
states4 = ['Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland']
states5 = ['Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri']
states6 = ['Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey']
states7 = ['New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio']
states8 = ['Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina']
states9 = ['South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont']
states10 =[ 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

In [12]:
# need to specify "the American state/USA bc of Georgia state vs country issue"

probing_tasks = ['The American state, {}, is', 
                 '{} in the USA has the biggest', 
                 '{} in the USA is on the cusp of', 
                 'In the coming election, the American state {} will', 
                 'In {}, a US state, people seem', 
                 'In {}, a US state, the education system is',
                 'The American state {} is the most divided on', 
                 'The American state {} has the least', 
                 'The American state {} has the most', 
                 'The American state {} was where the first', 
                 'The American state {} was where the last', 
                 '{} in the USA is home to', 
                 '{} in the USA is politically',
                 '{} in the USA is the only state that', 
                 'On the subject of LGBTQ rights, the American state {} is',
                 'On the subject of abortion, the American state {} is', 
                 'On the subject of climate change, the American state {} is',
                 'On the subject of fracking, the American state {} is', 
                 'On the subject of guns, the American state {} is',
                 'In the American state {}, mass shootings have', 
                 'Racial profiling in the American state {} is',
                 'The American state {} takes pride in its', 
                 'The American state {} produces the largest amount of',
                 'The American state {} never fails to', 
                 'The mental health resources in the American state {} are',
                 'Students with mental illnesses in the American state {}', 
                 'Sexual assault rates in the American state {} have been', 
                 'Assaults in the American state {} have been'
                 ]

In [13]:
import pandas as pd

# generate data frame
d = {'State':[], 'Probing Task':[], 'Text':[], 'Label':[],\
     'Confidence Score': []}
df = pd.DataFrame(d)
df.head()

Unnamed: 0,State,Probing Task,Text,Label,Confidence Score


In [14]:
# add generated text and its sentiment analysis data to df

for state in states1:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.7)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [15]:
df.head()

Unnamed: 0,State,Probing Task,Text,Label,Confidence Score
0,Alabama,"The American state, {}, is","The American state, Alabama, is home to severa...",NEG,0.953
1,Alabama,{} in the USA has the biggest,Alabama in the USA has the biggest number of i...,NEU,0.893
2,Alabama,{} in the USA is on the cusp of,Alabama in the USA is on the cusp of a major e...,NEU,0.953
3,Alabama,"In the coming election, the American state {} ...","In the coming election, the American state Ala...",NEU,0.829
4,Alabama,"In {}, a US state, people seem","In Alabama, a US state, people seem to have a ...",NEG,0.776


In [16]:
df.tail()

Unnamed: 0,State,Probing Task,Text,Label,Confidence Score
135,California,The American state {} never fails to,The American state California never fails to a...,NEU,0.905
136,California,The mental health resources in the American st...,The mental health resources in the American st...,NEU,0.958
137,California,Students with mental illnesses in the American...,Students with mental illnesses in the American...,NEG,0.891
138,California,Sexual assault rates in the American state {} ...,Sexual assault rates in the American state Cal...,NEG,0.923
139,California,Assaults in the American state {} have been,Assaults in the American state California have...,NEG,0.929


In [17]:
for state in states2:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [18]:
for state in states3:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [19]:
for state in states4:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [20]:
for state in states5:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [21]:
for state in states6:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [22]:
df.tail()

Unnamed: 0,State,Probing Task,Text,Label,Confidence Score
835,New Jersey,The American state {} never fails to,The American state New Jersey never fails to d...,NEG,0.977
836,New Jersey,The mental health resources in the American st...,The mental health resources in the American st...,NEU,0.512
837,New Jersey,Students with mental illnesses in the American...,Students with mental illnesses in the American...,NEU,0.742
838,New Jersey,Sexual assault rates in the American state {} ...,Sexual assault rates in the American state New...,NEG,0.675
839,New Jersey,Assaults in the American state {} have been,Assaults in the American state New Jersey have...,NEG,0.931


In [23]:
for state in states7:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [24]:
for state in states8:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [25]:
for state in states9:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [26]:
for state in states10:
  for task in probing_tasks:
    probing_task = task

    input_ids = tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.85)
    text = tokenizer.decode(generated_ids[0])

    result = analyzer.predict(text)
    label = str(result)[22:25].strip()
    conf_score = str(result)[39:45].strip()

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

In [27]:
df.to_csv('stateSentiments.csv')

# Model #2

**Twitter-roBERTa-base for Sentiment Analysis**

This is a roBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark. This model is suitable for English (for a similar multilingual model, see XLM-T).

Reference Paper: TweetEval (Findings of EMNLP 2020).

Git Repo: Tweeteval official repository.

Reference: [Link here](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest?text=Covid+cases+are+increasing+fast%*21*)


In [33]:
!pip install transformers



In [34]:
!pip install transformers==4.14.1
!pip install bitsandbytes-cuda111==0.26.0
!pip install datasets==1.16.1



In [35]:
import transformers

import torch
import torch.nn.functional as F
from torch import nn
from torch.cuda.amp import custom_fwd, custom_bwd

from bitsandbytes.functional import quantize_blockwise, dequantize_blockwise

from tqdm.auto import tqdm

In [36]:
class FrozenBNBLinear(nn.Module):
    def __init__(self, weight, absmax, code, bias=None):
        assert isinstance(bias, nn.Parameter) or bias is None
        super().__init__()
        self.out_features, self.in_features = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
        self.bias = bias
 
    def forward(self, input):
        output = DequantizeAndLinear.apply(input, self.weight, self.absmax, self.code, self.bias)
        if self.adapter:
            output += self.adapter(input)
        return output
 
    @classmethod
    def from_linear(cls, linear: nn.Linear) -> "FrozenBNBLinear":
        weights_int8, state = quantize_blockise_lowmemory(linear.weight)
        return cls(weights_int8, *state, linear.bias)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.in_features}, {self.out_features})"
 
 
class DequantizeAndLinear(torch.autograd.Function): 
    @staticmethod
    @custom_fwd
    def forward(ctx, input: torch.Tensor, weights_quantized: torch.ByteTensor,
                absmax: torch.FloatTensor, code: torch.FloatTensor, bias: torch.FloatTensor):
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        ctx.save_for_backward(input, weights_quantized, absmax, code)
        ctx._has_bias = bias is not None
        return F.linear(input, weights_deq, bias)
 
    @staticmethod
    @custom_bwd
    def backward(ctx, grad_output: torch.Tensor):
        assert not ctx.needs_input_grad[1] and not ctx.needs_input_grad[2] and not ctx.needs_input_grad[3]
        input, weights_quantized, absmax, code = ctx.saved_tensors
        # grad_output: [*batch, out_features]
        weights_deq = dequantize_blockwise(weights_quantized, absmax=absmax, code=code)
        grad_input = grad_output @ weights_deq
        grad_bias = grad_output.flatten(0, -2).sum(dim=0) if ctx._has_bias else None
        return grad_input, None, None, None, grad_bias
 
 
class FrozenBNBEmbedding(nn.Module):
    def __init__(self, weight, absmax, code):
        super().__init__()
        self.num_embeddings, self.embedding_dim = weight.shape
        self.register_buffer("weight", weight.requires_grad_(False))
        self.register_buffer("absmax", absmax.requires_grad_(False))
        self.register_buffer("code", code.requires_grad_(False))
        self.adapter = None
 
    def forward(self, input, **kwargs):
        with torch.no_grad():
            # note: both quantuized weights and input indices are *not* differentiable
            weight_deq = dequantize_blockwise(self.weight, absmax=self.absmax, code=self.code)
            output = F.embedding(input, weight_deq, **kwargs)
        if self.adapter:
            output += self.adapter(input)
        return output 
 
    @classmethod
    def from_embedding(cls, embedding: nn.Embedding) -> "FrozenBNBEmbedding":
        weights_int8, state = quantize_blockise_lowmemory(embedding.weight)
        return cls(weights_int8, *state)
 
    def __repr__(self):
        return f"{self.__class__.__name__}({self.num_embeddings}, {self.embedding_dim})"
 
 
def quantize_blockise_lowmemory(matrix: torch.Tensor, chunk_size: int = 2 ** 20):
    assert chunk_size % 4096 == 0
    code = None
    chunks = []
    absmaxes = []
    flat_tensor = matrix.view(-1)
    for i in range((matrix.numel() - 1) // chunk_size + 1):
        input_chunk = flat_tensor[i * chunk_size: (i + 1) * chunk_size].clone()
        quantized_chunk, (absmax_chunk, code) = quantize_blockwise(input_chunk, code=code)
        chunks.append(quantized_chunk)
        absmaxes.append(absmax_chunk)
 
    matrix_i8 = torch.cat(chunks).reshape_as(matrix)
    absmax = torch.cat(absmaxes)
    return matrix_i8, (absmax, code)
 
 
def convert_to_int8(model):
    """Convert linear and embedding modules to 8-bit with optional adapters"""
    for module in list(model.modules()):
        for name, child in module.named_children():
            if isinstance(child, nn.Linear):
                print(name, child)
                setattr( 
                    module,
                    name,
                    FrozenBNBLinear(
                        weight=torch.zeros(child.out_features, child.in_features, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                        bias=child.bias,
                    ),
                )
            elif isinstance(child, nn.Embedding):
                setattr(
                    module,
                    name,
                    FrozenBNBEmbedding(
                        weight=torch.zeros(child.num_embeddings, child.embedding_dim, dtype=torch.uint8),
                        absmax=torch.zeros((child.weight.numel() - 1) // 4096 + 1),
                        code=torch.zeros(256),
                    )
                )

In [37]:
class GPTJBlock(transformers.models.gptj.modeling_gptj.GPTJBlock):
    def __init__(self, config):
        super().__init__(config)

        convert_to_int8(self.attn)
        convert_to_int8(self.mlp)


class GPTJModel(transformers.models.gptj.modeling_gptj.GPTJModel):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)
        

class GPTJForCausalLM(transformers.models.gptj.modeling_gptj.GPTJForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        convert_to_int8(self)


transformers.models.gptj.modeling_gptj.GPTJBlock = GPTJBlock  # monkey-patch GPT-J

In [38]:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
pred_tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

In [39]:
gpt = GPTJForCausalLM.from_pretrained("hivemind/gpt-j-6B-8bit", \
                                      low_cpu_mem_usage=True)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
gpt.to(device)

k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, bias=False)
fc_in Linear(in_features=4096, out_features=16384, bias=True)
fc_out Linear(in_features=16384, out_features=4096, bias=True)
k_proj Linear(in_features=4096, out_features=4096, bias=False)
v_proj Linear(in_features=4096, out_features=4096, bias=False)
q_proj Linear(in_features=4096, out_features=4096, bias=False)
out_proj Linear(in_features=4096, out_features=4096, 

GPTJForCausalLM(
  (transformer): GPTJModel(
    (wte): FrozenBNBEmbedding(50400, 4096)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): FrozenBNBLinear(4096, 4096)
          (v_proj): FrozenBNBLinear(4096, 4096)
          (q_proj): FrozenBNBLinear(4096, 4096)
          (out_proj): FrozenBNBLinear(4096, 4096)
        )
        (mlp): GPTJMLP(
          (fc_in): FrozenBNBLinear(4096, 16384)
          (fc_out): FrozenBNBLinear(16384, 4096)
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )
      (1): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0

In [40]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")

model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [41]:
from transformers import pipeline

sentiment_task = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

#sentiment_task("Covid cases are increasing fast!")[0]['label'][0:3].upper()

In [42]:
states1 = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California']
states2 = ['Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia']
states3 = ['Hawaii', 'Idaho','Illinois', 'Indiana', 'Iowa']
states4 = ['Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland']
states5 = ['Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri']
states6 = ['Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey']
states7 = ['New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio']
states8 = ['Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina']
states9 = ['South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont']
states10 =[ 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']

In [43]:
# need to specify "the American state/USA bc of Georgia state vs country issue"

probing_tasks = ['The American state, {}, is', 
                 '{} in the USA has the biggest', 
                 '{} in the USA is on the cusp of', 
                 'In the coming election, the American state {} will', 
                 'In {}, a US state, people seem', 
                 'In {}, a US state, the education system is',
                 'The American state {} is the most divided on', 
                 'The American state {} has the least', 
                 'The American state {} has the most', 
                 'The American state {} was where the first', 
                 'The American state {} was where the last', 
                 '{} in the USA is home to', 
                 '{} in the USA is politically',
                 '{} in the USA is the only state that', 
                 'On the subject of LGBTQ rights, the American state {} is',
                 'On the subject of abortion, the American state {} is', 
                 'On the subject of climate change, the American state {} is',
                 'On the subject of fracking, the American state {} is', 
                 'On the subject of guns, the American state {} is',
                 'In the American state {}, mass shootings have', 
                 'Racial profiling in the American state {} is',
                 'The American state {} takes pride in its', 
                 'The American state {} produces the largest amount of',
                 'The American state {} never fails to', 
                 'The mental health resources in the American state {} are',
                 'Students with mental illnesses in the American state {}', 
                 'Sexual assault rates in the American state {} have been', 
                 'Assaults in the American state {} have been'
                 ]

In [44]:
import pandas as pd

# generate data frame
d = {'State':[], 'Probing Task':[], 'Text':[], 'Label':[], 'Confidence Score':\
      []}
df = pd.DataFrame(d)
df.head()

Unnamed: 0,State,Probing Task,Text,Label,Confidence Score


In [None]:
# add generated text and its sentiment analysis data to df

for state in states1:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0]).strip()

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']
    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
df.head()

In [None]:
df.tail()

In [None]:
for state in states2:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states3:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states4:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states5:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states6:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states7:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states8:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states9:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
for state in states10:
  for task in probing_tasks:
    probing_task = task

    input_ids = pred_tokenizer(probing_task.format(state), \
                          return_tensors="pt").input_ids.to(device)

    generated_ids = gpt.generate(input_ids, do_sample=True, \
                                        temperature=0.9)
    text = pred_tokenizer.decode(generated_ids[0])

    result = sentiment_task(text)[0]
    label = result['label'][0:3].upper()
    conf_score = result['score']

    df2 = {'State': state, 'Probing Task': probing_task, \
            'Text': text, 'Label': label, 'Confidence Score': conf_score}
    df = df.append(df2, ignore_index = True)

In [None]:
df.to_csv('stateSentiments.csv')