# Sami Abdelazim - JC Foster

We decided to consider 4 different groups of twitter uses, new organizations, think tanks, government officials, and oil companies.

And we consider the hourly price of oil from May 11 - May 13

In this notebook for each group we load the twitter data that we've collected, and then we predict a sentiment score to each tweet with the model that we've previously finetuned. 

Afterwards, we compute a custom engagement metric. We sum the metrics for each group, for each timeframe, and create a multidimensional timeseries, with an entry for each hour and 4 datapoints per entry corresponding to the aggregate custom engagement metric for that group.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

! pip install ftfy
! pip install transformers
import io
import os
import re
import torch
import pandas as pd
import nltk
import sklearn
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from tqdm.notebook import tqdm
from ftfy import fix_text
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from transformers import (set_seed,
                          TrainingArguments,
                          Trainer,
                          GPT2Config,
                          GPT2Tokenizer,
                          AdamW, 
                          get_linear_schedule_with_warmup,
                          GPT2ForSequenceClassification)

Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[?25l[K     |██████▏                         | 10 kB 20.2 MB/s eta 0:00:01[K     |████████████▍                   | 20 kB 23.9 MB/s eta 0:00:01[K     |██████████████████▌             | 30 kB 28.5 MB/s eta 0:00:01[K     |████████████████████████▊       | 40 kB 16.7 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51 kB 18.8 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 1.7 MB/s 
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
Collecting transformers
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 23.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 3.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_6

In [3]:
# PyTorch Dataset to pass unseen data to Sentiment Analysis model
class TwitterData(Dataset):
  def __init__(self, df):
    self.texts = df['text'].values
    self.labels = df['target'].values
    self.n_examples = len(self.labels)

  def __len__(self):
    return self.n_examples

  def __getitem__(self, item):
    return {'text':self.texts[item],
            'label':self.labels[item]}

In [4]:
## taken from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

class Gpt2ClassificationCollator(object):
    r"""
    Data Collator used for GPT2 in a classificaiton rask. 
    
    It uses a given tokenizer and label encoder to convert any text and labels to numbers that 
    can go straight into a GPT2 model.

    This class is built with reusability in mind: it can be used as is as long
    as the `dataloader` outputs a batch in dictionary format that can be passed 
    straight into the model - `model(**batch)`.

    Arguments:

      use_tokenizer (:obj:`transformers.tokenization_?`):
          Transformer type tokenizer used to process raw text into numbers.

      labels_ids (:obj:`dict`):
          Dictionary to encode any labels names into numbers. Keys map to 
          labels names and Values map to number associated to those labels.

      max_sequence_len (:obj:`int`, `optional`)
          Value to indicate the maximum desired sequence to truncate or pad text
          sequences. If no value is passed it will used maximum sequence size
          supported by the tokenizer and model.

    """

    def __init__(self, use_tokenizer, max_sequence_len=None):

        # Tokenizer to be used inside the class.
        self.use_tokenizer = use_tokenizer
        # Check max sequence length.
        self.max_sequence_len = use_tokenizer.model_max_length if max_sequence_len is None else max_sequence_len
        return

    def __call__(self, sequences):
        r"""
        This function allowes the class objesct to be used as a function call.
        Sine the PyTorch DataLoader needs a collator function, I can use this 
        class as a function.

        Arguments:

          item (:obj:`list`):
              List of texts and labels.

        Returns:
          :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model.
          It holddes the statement `model(**Returned Dictionary)`.
        """

        # Get all texts from sequences list.
        #print(sequences)
        texts = [sequence['text'] for sequence in sequences]
        # Get all labels from sequences list.
        labels = [sequence['label'] for sequence in sequences]
        # Call tokenizer on all texts to convert into tensors of numbers with 
        # appropriate padding.
        inputs = self.use_tokenizer(text=texts, return_tensors="pt", padding=True, truncation=True,  max_length=self.max_sequence_len)
        # Update the inputs with the associated encoded labels as tensor.
        inputs.update({'labels':torch.tensor(labels)})

        return inputs

In [5]:
# Set seed for reproducibility.
set_seed(123)
batch_size = 32
device = 'cpu'
model_name_or_path = 'gpt2'

In [6]:
# from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

# Get model configuration.
print('Loading configuraiton...')
model_config = GPT2Config.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=2)

# Get model's tokenizer.
print('Loading tokenizer...')
tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token

# Get the actual model.
print('Loading model...')
model = GPT2ForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_name_or_path, config=model_config)

# resize model embedding to match new tokenizer
model.resize_token_embeddings(len(tokenizer))

# fix model padding token id
model.config.pad_token_id = model.config.eos_token_id

# Load model to defined device.
model.to(device)
print('Model loaded to `%s`'%device)

optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # default is 1e-8.
                  )

Loading configuraiton...


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Loading tokenizer...


Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Loading model...


Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded to `cpu`




In [7]:
# from https://gmihaila.github.io/tutorial_notebooks/gpt2_finetune_classification/

def validation(dataloader, device_):
  global model

  # Tracking variables
  predictions_labels = []
  true_labels = []
  #total loss for this epoch.
  total_loss = 0

  # Put the model in evaluation mode--the dropout layers behave differently
  # during evaluation.
  model.eval()

  # Evaluate data for one epoch
  for batch in tqdm(dataloader, total=len(dataloader)):
    # add original labels
    true_labels += batch['labels'].numpy().flatten().tolist()

    # move batch to device
    batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()}

    # Telling the model not to compute or store gradients, saving memory and
    # speeding up validation

    with torch.no_grad():
      # Forward pass, calculate logit predictions.
      # This will return the logits rather than the loss because we have
      # not provided labels.
      # token_type_ids is the same as the "segment ids", which
      # differentiates sentence 1 and 2 in 2-sentence tasks.
      outputs = model(**batch)
      loss, logits = outputs[:2]
      logits = logits.detach().cpu().numpy()
      total_loss += loss.item()
      predict_content = logits.argmax(axis=-1).flatten().tolist()
      predictions_labels += predict_content

  # Calculate the average loss over the training data.
  avg_epoch_loss = total_loss / len(dataloader)
  # Return all true labels and prediciton for future evaluations.
  return true_labels, predictions_labels, avg_epoch_loss

In [8]:
output_model = 'drive/MyDrive/DS-301_PROJECT/twitter_SA_lw.pth'

# load sentiment analysis model that was previously finetuned
checkpoint = torch.load(output_model, map_location='cpu')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

In [9]:
# we want a way to order the dates that we are interested in
# note that this is virtually hard coded to the dates we have chosen in our data
# this will not run on new data

dates=[]
og = '202205'
for day in [11,12,13]:
  for hour in (list(range(17)) + list(range(20,24))):
    if day == 13 and hour > 16:
      break
    if len(str(hour))<2:
      new_str = og  + str(day) + "0" + str(hour)
    else:
      new_str = og + str(day) + str(hour)
    dates.append(int(new_str))

In [10]:
# oil price data and tweet data
data_locations = ['news_tweets_clean.csv',
                  'oil_tweets_clean.csv',
                  'think_tank_tweets_clean.csv',
                  'gov_tweets_clean.csv']

final_data = pd.read_csv('drive/MyDrive/DS-301_PROJECT/TwitterData/prices.csv')

In [11]:
import random

# we want to iterate through all of the groups
# and aggregate the sentiment scores and create a metric based on other twitter data

# we consider hourly oil prices from May 11 to May 13
# we also consider 4 groups of data
# news -> tweets in that timeframe from news organizations
# oil -> tweets in that timeframe from oil organizations
# think -> tweets in that timeframe from think tanks
# gov -> tweets in that timeframe from the US government

# for every hour we aggregate the score in such a way that there is a score for each group
# the score is calculated by adding all the tweet scores from each group during that timeframe
# the tweet scores for each tweet is calculated as follows:
# (quotes + likes + replies + retweet) * (retweets/(likes+replies)) * SA score
# where SA score is 1 or -1, indicating positive/negative sentiment predicted on the tweet
for location in data_locations:
  path = os.path.join('drive/MyDrive/DS-301_PROJECT/TwitterData/',location)
  data = pd.read_csv(path)
  data['text'] = data['tweet']

  # we create an arbitrary target column that is meaningless
  # this simply makes it easier to reuse code from finetuning
  data['target']=data['author id'].apply(lambda x : random.randint(0,1))

  gpt2_classificaiton_collator = Gpt2ClassificationCollator(use_tokenizer=tokenizer)
  dataset = TwitterData(data)
  print('Created dataset with %d examples!'%len(dataset))
  dataloader = DataLoader(dataset,
                              batch_size=batch_size,
                              shuffle=False,
                              collate_fn=gpt2_classificaiton_collator)
  
  valid_labels, valid_predict, val_loss = validation(dataloader, device)

  data['SA_score'] = valid_predict
  data['SA_score']=data['SA_score'].apply(lambda x : 1 if x==1 else -1)
  data['SA_score'] = (data['quote_count'] +
                      data['like_count'] +
                      data['reply_count'] +
                      data['retweet_count']) * (data['retweet_count']/
                                               (data['like_count']+
                                                data['reply_count']+1)) * data['SA_score']
  
  # process dates and convert to ints
  data['order'] = data['created_at'].apply(
    lambda x : int(x.split(":")[0].replace('-','').replace(':','').replace(' ','')))
  # aggregate metrics for each time period (hours)
  sums = [0]
  for i,date in enumerate(dates[1:]):
    sums.append(sum(data[(data['order'] < date) & (data['order'] >= dates[i])]['SA_score']))
  # update the data
  location.split('_')[0]
  final_data[location.split('_')[0]] = sums

Created dataset with 401 examples!


  0%|          | 0/13 [00:00<?, ?it/s]

Created dataset with 22 examples!


  0%|          | 0/1 [00:00<?, ?it/s]

Created dataset with 70 examples!


  0%|          | 0/3 [00:00<?, ?it/s]

Created dataset with 190 examples!


  0%|          | 0/6 [00:00<?, ?it/s]

In [12]:
final_data

Unnamed: 0,date,price,news,oil,think,gov
0,05-11-0,101.58,0.0,0.0,0.0,0.0
1,05-11-1,101.91,-14.777126,0.0,0.0,1.0
2,05-11-2,101.94,-274.500531,0.0,0.0,7.25
3,05-11-3,102.01,-30.333333,0.0,-219.878971,0.0
4,05-11-4,103.21,-38.770103,0.0,0.0,0.0
5,05-11-5,102.99,-71.097561,0.0,0.0,0.0
6,05-11-6,103.22,-244.889092,0.0,0.0,0.0
7,05-11-7,103.69,-139.451237,0.0,0.0,0.0
8,05-11-8,102.6,-297.717113,-17.647059,89.681818,0.0
9,05-11-9,104.69,-75.592928,0.0,0.0,0.0


In [14]:
# save data
import os
final_data.to_csv('drive/MyDrive/DS-301_PROJECT/TwitterData/final_data.csv',index=False)