## Connection to Twitter API's

In order to interact with Twitter, and in particular to download a certain number of tweets, it is necessary to connect to the Twitter API. The piece of code below accesses Twitter via a Twitter Developer account.

In [None]:
!pip install tweepy



In [None]:
import tweepy
import pandas as pd

In [None]:
pd.set_option('display.max_colwidth', None) # show the entire column, this is for text

In [None]:
consumer_key = "your consumer_key"
consumer_secret  = "your consumer secret key"
access_token = "your access token key"
access_token_secret = "your access token secret key"

In [None]:
# tweepy connectors
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)

## Tweet Scraping

We are interested on tweets related to the Pfzier's vaccine againts the Covid-19 Virus. The following piece of code is used to collect a certain number of this kind of tweets.

In [None]:
text_query = 'Pfizer -filter:retweets -filter:replies' # excluding retweets and replies
count = 200 # number of tweets

tweets_obj = tweepy.Cursor(api.search, q = text_query, lang = 'en', 
                           result_type = 'mixed', tweet_mode = 'extended').items(count)
tweets_list_obj = [tweet for tweet in tweets_obj]

In [None]:
print('tweets downloaded:', len(tweets_list_obj))

tweets downloaded: 0


## Extraction of Information from Tweets

In this section we are going to gather usefull information from each tweet object, as for example: tweet ID, user, datetime and number of likes. 
These information are collected into a dedicated dataframe, ordered by number of retweets, number of likes and number of replies.

In [None]:
def extract_replies_to(username, tweet_id):

  if username is None or tweet_id is None:
    print('Error: username or tweet_id are missed')
    return []

  n_replies_per_tweet = 10
  replies = tweepy.Cursor(api.search, q='to:{}'.format(username),
                                since_id=tweet_id, tweet_mode='extended', lang = 'en').items(n_replies_per_tweet)
  comments_list = []

  while True:
      try:
          reply = replies.next()
          if not hasattr(reply, 'in_reply_to_status_id_str'):
              continue
          if reply.in_reply_to_status_id == tweet_id:
            comments_list.append(reply.full_text)
            #print('reply of tweet:', reply.id, '{}'.format(reply.full_text))

      except tweepy.RateLimitError as e:
          print("Twitter api rate limit reached".format(e))
          time.sleep(60)
          continue

      except tweepy.TweepError as e:
          print("Tweepy error occured:{}".format(e))
          break

      except StopIteration:
          break

      except Exception as e:
          print("Failed while fetching replies {}".format(e))
          break

  return comments_list

# extract_replies_to('Teensthack', 1385161194940420105) # an example

In [None]:
tweets = []
counter = 0
for tweet in tweets_list_obj:
  time = pd.to_datetime(tweet.created_at).date() # date pd.to_datetime
  tweet_id = tweet.id # id of the tweet
  username = tweet.user.screen_name # username of the author of the tweet
  user_id = tweet.user.id # id of the author
  user_verified = tweet.user.verified
  location = tweet.user.location # location of the user
  n_user_followers = 0
  try:
    n_user_followers = tweet.user.followers_count # numebr of followers
  except:
    continue
  #geo = tweet.geo
  #coord = tweet.coordinates # tweet.geo['coordinates']['coordinates'] ?
  retweetcount = tweet.retweet_count # tweet.organic_metrics['retweet_count'] ?
  likes = 0
  try:
    likes = tweet.retweeted_status.favorite_count # it is count_favorite = likes
  except:
    likes = tweet.favorite_count
  
  hashtags = [dict_tag['text'] for dict_tag in tweet.entities['hashtags']]

  try:
    text = tweet.full_text
  except:
    continue

 # replies = extract_replies_to(username, tweet_id)
 # n_replies = len(replies)

  #tweet_info = [time, tweet_id, username, user_id, user_verified, location, n_user_followers, 
  #              retweetcount, likes, hashtags, text,replies, n_replies]

  tweet_info = [time, tweet_id, username, user_id, user_verified, location, n_user_followers, 
                retweetcount, likes, hashtags, text]              
  tweets.append(tweet_info)

  counter += 1
  if counter%10 == 0: 
    print(counter, '/', len(tweets_list_obj), '- (actual) tweet_id:', tweet_id)

In [None]:
#tweet_columns = ['date', 'tweet_id', 'user_name', 'user_id', 'user_verified', 'user_location', 'n_followers'
#                 'n_retweet', 'n_likes', 'hashtags', 'text', 'replies', 'n_replies']

tweet_columns = ['date', 'tweet_id', 'user_name', 'user_id', 'user_verified', 'user_location', 'n_followers',
                 'n_retweet', 'n_likes', 'hashtags', 'text']
      
print(len(tweet_columns))

tweets_df = pd.DataFrame(tweets, columns = tweet_columns)
#tweets_df.sort_values(by = ['n_retweet', 'n_likes', 'n_replies'], axis = 0, ascending = False, inplace = True)
tweets_df.sort_values(by = ['n_retweet', 'n_likes'], axis = 0, ascending = False, inplace = True)

tweets_df.reset_index(inplace = True, drop = True)

11


In [None]:
tweets_df.head(10) # twitter.com/anyuser/status/

Unnamed: 0,date,tweet_id,user_name,user_id,user_verified,user_location,n_followers,n_retweet,n_likes,hashtags,text


In [None]:
tweets_df.shape

(0, 11)

## Assignment of Sentiment Labels

Now we are going to assign sentiment labels to each tweets by using Hugging Face library.

Since we are going to identify the sentiment related each tweet sentence, it is not so reccomended to apply any relevant pre-processing on text.
So, according that we could make a soft pre-processing cleaning on tweets, by just removing html strings and quotes, since they are not necessary. 

A possible operation is to tokenize sentences for each tweet, and then compute the average among scores of the same 'type', in this way we should increase the accuracy of the results. However, since in Twitter there is no real punctuation rules that we may consider to split tweets in sentences, this task could be quite complex and may be missleading for the porpouses of this project, and for that reason it will not be done here.

In [None]:
! pip install transformers 

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 22.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 40.6MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request

# adding a sentiment field in the tweet dataframe.
# first apply the soft preprocessing to each tweet and then get the
# the sentiment with higher score
def preprocess(text):
  
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)


task='sentiment'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# download label mapping
labels=[]
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')

labels = [row[1] for row in csvreader if len(row) > 1]
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=747.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=150.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=498679497.0, style=ProgressStyle(descri…




In [None]:
def get_higher_sentiment(sentiment_dict):
  return max(sentiment_dict.items(), key=operator.itemgetter(1))[0]

In [None]:
def hugging_face(text):

  text = preprocess(text)
  encoded_input = tokenizer(text, return_tensors='pt')
  output = model(**encoded_input)
  scores = output[0][0].detach().numpy()
  scores = softmax(scores)

  # ranking = np.argsort(scores)
  # ranking = ranking[::-1]

  # for i in range(scores.shape[0]):
      # l = labels[ranking[i]]
      # s = scores[ranking[i]]
      # print(f"{i+1}) {l} {np.round(float(s), 4)}")

  return {'neg' : scores[0], 'neu' : scores[1], 'pos' : scores[2]}

# get_higher_sentiment(hugging_face('Good night 😊')) # an example

In [None]:
tweet_sentiment = [get_higher_sentiment(hugging_face(tweet)) for tweet in tweet_data.text]

In [None]:
import collections
collections.Counter(tweet_sentiment) # check how many for each sentiment class

Counter({'neg': 5787, 'neu': 8728, 'pos': 5485})

## Store Tweets Dataset

The tweets dataset is now ready to be stored in a JSON file that we will use in the analytics scripts.

In [None]:
from google.colab import drive
drive.mount('drive')

Mounted at drive


In [None]:
tweets_df.to_json('tweets.json', date_format='iso') # convert dataframe to JSON to mantain the data structure

In [None]:
!cp tweets.json "drive/My Drive/Colab Notebooks/DSSC_NLP/Project" # store JSON file

cp: cannot stat 'tweets.json': No such file or directory


In [None]:
tweet_data = pd.read_json('drive/My Drive/Colab Notebooks/DSSC_NLP/Project/tweets.json') # load JSON file (check integrity)

In [None]:
tweet_data = pd.read_json('tweets.json') # load JSON file (check integrity)

In [None]:
tweet_data.shape # check correct shape

(20000, 14)

## References

- [Twitter: Beginner Guide for Developer](https://developer.twitter.com/en/docs/getting-started)

- [Twitter: API](https://developer.twitter.com/en/products/twitter-api)

- [Tweepy: API](https://docs.tweepy.org/en/latest/api.html)

- [Twitter: Tweet Object](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet)

- [NLTK: Documentation](https://www.nltk.org/index.html)

- [NLTK: Sentiment Analysis](https://www.nltk.org/api/nltk.sentiment.html)

- [NLTK: Vader Module](https://www.codeproject.com/Articles/5269445/Using-Pre-trained-VADER-Models-for-NLTK-Sentiment)

- [Hugging Face: Sentiment Analyzer](https://huggingface.co/)