<a href="https://colab.research.google.com/github/tbahng/IST718-FinalProject/blob/master/HydrateTweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hydrate Open Source Twitter Chatter Tweet IDs
* Of the 80M tweet IDs extracted from the Open Source Covid-19 Twitter Chatter data set, a sample will be extracted using the Twitter API for modeling.
* The term 'hydrate' in the context of tweets mean to get the details respective to a list of tweet ids.

In [18]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
# set working directory + view files in directory
import os
#os.listdir('drive/Shared drives/IST718-Summer2020-Team')
wd = 'drive/Shared drives/IST718-Summer2020-Team'
os.listdir(wd)

['Sample Colab Notebook.ipynb',
 '551982_1359228_bundle_archive.zip',
 'Kaggle',
 'Project Check in 1 Rubric.pdf',
 'Project Check In 1 ROUGH DRAFT.docx',
 'Group1-Section2-Week5ProjectCheckin-2020.docx',
 'IST718-Team-Contact-Info.gdoc',
 'Twitter',
 'Project Check In 2 ROUGH DRAFT.docx',
 'Copy of Final Project Workbook .ipynb']

## Get Tweet IDs and Down-Sample
* all IDs are for tweets created in 2020 in the english language.
* down-sampling will be random.

In [20]:
%%time
# read extracted tweet ids, see 'Download Open Resource Covid-19 Twitter Chatter Dataset.ipynb'
fname = wd + '/Twitter/tweet_ids/english2020.txt'
with open(fname, 'r') as fin:
  tweet_id_list = [line.rstrip('\n') for line in fin]
print("There are {:d} tweet ids in english language and created in 2020.".format(len(tweet_id_list)))

There are 81083863 tweet ids in english language and created in 2020.
CPU times: user 22.2 s, sys: 9.81 s, total: 32 s
Wall time: 34.4 s


In [21]:
# function to break list into chunks
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]
tweet_id_list_chunks = chunks(tweet_id_list, 1000000)

In [22]:
%%time
# down-sampling
import random
random.seed(11)
sample_size = 3000 # desired sample size
num_chunks = int(len(tweet_id_list) / 1000000) # rounded to closest integer
chunk_draws = int(sample_size / num_chunks) # rounded average number of draws per chunk
sample_list = [] # initialized list of sample ids
for chunk in tweet_id_list_chunks:
  random.shuffle(chunk)
  sample_list_size = len(sample_list) # size of sample list in iteration
  to_go = sample_size - sample_list_size
  if to_go >= chunk_draws:
    sample_list.extend(chunk[:chunk_draws])
  else:
    sample_list.extend(chunk[:to_go])

CPU times: user 1min 23s, sys: 38.7 ms, total: 1min 23s
Wall time: 1min 23s


In [23]:
del tweet_id_list
print("Sample size of tweet ids for hydration: {:d}".format(len(sample_list)))

Sample size of tweet ids for hydration: 3000


## Twitter API Keys and Setup

In [24]:
# get twitter keys
# key dictionary has the following keys: CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_SECRET
fname = 'drive/My Drive/IST 718/secret_twitter_key.txt'
key_dict = {}
with open(fname, 'r') as fin:
  lines = [line.strip().split() for line in fin]
  for line in lines:
    key_dict[line[0]] = line[2].strip('\'')
    
# assign these keys to variables
CONSUMER_KEY = key_dict['CONSUMER_KEY']
CONSUMER_SECRET = key_dict['CONSUMER_SECRET']
OAUTH_TOKEN = key_dict['OAUTH_TOKEN']
OAUTH_TOKEN_SECRET = key_dict['OAUTH_SECRET']

In [25]:
import tweepy
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)

In [26]:
# TEST with one ID
id_of_tweet = sample_list[0] # single tweet id
tweet = api.get_status(id_of_tweet)
print(tweet.text)

As no one knows how long #coronavirus stays active on inert surfaces, from shoes, clothing to controls, all bets ar… https://t.co/QClfIl77Zw


## Extract Tweets

In [27]:
id_chunks = chunks(sample_list, 100) # break ids into chunks of 100

In [28]:
import time
# function to get tweet data per id
# accounts for rate limiting thresholds and will wait on rate limit
def lookup_tweets(tweet_IDs, api):
  id_chunks = chunks(tweet_IDs, 100) # break ids into chunks of 100
  full_tweets = [] # initialize list of results
  try:
    for chunk in id_chunks:
      full_tweets.extend(api.statuses_lookup(id_=chunk))
      #time.sleep(60)
    return full_tweets    
  except tweepy.TweepError as e:
    print(e)
  print('Full tweets collected: {:d}'.format(len(full_tweets)))

In [29]:
%%time
# execute
results = lookup_tweets(sample_list, api)

CPU times: user 1.08 s, sys: 76 ms, total: 1.15 s
Wall time: 15.1 s


In [30]:
# function to extract dictionaries from list of results
def getDict(lst):
  out = [] # list of dictionaries
  for item in lst:    
    out.append({k: item._json[k] for k in ('id', 'created_at', 'text')})
  return out

In [31]:
# extract relevant properties from tweets (i.e. 'created_at', 'id', 'text)
tweets = getDict(results)
print("There are {:d} tweets returned in the lookup.".format(len(tweets)))

There are 2711 tweets returned in the lookup.


Possible reason for difference with sample size might be deleted tweets.

In [35]:
# save to files
import json
fname = wd + '/Twitter/english2020_sample3k.json'
with open(fname, 'w') as fout:
  json.dump(tweets, fout)
print("Results saved to {:s}".format(fname))

Results saved to drive/Shared drives/IST718-Summer2020-Team/Twitter/english2020_sample3k.json
