# Scraping Monthly Tweets

This notebook uses the getoldtweet package to scrape monthly tweets from twitters search results page. also it contains sme post processing options in the lower cells.

## Import Packages

In [1]:
import os
import gc
from IPython.display import Audio, display
import time

from collections import Counter

import pandas as pd 

from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta

import GetOldTweets3 as got

## Run Monthly Scrape

### Construct each group to be scraped

In [2]:
#Artificial Intelligence graph terms 
AI_graph = ['#AI', '#ML', '#NLP', 'Artificial Intelligence','Deep Learning',
            '"Machine Learning"', 'Natural Language Processing','Neural Network']
#Distributed Ledger graph terms
DL_graph = ['Bitcoin', 'Blockchain','Ethereum','distributed ledger','smart contract']

### Define Functions

In [3]:
def allDone():
    '''this function outputs a short funny audio when called. 
    Typically this is used to signal a task completion'''
    
  display(Audio(url='https://sound.peal.io/ps/audios/000/000/537/original/woo_vu_luvub_dub_dub.wav', autoplay=True))

def update_tweet_csv(path,DF,start,end,delta,Verbose=True):
    '''This function saves the results of the scrape to the disk. it is meant to be passed 
    within a loop and append data being scraped with each loop to the DF stored on the disk. typically the loop 
    runs daily scrapes for the period of a month'''
    
    #if the scrape was successful and the file doesnt exist then create a file and save the DF as a csv
    if len(DF)>0 and os.path.isfile(path) == False:            
        DF.to_csv(path, index=False) 
        #start and end parameters dont need editing since scrape was successful
        start, end = start, end
        #print date scraped, time of scrape, and number of daily tweets scraped 
        if Verbose==True:
            print(since," // ",datetime.now()," / ", round(len(DF))," tweets/day")
    
    #if the scrape is successful and file name exists, then append to it
    elif len(DF)>0 and os.path.isfile(path) == True:
        #open the csv of the month being scraped
        globe = pd.read_csv(path)
        #append the day scraped 
        globe = globe.append(DF)
        #save new DF to the csv
        globe.to_csv(path, index=False)
        start, end = start, end
        if Verbose==True:
            print(since," // ",datetime.now()," // ", round(len(DF))," tweets/day ",len(globe))
    #If twitter data was not reached due to any interruptions/block wait then try that day again
    elif len(DF)==0:
        if Verbose==True:
            print(since," // ",datetime.now()," // ", round(len(DF))," tweets/day **")
        #adjust the start and end dates to retry scraping this day
        start -= delta
        end   -= delta
        time.sleep(60)
        
    return start, end

def tweets_to_df(tweet):
    '''this function saves the results of the twitter scrapes into lists then creates a DF out of them.
    this is needed to extract info from the  getoldtweets3 generator object'''
    #initialize lists
    text, date, hashtag, username, link, keyword, ID = [], [], [], [], [], [], []
    
    #add content to lists using GOT3 "tweet" generator object
    for tweets in tweet:
        text.append(str(tweets.text))
        date.append(str(tweets.date))
        hashtag.append(str(tweets.hashtags))
        username.append(str(tweets.username))
        link.append(tweets.permalink)
        keyword.append(word)
        ID.append(tweets.id)

    #compile content into a DF
    DF = pd.DataFrame({'tweet':text, 'date/time':date, 'hashtags':hashtag, 'user':username, 'links':link,
                       'search':keyword,'tweet_id':ID})
    return DF


####  why twitter has limitations and why you should download in daily intervals:

"The issue here is **Min_position** and **Has_more_items** flags. Twitter's legacy timeline caching system **Haplo** has its limitations. So when you start downloading millions of tweets, it runs out of memory and sometimes returns has_more_items as false. You can read about how twitter cache works in here

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html  "

source: https://github.com/Mottl/GetOldTweets3/issues/3

### Run Monthly Scrape

Info relating to the steps to be followed below:
- set the start and end date to be scraped
- scrapes are ran in daily intervals because that is the smallest interval allowed by twitter.(e.g. since the results are scraped in descending chronological order if a scrape is ran over a week and gets interrupted,due to hash issues, days worth of data can be lost, however if a single day's scrape gets interrupted then only hours are lost. this saves the user the hassle of rechecking for missing days and rescraping.)
- if the user doesnt want to see process updates set verbose==False in the update_tweet_csv function

Background info:
- the typical speed of GOT3 is roughly 2.5 million tweets/day
- scraping a month worth's data, using the lists above(AI_graph and DL_graph), takes a full day
- using a different proxy for each request(20 tweets) using services like crawlera reduces scraping speed by 5.5 times.
- it is recommended to use a diffferent IP address for each day of scraping or scraping gets blocked by twitter repeatedly 10  cycles.

In [None]:
for word in AI_graph+DL_graph:
    
    delta = timedelta(days = 1)            #set scrape range (e.g. number of days, ,weeks, months)
    start = datetime(2019,7,1) - delta     #set first day of scrape 
    x = start + 2*delta                    # x is the element used in the while loop indicating the current start date being scraped
    stop_point = datetime(2019,8,1)        #set final day of scrape, this is not inclusive
    
    data_dir = os.getcwd() + '/twitter_data_2019/'
    file_name = 'globe_' + word + "_" + (start+delta).strftime('%Y-%m') + '.csv'
    print(file_name, '\nstart: ', datetime.now())

    while x < stop_point:
        try:    
            start += delta
            end = start + delta
            since = (start).strftime("%Y-%m-%d")
            until = (end).strftime("%Y-%m-%d")
            x = end


            #Get tweets by query search
            tweetCriteria = got.manager.TweetCriteria().setQuerySearch(word).setSince(since).setUntil(until)
            tweet = got.manager.TweetManager.getTweets(tweetCriteria)
            
            #store the data as a DF
            DF = tweets_to_df(tweet)
            
            #save the daily scrape to csv on disk and update start & end accordingly
            path = data_dir + file_name
            start, end = update_tweet_csv(path,DF,start,end,delta,Verbose=True)
            #minimize memory retention
            del [DF, tweet, tweetCriteria]
            gc.collect()
        
        #in case of an error occuring mid a scrape cycle, wait then repeat the cycle    
        except:
            print('error occured at ', since, datetime.now())
            #maintian same date and dont save the data
            start -= delta
            end   -= delta
            #wait a while before trying again
            time.sleep(120)
       
    #audio signal when each each phrase/month finishes
    allDone()


globe_Bitcoin_2019-03.csv 
start:  2019-07-23 22:53:29.471256
An error occured during an HTTP request: [Errno 104] Connection reset by peer
Try to open in browser: https://twitter.com/search?q=Bitcoin%20since%3A2019-03-21%20until%3A2019-03-22&src=typd
error occured at  2019-03-21
An error occured during an HTTP request: [Errno 104] Connection reset by peer
Try to open in browser: https://twitter.com/search?q=Bitcoin%20since%3A2019-03-21%20until%3A2019-03-22&src=typd
error occured at  2019-03-21
An error occured during an HTTP request: [Errno 104] Connection reset by peer
Try to open in browser: https://twitter.com/search?q=Bitcoin%20since%3A2019-03-21%20until%3A2019-03-22&src=typd
error occured at  2019-03-21
An error occured during an HTTP request: [Errno 104] Connection reset by peer
Try to open in browser: https://twitter.com/search?q=Bitcoin%20since%3A2019-03-21%20until%3A2019-03-22&src=typd
error occured at  2019-03-21
An error occured during an HTTP request: [Errno 104] Connectio

## Check Continuity of Data

As mentioned above, due to hash issues and others, twitter sometimes limits the results returned in a search. to detect the missing data use the cell below and discover the  number of hours and dates missing. 

info  to be filled:
- the "filename" should be changed to the psth of the desired csv
- change the range of dates to be searched for missing data by changing b(end date) and a(start date)
- set the min_hrs parameter which is used to show the days with more than a certain number of hours missing(e.g. min_hrs=2 then only dates with more than 2 hrs missing will be printed)

results:
- percent of hours missed (typically a DF will have <3% of missing data if scrape is done in daily intervals as is recommended)
- a list showing how many days have how many hours missing (# of hrs, num of days) e.g.[(1, 4), (2, 2),..., (23, 5), (24, 2)]
- the number of days with more than min_hrs missing
- date of each day with more than min_hrs missing (this list can later be used to rescrape dates with significant number of hrs/day missing)

In [3]:
#### Check if data is continous
filename = 'globe_Bitcoin_130101_170121_6.csv'

print(filename, datetime.now())

# get hours scraped, for days change 13 to 10
actual = set([datetime.strptime(date_str[:13],"%Y-%m-%d %H")  for date_str in  pd.read_csv(filename)['date/time']])

# generate all possible hours in date range
b = datetime(2017,1,21)
a = datetime(2013,1,1)
numhrs = 24*((b.date()-a.date()).days)
dateList = []
for x in range(0, numhrs+2):
    dateList.append((a - timedelta(hours = x)))

#the list incomplete/missing dates
min_hrs = 1                                                     #the minumum number hours needed to display date
hours_missed = sorted(set(dateLis) - actual)                    #all missing hours
counter = Counter([date.date() for date in hours_missed])       #count hours missed per day
sort = sorted(counter.items())                                  #sort in chronological order
dates_missed = [date[0] for date in sort if date[1]>min_hrs]    #keep dates with more than 2 hours missing

#calculate the total number of hours missed as a percentage
summary = Counter([date[1] for date in sort])                   
summary = sorted(summary.items())                               
total_missed_hours = sum([x[0]*x[1] for x in summary])
print('Missing: ', total_missed_hours*100/numhrs, "%","\n",summary)

# create since and until to search twitter for those missing date ranges
since_missing = dates_missed
until_missing = [dm + timedelta(days=1) for dm in dates_missed]

print(" # Days: ", len(dates_missed),"\n",
     "Ranges sizes: ", since_missing)

globe_Bitcoin_130101_170121_6.csv 2019-05-14 21:43:49.747945
Missing:  2.160702228224173 % 
 [(1, 4), (2, 2), (3, 2), (4, 2), (6, 1), (7, 2), (8, 2), (9, 3), (10, 4), (11, 4), (13, 1), (14, 3), (15, 2), (16, 2), (18, 2), (19, 5), (20, 4), (21, 2), (22, 3), (23, 5), (24, 2)]
 # Days:  53 
 Ranges sizes:  [datetime.date(2014, 5, 6), datetime.date(2014, 5, 9), datetime.date(2015, 10, 24), datetime.date(2016, 3, 6), datetime.date(2016, 7, 3), datetime.date(2016, 7, 9), datetime.date(2016, 7, 21), datetime.date(2016, 7, 29), datetime.date(2016, 7, 31), datetime.date(2016, 8, 3), datetime.date(2016, 8, 4), datetime.date(2016, 8, 6), datetime.date(2016, 8, 7), datetime.date(2016, 8, 8), datetime.date(2016, 8, 9), datetime.date(2016, 8, 10), datetime.date(2016, 8, 11), datetime.date(2016, 8, 12), datetime.date(2016, 8, 13), datetime.date(2016, 8, 14), datetime.date(2016, 8, 16), datetime.date(2016, 8, 17), datetime.date(2016, 8, 18), datetime.date(2016, 8, 19), datetime.date(2016, 8, 20), date

## Rescrape missing Data

this cell is used to rescrape days with significant missing hours. this cell is optional and can be avoided by simply scraping 1 day at a time as reccomended above. it was only created to ammend scrapes initially done in larger intervals (1 week scrapes). however since twitter's smallest date interval is a date then by setting the scrape to 1 day intervals the highest accracy will be acheived from the start. 

In [9]:
# import urllib3
# urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# import warnings
# warnings.filterwarnings('ignore', category=ResourceWarning)

#Amended search for missing dates
filename = 'globe_' + AI_graph[0] + '_130101_170121_5_missing.csv'
print("start: ", filename, datetime.now())

word = AI_graph[0]   
for i in range(len(since_missing)):
    if (until_missing[i] - since_missing[i]).days <= 0:
        continue
        
    since = (since_missing[i]).strftime("%Y-%m-%d")
    until = (until_missing[i]).strftime("%Y-%m-%d")
    text, date, hashtag, username, link, keyword, ID = [], [], [], [], [], [], []

    try:
        #Get tweets by query search
        tweetCriteria = got.manager.TweetCriteria().setQuerySearch(word).setSince(since).setUntil(until)
        tweet = got.manager.TweetManager.getTweets(tweetCriteria)
    except:
        print('ERROR: ', since)
        time.sleep(15)
        continue

    #add content to lists
    for tweets in tweet:
        text.append(str(tweets.text))
        date.append(str(tweets.date))
        hashtag.append(str(tweets.hashtags))
        username.append(str(tweets.username))
        link.append(tweets.permalink)
        keyword.append(word)
        ID.append(tweets.id)

    #compile content into a DF
    DF = pd.DataFrame({'tweet':text, 'date/time':date, 'hashtags':hashtag, 'user':username, 'links':link,
                       'search':keyword,'tweet_id':ID})


    if len(DF)>0 and os.path.isfile(filename) == False:            
        DF.to_csv(filename, index=False) 
        print(since,"-->",until," // ",datetime.now()," / ", len(DF)/(until_missing[i] - since_missing[i]).days,"rows/days")
        del [DF, text, date, hashtag, username, link, keyword, ID, tweet, tweetCriteria ]
        gc.collect()
        continue
    elif len(DF)>0 and os.path.isfile(filename) == True:
        globe = pd.read_csv(filename)
        globe = globe.append(DF)
        globe.to_csv(filename, index=False)
        print(since,"-->",until," // ",datetime.now()," // ", (len(DF))/(until_missing[i] - since_missing[i]).days,"rows/days")
        del [globe, DF, text, date, hashtag, username, link, keyword, ID, tweet, tweetCriteria ]
        gc.collect()
        continue     
    else:
        print(since," // ",datetime.now()," // "," 0 rows")


start:  globe_Bitcoin_130101_170121_5_missing.csv 2019-05-13 23:03:51.866060
An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable
Try to open in browser: https://twitter.com/search?q=Bitcoin%20since%3A2016-08-09%20until%3A2016-08-10&src=typd
ERROR:  2016-08-09
2016-08-10 --> 2016-08-11  //  2019-05-13 23:13:44.042894  //  9338.0 rows/days
An error occured during an HTTP request: HTTP Error 503: Service Temporarily Unavailable
Try to open in browser: https://twitter.com/search?q=Bitcoin%20since%3A2016-08-11%20until%3A2016-08-12&src=typd
ERROR:  2016-08-11
2016-08-12 --> 2016-08-13  //  2019-05-13 23:20:46.029110  //  8622.0 rows/days
2016-08-13 --> 2016-08-14  //  2019-05-13 23:24:37.337142  //  8171.0 rows/days
2016-08-14 --> 2016-08-15  //  2019-05-13 23:25:26.047601  //  1609.0 rows/days
2016-08-16 --> 2016-08-17  //  2019-05-13 23:26:58.859637  //  3061.0 rows/days
2016-08-17 --> 2016-08-18  //  2019-05-13 23:27:53.489701  //  1816.0 rows/days
201