## NOTE: No need to run this notebook. I supplied it so you can see HOW the twitter data is collected :)
## For actually USING the data collected here look at the <b>'twitter_unpackTop100_example.ipynb'</b> notebook!

## Load Modules
- ttools has helper functions

In [48]:

%load_ext autoreload
%autoreload 2
import sys, codecs, json
import ttools
from twython import TwythonStreamer, Twython
from datetime import datetime
from time import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Get top100 [from pre-made json file]
First, load the dictionary with the top100 most followed twtter users and extract the user_ids for use in api

In [53]:
top100file = './top100_id_dictionary.json'
top100 = ttools.json_to_dict(top100file)  # format is {user_id:[username,name]} really we just care about the user ids for now
top100ids = [int(uid) for uid in top100.keys()]

## Set up twitter API, get user metadata, and remove non-english accounts
Initialize the api connection

In [40]:
api = ttools.initAPI()
credentials = api.verify_credentials()  #KRC__verify the connection

Get all users metadata from direct users_lookup api [can gather 100 users in a single api call..how convenient!]

In [41]:
userdata = api.lookup_user(user_id=[top100ids])
#for some reason the api is not getting the 1st and last users....
num1 = api.lookup_user(user_id=[top100ids[0]])
num100 = api.lookup_user(user_id=[top100ids[-1]])
userdata.append(num1[0])
userdata.append(num100[0])

Check and clean the data we collected

In [42]:
#verify we got all the users
usersGotten = []
for d in userdata:
    usersGotten.append(int(d['id']))
commonUsers = set(top100ids).intersection(set(usersGotten))
if len(commonUsers) != 100:
    print('api did not give all/correct user ids...need to investigate')

#remove the non-english accounts [actually, do this in-loop below]
# nonEnglish = []
# for d in userdata:
#     if d['lang'] != 'en':
#         nonEnglish.append(d['id'])
#         print('removing non-english account: %s'%(top100[str(d['id'])]))
#         top100.pop(str(d['id']))
# print('top100 is composed of %s english speakers'%(len(top100)))

## Collect Timeline Data and Save json
Now, let's gather the timeline data! Note the user information we just collected is used in the 'user_info' key of the limitedUserDict [which is the one collecting ALL of the data]. The data will be saved in a *json format

In [43]:
%%time
numPasses = 1
currentUserID = 0
timeStart = time()
allCollectedUsers = []  #track users we successfully got timelines for

limitedUserDict = {}
try:
    for i,udata in enumerate(userdata):
        user_id = int(udata['id'])
        #skip the non-english accounts
        if udata['lang'] != 'en':
            print('skipping non-english user: %s'%(udata['screen_name']))
            #top100.pop(str(d['id']))  #remove from the top100 list...not really necessary
            continue
        currentUserID = user_id
        limitedUserDict[int(user_id)] = {'user_info':udata,'user_timeline':[]}  #hydrates the user info and preps the timeline list
        #limitedUserDict[int(user_id)] = activeusers[int(user_id)]  #copy the structure for the user
        print('%s__of__%s total users gathered'%(i,len(top100ids)))
        print('User ID: %s'%user_id)
        print('username: %s'%udata['screen_name'])
        kwargs = {'user_id':int(user_id),'count':200,'exclude_replies':'false','trim_user':'true','include_rts':'false'}
        timelineTweets = ttools.rateLimitWrapperTimeline(api,api.get_user_timeline,kwargs,willingToWait=True,maxExecTime=14400)
        limitedUserDict[user_id]['user_timeline'].extend(timelineTweets)  #extend the list
        allCollectedUsers.append(user_id)
        del timelineTweets
except:
    print('some sort of error occurred...dumping data collected so far')
    jsonStr = json.dumps(limitedUserDict)
    with open('top100users_and_timelines.json','w') as f:
        f.write(jsonStr)
    del jsonStr
    with open('top100gotten.txt','w') as outF:
        outF.write('%s'%allCollectedUsers)
    print('last user_id attempted = %s'%currentUserID)
    print('total number of users collected: %s'%(len(allCollectedUsers)))
    print('finished!')
    print('Elapsed time: %s'%(time() - timeStart))
    sys.exit()
#print(len(r))

print('made it to the end without error')
jsonStr = json.dumps(limitedUserDict)
with open('top100users_and_timelines.json','w') as f:
    f.write(jsonStr)
del jsonStr
with open('top100gotten.txt','w') as outF:
    outF.write('%s'%allCollectedUsers)
print('last user_id attempted = %s'%currentUserID)
print('total number of users collected: %s'%(len(allCollectedUsers)))
print('finished!')
print('Elapsed time: %s'%(time() - timeStart))

0__of__100 total users gathered
User ID: 27260086
username: justinbieber
returning from rateLimitWrapper
1__of__100 total users gathered
User ID: 813286
username: BarackObama
returning from rateLimitWrapper
2__of__100 total users gathered
User ID: 79293791
username: rihanna
returning from rateLimitWrapper
3__of__100 total users gathered
User ID: 17919972
username: taylorswift13
returning from rateLimitWrapper
4__of__100 total users gathered
User ID: 14230524
username: ladygaga
returning from rateLimitWrapper
5__of__100 total users gathered
User ID: 15846407
username: TheEllenShow
returning from rateLimitWrapper
skipping non-english user: Cristiano
7__of__100 total users gathered
User ID: 10228272
username: YouTube
returning from rateLimitWrapper
8__of__100 total users gathered
User ID: 26565946
username: jtimberlake
returning from rateLimitWrapper
9__of__100 total users gathered
User ID: 25365536
username: KimKardashian
returning from rateLimitWrapper
10__of__100 total users gathered
U

Now we take the raw tweetdata and extract our defined features and put them into a dataframe. then save that dataframe as a *.csv file!


In [49]:
with open('top100users_and_timelines.json','r') as f:
    readstr = f.read()
    alldata = json.loads(readstr)
    del readstr

finalFrame = pd.DataFrame(columns=ttools.COLUMNS)
uNum = 0
for uid,data in alldata.items():
    print('user number: %s'%uNum)
    if 'ErrorCaught' in data:
        print('Handled User ErrorCaught')
        continue
    finalFrame = finalFrame.append(ttools.rawTimelineToTrainingInstances(uid,data))
    uNum += 1
print(finalFrame.shape)
finalFrame.to_csv('top100users_and_timelines.csv', index=False)
del finalFrame
del limitedUserDict

user number: 0
user number: 1
user number: 2
user number: 3
user number: 4
user number: 5
user number: 6
user number: 7
user number: 8
user number: 9
user number: 10
user number: 11
user number: 12
user number: 13
user number: 14
user number: 15
user number: 16
user number: 17
user number: 18
user number: 19
user number: 20
user number: 21
user number: 22
user number: 23
user number: 24
user number: 25
user number: 26
user number: 27
Handled Tweet ErrorCaught
user number: 28
Handled Tweet ErrorCaught
user number: 29
Handled Tweet ErrorCaught
user number: 30
Handled Tweet ErrorCaught
user number: 31
Handled Tweet ErrorCaught
user number: 32
Handled Tweet ErrorCaught
user number: 33
user number: 34
user number: 35
user number: 36
user number: 37
user number: 38
user number: 39
user number: 40
user number: 41
user number: 42
user number: 43
user number: 44
user number: 45
user number: 46
user number: 47
user number: 48
user number: 49
user number: 50
user number: 51
user number: 52
user n

And there you have it! Top100 most followed users on twitter and their timelines now in file: <b>top100users_and_timelines.csv</b>

## Look in 'top100users_and_timelines.csv' for example using the actual data!