# Introduction to Tweepy

In [2]:
import sys
sys.version

## 1. Twitter API and retrieving tokens

`Tweepy` is a library for accessing the Twitter API. This is how Twitter API works.

![alt](https://cms-assets.tutsplus.com/uploads/users/317/posts/22192/image/streaming-intro-1_1.png)

In order to get data from Twitter, we need to have consumer and access tokens. The consumer key and consumer secret are values you receive from the server when you register your application and these are used to verify that you're a Twitter user. The consumer key is how Twitter identifies your application when you send requests. You will need read and write access to post a tweet.

Access tokens are to be used to make API request on your own account's behalf to Twitter and access tokens secret is used to encrypt data from the server.

(More on OAuth: https://techrangers.cdl.ucf.edu/oauth-python-tutorial.php)

To get consumer token and access token, proceed to following instructions.

- Create a Twitter account or use your existing one.
- Go to https://apps.twitter.com/ and log in with your account.
- Click on Create your app and submit your phone number. A valid phone number is required for the verification process. You can use your mobile phone number for one account only.
- Fill the form, agree to the terms and conditions, and create your Twitter application.
- Go to the Keys and Access Tokens tab, save your API key, and API secret and then click on - - 
- Create my access token to obtain the Access token and Access token secret. These four elements will be required to establish a connection with the API.

In [9]:
url_rest = "https://api.twitter.com/1.1/search/tweets.json" 
url_streaming = "https://stream.twitter.com/1.1/statuses/sample.json" 

## 2. Tweepy library
- http://www.tweepy.org

## 2.1. OAuth

OAuth is Open Authentication Protocol which lets you act on a person's behalf and do things to a website.

In [12]:
consumer_key        = u'rfbbqTwbEdU7FfhEM5qhWdQSn'
consumer_secret     = u'AXbEXOoIOKIRPPOGXizYCY8zrPpg73aznPwqlbiPBtqyNjX1ZZ'
access_token        = u'50464413-petgMmE6qEKzW33wN7AdMMZODxME9KJSSADyKx99r' 
access_token_secret = u'JBVgy7RcTnHswLyIaOgkLNMRiIsrICBt7fB4CxFRd5Wra' 

`OAuthHandler()` is a function that authorize our app to access Twitter on our behalf. `set_access_token()` initializes access token and token secret. Then your authorization is passed to `.API` function to access Twitter data.

In [14]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

#Error handling
try:
  redirect_url = auth.get_authorization_url()
except tweepy.TweepError:
  print ('Error! Failed to get request token.')

## 2.2. API Examples

### 1. Timeline methods

Timeline methods help us access timeline information of authenticated user or the user specified.

`.home_timeline()` pulls the most recents statuses, including retweets, on your Twitter timeline and default number of tweets that returns is 20 which can be modified.

In [18]:
number_tweets=30
pagination=1
public_tweets = api.home_timeline(count=number_tweets, page=pagination)
print("number of tweets:", len(public_tweets))
print("class of the object:", type(public_tweets))
public_tweets[0]._json

Each one of the JSON attributes is turned into an individual class member in Tweepy’s tweet object. The following is the function that lists all the class members.

In [20]:
def PrintMembers(obj):
    for attribute in dir(obj):
        
        #We don't want to show built in methods of the class
        if not attribute.startswith('__'):
            print(attribute)

In [21]:
PrintMembers(public_tweets[0])

We can access each attribute by mounting the class name to the object.

In [23]:
print("tweet:", public_tweets[0].text)

`.user_timeline()` accesses the most recent statuses, including retweets, posted from the specified user. The default value of number of tweets to retrieve can be modified and pagination can be used.

In [25]:
username="hugobowne"
number_tweets=30
user_tweets=api.user_timeline(username, count=number_tweets)

for tweet in user_tweets:
  print(tweet.text)

### 2. User methods

User methods are designed to work with user and followers information.

####Extracting followers list

`.followers()` returns information of people who follows specified user and the results are ordered in the most recent followers first in
which they're added 20 at a time. To mount more followers list, we need to use `cursor` to navigate to the next requests.

In [28]:
followers=api.followers(username)
len(followers)

Following function can be used to pass the next cursor to the API and extract followers list automatically. After certain number of follower extraction, the extraction reaches its limit and takes time to sleep.

In [30]:
import json
import time

followers=[]
def process_or_store_followers(follower):
    followers.append(follower)
    
def limit_handled(cursor):
  while True:
    try:
      yield cursor.next()
    except tweepy.RateLimitError:
      time.sleep(15 * 60)

In [31]:
followers_count=0
user_name="hugobowne"
for follower in limit_handled(tweepy.Cursor(api.followers, 
                              screen_name=user_name, 
                              rpp=100,
                             ).items()):  
  process_or_store_followers(follower._json)
  followers_count+=1
  print("Downloaded {0} followers".format(followers_count))

Display `followers` list.

In [33]:
followers

In [34]:
print(len(followers))
followers[1]

Saving extracted followers list as `pickle` file.

In [36]:
import pickle

pickle.dump(followers, open( "/dbfs/FileStore/tmp/followers.pkl", "wb" ) )

In [37]:
%sh ls -lh /dbfs/FileStore/tmp/*.pkl

Loading saved pickle file as `hugo_followers`

In [39]:
import pickle

hugo_followers = pickle.load(open( "/dbfs/FileStore/tmp/followers.pkl", "rb" ) )

In [40]:
len(hugo_followers), type(hugo_followers), type(hugo_followers[0])

In [41]:
hugo_followers[0]

In [42]:
hugo_followers[0].keys()

We can use `pandas` library to create dataframe from list of dictionaries by assigning column names to `columns` parameter.

In [44]:
import pandas as pd

col_names=['name', 'screen_name', 'followers_count', 'friends_count', 'favourites_count', 
           'status_count','created_at','language','location']
hugo_follower_data = pd.DataFrame.from_records(hugo_followers, columns=col_names)

In [45]:
hugo_follower_data.head()

Saving `hugo_follower_data` dataframe as pickle file.

In [47]:
import pickle

pickle.dump(hugo_follower_data, open( "/dbfs/FileStore/tmp/hugo_follower_data.pkl", "wb" ) )

####Extracting friends list

`.friends()` returns information of users that specified user is following. It returns the most recent following first in which they're added 20 at a time. To mount more friends list, `cursor` is used to navigate to the next requests.

In [49]:
import json
import time

following=[]
def process_or_store_following(friend):
    following.append(friend)
    
def limit_handled(cursor):
  while True:
    try:
      yield cursor.next()
    except tweepy.RateLimitError:
      time.sleep(15 * 60)

In [50]:
friend_count=0
user_name="hugobowne"
for friend in limit_handled(tweepy.Cursor(api.friends, 
                                          screen_name=user_name, 
                                          rpp=100
                                         ).items()):  
  process_or_store_following(friend._json)
  friend_count+=1
  print("Downloaded {0} friends".format(friend_count))

Save `following` as `pickle` file to get easy access to the data.

In [52]:
import pickle

pickle.dump(following, open( "/dbfs/FileStore/tmp/following.pkl", "wb" ) )

In [53]:
%sh ls -lh /dbfs/FileStore/tmp/*.pkl

Loading saved `pickle` file.

In [55]:
import pickle

hugo_following = pickle.load(open( "/dbfs/FileStore/tmp/following.pkl", "rb" ) )

In [56]:
len(hugo_following), type(hugo_following), type(hugo_following[0])

In [57]:
hugo_following[0]

We can use `pandas` library to create dataframe from list of dictionary by assigning column names to `columns` parameter.

In [59]:
import pandas as pd

col_names=['name', 'screen_name','followers_count', 'friends_count', 'favourites_count', 
           'status_count','created_at','language','location','id']
hugo_following_data = pd.DataFrame.from_records(hugo_following, columns=col_names)

In [60]:
hugo_following_data.shape

And save the dataframe as pickle file

In [62]:
import pickle

pickle.dump(hugo_following_data, open( "/dbfs/FileStore/tmp/hugo_following_data.pkl", "wb" ) )

`.friends_ids()` function returns only user id of specific user, where user id and screen name can be used. This function perform faster than `.friends()` function because it only returns user id and it doesn't require cursors to navigate.

`.get_user()` function returns information about specific user, where user id and screen name can be used.

![alt](https://i.imgur.com/2rQYdpU.jpg)

Here we will extract user id of people who are followed by users that Hugo follows as illustrate in the schema above. Following function creates list of tuples with two elements. The first element is users followed by Hugo, and the second is dictionary of users followed by the first element. We use `.friends_ids()` function to get user id of the second users, then use `.get_user()` function to extract user information of the second users.

In [66]:
write_edgelist=[]
def second_friend(json):
  count=0
  for friend in json:
    friend_screen_name=friend['screen_name']
    id_list = api.friends_ids(screen_name=friend_screen_name)
    count+=1
    count_fol=0
    for second_id in id_list:
      second_user=api.get_user(id=second_id)
      second_screen_name=second_user.screen_name
      write_edgelist.append((friend_screen_name, second_user))
      count_fol+=1
      print("Friend {}".format(count), "following {} downloaded".format(count_fol))

In [67]:
second_friend(hugo_following)

Since the second element is dictionary, we can extract any information related to the second element. This comes in handy when we do network analysis where we can analyze followee/follower relationship. In our case, the first element is the follower, the second element is followee and the extra information can be stored as node's metadata.

In [69]:
write_edgelist[0]

We can save this as pickle file.

In [71]:
import pickle

pickle.dump(write_edgelist, open( "/dbfs/FileStore/tmp/hugo_write_edgelist.pkl", "wb" ) )

###3. Geo methods

`.geo_search()` helps us to extract location id.

In [74]:
places = api.geo_search(query="USA", granularity="country")
place_id = places[0].id
print('USA id is: ',place_id)

The USA's location ID is `96683cc9126741d1` which will allow us to extract tweets originated in the USA.

The following function extracts tweets and navigate to the next cursor.

In [77]:
import json
import time

geo_usa=[]
def process_or_store_geo(place):
    geo_usa.append(place)
    
def limit_handled(cursor):
  while True:
    try:
      yield cursor.next()
    except tweepy.RateLimitError:
      time.sleep(15 * 60)

`.search()` function returns tweets that match a specified query. We can use place ID to collect tweets originated in the USA by including `'place:96683cc9126741d1'` to the search query. We can set up maximum number of tweets to extract.

In [79]:
searchquery='place:96683cc9126741d1'
max_tweets=10000
tweet_count=0

for tweet in limit_handled(tweepy.Cursor(api.search,q=searchquery).items(max_tweets)) :         
  if tweet.place is not None:
    process_or_store_geo(tweet._json)
    tweet_count += 1
    print("Downloaded {0} tweets".format(tweet_count))

Saving geo_usa as pickle file.

In [81]:
import pickle

pickle.dump(geo_usa, open( "/dbfs/FileStore/tmp/geo_usa.pkl", "wb" ))

In [82]:
%sh ls -lh /dbfs/FileStore/tmp/*.pkl

Loading `geo_usa` pickle file as `geo_usa_location`.

In [84]:
import pickle

geo_usa_location = pickle.load(open( "/dbfs/FileStore/tmp/geo_usa.pkl", "rb" ))

In [85]:
len(geo_usa_location)

We can use the coodinates to map the origins of the tweets.

In [87]:
geo_usa[0]['place']['bounding_box']['coordinates']

##2.3. Extracting old tweets

We can use `.search()` to extract old tweets.

In [90]:
import json
import time

old_tweets=[]
def process_or_store(tweet):
    old_tweets.append(json.dumps(tweet))

For example, here we extract tweets with `google` word in the text and written in English. We can also specify whether to include entities which are hashtags, user mentions, links, stock tickers (symbols), Twitter polls, and attached media. We can set up maximum number of tweets to extract, otherwise this will never stop.

In [92]:
tweet_count=0
max_count=10000
for tweet in tweepy.Cursor(api.search, 
                           q="google", 
                           result_type="recent", 
                           include_entities=True, 
                           lang="en"
                          ).items(max_count):
    tweet_count+=1
    process_or_store(tweet._json)
    print("Downloaded {0} tweets".format(tweet_count))

We can save it as pickle file.

In [94]:
import pickle

pickle.dump(old_tweets, open( "/dbfs/FileStore/tmp/google_tweets.pkl", "wb" ) )

In [95]:
%sh ls -lh /dbfs/FileStore/tmp/*.pkl

In [96]:
import pickle
google_tweets = pickle.load(open( "/dbfs/FileStore/tmp/google_tweets.pkl", "rb" ) )

In [97]:
google_tweets[0]

The End