# Collecting Twitter Data using Twitter API

- We will use the `tweepy` python wrapper for accessing twitter data.
- We will also see a new python data structure: `json`. 

`tweepy`: https://www.tweepy.org/ python library for accessing the Twitter API. [Read the Docs](https://docs.tweepy.org/en/latest/).
To start using tweepy, you first need to install: https://docs.tweepy.org/en/latest/install.html


`json`: 
JSON is a text-based data exchange format derived from JavaScript that is used in web services and other connected applications. 
JSON supports primitive types, like strings and numbers, as well as nested lists and objects. 
[Read the Docs](https://realpython.com/python-json/)

In [3]:
# let's start by importing both
import tweepy #https://docs.tweepy.org/en/latest/install.html pip install tweepy or easy_install tweepy
import json

To conduct data collection via Twitter API, you need to get Twitter developer access.
Check the "Apply for Twitter Developer Account" document that was shared via Announcements.

### Write a function that loads the keys from your keys.json file

In [4]:
# Function to read the key file and load keys in a dictionary
def loadKeys(key_file):
    with open(key_file) as f:
        key_dict = json.load(f)
    return key_dict['api_key'], key_dict['api_secret'], key_dict['token'], key_dict['token_secret']

### Authorizing an application to access Twitter account data
At this stage, we will be using Twitter API v1.1. Twitter very recently rolled out API v2 (Jan 2021). 
As of now tweepy is not compatible with API v2. But later we will use other means of using the new endpoints provided by API v2.

For now, let's just focus on tweepy and how it allows you to easy access to API v1.1 of twitter.

Twitter requires all requests to use OAuth for authentication.
For more details, see here:  https://docs.tweepy.org/en/latest/getting_started.html

In [5]:
KEY_FILE = 'twitterkeys.json'
api_key, api_secret, token, token_secret = loadKeys(KEY_FILE)
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(token, token_secret)
api = tweepy.API(auth)

At this stage, you are ready to use fetch data.
The API class provides access to the entire twitter RESTful API methods. 

Let's try a few API methods.

### Your home timeline tweets

`API.home_timeline`: https://docs.tweepy.org/en/latest/api.html?highlight=home#API.home_timeline

returns the 20 most recent statuses

In [None]:
my_tweets = api.home_timeline()

In [None]:
for tweet in my_tweets:
    print(tweet.text)

In [None]:
len(my_tweets)

In [None]:
type(tweet)

In [None]:
# The Status object is really hard to read. But we do see that is has a _json property. Let's try printing that.
tweet

In [None]:
# much better to read
tweet._json # Also check: type(tweet._json) 

## Extract user timelines
[API.user_timeline](https://docs.tweepy.org/en/latest/api.html?highlight=pi.user_timeline#API.user_timeline): Returns the 20 most recent statuses posted from the authenticating user or the user specified

In [6]:
user_tweets = api.user_timeline(id = "WHO")

In [7]:
# let's look at the returned tweets
for tweet in user_tweets:
    print(tweet.text)

RT @WHO: LIVE: @DrTedros' opening remarks at the #EB148. https://t.co/9HyDVXgyUT
RT @DrTedros: There will be enough #COVID19 vaccine for everyone. But right now, we must work together as one global family to prioritize t…
RT @DrTedros: Today, I called on:
-countries with bilateral #COVID19 vaccine contracts to be transparent on these contracts with COVAX
-pro…
RT @DrTedros: #COVID19 vaccines are the shot in the arm we all need literally &amp; figuratively. But the promise of equitable access is at ser…
RT @DrTedros: A year ago, #COVID19 emerged &amp; sparked a pandemic. Life-saving vaccines have been developed. What happens next is up to us. I…
It’s #CervicalCancer Awareness Month!

All adult women 👩👩🏾‍🦱 🧕👩🏻 should undergo periodic screening to detect pre-ca… https://t.co/KTMVb1aX0k
January is Cervical Cancer Awareness Month!
 
Get your #CervicalCancer questions answered by following… https://t.co/0VBIZvVMh8
#EB148 - Agenda item 14: #COVID19 response https://t.co/x7HgeopOxi
RT @DrTedro

In [8]:
# much better to read
tweet._json # Also check: type(tweet._json) 

{'created_at': 'Mon Jan 18 11:43:12 +0000 2021',
 'id': 1351133032388943879,
 'id_str': '1351133032388943879',
 'text': '"The draft budget reflects the need to forge ahead with the work we have to do, implementing the lessons we have al… https://t.co/a7XOFI77JR',
 'truncated': True,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': 'https://t.co/a7XOFI77JR',
    'expanded_url': 'https://twitter.com/i/web/status/1351133032388943879',
    'display_url': 'twitter.com/i/web/status/1…',
    'indices': [117, 140]}]},
 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>',
 'in_reply_to_status_id': 1351132211995684867,
 'in_reply_to_status_id_str': '1351132211995684867',
 'in_reply_to_user_id': 14499829,
 'in_reply_to_user_id_str': '14499829',
 'in_reply_to_screen_name': 'WHO',
 'user': {'id': 14499829,
  'id_str': '14499829',
  'name': 'World Health Organization (WHO)',
  'screen_name': 'WHO',
  'location': 'Geneva, Switzer

<span class="mark">**TODO**</span>
For the tweets in your the user timeline, you already saw the tweet by printing `tweet.text`. 
Pick a few other properties and print those (pick at least two).

In [None]:
# Your code below



## Followers and friends

https://docs.tweepy.org/en/latest/api.html?highlight=followers#API.followers

Searching for followers and friends of a specified user.

[`API.get_user`](https://docs.tweepy.org/en/latest/api.html?highlight=get_user#API.get_user): Returns information about the specified user.

In [None]:
# Search for followers of your favorite celebrity.
# Fetch first 10 followers and friends of your favorite celebrity. You need to get their Twitter handle first.
    
USER_HANDLE = 'DrMLKJrWisdom' #Dr. Martin Luther King: https://twitter.com/DrMLKJrWisdom

In [None]:
user = api.get_user(USER_HANDLE)
print('# of followers = ', user.followers_count)
print('# of friends = ', user.friends_count)

In [None]:
type(user)

In [None]:
# similar concept as before. The _json property will show what all attributes are available for any user object
user._json

<span class="mark">**TODO**:</span>

For your favorite celebrity, pick a few attributes, print and check those.

In [None]:
# Your code below



## Trends data

https://docs.tweepy.org/en/latest/api.html?highlight=trends_place#API.trends_place

`API.trends_place`: returns the top 50 trending topics for a specific WOEID (Where On Earth ID)

In [None]:
# Where On Earth ID (WOEID) is a legacy identifier created by Yahoo.
# Twitter API v1.1 still uses the numeric value to identify town and country trend locations.
# WOEID for the entire world is 1 and WOEID for USA is 23424977
# See https://developer.twitter.com/en/docs/twitter-api/v1/trends/trends-for-location/api-reference/get-trends-place and
# https://blog.twitter.com/engineering/en_us/a/2010/woeids-in-twitters-trends.html

WORLD_WOE_ID = 1

world_trends = api.trends_place(WORLD_WOE_ID)
print(world_trends)

In [None]:
type(world_trends)

In [None]:
for trend in world_trends[0]['trends']:
    print(trend)

In [None]:
WORLD_WOE_ID = 1

world_trends = api.trends_place(WORLD_WOE_ID)
print(world_trends)

In [None]:
for trend in world_trends[0]['trends']:
    print(trend['name'])

**TODO**
Let's find trends from USA and then determine the common trends in the world and in USA.

USA_WOE_ID = 23424977

*Hint:* To find common trends, you can perform a set operation. 

In [None]:
# Your code below

USA_WOE_ID = 23424977


## Pagination (Cursor way)

Twitter API returns results in pages. 

`Cursor` handles all the pagination work for us behind the scenes, so our code can now focus entirely on processing the results. 

Check the link to see Old Way vs. Cursor Way https://docs.tweepy.org/en/latest/cursor_tutorial.html?highlight=Pagination

In [None]:
# Tweepy cursors take away part of the complexity of working with paginated results. 
# Cursors are implemented as a Tweepy class named Cursor. To use a cursor, you select the API method to use
# to fetch items and specify the number of pages you want.

no_of_pages = 2
for page in tweepy.Cursor(api.user_timeline, id = "DrMLKJrWisdom").pages(no_of_pages):
    for status in page:
        print("\033[1mtweet :\033[0m: " + status.text)

### Twitter's rate limit restriction

Twitter API has rate limit restrictions. But sometimes you might need lots of data that lets you cross the limit imposed by Twitter's rate limit. For example, say you want to retrive all followers of a celebrity. Or all tweets from the President How can you fetch all the friends and followers of a celebrity account?

Read more here to find the [API v1.1 rate limits](https://developer.twitter.com/en/docs/twitter-api/v1/rate-limits#:~:text=The%20maximum%20number%20of%20requests,15%2Dminute%20interval%20is%20allowed.)



## Pagination and rate limit

https://docs.tweepy.org/en/latest/code_snippet.html?highlight=Pagination#handling-the-rate-limit-using-cursors


Twitter API has rate limit restrictions. But sometimes you might need lots of data that makes you cross the limit imposed by Twitter's rate limit. How can you fetch all the friends and followers of a celebrity account?

#### A celebrity will have lots of followers and friends.
Every time you hit the rate limit, you need to wait (time.sleep) before issuing the next call

In [None]:
# helper function
import time
SLEEP_TIME = 60 * 15
def getFollowers(api, root_user, no_of_followers):
    primary_followers = []
    try:
        for user in tweepy.Cursor(api.followers, screen_name = root_user).items(no_of_followers):
            primary_followers.append((user.screen_name,root_user))

    except tweepy.RateLimitError:
        print('rate limit hit. Will sleep for', SLEEP_TIME)
        time.sleep(SLEEP_TIME) #sleep and make failed request again
        getFollowers(api, root_user, no_of_followers)

    return primary_followers

In [None]:
NO_OF_FOLLOWERS = 10

In [None]:
primary_followers = getFollowers(api, USER_HANDLE, NO_OF_FOLLOWERS)

In [None]:
primary_followers

In [None]:
NO_OF_FOLLOWERS = 1000

In [None]:
primary_followers = getFollowers(api, USER_HANDLE, NO_OF_FOLLOWERS)

In [None]:
primary_followers

## Twitter search APIs

Allows you to search for tweets containing a particular keyword

`API.search`: [Read the docs](https://docs.tweepy.org/en/latest/api.html?highlight=api.search#API.search)

When preparing this notebook, the Trending topic on my twitter feed was #TrumpsNewArmy
Let's fetch tweets with this keyword (in this case it is a hashtag)
https://twitter.com/search?q=%23TrumpsNewArmy&src=trend_click&vertical=trends

In [None]:
# Now we will search for tweets containing a particular keyword

search_term = "#TrumpsNewArmy"
no_of_pages = 1 #let's just fetch 1 page of data

for page in tweepy.Cursor(api.search, q = search_term, lang="en",).pages(no_of_pages):
    for status in page:
        print("\033[1mtweet :\033[0m " + status.text)

### Remove retweets from the tweets returned

In [None]:
new_search = search_term + " -filter:retweets"
no_of_pages = 1

for page in tweepy.Cursor(api.search, q = new_search, lang="en",).pages(no_of_pages):
    for status in page:
        print("\033[1mtweet :\033[0m " + status.text)

<span class="mark">**TODO**</span>

Search for tweets containing exact phrase "COVID19" or "vaccine" or both and do not include retweets

For more details on how to form search query: https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators

In [None]:
# Your code here


#### Possible paths for later

- You can fetch data and save in a .csv file. Or use json.dumps to dump it into a .json file for doing data analysis later
- You can load the data in a pandas dataframe to perform all sorts of analysis
-