
![Titter-logo](https://unsplash.com/photos/k1xf2D7jWUs/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8NXx8dHdlZXR8fDB8fHx8MTYzODg2MjM4MA&force=true&w=640)

# 1. Introduction

Importance of natural language processing (NLP) on high frequency data from social media platforms like Twitter are growing. Companies are engaging with their customers for live feedback. It is being used to gather intelligence about new product launches, monitor customer complaints, and analyse the effectiveness of marketing campaigns. It can also be used to gauge sentiments/reactions of people with regard to policy decisions taken by government bodies.

Twitter makes available differeent APIs to mine tweets. There are three levels of access to its APIs for mining data - Essential, Elevated and Academic. With essential and elevated access, one can dig out tweets which are upto a week old. Its academic license, however,comes with two premium search APIs - `search_30_day` and `search_full_archive`, which allow searching of tweets dated upto 30 days and beyond. Also, academic access allows the use of advanced search operators for querying. We will learn more about them in this blog. 

For an academic reseracher, there aren't enough examples which can get him started readily with his project. The primary problem is that the structure of json returned by tweets differs based on the type of the tweet. The second problem is that the tweet text in search results are normally truncated if the tweet exceeds 140 character limit. Third, twitter doesn't fetch the entire data in one go rather it optimises by scattering them over multiple pages with each page containing 100 tweets by default. Hovering over these pages contitutes as fresh requests and is monitored by twitter. The free academic access only provides 250 requests per month. 

Twitter developer resource [1] can prove to be very exhaustive but time-consuming at the same time. This blog is intended to give researchers a headstart by exposing a simple python snippet which can help researchers pull out full text of historical tweets filtered by keywords with advanced search operators, thereby saving their time, effort and energy in browsing the documentation and experimenting with twitter APIs.

Before, we proceed, I assume that the reader already has access to twitter developer account with academic access. In this demonstration, I will be using the `Tweepy` library which is a wrapper around Twitter API. It makes our work easier by reducing the lines of code we need to write. Let's begin.


# 2. Workflow

Twitter APIs are smartly designed to reduce the payload on its servers i.e. data transfer during requests. In most of its standard APIs, the text gets truncated by default and therefore one has to use `extended` mode to extract full text of the API. In Twitter V2 premium search APIs, however, `full_text` attribute gives the non-truncated original tweet which is often what a researcher wants but it is embedded deep inside the json dictionary of the tweet which does not follow a consistent pattern across all tweets. The second issue while using these APIs is that the search results are presented in pages just like Google web search. Each page, by default, gives 100 tweets. To access more tweets from subsequent pages, we need to perform pagination. Third, Twitter allows the use of search operators in its query to filter the results, however, query format for original Twitter APIs and Tweepy APIs differ slightly. We will see how to specify the query which makes use of logical (AND/OR) operators to finetune the search results as desired. In summary, the workflow is as follows:

* How to construct the twitter search API to extract all the tweets 

* How to avoid truncation and get full text of the tweet

* How to do pagination

* How to specify logical AND/OR arguments in a query

# 3. How to construct the twitter search API to extract all the tweets

We will set up the API constructor and authorise with the credentials received as part of academic access. Twitter sandbox environment imposes rate limits of upto 30 requests per minute and 10 requests per second. The Tweepy API constructor gives flexibility to mention `wait_on_rate_limit` parameter which waits before sending new requests when twitter rate limits are reached.

In [1]:
# Establishing connection and authorisation:
import os
import tweepy
import pandas as pd
import json
#Add your credentials here
twitter_keys = {
        'consumer_key': os.environ.get('API_KEY'),
        'consumer_secret': os.environ.get('API_KEY_SECRET'),
        'access_token_key': os.environ.get('ACCESS_TOKEN'),
        'access_token_secret': os.environ.get('ACCESS_TOKEN_SECRET')
    }

#Setup access to API
auth = tweepy.OAuthHandler(twitter_keys['consumer_key'], twitter_keys['consumer_secret'])
auth.set_access_token(twitter_keys['access_token_key'], twitter_keys['access_token_secret'])

api = tweepy.API(auth, wait_on_rate_limit=True)

The premium search APIs can now be invoked but it requires creation of sandbox environment in twitter developer account first. I have labelled my sandbox environment as `research`. The search function mandatoritly requires environment label and query argument. Optionally, the `fromDate` and `toDate` fields can also be provided to filter search results by time. Notice the format of the date fields is "YYYYMMDDHHMM". The tweets are fetched in chronologically descending order i.e. latest tweet comes first. The query field comes with a maximum limit of 256 characters with Sandbox development environments. By default, the constructor will fetch the first page of results consisting of 100 tweets as discussed earlier, but it can be increased by specifying a maxResults parameter in non-sandbox environment subject to a maximum of 500 tweets per page.

In [53]:
tweet_results = api.search_30_day(label='research', query="RBI", fromDate="202111100000", toDate="202111300000")

If you try to print `tweet_results`, it will seem gibberish at first sight. It is essentially a list of 100 status objects, each tweet data being packaged as one status object. You can also access each tweet data using indexes like its done in case of arrays and lists. It will reveal the json structure of each tweet. Try printing the first tweet object with the following code.

In [59]:
print(json.dumps(tweet_results[0]._json, indent=4, sort_keys=True))

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Mon Nov 29 23:59:17 +0000 2021",
    "entities": {
        "hashtags": [],
        "symbols": [],
        "urls": [],
        "user_mentions": [
            {
                "id": 958513313041788928,
                "id_str": "958513313041788928",
                "indices": [
                    3,
                    14
                ],
                "name": "6-4-3",
                "screen_name": "643Network"
            }
        ]
    },
    "favorite_count": 0,
    "favorited": false,
    "filter_level": "low",
    "geo": null,
    "id": 1465470451094806536,
    "id_str": "1465470451094806536",
    "in_reply_to_screen_name": null,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "is_quote_status": false,
    "lang": "es",
    "matching_rules": [
        {
            "tag": null
        }
    ],
    "place":

# 4. How to avoid truncation and get full text of the tweet

The core structure of a tweet can be found at the developer portal [2]. As indicated earlier, the payload structure returned from the search query differs depending on the type of the tweet. There are four different types of tweet namely - Original Tweet, Retweet, Reply Tweet and Quote Tweet [3]. We will change our mining criteria based on json structure of the tweets. First, we write a function to determine the type of the tweet.

In [2]:
# Code reproduced from Twitter developer API documentation [4]
def determine_tweet_type(tweet):
    # Check for reply indicator first
    if tweet["in_reply_to_status_id"] is not None:
        tweet_type = "Reply Tweet"
    # Check boolean quote status field but make sure it's not a Retweet (of a Quote Tweet) 
    elif tweet["is_quote_status"] is True and not tweet["text"].startswith("RT"):
        tweet_type = "Quote Tweet"
    # Check both indicators of a Retweet
    elif tweet["text"].startswith("RT") and tweet.get("retweeted_status") is not None:
        tweet_type = "Retweet"
    else:
        tweet_type = "Original Tweet"
    return tweet_type  

As a next step, we will write a parsing function to extract other relevant attributes from the json list and make our own customised dictionary. The sample code for illustration is provided below.

In [5]:
parsedTweets = []

def parse_tweets(status):
    for tweet in status:
        if determine_tweet_type(tweet._json) == 'Retweet':
            if 'extended_tweet' in tweet._json['retweeted_status']:
                full_text = tweet._json['retweeted_status']['extended_tweet']['full_text']
            else:
                full_text = tweet._json['retweeted_status']['text']
               

        elif determine_tweet_type(tweet._json) == 'Quote Tweet':
            if 'extended_tweet' in tweet._json['quoted_status']:
                full_text = tweet._json['quoted_status']['extended_tweet']['full_text']
            else:
                full_text = tweet._json['quoted_status']['text']
                
        else:
            if 'extended_tweet' in tweet._json:
                full_text = tweet._json['extended_tweet']['full_text']
            else:
                full_text = tweet._json['text']

        mydict = { "tweet_id": tweet._json["id_str"],
                       "date":tweet._json["created_at"],
                       "full_text": full_text,
                       "tweet_type": determine_tweet_type(tweet._json),
                        "reply_count": tweet._json["reply_count"], #Number of times Tweet has been replied to
                       "quote_count": tweet._json["quote_count"], # Number of times Tweet has been quoted
                       "likes_count": tweet._json["favorite_count"], #Number of times Tweet has been liked 
                       "retweet_counts": tweet._json["retweet_count"], #Number of times this Tweet has been retweeted
                       "hyperlink": "https://twitter.com/twitter/status/" + tweet._json["id_str"]
              }
        parsedTweets.append(mydict) # Add Tweet to parsedTweets list
            
            

# 5. How to do pagination with Tweepy 

Tweepy makes pagination very simple. All that is required is to encapsulate the api search into Tweepy's Cursor object with `pages()` method which breaks the results into pages[5]. The next step would be to iterate through all the pages and parse the tweets in each page. While testing your code, it is advisable to enter a small number as argument to pages as it limits the number of requests to that number. When full results are required, you can omit this parameter.

In [6]:
parsedTweets = []

for page in tweepy.Cursor(api.search_30_day,label='research', 
                          query="\"Governor\" \"RBI\"",
                          fromDate='202111130000',
                          toDate='202111170000').pages(1):
    
    parse_tweets(page)
    
    
tweetdf = pd.DataFrame(parsedTweets)

tweetdf

Unnamed: 0,tweet_id,date,full_text,tweet_type,reply_count,quote_count,likes_count,retweet_counts,hyperlink
0,1460758236945010692,Tue Nov 16 23:54:37 +0000 2021,Manmohan ji jaise pm radhu ram jaise rbi gover...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14607582369...
1,1460756412506779652,Tue Nov 16 23:47:22 +0000 2021,Deeper Issues Are Involved With Cryptocurrenci...,Original Tweet,0,0,0,0,https://twitter.com/twitter/status/14607564125...
2,1460746805352804352,Tue Nov 16 23:09:12 +0000 2021,Economic recovery catches up; Private investme...,Original Tweet,0,0,0,0,https://twitter.com/twitter/status/14607468053...
3,1460744298693943297,Tue Nov 16 22:59:14 +0000 2021,"RBI Governor bats for crypto ban, says it coul...",Retweet,0,0,0,0,https://twitter.com/twitter/status/14607442986...
4,1460739977587875843,Tue Nov 16 22:42:04 +0000 2021,"India’s economic recovery now ‘taking hold’, c...",Original Tweet,0,0,0,0,https://twitter.com/twitter/status/14607399775...
...,...,...,...,...,...,...,...,...,...
95,1460662137953550339,Tue Nov 16 17:32:46 +0000 2021,#CryptoBubble | RBI Governor reiterates his re...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14606621379...
96,1460662024426430467,Tue Nov 16 17:32:19 +0000 2021,RBI governor @DasShaktikanta said the country'...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14606620244...
97,1460661451748569094,Tue Nov 16 17:30:02 +0000 2021,Manmohan ji jaise pm radhu ram jaise rbi gover...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14606614517...
98,1460661442378616842,Tue Nov 16 17:30:00 +0000 2021,"A gradual but uneven recovery is underway, say...",Original Tweet,1,0,23,1,https://twitter.com/twitter/status/14606614423...


Alternatively, the Cursor object can also be used to obtain a list of status objects using `items()` method which can then be iterated over to extract individual elements of each tweet. While testing your code, it is advisable to enter a small number as argument to items as it limits the number of tweets to that number. When full results are required, you can omit this parameter. It is to be noted that in both the approaches, the number of requests as reflected in the usage dashboard will be the same as twitter will count 100 tweets equivalent to 1 request.

In [7]:
parsedTweets = []
#https://twittercommunity.com/t/correct-syntax-for-an-exact-phrase-match-and-keyword-query/124617
status = tweepy.Cursor(api.search_30_day,label='research', 
                          query= "\"Governor\" \"RBI\"", # To match a phrase, you’d use quotes but you need to escape them since it’s a json value,
                          fromDate='202111130000',
                          toDate='202111170000').items(100)
parse_tweets(status)

tweetdf = pd.DataFrame(parsedTweets)

tweetdf

Unnamed: 0,tweet_id,date,full_text,tweet_type,reply_count,quote_count,likes_count,retweet_counts,hyperlink
0,1460758236945010692,Tue Nov 16 23:54:37 +0000 2021,Manmohan ji jaise pm radhu ram jaise rbi gover...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14607582369...
1,1460756412506779652,Tue Nov 16 23:47:22 +0000 2021,Deeper Issues Are Involved With Cryptocurrenci...,Original Tweet,0,0,0,0,https://twitter.com/twitter/status/14607564125...
2,1460746805352804352,Tue Nov 16 23:09:12 +0000 2021,Economic recovery catches up; Private investme...,Original Tweet,0,0,0,0,https://twitter.com/twitter/status/14607468053...
3,1460744298693943297,Tue Nov 16 22:59:14 +0000 2021,"RBI Governor bats for crypto ban, says it coul...",Retweet,0,0,0,0,https://twitter.com/twitter/status/14607442986...
4,1460739977587875843,Tue Nov 16 22:42:04 +0000 2021,"India’s economic recovery now ‘taking hold’, c...",Original Tweet,0,0,0,0,https://twitter.com/twitter/status/14607399775...
...,...,...,...,...,...,...,...,...,...
95,1460662137953550339,Tue Nov 16 17:32:46 +0000 2021,#CryptoBubble | RBI Governor reiterates his re...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14606621379...
96,1460662024426430467,Tue Nov 16 17:32:19 +0000 2021,RBI governor @DasShaktikanta said the country'...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14606620244...
97,1460661451748569094,Tue Nov 16 17:30:02 +0000 2021,Manmohan ji jaise pm radhu ram jaise rbi gover...,Retweet,0,0,0,0,https://twitter.com/twitter/status/14606614517...
98,1460661442378616842,Tue Nov 16 17:30:00 +0000 2021,"A gradual but uneven recovery is underway, say...",Original Tweet,1,0,23,1,https://twitter.com/twitter/status/14606614423...


# 6. How to specify logical AND/OR arguments in a query

Twitter search APIs also allow logical operators embedded inside the query which is essentially a string object. Within this string, we can connect our keywords using AND/OR operators. The only catch here is that twitter API search rules are to be put in json format [6,7]. Some illustrative examples are given below:

1. query = "\\"RBI article\\"", the search API will render tweets which match the exact phrase `RBI article`
2. query = "\\"RBI article\\" OR \\"RBI Bulletin\\"" will render tweets which contain either of the exact phrases `RBI article` or `RBI Bulletin`.
3. query = "\\"RBI\\" \\"Governor\\"" will render tweets which contain both the keywords `RBI` and `Governor` irrespective of order in which they appear.



# 7. Conclusion

In this blog, I discussed the difficulties encountered by researchers in using twitter premium search APIs. Scarce documentation 
renders these problems into a time consuming ordeal thereby pivoting the focus from original research question to how to use twitter APIs to get the desired information about relevant tweets in desired format. We addressed the difficulties of truncation, pagination and using logical operators in queries and authored a code which can be directly deployed by researchers in their respective use cases.

### Before we wind up,

I invite you to join me in this exciting data science odyssey. Follow my medium page to explore more exciting content about data science.

## Resources

1. https://developer.twitter.com/en/use-cases/do-research/academic-research/resources
2. https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet
3. https://developer.twitter.com/en/docs/tutorials/determining-tweet-types
4. https://github.com/twitterdev/enterprise-scripts-python/blob/main/Search-API/search_tweet_type.py
5. https://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html
6. https://developer.twitter.com/en/docs/twitter-api/premium/rules-and-filtering/using-premium-operators
7. https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query
