# ReThink Media Twitter API

This notebook is for the development and exploration of code for ReThink Media's Twitter API Python interface. The main goals of this notebook are:

- Search Tweets: query, date (optional)
  - Past seven days
  - Past 30 days
  - Full archive
  - Language = English
- Collect Tweets in .csv file
- Add data visualization
  - Top hashtags, keywords, influencers
  - Volume over time for queries/topics

In [None]:
# importing necessary modules
from dotenv import load_dotenv
import os
import json
import numpy as np
import pandas as pd
import tweepy

load_dotenv()

## Utility Functions

Functions for general use across the different analysis functions within the notebook.

In [33]:
# function to parse Twitter API v2 response into a DataFrame of Tweet data
def tweet_df(df, response, tweet_fields):
    
    # looping through each Tweet in response, parsing data
    for i in range(len(response[0])):
        tweet = response[0][i]
        tweet_id = tweet.id
        tweet_data = {}
        for field in tweet_fields:
            if tweet[field]:
                tweet_data[field] = tweet[field]
                
                # extracting hashtag from "entities" field and adding it as its own column
                if field == 'entities':
                    try:
                        tweet_data['entities_hashtags'] = tweet[field]['hashtags']
                    except KeyError:
                        tweet_data['entities_hashtags'] = None
            else:
                tweet_data[field] = None
        df.loc[tweet_id] = tweet_data
    
    return df

## Authentication

The variables below are what allow access to the Twitter API. I've defined them in a `.env` file, and I'm retrieving them with the code below. We then pass those variables in to a tweepy client in order to instantiate a Twitter API instance.

In [None]:
# retrieving environment variables
consumer_key = os.getenv("API_KEY")
consumer_secret = os.getenv("API_KEY_SECRET")
bearer_token = os.getenv("BEARER_TOKEN")
access_token = os.getenv("ACCESS_TOKEN")
access_secret = os.getenv("ACCESS_SECRET")

In [None]:
# Twitter API authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

In [2]:
# function to initialize Twitter API v1.1 instance (for 30-day and full archive search)
def init_api_1():
    
    # importing necessary modules and loading .env file
    from dotenv import load_dotenv
    import os
    import tweepy
    load_dotenv()
    
    # retrieving environment variables from .env file
    consumer_key = os.getenv("API_KEY")
    consumer_secret = os.getenv("API_KEY_SECRET")
    bearer_token = os.getenv("BEARER_TOKEN")
    access_token = os.getenv("ACCESS_TOKEN")
    access_secret = os.getenv("ACCESS_SECRET")
    
    # Twitter API authentication
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    
    # instantiating Twitter API v1.1 reference
    api_1 = tweepy.API(auth)
    
    return api_1

In [3]:
# function to initialize Twitter API v2 instance (for 7-day search)
def init_api_2():
    # importing necessary modules and loading .env file
    from dotenv import load_dotenv
    import os
    import tweepy
    load_dotenv()
    
    # retrieving environment variables from .env file
    consumer_key = os.getenv("API_KEY")
    consumer_secret = os.getenv("API_KEY_SECRET")
    bearer_token = os.getenv("BEARER_TOKEN")
    access_token = os.getenv("ACCESS_TOKEN")
    access_secret = os.getenv("ACCESS_SECRET")
    
    # instantiating Twitter API v2 reference
    api_2 = tweepy.Client(bearer_token=bearer_token,
                         consumer_key=consumer_key,
                         consumer_secret=consumer_secret,
                         access_token=access_token,
                         access_token_secret=access_secret)
    
    return api_2

## Recent Search

The search function available to us in the Standard API package restricts our search to the past seven days, without a premium API dev subscription. For searches further back in the archive, we need to subscribe to a premium API dev environment or upgrade to the Academic API package, which is given to researchers with a clear thesis or research paper goal in mind.

The query can be 512 characters maximum, and the user can specify a `start_time` and `end_time` (as `datetime` or `str` objects) within the past seven days. The user can also search for hashtags as well. The default behavior for white space is "AND" joins, e.g., hello world = hello AND world. More information about Twitter API queries can be found [in their documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query).

The 7-day search can receive an unlimited number of requests and 500,000 Tweets per month.

The 7-day search has a query character limit of 512 characters.

The `response` object is a tuple, and it consists of four items: `(data, includes, errors, meta)`.

The `data` object contains the Tweets that are retrieved, and `meta` is the metadata for those Tweets. In this reponse object, `includes` and `errors` are empty, so I'm not sure what `includes` is yet.

In [138]:
# function to retrieve Tweets from the past 7 days relevant to a query
def search_7(query, start_date=None, end_date=None, max_results=20, write_csv=False, filename="search_7.csv"):
    
    # initializing API v1.1 instance
    api_2 = init_api_2()
    
    # parsing dates passed into function
    from dateutil import parser
    if start_date:
        start_date = parser.parse(start_date)
    if end_date:
        end_date = parser.parse(end_date)
    
    # setting Tweet data to be included in response
    tweet_fields = ["text", "attachments", "author_id", "context_annotations", "conversation_id", "created_at",
                   "entities", "geo", "in_reply_to_user_id", "lang", "public_metrics", "referenced_tweets"]
    
    # initializing variables for API calls and DataFrame for Tweet data
    import pandas as pd
    next_token = None
    num_results = 0
    tweets = pd.DataFrame(columns=tweet_fields+['entities_hashtags'])
    tweets.index.name = "Tweet ID"
    
    # aggregating multiple pages of query results
    import tweepy
    paginator_results = tweepy.Paginator(api_2.search_recent_tweets,
                                 query=f"{query} lang:en",
                                 start_time=start_date,
                                 end_time=end_date,
                                 tweet_fields=tweet_fields
                                ).flatten(max_results)
    
    # collecting tweets in a format acceptable by tweet_df()
    response = [[tweet for tweet in paginator_results]]
        
    # adding Tweet data to DataFrame
    tweets = tweet_df(tweets, response, tweet_fields)
    num_results = len(tweets)
    
    # writing Tweet DataFrame to csv file
    if write_csv:
        tweets.to_csv(filename)
    
    return tweets

In [137]:
test = search_7("hello world", max_results=300, write_csv=True)
print(len(test))
test

300


Unnamed: 0_level_0,text,attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,in_reply_to_user_id,lang,public_metrics,referenced_tweets,entities_hashtags
Tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1453106019228336141,"Hello, yes. I would like to not be sick anymor...",{'media_keys': ['16_1453106013075386369']},421533256,,1453106019228336141,2021-10-26 21:07:27+00:00,"{'urls': [{'start': 152, 'end': 175, 'url': 'h...",,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453106012467122186,RT @1_LoveLiberty: Hello World ! I'm 15 minute...,{'media_keys': ['3_1452584689408040963']},1241085773425840131,,1453106012467122186,2021-10-26 21:07:25+00:00,"{'urls': [{'start': 55, 'end': 78, 'url': 'htt...",,,en,"{'retweet_count': 1053, 'reply_count': 0, 'lik...","[(type, id)]",
1453105986294599682,RT @1_LoveLiberty: Hello World ! I'm 15 minute...,{'media_keys': ['3_1452584689408040963']},2941106676,,1453105986294599682,2021-10-26 21:07:19+00:00,"{'urls': [{'start': 55, 'end': 78, 'url': 'htt...",,,en,"{'retweet_count': 1053, 'reply_count': 0, 'lik...","[(type, id)]",
1453105978023546880,HELLO WORLD.,,1453101068645289993,,1453105978023546880,2021-10-26 21:07:17+00:00,,,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453105954724089856,RT @KanchanGupta: “So-called bigoted 'outrage'...,,1449210345076125700,"[{'domain': {'id': '6', 'name': 'Sports Event'...",1453105954724089856,2021-10-26 21:07:11+00:00,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",,,en,"{'retweet_count': 581, 'reply_count': 0, 'like...","[(type, id)]",
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1453093853091803140,"RT @MazikeenofH: Hello @hbomax , please consid...",,1275508883667247107,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",1453093853091803140,2021-10-26 20:19:06+00:00,"{'mentions': [{'start': 3, 'end': 15, 'usernam...",,,en,"{'retweet_count': 4, 'reply_count': 0, 'like_c...","[(type, id)]",
1453093821408059393,"Hello World! It is October 26th 2021, 8:18:58 ...",,991029304644591618,,1453093821408059393,2021-10-26 20:18:58+00:00,,,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453093819143135238,Hello world! \n\nhttps://t.co/Puxu3heyCp,,1283171787803688960,,1453093819143135238,2021-10-26 20:18:58+00:00,"{'urls': [{'start': 15, 'end': 38, 'url': 'htt...",,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453093815049355266,Hello World.\nWe meet again....,,424832768,,1453093815049355266,2021-10-26 20:18:57+00:00,,,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,


## 30-Day/Full Archive Search

We can access 30-day and full archive searches without an Academic API package with a premium development environment through the Twitter API. This requires interfacing with the API v1.1, as opposed to v2 in the Recent Search.

The 30-day search can receive 250 requests and 25,000 Tweets per month, while the full archive search can receive 50 requests and 5,000 Tweets per month.

Both 30-day and full archive searches have a query character limit of 256 characters.

The `tweepy.models.Status` object contains a lot of data about the Tweet, such as its text, its author, and various aspects of metadata about the Tweet's creation and interactions.

In [116]:
# function to search Tweets within the past 30 days
# utilizes both API v1.1 and v2 to be consistent with 7-day search.
def search_30(query, start_date=None, end_date=None, max_results=20, write_csv=False, filename="search_30.csv"):
    # initializing API v1.1 instance
    api_1 = init_api_1()
    
    # parsing dates passed into function
    from dateutil import parser
    if start_date:
        start_date = parser.parse(start_date)
    if end_date:
        end_date = parser.parse(end_date)
    
    # retrieving Tweets from the past 30 days relevant to query using tweepy's pagination function
    import tweepy
    response_1 = tweepy.Cursor(api_1.search_30_day,
                               label="30day",
                               query=f"{query} lang:en",
                               fromDate=start_date,
                               toDate=end_date
                              ).items(max_results)
    
    # gathering Tweet ID's in a list
    tweet_ids = [tweet._json['id'] for tweet in response_1]
    
    # setting Tweet data to be included in response
    tweet_fields = ["text", "attachments", "author_id", "context_annotations", "conversation_id", "created_at",
                   "entities", "geo", "in_reply_to_user_id", "lang", "public_metrics", "referenced_tweets"]
    
    # initializing variables for API calls and DataFrame for Tweet data
    import pandas as pd
    num_results = 0
    tweets = pd.DataFrame(columns=tweet_fields+['entities_hashtags'])
    tweets.index.name = "Tweet ID"
    
    # loop to retrieve Tweets from ID's through API v2, 100 at a time
    api_2 = init_api_2()
    while num_results < max_results:
        # slicing tweet_ids since API v2 get_tweets only takes max 100 ID's per request
        try:
            slice_ids = tweet_ids[num_results:num_results+99]
        except IndexError:
            slice_ids = tweet_ids[num_results:]
        
        # retrieving Tweet data from API v2 and adding to DataFrame
        response_2 = api_2.get_tweets(slice_ids, tweet_fields=tweet_fields)
        tweets = tweet_df(tweets, response_2, tweet_fields)
        num_results = len(tweets)
    
    # writing Tweet DataFrame to csv file
    if write_csv:
        tweets.to_csv(filename)
    
    return tweets

In [117]:
test30 = search_30("hello world", max_results=150, write_csv=True)
test30

Unnamed: 0_level_0,text,attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,in_reply_to_user_id,lang,public_metrics,referenced_tweets,entities_hashtags
Tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1453091911619825673,"Hello, world! The Bitcoin white paper is takin...",,24222556,"[{'domain': {'id': '45', 'name': 'Brand Vertic...",1453091911619825673,2021-10-26 20:11:23+00:00,"{'urls': [{'start': 83, 'end': 106, 'url': 'ht...",,,en,"{'retweet_count': 2, 'reply_count': 4, 'like_c...",,
1453091826072633345,RT @KanchanGupta: “So-called bigoted 'outrage'...,,1637616686,"[{'domain': {'id': '6', 'name': 'Sports Event'...",1453091826072633345,2021-10-26 20:11:03+00:00,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",,,en,"{'retweet_count': 549, 'reply_count': 0, 'like...","[(type, id)]",
1453091634879639553,RT @SSGPrinceVegeta: Hello guys i hate to ask ...,,997125904047501312,,1453091634879639553,2021-10-26 20:10:17+00:00,"{'mentions': [{'start': 3, 'end': 19, 'usernam...",,,en,"{'retweet_count': 21, 'reply_count': 0, 'like_...","[(type, id)]",
1453091628952948738,“Hello world”,,1390983395711152132,,1453091628952948738,2021-10-26 20:10:16+00:00,,,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453091603594293251,RT @AxieWomensAlli: Hello world!\n\nOur missio...,,1327636179513176073,"[{'domain': {'id': '30', 'name': 'Entities [En...",1453091603594293251,2021-10-26 20:10:10+00:00,"{'hashtags': [{'start': 60, 'end': 66, 'tag': ...",,,en,"{'retweet_count': 44, 'reply_count': 0, 'like_...","[(type, id)]","[{'start': 60, 'end': 66, 'tag': 'women'}, {'s..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1453083680302411790,RT @1_LoveLiberty: Hello World ! I'm 15 minute...,{'media_keys': ['3_1452584689408040963']},1446965634651262978,,1453083680302411790,2021-10-26 19:38:41+00:00,"{'mentions': [{'start': 3, 'end': 17, 'usernam...",,,en,"{'retweet_count': 1006, 'reply_count': 0, 'lik...","[(type, id)]",
1453083580788260871,"RT @otterclam: Hello, world! \n\nWe are enhydr...",,1387434854942482441,,1453083580788260871,2021-10-26 19:38:17+00:00,"{'mentions': [{'start': 3, 'end': 13, 'usernam...",,,en,"{'retweet_count': 475, 'reply_count': 0, 'like...","[(type, id)]",
1453083576602439685,"RT @otterclam: Hello, world! \n\nWe are enhydr...",,1386073567796813824,,1453083576602439685,2021-10-26 19:38:16+00:00,"{'mentions': [{'start': 3, 'end': 13, 'usernam...",,,en,"{'retweet_count': 475, 'reply_count': 0, 'like...","[(type, id)]",
1453083547816845314,"RT @LiborSason: @Sylvia70485099 Hello Sylvia, ...",{'media_keys': ['3_1453072128379609096']},877756963416920064,,1453083547816845314,2021-10-26 19:38:09+00:00,"{'mentions': [{'start': 3, 'end': 14, 'usernam...",,,en,"{'retweet_count': 2, 'reply_count': 0, 'like_c...","[(type, id)]",


In [118]:
# function to search Tweets within the past 30 days
# utilizes both API v1.1 and v2 to be consistent with 7-day search.
def search_full(query, start_date=None, end_date=None, max_results=20, write_csv=False, filename="search_full.csv"):
    # initializing API v1.1 instance
    api_1 = init_api_1()
    
    # parsing dates passed into function
    from dateutil import parser
    if start_date:
        start_date = parser.parse(start_date)
    if end_date:
        end_date = parser.parse(end_date)
    
    # retrieving Tweets from the past 30 days relevant to query using tweepy's pagination function
    import tweepy
    response_1 = tweepy.Cursor(api_1.search_full_archive,
                               label="full",
                               query=f"{query} lang:en",
                               fromDate=start_date,
                               toDate=end_date
                              ).items(max_results)
    
    # gathering Tweet ID's in a list
    tweet_ids = [tweet._json['id'] for tweet in response_1]
    
    # setting Tweet data to be included in response
    tweet_fields = ["text", "attachments", "author_id", "context_annotations", "conversation_id", "created_at",
                   "entities", "geo", "in_reply_to_user_id", "lang", "public_metrics", "referenced_tweets"]
    
    # initializing variables for API calls and DataFrame for Tweet data
    import pandas as pd
    num_results = 0
    tweets = pd.DataFrame(columns=tweet_fields+['entities_hashtags'])
    tweets.index.name = "Tweet ID"
    
    # loop to retrieve Tweets from ID's through API v2, 100 at a time
    api_2 = init_api_2()
    while num_results < max_results:
        # slicing tweet_ids since API v2 get_tweets only takes max 100 ID's per request
        try:
            slice_ids = tweet_ids[num_results:num_results+99]
        except IndexError:
            slice_ids = tweet_ids[num_results:]
        
        # retrieving Tweet data from API v2 and adding to DataFrame
        response_2 = api_2.get_tweets(slice_ids, tweet_fields=tweet_fields)
        tweets = tweet_df(tweets, response_2, tweet_fields)
        num_results = len(tweets)
    
    # writing Tweets DataFrame to csv file
    if write_csv:
        tweets.to_csv(filename)
    
    return tweets

In [119]:
test_full = search_full("hello world", max_results=150, write_csv=True)
test_full

Unnamed: 0_level_0,text,attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,in_reply_to_user_id,lang,public_metrics,referenced_tweets,entities_hashtags
Tweet ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1453092553956503553,"RT @tyler: Hello, world! The Bitcoin white pap...",,1189263728883175426,"[{'domain': {'id': '45', 'name': 'Brand Vertic...",1453092553956503553,2021-10-26 20:13:56+00:00,"{'mentions': [{'start': 3, 'end': 9, 'username...",,,en,"{'retweet_count': 8, 'reply_count': 0, 'like_c...","[(type, id)]",
1453092549292404737,Hello World,,1374307487298555906,,1453092549292404737,2021-10-26 20:13:55+00:00,,,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453092461220450311,"RT @tyler: Hello, world! The Bitcoin white pap...",,1431633525556535301,"[{'domain': {'id': '45', 'name': 'Brand Vertic...",1453092461220450311,2021-10-26 20:13:34+00:00,"{'mentions': [{'start': 3, 'end': 9, 'username...",,,en,"{'retweet_count': 8, 'reply_count': 0, 'like_c...","[(type, id)]",
1453092368652005381,"RT @tyler: Hello, world! The Bitcoin white pap...",,72868454,"[{'domain': {'id': '45', 'name': 'Brand Vertic...",1453092368652005381,2021-10-26 20:13:12+00:00,"{'mentions': [{'start': 3, 'end': 9, 'username...",,,en,"{'retweet_count': 8, 'reply_count': 0, 'like_c...","[(type, id)]",
1453092253660942337,RT @KanchanGupta: “So-called bigoted 'outrage'...,,714568295446224897,"[{'domain': {'id': '6', 'name': 'Sports Event'...",1453092253660942337,2021-10-26 20:12:45+00:00,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",,,en,"{'retweet_count': 551, 'reply_count': 0, 'like...","[(type, id)]",
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1453084241512648706,Hello World,,4848518997,,1453084241512648706,2021-10-26 19:40:54+00:00,,,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453084209241858051,Free iOS App - Gotchasaur: Hello World iPhone ...,,2909948695,"[{'domain': {'id': '47', 'name': 'Brand', 'des...",1453084209241858051,2021-10-26 19:40:47+00:00,"{'urls': [{'start': 176, 'end': 199, 'url': 'h...",,,en,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,
1453084185179066375,RT @tssremindbot: Hello. I just wanted to remi...,,1219268565842513920,,1453084185179066375,2021-10-26 19:40:41+00:00,"{'mentions': [{'start': 3, 'end': 16, 'usernam...",,,en,"{'retweet_count': 2, 'reply_count': 0, 'like_c...","[(type, id)]",
1453084159778426883,"RT @auddriy: hello distwitter, what is the bes...",{'poll_ids': ['1452826903631499273']},1117796124,,1453084159778426883,2021-10-26 19:40:35+00:00,"{'mentions': [{'start': 3, 'end': 11, 'usernam...",,,en,"{'retweet_count': 1, 'reply_count': 0, 'like_c...","[(type, id)]",


## Stream

A Stream is an object that can filter and sample realtime Tweets. Since it's a real-time stream, this is probably not what we're looking for in an analysis pipeline.

In [None]:
# instantiating Stream object
stream = tweepy.Stream(consumer_key, consumer_secret, access_token, access_secret)
stream

In [None]:
stream.sample(languages=["en"])