# CACTUS Week 1
## Import essential libraries

In [5]:
# Code thanks to https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a


# For sending GET requests from the API
import requests
# For saving access tokens and for file management when creating and adding to the dataset
import os
# For dealing with json responses we receive from the API
import json
# For displaying the data after
import pandas as pd
# For saving the response data in CSV format
import csv
# For parsing the dates received from twitter in readable formats
import datetime
import dateutil.parser
import unicodedata
#To add wait time between requests
import time

1. To be able to send your first request to the Twitter API, you need to have a developer account.
2. Next, create a project and connect an App through the developer portal.
3. Go to the developer portal dashboard
4. Sign in with your developer account
5. Create a new project, give it a name, a use-case based on the goal you want to achieve, and a description.
6. If everything is successful, you should be able to see a page containing your keys and tokens, we will use one of these to access the API. Look out for the BEARER TOKEN. See https://miro.medium.com/max/2400/1*Y20zm9Vf1k5uRMRTMkHRkQ.png

7. The next step is to create an auth() function that will have the “Bearer Token” from the app we just created.
8. Since this Bearer Token is sensitive information, you should not be sharing it with anyone at all. If you are working with a team you don’t want anyone to have access to it.
9. So, we will save the token in an “environment variable”.
10. Finally, we will create our auth() function, which retrieves the token from the environment.

In [6]:
os.environ['TOKEN'] = ''
def auth():
    return os.getenv('TOKEN')

## Create Headers
Next, we will define a function that will take our bearer token, pass it for authorization and return headers we will use to access the API.

In [7]:
def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers

# Create URL
Now that we can access the API, we will build the request for the endpoint we are going to use and the parameters we want to pass.

In [54]:
def create_url(keyword):
    
    search_url = "https://api.twitter.com/2/tweets/search/recent" #Change to the endpoint you want to collect data from
    # https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent
    #change params based on the endpoint you are using
    query_params = {'query': keyword,
                    'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
                    'user.fields': 'id,name,username,created_at,description,public_metrics,verified',
                    'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
                    'next_token': {}}
    return (search_url, query_params)

The defined function above contains two pieces:

## search_url:

Which is the link of the "endpoint" we want to access. Endpoint just means.. what we want to do with it. E.g.: if we want all the posts by a user, the endpoint is "user lookup"

Twitter’s API has a lot of different endpoints. You can look them up here: https://miro.medium.com/max/700/1*1oJExGGK151WfQJ6LIikww.png

Right now, this code is written for the full-archive search endpoint.

## query_params:

The parameters that the endpoint offers and we can use to customize the request we want to send. E.g.: if we want all the posts by a user, the endpoint is "user lookup", and the query parameter is the screen name of the user.

1. Some parameters control the returned response, e.g., query, start time, end time, max results

``'query':        keyword, # (e.g. query can be "xbox lang:en" (also specifies that we only what english tweets)
``


2. Some fields are optional, e.g., you can filter what subset of the full data you want. Only the user data, only the tweet data, or only the place data.

``'expansions':   'author_id,in_reply_to_user_id,geo.place_id',
'tweet.fields': 'id,text,author_id,in_reply_to_user_id,geo,conversation_id,created_at,lang,public_metrics,referenced_tweets,reply_settings,source',
'user.fields':  'id,name,username,created_at,description,public_metrics,verified',
'place.fields': 'full_name,id,country,country_code,geo,name,place_type',
``
3. One field lets you "turn the page" when there are hundreds or thousands of results, because the response bunches results into 500 at a time. The "next_token" parameter lets you access the next page of results.

``'next_token': {}
``

For the full-archive search endpoint that we are using for this article, you can find the list of query parameters in its API Reference page under the “Query parameters” section, and an example in this image: https://miro.medium.com/max/700/1*Ex1pG3yTXHc6b_dXDvnUoQ.png

# Connect to Endpoint
Now that we have the URL, headers, and parameters we want, we will create a function that will put all of this together and connect to the endpoint.
The function below will send the “GET” request and if everything is correct (response code 200), it will return the response in “JSON” format.
Note: next_token is set to “None” by default since we only care about it if it exists.

In [55]:
def connect_to_endpoint(url, headers, params, next_token = None):
    params['next_token'] = next_token   #params object received from create_url function
    response = requests.request("GET", url, headers = headers, params = params)
    print("Endpoint Response Code: " + str(response.status_code))
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

# Putting it all Together
Now that we have all the functions we need, let's test putting them all together to create our first request!

In the next cell, we will set up our inputs:
bearer_token and headers from the API.

We will look for tweets in English that contain the word “xbox”.

We will look for tweets between the 1st and the 31st of March, 2021.

We want only a maximum of 15 tweets returned.

In [56]:
#Inputs for the request
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = "xbox lang:en"


Now we will create the URL and get the response from the API.

The response returned from the Twitter API is returned in JavaScript Object Notation “JSON” format.

To be able to deal with it and break down the response we get, we will the encoder and decoder that exists for python which we have imported earlier. You can find more information about the library here: https://docs.python.org/3/library/json.html

If the returned response from the below code is 200, then the request was successful.

In [57]:
url = create_url(keyword)
json_response = connect_to_endpoint(url[0], headers, url[1])

Endpoint Response Code: 200


Lets print the response in a readable format using this JSON library functions

In [59]:
print(json.dumps(json_response, indent=4, sort_keys=True))

{
    "data": [
        {
            "author_id": "1082608942847574017",
            "conversation_id": "1461648044848521217",
            "created_at": "2021-11-19T10:50:24.000Z",
            "id": "1461648044848521217",
            "lang": "en",
            "public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 76
            },
            "referenced_tweets": [
                {
                    "id": "1461439491395112960",
                    "type": "retweeted"
                }
            ],
            "reply_settings": "everyone",
            "source": "Twitter for iPhone",
            "text": "RT @LD50_II: Neat little hidden sign on the Halo xbox https://t.co/Q9M9w4axot"
        },
        {
            "author_id": "1440356865951428611",
            "conversation_id": "1461424151365922822",
            "created_at": "2021-11-19T10:50:24.000Z",
            "id": "1461648044

# Exploring the JSON response

Now let's break down the returned JSON response.
the response is basically read as a Python dictionary and the keys either contain data or contain more dictionaries. The top two keys are:

## Data
A list of dictionaries, each dictionary represents the data for a tweet. Example on how to retrieve the time from the first tweet was created:

In [25]:
json_response['data'][0]['created_at']

'2021-11-19T09:59:33.000Z'

## Meta
A dictionary of attributes about the request we sent, we usually would only care about two keys in this dictionary, next_token and result_count.

1. next_token is the unique ID field for the next page of results
2. result_count is the number of results returned from the request


In [61]:
# 1. next_token is the unique ID field for the next page of results
print(json_response['meta']['next_token'])
#nothing because we said we want only 15 results


# 2. result_count is the number of results returned from the request

print(json_response['meta']['result_count'])

b26v89c19zqg8o3fpdy5zod1ze0pto59mjulad4nt82v1
10


# Write to CSV file

In [27]:
df = pd.DataFrame(json_response['data'])
df.to_csv('data.csv')

## The custom approach:
First, we will create a CSV file with our desired column headers, we will do this separately from our actual function so later on it does not interfere with looping over requests.

In [28]:
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','tweet'])
csvFile.close()

Then, we will create our append_to_csv function, which we will input the response and desired filename into, and the function will append all the data we collected to the CSV file.

In [29]:
def append_to_csv(json_response, fileName):

    #A counter variable
    counter = 0

    #Open OR create the target CSV file
    csvFile = open(fileName, "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    #Loop through each tweet
    for tweet in json_response['data']:
        
        # We will create a variable for each since some of the keys might not exist for some tweets
        # So we will account for that

        # 1. Author ID
        author_id = tweet['author_id']

        # 2. Time created
        created_at = dateutil.parser.parse(tweet['created_at'])

        # 3. Geolocation
        if ('geo' in tweet):   
            geo = tweet['geo']['place_id']
        else:
            geo = " "

        # 4. Tweet ID
        tweet_id = tweet['id']

        # 5. Language
        lang = tweet['lang']

        # 6. Tweet metrics
        retweet_count = tweet['public_metrics']['retweet_count']
        reply_count = tweet['public_metrics']['reply_count']
        like_count = tweet['public_metrics']['like_count']
        quote_count = tweet['public_metrics']['quote_count']

        # 7. source
        source = tweet['source']

        # 8. Tweet text
        text = tweet['text']
        
        # Assemble all data in a list
        res = [author_id, created_at, geo, tweet_id, lang, like_count, quote_count, reply_count, retweet_count, source, text]
        
        # Append the result to the CSV file
        csvWriter.writerow(res)
        counter += 1

    # When done, close the CSV file
    csvFile.close()

    # Print the number of tweets for this iteration
    print("# of Tweets added from this response: ", counter) 

In [30]:
#Now if we run our append_to_csv() function on our last call, 
#we should have a file that contains 15 tweets (or less depending on your query)

append_to_csv(json_response, "data.csv")


# of Tweets added from this response:  10


# Looping Through Requests
Now, what if we want to save more responses? Beyond the first 500 results that Twitter gave us or if we want to automate getting Tweets over a specific period of time. For that, we will be using loops and the next_token variables we receive from Twitter.

What we can do is, we can set a limit for tweets we want to collect per month, so that if we reach the specific cap at one month, we move on to the next one.

The code below is an example that will just do that exactly! The block of code below is composed of two loops:

A For-loop that goes over the months/weeks/days we want to cover (Depending on how it is set)

A While-loop that controls the maximum number of tweets we want to collect per time period.

Notice that a time.sleep() is added between calls to ensure you are not just spamming the API with requests.

In [67]:
#Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)
keyword = "xbox lang:en"

#Total number of tweets we collected from the loop
total_tweets = 0

# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)

#Create headers for the data you want to save, in this example, we only want save these columns in our dataset
csvWriter.writerow(['author id', 'created_at', 'geo', 'id','lang', 'like_count', 'quote_count', 'reply_count','retweet_count','source','tweet'])
csvFile.close()

# Inputs
count = 0 # Counting tweets per time period
max_results = 100 # Max tweets per time period
max_count = 1000
flag = True
next_token = None #json_response['meta']['next_token']
    
    # Check if flag is true
while flag:
        # Check if max_count reached
        if count >= max_count:
            break
        print("-------------------")
        print("Token: ", next_token)
        url = create_url(keyword)
        json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
        result_count = json_response['meta']['result_count']

        if 'next_token' in json_response['meta']:
            # Save the token to use for next call
            next_token = json_response['meta']['next_token']
            print("Next Token: ", next_token)
            if result_count is not None and result_count > 0 and next_token is not None:
                append_to_csv(json_response, "data.csv")
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)                
        # If no next token exists
        else:
            if result_count is not None and result_count > 0:
                print("-------------------")
                append_to_csv(json_response, "data.csv")
                count += result_count
                total_tweets += result_count
                print("Total # of Tweets added: ", total_tweets)
                print("-------------------")
                time.sleep(5)
            
            #Since this is the final request, turn flag to false to move to the next time period.
            flag = False
            next_token = None
        time.sleep(5)
print("Total number of results: ", total_tweets)

-------------------
Token:  None
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fpdy5zod20xxgh1rnb1qjn626527wd
# of Tweets added from this response:  10
Total # of Tweets added:  10
-------------------
-------------------
Token:  b26v89c19zqg8o3fpdy5zod20xxgh1rnb1qjn626527wd
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fpdy5zod20xx9z9yz6nve35uutgb5p
# of Tweets added from this response:  10
Total # of Tweets added:  20
-------------------
-------------------
Token:  b26v89c19zqg8o3fpdy5zod20xx9z9yz6nve35uutgb5p
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fpdy5zod20xpooo1w874olkon3gzgd
# of Tweets added from this response:  10
Total # of Tweets added:  30
-------------------
-------------------
Token:  b26v89c19zqg8o3fpdy5zod20xpooo1w874olkon3gzgd
Endpoint Response Code: 200
Next Token:  b26v89c19zqg8o3fpdy5zod20xi06rtb7bdco52823eyl
# of Tweets added from this response:  10
Total # of Tweets added:  40
-------------------
-------------------
Token:  b26v8

KeyboardInterrupt: 