# Extracting tweet information using the Twitter API 

<hr style="border:2px solid gray"> </hr>

My Twitter bot's performance is not the best. Each tweet receives few impressions, and favorites and retweets are rare. I want to better understand how to engage the Twitter user base that values tweets about books and book-quotes. The most straightforward way I can think of to do so is to look at what book-and-bookquote tweets are most well received -- enter the Twitter API's search resources. 

## API search resources

The Twitter API allows a user to search for tweets in as straightforward or as advanced  a manner as anyone could want. The API search resource URL is:

`https://api.twitter.com/1.1/search/tweets.json`

Twitter's API docs describe the required and optional parameters at the link below:

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets

The API search takes a query -- q -- and returns the results in json format. The query must be in URL encoding. Below is a handy resource to understand how to encode special characters and text in general:

https://www.w3schools.com/tags/ref_urlencode.ASP

Further details on the API search:

https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/guides/standard-operators

And details on search operators:

https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators

## Searching for bookquote-related tweets

Twitter API limits the search's returned tweets to less than 100 from within the past seven days. I would like to search for the most popular tweets in this time frame, but the result_type's 'popular' option appears buggy. Let's look at tweets from the past week that:
- contain the hashtags 'books' and 'bookquotes'
- were not retweets
- were written in the English language

I will definitely expand these search parameters later, but for now they serve as a handy starting point and example. 

# Python search implementation

I'll be using my standard Python data analysis libraries -- Numpy and Pandas for example. The Twitter API's resource URL's return requested data in json format, so I'll have to use Python's built-in json library. I've never used this library before, so it should be fun to figure out.  

In [1]:
import numpy as np
import pandas as pd
import json
from subprocess import Popen, PIPE
import datetime

Quering the API for search results is fairly straightforward once the query is URL encoded. Here I'm using 'twurl', but any other bash tool will do. 

In [2]:
# define search parameters:
query = "%23books+%23bookquotes+-filter%3Aretweets" #URL encoding
count = 100 #max number available
result_type = "recent" 

# place search parameters into the appropriate URL:
url = "{api}?q={q}&result_type={result_type}&count={count}&tweet_mode=extended&include_entities=True".format(
    api="/1.1/search/tweets.json",
    q=query,
    result_type=result_type,
    count=count)

# create process:
cmd = ["twurl",url]
process = Popen(cmd,stdout=PIPE,stderr=PIPE)
stdout,stderr = process.communicate()

## Understanding output from /1.1/search/tweets.json
The API tweet search returns a collection of relevant tweets matching the specific inquery. The Python subprocess returns a 'bytes' object. The object is made more user-friendly with the json Python library. json.loads() is used to deserialize the bytes instance into a Python object. 

In [3]:
output = json.loads(stdout)

json.loads() returns a Python dictionary with two key-item pairs:
1. The 'statuses' key contains the search's resultant json objects
2. The 'search_metadata' key contains the search's metadata

'search_metadata' is a straightforward dictionary with basic information about the completed search. 

In [4]:
output["search_metadata"]

{'completed_in': 0.084,
 'max_id': 1345441486255632390,
 'max_id_str': '1345441486255632390',
 'next_results': '?max_id=1342562577361084417&q=%23books%20%23bookquotes%20-filter%3Aretweets&count=100&include_entities=1&result_type=recent',
 'query': '%23books+%23bookquotes+-filter%3Aretweets',
 'refresh_url': '?since_id=1345441486255632390&q=%23books%20%23bookquotes%20-filter%3Aretweets&result_type=recent&include_entities=1',
 'count': 100,
 'since_id': 0,
 'since_id_str': '0'}

'statuses' is a Python list of Python dictionaries. Each dictionary contains a mess of other nested dictionaries, strings, and other types. Let's take a closer look at the key-item pairs. 

In [5]:
keys,items = [],[]
for key, item in output["statuses"][0].items():
    keys.append(key)
    items.append(type(item))

# output the key-item information as a Pandas dictionary for 
# easy viewing:
pd.DataFrame(
    [keys,items],
    index=["key","item type"]
    ).transpose()

Unnamed: 0,key,item type
0,created_at,<class 'str'>
1,id,<class 'int'>
2,id_str,<class 'str'>
3,full_text,<class 'str'>
4,truncated,<class 'bool'>
5,display_text_range,<class 'list'>
6,entities,<class 'dict'>
7,extended_entities,<class 'dict'>
8,metadata,<class 'dict'>
9,source,<class 'str'>


In [6]:
output["statuses"][0]

{'created_at': 'Sat Jan 02 18:47:02 +0000 2021',
 'id': 1345441486255632390,
 'id_str': '1345441486255632390',
 'full_text': 'Let\'s play a game! Name that book! (hint: It\'s not one of mine!)\n\n"There is some good in this world ... and it\'s worth fighting for." \n\n#NewYear #trynewthings #novelinteractives #bookquotes #quote #books #reading #read #namethatbook https://t.co/XVt17WKodl',
 'truncated': False,
 'display_text_range': [0, 232],
 'entities': {'hashtags': [{'text': 'NewYear', 'indices': [136, 144]},
   {'text': 'trynewthings', 'indices': [145, 158]},
   {'text': 'novelinteractives', 'indices': [159, 177]},
   {'text': 'bookquotes', 'indices': [178, 189]},
   {'text': 'quote', 'indices': [190, 196]},
   {'text': 'books', 'indices': [197, 203]},
   {'text': 'reading', 'indices': [204, 212]},
   {'text': 'read', 'indices': [213, 218]},
   {'text': 'namethatbook', 'indices': [219, 232]}],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 13454414846953472

# Process search results
Let's judge a tweet's success by its number of retweets and its number of favorites. We are also interested in what was in the tweet text (number of hashtags, length, etc), what time and day of the week the tweet was made, and whether or not an image was attached. The corresponding output dictionary keys to most of these parameters are:
- retweet_count
- favorite_count
- full_text
- created_at
- id

Information about the media contained in the search's output is accessed via the 'entities' key. The 'entities' key accesses a dictionary that contains tweet hashtag and attachment information. 

In [7]:
output["statuses"][0]["entities"]

{'hashtags': [{'text': 'NewYear', 'indices': [136, 144]},
  {'text': 'trynewthings', 'indices': [145, 158]},
  {'text': 'novelinteractives', 'indices': [159, 177]},
  {'text': 'bookquotes', 'indices': [178, 189]},
  {'text': 'quote', 'indices': [190, 196]},
  {'text': 'books', 'indices': [197, 203]},
  {'text': 'reading', 'indices': [204, 212]},
  {'text': 'read', 'indices': [213, 218]},
  {'text': 'namethatbook', 'indices': [219, 232]}],
 'symbols': [],
 'user_mentions': [],
 'urls': [],
 'media': [{'id': 1345441484695347200,
   'id_str': '1345441484695347200',
   'indices': [233, 256],
   'media_url': 'http://pbs.twimg.com/media/Eqv35uFXIAAeYbm.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/Eqv35uFXIAAeYbm.jpg',
   'url': 'https://t.co/XVt17WKodl',
   'display_url': 'pic.twitter.com/XVt17WKodl',
   'expanded_url': 'https://twitter.com/mmadiganauthor/status/1345441486255632390/photo/1',
   'type': 'photo',
   'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
    'sm

The 'media' key within 'entities' is a list of the attachment media. The list's items are themselves dictionaries. 

In [16]:
output["statuses"][0]["entities"]["media"][0]

{'id': 1345441484695347200,
 'id_str': '1345441484695347200',
 'indices': [233, 256],
 'media_url': 'http://pbs.twimg.com/media/Eqv35uFXIAAeYbm.jpg',
 'media_url_https': 'https://pbs.twimg.com/media/Eqv35uFXIAAeYbm.jpg',
 'url': 'https://t.co/XVt17WKodl',
 'display_url': 'pic.twitter.com/XVt17WKodl',
 'expanded_url': 'https://twitter.com/mmadiganauthor/status/1345441486255632390/photo/1',
 'type': 'photo',
 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
  'small': {'w': 680, 'h': 680, 'resize': 'fit'},
  'large': {'w': 1000, 'h': 1000, 'resize': 'fit'},
  'medium': {'w': 1000, 'h': 1000, 'resize': 'fit'}}}

The 'type' key returns the media type.

In [9]:
output["statuses"][0]["entities"]["media"][0]["type"]

'photo'

Let's extract the relevant information and store them in a Pandas' dataframe.

In [17]:
def extract_page_data(statuses):
    tweet_id = []
    tweet_datetime = []
    tweet_body = []
    tweet_media_type = []
    retweets = []
    favorites = []
    for tweet in statuses:

        # convert 'created_at' to datetime object:    
        aware_utc = datetime.datetime.strptime(
            tweet["created_at"],
            "%a %b %d %H:%M:%S %z %Y")
        naive_utc = aware_utc.replace(tzinfo=None)

        # store basic tweet info:
        tweet_id.append(tweet["id"])
        tweet_datetime.append(naive_utc)
        tweet_body.append(tweet["full_text"])

        # store tweet retweets and favorites:
        retweets.append(tweet["retweet_count"])
        favorites.append(tweet["favorite_count"])

        # store media type attachment:
        try:
            media = tweet["entities"]["media"]
            media_list = []
            for mm in media:
                media_list.append(mm["type"])
            tweet_media_type.append(";".join(media_list))

        except:
            tweet_media_type.append(np.nan)

    # return Pandas' Dataframe:
    data = [
        tweet_id,
        tweet_datetime,
        tweet_body,
        tweet_media_type,
        retweets,
        favorites,
        ]
    index = [
        "tweet_id",
        "tweet_datetime",
        "tweet_body",
        "tweet_media",
        "num_retweets",
        "num_favorites"
        ]
    return pd.DataFrame(data,index=index).transpose()

# display results:
results = extract_page_data(output["statuses"])
results.head()

Unnamed: 0,tweet_id,tweet_datetime,tweet_body,tweet_media,num_retweets,num_favorites
0,1345441486255632390,2021-01-02 18:47:02,Let's play a game! Name that book! (hint: It's...,photo,0,0
1,1345357236084690944,2021-01-02 13:12:15,"Them - ""Books are boring!""\nMe - *blocked*\n\n...",photo,0,3
2,1345344159029354497,2021-01-02 12:20:17,#nortonjuster #thephantomtollbooth #bookstagra...,photo,0,4
3,1345255530193965056,2021-01-02 06:28:06,#mythsandmusic #blackmagickseries #books #whit...,photo,0,0
4,1345203214329782276,2021-01-02 03:00:13,"“Great writing, great action, great characters...",photo,1,0


# Iterate over search results pages

The search results will often exceed the 100-tweet limit imposed by the API. The result's metadata provides a way to query the API recursively until the search is complete with 'next_results'. 

Here is the search metadata object again: 

In [11]:
output["search_metadata"]

{'completed_in': 0.084,
 'max_id': 1345441486255632390,
 'max_id_str': '1345441486255632390',
 'next_results': '?max_id=1342562577361084417&q=%23books%20%23bookquotes%20-filter%3Aretweets&count=100&include_entities=1&result_type=recent',
 'query': '%23books+%23bookquotes+-filter%3Aretweets',
 'refresh_url': '?since_id=1345441486255632390&q=%23books%20%23bookquotes%20-filter%3Aretweets&result_type=recent&include_entities=1',
 'count': 100,
 'since_id': 0,
 'since_id_str': '0'}

'next_results' contains an API search query that will provide the next block of tweets from the search.

In [19]:
output["search_metadata"]["next_results"]

'?max_id=1342562577361084417&q=%23books%20%23bookquotes%20-filter%3Aretweets&count=100&include_entities=1&result_type=recent'

In this example's case, the search results did not exceed 100 tweets so the 'next_results' query returns an empty results object. 

In [12]:
next_results = output["search_metadata"]["next_results"]
url = "/1.1/search/tweets.json%s" %next_results
cmd = ["twurl",url]
process = Popen(cmd,stdout=PIPE,stderr=PIPE)
stdout,stderr = process.communicate()
output2 = json.loads(stdout)
#results2 = extract_page_data(output2["statuses"])
#results2

In [14]:
output2

{'statuses': [],
 'search_metadata': {'completed_in': 0.008,
  'max_id': 1342562577361084417,
  'max_id_str': '1342562577361084417',
  'query': '%23books+%23bookquotes+-filter%3Aretweets',
  'refresh_url': '?since_id=1342562577361084417&q=%23books%20%23bookquotes%20-filter%3Aretweets&result_type=recent&include_entities=1',
  'count': 100,
  'since_id': 0,
  'since_id_str': '0'}}