# DEVELOPMENT DRAFT 
# Introduction to Facebook API for Researchers

This notebook was written by [John Simpson](mailto:john.simpson@computecanada.ca) and is meant to provide some simple, working examples for researchers who would like to collect information from Facebook.  This workbook uses a Python library called the [Facebook SDK](https://pypi.org/project/facebook-sdk-python/) to access the Facebook graph and collect information.  Much of this workbook is drawn from Study Tonight's [Working with Facebook Graph API](https://www.studytonight.com/network-programming-in-python/facebook-graph-api).  Official documentation for the Facebook SDK can be found [HERE](https://facebook-sdk.readthedocs.io/en/latest/index.html).  Official documentation for the Facebook API can be found [HERE](https://developers.facebook.com/docs/graph-api/reference).

This notebook assumes:

1. Basic familiarity with the Jupyter Notebook environment.
2. A functioning python environment on the system it is run in and that you have the authority to install software on it.
3. That you have a Facebook account [HERE](https://www.facebook.com/)
4. That you pay attention to the various notes and warnings around the cells.

I won't promise you any support but if you send me a note I'll help as I am able.

With all this said, let's get started by installing the TwitterAPI library.

In [1]:
!pip install facebook-sdk

Collecting facebook-sdk
  Downloading facebook_sdk-3.1.0-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: facebook-sdk
Successfully installed facebook-sdk-3.1.0


With the Facebook SDK library installed on the system we should be able to open it for use throughout this workbook with the following command:

In [2]:
import facebook

[Note that almost every piece of code in the remainder of this workbook assumes that the two cells above have been run and run successfully.  If you open the workbook and immediately try to run a cell other than this one first then it is likely that you will receive an error.  Simply run the cells above and try to run the cell you want to run again.  If you are receiving errors then two likely possibilities are an incorrect installation of Python or no network connection.]

## Authentication

Use of this workbook–and the Facebook API (Application Programming Interface) in general—requires a Facebook account with the developer features turned on.  Once you have a regular Facebook account setup the develope features can can be quickly turned on by creating an app as follows:

1. Go to [https://developers.facebook.com/](https://developers.facebook.com/) and choose "Get Started" from the top right corner.  (If you do not see "Get Started" then you have likely already created an app.  Choose, "My Apps" instead and then choosing the option to create a new app _or_ jumping to step ________ if you would like to use the app you already have.)
2. You will be asked to share some information and to perform a test to prove you are not a bot.  Do this.
3. You will be asked for a name for your app.  The demo version used to produce this notebook was called "Test App for Training".  You can call it anything you would like.
4. You will be asked for an email to be contacted through if Facebook every needs to raise a concern about your app.  Provide this.

When complete you will be put on a page for your app.  It will have a url in the address bar that looks like `https://developers.facebook.com/apps/321150513924422/add/` and has a bunch of products that can be added to the app listed as blocks in the middle of the page.  We are not interested in any of this.  At least not yet.

What we want is an access token for the Facebook Graph API.  You can get one of these by:

1. Going to [https://developers.facebook.com/tools/explorer/](https://developers.facebook.com/tools/explorer/).
2. Clicking on the blue `Generate Access Token` button on the right-hand side of the screen.
3. 


We will use oAuth1 throughout this workbook even though we'll only be reading tweets since it can be used in more situations (in particular when we try to read from the streaming API).  If it is necessary to read Twitter API endpoints (other than the streaming endpoint) at a faster rate than this workbook initially provides then consider switching to oAuth2.

You will need oAuth1 to do any of the following:

* Post Tweets or other resources;
* Connect to Streaming endpoints;
* Search for users;
* Use any geo endpoint;
* Access Direct Messages or account credentials;
* Retrieve user's email addresses;

You can get away with oAuth2 (or application-only authentication) if you are only looking to perform the following:

* Pull user timelines;
* Access friends and followers of any account;
* Access lists resources;
* Search in Tweets;
* Retrieve any user information, excluding the user's email address;

Both authentication methods will require you to collect some information about keys and tokens and paste it into the appropriate section of the cell below.  This key and token information is generated when you create a profile for an app on the Twitter Developer site.  App profiles can be created at https://developer.twitter.com/en/apps.  That same page will hold a list of all the profiles that you have created and clicking on the "Details" button for each app will bring you to a summary page.  There will be a link/tab near the top of the page called "Keys and Tokens" and clicking this will bring you to the page with the key and token information.

Paste in the required key and token information from the Twitter Developer site into the cell below and then run it in order to use this workbook.  Remember that you'll need to run the cell below (which loads your credentials) and _one_ of the authorization methods below (default to oAuth1 unless you are sure you need oAuth2).

In [17]:
API_KEY = 'VZn9a35jQsIcNDUSH8GCFfk2c'
API_KEY_SECRET = 'jUyu69HeCHuxepY1j2LvFZmwWTVHSGXATlvoLQ34U2E6LQjNRn'
ACCESS_TOKEN = '557113581-S5yxV6FXDQUPm0Ih3AjbgvOBxLwvqJaghjtlrVRQ'
ACCESS_TOKEN_SECRET = 'eBghW3C9qglDBlFRpSNxfmQTEbfGkp8RbpKPhLkLyhuwm'

!!! IMPORTANT !!!

If the code in the cells below fails it is likely because you need to put your own authentication details in the cell above.  More specifically, you will need to copy-paste in the api key, api key secret, access token, and access token secret from the "keys and tokens" tab of the description of the app that you set up with your Twitter developer account.

!!! IMPORTANT !!!

### oAuth1 (User Identification)

In [18]:
api = TwitterAPI(API_KEY, 
                 API_KEY_SECRET, 
                 ACCESS_TOKEN, 
                 ACCESS_TOKEN_SECRET)

api.auth

<requests_oauthlib.oauth1_auth.OAuth1 at 0x7f9c094f0210>

If successful the output of the cell above should look something like:

    <requests_oauthlib.oauth1_auth.OAuth1 at 0x107b8bba8>

### oAuth2 (App Identification)
!!!WARNING!!! 

Using oAuth2 will prevent you from using the streaming endpoint.  If you choose to try oAuth2 in the streaming example and receive the following error

    TwitterRequestError: Twitter request failed (401)

then simply run the oAuth1 section and then try the streaming portion of this workbook again.

!!!WARNING!!! 

In [5]:
api = TwitterAPI(API_KEY,
                 API_KEY_SECRET,
                 auth_type='oAuth2')

api.auth

<TwitterAPI.BearerAuth.BearerAuth at 0x7f9c09488f90>

If successful the output of the cell above should look something like:

    <TwitterAPI.BearerAuth.BearerAuth at 0x107b9acc0>

## What is a tweet, _really_ ?

Given that most people use "tweet" to refer to snippets of text that are usually 140 characters or less (But can now be up to 280 characters) most people are generally surprised to discover that this is only the proverbial "tip of the iceberg" in terms of what a tweet really is.  In this section we'll see exactly what a tweet is, how to improve looking at the full content, and then how to grab the portions that we want (usually the "text").

To make this easy we'll only request a single tweet by its ID number.  Every tweet has its own unique ID and can be requested if that ID is known.  We request the tweet with ID# 210462857140252672 and then print the response object.

In [29]:
r = api.request('statuses/show/:%d' % 1270422899195817986)
print(r)

<TwitterAPI.TwitterAPI.TwitterResponse object at 0x7f9c097c8d90>


The output of running the cell above will be something like `<TwitterAPI.TwitterAPI.TwitterResponse object at 0x107b9af28>`, which isn't quite what we are looking for.  This `TwitterResponse object` is a bundle of information related to the request including status code returned (`r.status_code`), how much of your quota is left (`r.get_quota`), the response headers (`r.headers`), etc.  What we wantis the "text" portion of this response (`r.text`).

In [30]:
r.text

'{"created_at":"Tue Jun 09 18:29:57 +0000 2020","id":1270422899195817986,"id_str":"1270422899195817986","text":"From my hometown, Cheyenne translation of Black Lives Matter. \\nM\\u00f2\\u022fht\\u0227ev\\u00e9\'h\\u00f2e h\\u00e9v\\u00f2\\u0117stan\\u00e9hevesto\\u00e9va \\u00e9h\\u00e8\\u0227m\\u022femenestse https:\\/\\/t.co\\/T7mWYkzF29","truncated":false,"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[],"media":[{"id":1270417788184731651,"id_str":"1270417788184731651","indices":[114,137],"media_url":"http:\\/\\/pbs.twimg.com\\/media\\/EaFuP1UWoAMQF2z.jpg","media_url_https":"https:\\/\\/pbs.twimg.com\\/media\\/EaFuP1UWoAMQF2z.jpg","url":"https:\\/\\/t.co\\/T7mWYkzF29","display_url":"pic.twitter.com\\/T7mWYkzF29","expanded_url":"https:\\/\\/twitter.com\\/CoreyWelch_STEM\\/status\\/1270422899195817986\\/photo\\/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"small":{"w":680,"h":680,"resize":"fit"},"medium":{"w":1200,"h":1200,"resize":"fit"},"la

That's a lot more than 140 characters!

Exactly what is there is hard to determine though given the formatting.  We can do better.

This content is in is JavaScript Object Notation (JSON), which is really a nested list of properties.  Python doesn't know this is JSON though so we need to tell it.  We do this in the next cell by importing the `json` library, converting `r.text` to json using the load string method ( `.loads()` ), and then outputting that json with formatting using the output string method ( `.dumps()` ) with some some options added for readability.

In [31]:
import json
parsed_r = json.loads(r.text)
print(json.dumps(parsed_r, indent=3, sort_keys=True))

{
   "contributors": null,
   "coordinates": null,
   "created_at": "Tue Jun 09 18:29:57 +0000 2020",
   "entities": {
      "hashtags": [],
      "media": [
         {
            "display_url": "pic.twitter.com/T7mWYkzF29",
            "expanded_url": "https://twitter.com/CoreyWelch_STEM/status/1270422899195817986/photo/1",
            "id": 1270417788184731651,
            "id_str": "1270417788184731651",
            "indices": [
               114,
               137
            ],
            "media_url": "http://pbs.twimg.com/media/EaFuP1UWoAMQF2z.jpg",
            "media_url_https": "https://pbs.twimg.com/media/EaFuP1UWoAMQF2z.jpg",
            "sizes": {
               "large": {
                  "h": 1440,
                  "resize": "fit",
                  "w": 1440
               },
               "medium": {
                  "h": 1200,
                  "resize": "fit",
                  "w": 1200
               },
               "small": {
                  "h": 680,
  

This is much nicer to read, especially since the various components have been alphabetized.  Having the response text as JSON also allows us to easily access each subcomponent.  We show this in the next cell by printing the text (sometimes called the "body" of the tweet), the ID# of the tweet, and the screen name of the user.

In [89]:
print("Tweet Body: ",parsed_r['text'])
print("Tweet ID: ",parsed_r['id'])
print("Screen Name: ",parsed_r['user']['screen_name'])
print("Declared User Location: ", parsed_r['user']['location'])

Tweet Body:  Where was Richard running today? https://t.co/i7QT7MLG75
Tweet ID:  1230671342866821121
Screen Name:  R_Sigurdson
Declared User Location:  Calgary, Alberta


Rather than parse the output to JSON everytime we can combine the Twitter Response Object's `.get_iterator()` method with a for-loop to do this directly.  It's less work overall and is cleaner.

In [None]:
r = api.request('statuses/show/:%d' % 210462857140252672)
for item in r.get_iterator():
    print("Tweet Body: ",item['text'])
    print("Tweet ID: ",item['id'])
    print("Screen Name: ",item['user']['screen_name'])

### Response Codes
As we move forward, eventually you're likely to end up with an error.  When these are related to our interaction with Twitter rather than a more local mistake then Twitter helpfully provides a code to help diagnose the problem.  A list of all these codes is [HERE](https://developer.twitter.com/en/docs/basics/response-codes.html).

## Accessing the entire API

With what you have in hand you now have the ability to request anything from the entire API.  If the API serves it then you can get it.  For a list of what is on the menu see [HERE](https://developer.twitter.com/en/docs/api-reference-index).  All that is missing is how to modify the request to grab any of that information and that's exactly what we'll cover in this section.

In the block below we grab the ids of the followers associated with the Twitter Handle passed to the variable "TwitterHandle".  This will return only the ids of the followers but it will return up to 5,000 at a time.

[To add: Example with likes and retweets.  "I suspect that what you want are the IDs of the retweeters for a specific tweet.  And/Or all those who liked a specific tweet.  You can get the first fairly directly by looking at the API Reference [HERE](https://developer.twitter.com/en/docs/api-reference-index) and searching for “retweet” to see what you can GET (See “Accessing the entire API” in the notebook for an example with IDs).  Favourites is tougher.  You can get all the favourite tweets of a user but not all the users who favourited a tweet (at least if you can I don’t know how to do it directly).  Look at the API reference and search “Favorite” to see how."

also: Media.  "How does the scraper deal with media?  The short answer: it doesn’t.  Why?  Media are just values/links in the JSON objects that are returned.  If you want the actual media then you’ll need to follow the links to see what is there and capture it another way (possibly with Python but outside the scraper).  For example try this (assuming the previous stuff in the notebook to the very first example that pulls a tweet by ID):

`import json`

`r = api.request('statuses/show/:%d' % 1270422899195817986)`

`parsed_r = json.loads(r.text)`

`print(json.dumps(parsed_r, indent=3, sort_keys=True))`

You’ll see some “media” keys that hold the links to what you want, such as “http://pbs.twimg.com/media/EaFuP1UWoAMQF2z.jpg”.]

In [90]:
TwitterHandle = 'symulation'

r = api.request('followers/ids', {'screen_name': TwitterHandle})

r.text

'{"ids":[89597974,354740435,2809015377,293651076,743086431971778561,154793244,468571696,1155634482629791746,59819048,14726818,1468480220,2831073533,326077427,7465672,1142871446,15280441,108797530,3280014502,1281391231,20655405,85698409,1106125130529497093,201729872,634145696,317837804,2530913857,108366404,1009540472363077632,706961700000411649,3004785682,1275919560,2794614248,377648815,7737262,1061902105,65767723,273333980,935257516669272064,1875230906,2163496622,4490863694,975385941039812608,974688967193841669,958828880684462080,395545900,54840028,923351385810309120,88226196,21900557,113369924,229303281,4274106434,2364592844,365770640,66843277,775740202576781312,804234421331107841,915060319260639232,2902074361,915705393732562945,914617820914356225,335027958,280168233,91616458,236361035,1398709159,892836988612509696,730541257299251201,890993297727963136,44410209,1020311,489956375,880115301743960064,53893339,416157021,345642306,75694920,279097061,3133107355,46712597,319103163,1413925345

In the block below we grab the actual JSON entries on followers for the twitter handle passed.  This returns a lot more information than just the id but it only returns up to 20 at a time.  If you wanted more then you'd need to use paging, as described below.  

Note the use of the `get_iterator` method to help parse the JSON that is returned into user by user chunks.

In [91]:
TwitterHandle = 'symulation'

r = api.request('followers/list', {'screen_name': TwitterHandle})

for item in r.get_iterator():
    print(item,"\n")

{'id': 89597974, 'id_str': '89597974', 'name': '🅼🅴🅻🅾🅳🆈 🅶🆁🅴🅴🅽', 'screen_name': 'Melody_36582', 'location': '📍 Canada 🇨🇦', 'description': "Follow me, follow me, and don't forget the whiskey 💋", 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 3, 'friends_count': 291, 'listed_count': 1, 'created_at': 'Fri Nov 13 01:48:23 +0000 2009', 'favourites_count': 0, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': False, 'statuses_count': 30, 'lang': None, 'status': {'created_at': 'Sun May 03 23:01:54 +0000 2020', 'id': 1257082986497015809, 'id_str': '1257082986497015809', 'text': 'Follow Your Heart But Take Your Brain With You. 💝👅💖\nhttps://t.co/CLYWsgzZAT', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 1254935218739445761, 'id_str': '1254935218739445761', 'indices': [52, 75], 'media_url': 'http://pbs.twimg.com/media/EWps7rkXgAEehUe.jpg', 'media_url_https': 'http

## Streaming

There are two approaches to collecting information from Twitter: grabbing tweets as they are published and searching through the archive of past tweets.  The first approach is known as "streaming" and we'll look at how to use it now (The second we'll call "standard search" and we'll look at it next).  It is important to note up front that the results returned using this method are incomplete: you will _not_ necessarily capture every single tweet that you intend to this way.  Still, you can get a lot of tweets in a very short time and likely enough to get you started (assuming your search terms are not overly restrictive).

Streaming amounts to applying a set of filters to the stream of all tweets being published from now until we stop reading the stream and capturing what is matched by those filters.  For this first example we'll simply print the body of each tweet out on the screen.  There are lots of opinions about Donald Trump right now so we'll use 'trump' as the term we are tracking.

!!! IMPORTANT !!!

When streaming it is possible for your code to run indefinitely.  In the case of the cell below you'll want to stop it at some point—likely after only a few seconds!—so be prepared to click the stop button at the top of this workbook once you have some tweets in the output cell.  If you leave the search term as 'trump' then you'll have enough tweets to prove that it works after about two seconds!

It is also possible that you'll hit a rate limit with the search term "trump".  If you approach a 1% sample of all the tweets being published with your request then the API will cut you off.  This will look like a "key error" because the code is looking for the key "text" in the JSON that is returned but it's not finding it because the API didn't give any text as a response. 

Lastly, note that the API only returns tweets inside a 10ms window per second.  You will not be getting everything via streaming.

!!! IMPORTANT !!!

In [28]:
TRACK_TERM = 'covid daycare'

r = api.request('statuses/filter', {'track': TRACK_TERM})

for item in r.get_iterator():
    print(item['text'],item['id'] if 'text' in item else item)

KeyboardInterrupt: 

The value passed to TRACK_TERM can be modified as per the standard search operators (or the premium ones once we get there) defined by the Twitter API to filer what is returned.  You can see a list of all the standard operators with examples [HERE](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/standard-operators).

## Write to File

As satisfying as it may be to have an endless stream of tweets scroll in front of you there isn't much value in it unless you are able to capture the tweets for analysis in the future. If you really only need a few and only need them once then you can just cut and copy from the output above.  In most cases though you'll want a lot of tweets or want to collect them multiple times.  In such cases the ideal thing to do is write the tweets to a database because then you will have the search features of the database at your disposal.  We'll get to interacting with database later in this workbook but keep in mind that actually setting up that database is beyond the scope of this workbook (If you're not sure where to start, [MongoDB](https://www.mongodb.com/) is worth considering since its internal format is very similar to the JSON (JavaScript Object Notation) that tweets come in).

At this point we will simply write the text and ID number of each tweet to a file.  We do this using `with open...` because this method of opening a file will ensure that it will close properly if the program crashes/halts unexpectedly, something that in inevitable at this point because the only way we have to stop the code we are running right now is to interrupt it.  There is some casting values to strings using the `str` function when writing to the output file because the ` .write() `  method requires a single string as an input.  Note as well the addition of the line break (`\n`) to ensure that each tweet body starts on a new line.

[Note that if you are a Windows user then you may have difficulty saving this file because of the character encoding used.  I'm working on finding a reliable fix for this.  Setting the encoding to utf-8, as is done below, seems to work in many but not all cases.]

In [None]:
TRACK_TERM = 'potato'

r = api.request('statuses/filter', {'track': TRACK_TERM})

with open("streamTweets.csv","a",encoding="utf-8") as outfile:
    for item in r.get_iterator():
        line = item['text'] + ',' + str(item['id'])
        print(line if 'text' in item else item)
        outfile.write((line + '\n') if 'text' in item else item)

Looking at the output (there should be a file called "streamTweets.csv" in the same directory as this notebook) we can see that the body of each tweet is printed on its own line followed by a comma which is followed by the tweet ID... mostly.  Scrolling through the list will reveal that there are tweets that span multiple lines and blank spaces.  Why is this?  The body of some tweets includes line break characters (` \n `).

If we compare what was printed to the screen to what was written in the file we'll see the `\n`'s on the screen translated to blank lines in the file.

While there is an argument to be made that removing these characters makes no difference to the content of the tweet the counterargument is that line breaks are important punctuation and should be kept with the original tweet.  We will keep the line breaks.

A popular way to do this would be to use a regular expression and replace each `\n` that occurs with a `\\\\n` so that each `\` is appropriately escaped as it is passed from the variable to the file.  The problem with this method is that there are other characters that might appear as well (either in tweets or elsewhere) and while we could write a regular expression to do the substitution in each case Python offers a better way: the [representation function](https://docs.python.org/3.5/library/functions.html#repr).  To do this we pass the line variable to the function `repr` as we write it to the output file, as in the example below.

(Remember to stop the cell after a few seconds.)

In [None]:
TRACK_TERM = 'trump'

r = api.request('statuses/filter', {'track': TRACK_TERM})

with open("sentimentTest.csv","a") as outfile:
    for item in r.get_iterator():
        line = item['text'] + ',' + str(item['id'])
        print(line if 'text' in item else item)
        outfile.write((repr(line) + '\n') if 'text' in item else item)

Checking streamTweets.csv shows that this approach is working well.  While we won't write to an output file in every example in the rest of this workbook keep in mind that you can use the same methods in all the examples that follow.

## Standard Search

The [Standard Search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) allows for searching in the past 7 days.  It is rate limited to 180 requests per 15 minutes using oAuth1 and 450 requests per 15 minute using oAuth2.  It is also "not exhaustive", meaning that the full body of tweets matching search criteria within the window is unlikely to be returned (maybe if the body of tweets is very small).

We'll shift away from the politically hot topic of Donald Trump to the topic of pizza for these next examples.

Note the use of the `.get_quota()` method on the response object in order to see how much of our quota remains.  That's right, there are quotas on the free account we are using.  If you're only looking back 7 days then you only have to worry about being rate limited.  If you're looking back farther then you can make up to 250 requests per month to the 30-Day API and up to 50 requests per month to the Full Archive.

In [32]:
SEARCH_TERM = 'covid school outbreak -rt'

r = api.request('search/tweets', {'q': SEARCH_TERM})

for item in r.get_iterator():
    print(item['text'] if 'text' in item else item)

print('\nQUOTA: %s' % r.get_quota())

@ecaorg @HRDMinistry Online classes are important and need of the hour. Because in this crucial time we cannot send… https://t.co/WDCcsgWFkr
I just know when we go back to school no ones going to follow any sort of social distancing guidelines, will party… https://t.co/MeYT4wsbF2
one of my students just facetimed me to ask when we’re going back to school. he’s so worried about a covid outbreak… https://t.co/78zJoRBAWH
over half are worried that Asian children are going to be bullied when they return to school due to the COVID-19 outbreak
Boise State is stopping voluntary workouts due to eight positive or presumed positive COVID-19 tests on campus in t… https://t.co/zURKpN4X3J
Article on similar conditions in the Parc Ex district in Montreal, thanks for sharing @sashamd.

“We used to learn… https://t.co/FpKroaYDZn
The Little Village Fiesta Patria Festival and 26th Street Mexican Independence Parade have been cancelled: https://t.co/3pO6xOQZW4
More details on a COVID-19 outbreak among st

Seems to be working but there are not many tweets being returned.  We can increase this by specifying the `count` parameter.  We'll also add a simple counter just to see what we are actually getting.  

The set of all the parameters that can be invoked is available [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets).

In [None]:
SEARCH_TERM = '#pizza'
COUNT = 100 

r = api.request('search/tweets', {'q': SEARCH_TERM, 'count': COUNT})

a = 1
for item in r.get_iterator():
    print(a)
    print(item['text'] if 'text' in item else item)
    a=a+1

print('\nQUOTA: %s' % r.get_quota())

So, not 100 tweets but certainly more than before.  To go back further we'll need to look at paging.

## Paging 

Twitter returns results in chunks that are called "pages".  In the example above we are seeing just the first page of results and setting the `count` parameter to the maximum number of possible results returnable per page.  If you want to go back further then it becomes necessary to send multiple requests to the API in succession with each one asking for the next page.  While this can be implemented "by hand" TwitterAPI makes this much easier by providing a paging function called 'TwitterPager' (You can read more about it [HERE](https://geduldig.github.io/TwitterAPI/paging.html) that does all of the heavy lifting for you by tracking what page to ask for, ensuring the request rate is not too high, and generally managing the connection.  It is invoked in an almost identical way to everything you have seen so far in this workbook.

Unlike the previous examples where we were printing out the body of each tweet here we will print out only the date the tweet was created and the ID.  This is done simply because it makes it easier to track what is happening.

It is important to note that as with the streaming API endpoint you will need to stop the code at some point.  While you will eventually hit the 7-day limit it is unlikely that you want to wait that long for this toy example. 

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = 'pizza'
COUNT = 100

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT})

for item in pager.get_iterator():
    #print(item['text'] if 'text' in item else item)
    print((item['created_at'], item['id']) if 'text' in item else item)    

You'll note that as the code runs the dates and the tweet IDs both roll backwards and the search tool moves from the present (tweets become accessible via the search APIs about 30 seconds after they are created) to the past.

The TwitterAPI documentation provides some [advanced ways to add fault tolerance](https://geduldig.github.io/TwitterAPI/faulttolerance.html).  These are fairly sophisticated and involve checking status codes and the like.  While valuable it is inevitable that you will end up halting your program for one reason or another and need it to restart scraping where it left off.  

The code below does this by first checking to see if there is an object called "item" that has a value keyed to 'id'.  If it does then it captures this ID and uses it as input into the TwitterPager function so that all new tweets collected will be earlier than it.  If the value does not exist then an empty string is assigned as the ID to start from which the TwitterAPI will ignore and start providing input from the present.

This code will work as long as the notebook stays open, no matter how often the cell is interrupted.  If you close the notebook and reopen it then you'll need to pass in the ID value from the last line of the output file to restart in the correct location.  If you need a more sophisticated method for handling faults then follow the link above to the TwitterAPI documentation.

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = 'pizza'
COUNT = 100

try:
    SINCE_ID = item['id']
except:
    SINCE_ID = ''

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT,'since_id':SINCE_ID})

with open("restartTweetsTest.csv","a", encoding="utf-8") as outfile:
    for item in pager.get_iterator():
            line = item['text'] + ',' + str(item['id'])
            print(line if 'text' in item else item)
            outfile.write((repr(line) + '\n') if 'text' in item else item)

## Premium Access

Access beyond stream filtering and searching imperfectly through the past 7 days requires some extra steps beyond simply making an app.

1. Setting up a dev environment.  Within the Twitter developer site click on your name in the top right corner.  From the menu select "Dev environments".  Follow the interface to create the environments that you would like and associate an app with each.
2. Note the name of each dev environment because it will go into one of the variables called `LABEL`, below.  I named my 30-Day development environment "30DayTesting" and my full archive development environment "fullArchiveTesting".  So in the 30 Day example I set `LABEL` to "30DayTesting" and in the Full Archive example I set `LABEL` to "fullArchiveTesting".

To see exactly what is available in the premium sandbox have a look at the overview [HERE](https://developer.twitter.com/en/docs/tweets/search/overview/premium.html) and the search guide [HERE](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators).


## Rate Limits

The search of the 7-day archive that we have done and the searches of the 30-day archive and the full archive that we are about to do are all subject to rate limits.  These matter less for the 7-day archive because its quotas reset every 15 minutes while there are quotas attached the premium apis that reset only once per month.  Failure to respect these will result in Twitter rejecting your search requests until your quota refreshes, typically with an error code of 429.

The limits for the Standard API can be seen [HERE](https://developer.twitter.com/en/docs/basics/rate-limits.html).  To see the current quota status for your account that apply to accessing the 7-day archive you can run the code block immediately below [Note that this code block also demonstrates the basics of using the Python requests library to access the Twitter API rather than the TwitterAPI library.].  There is also the `.get_quotas` method within the TwitterAPI that was used above but this shows substantially less information [and it is not currently clear to me why it is different...].

To see the subscription usage for your premium account you can log into your developer account on the Twitter website, click on your name in the top right corner, and then choose "Subscriptions" (the direct link should be [https://developer.twitter.com/en/account/subscriptions]()).  This will give you a dashboard with content that looks like the following:

![](twitterSubscriptionDashboard.png)

The limits you will held to (unless you pay to upgrade) will be as follows:

* fullarchive: 50 searches/month and a total of 5,000 tweets returned
* 30day: 250 searches/month and a total of 25,000 tweets returned

There are also limits on the number of times you can hit various APIs, typically within a 15 minute window.  To see some of these you can run the following:

In [None]:
# How to do this came from https://stackoverflow.com/questions/33308634/how-to-perform-oauth-when-doing-twitter-scraping-with-python-requests

import requests
from requests_oauthlib import OAuth1
import json

url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
auth = OAuth1(API_KEY, API_KEY_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
requests.get(url, auth=auth)

# using 'rr' as the variable to ensure that any work done above when the variable was 'r'
# isn't overwritten.
rr = requests.get('https://api.twitter.com/1.1/application/rate_limit_status.json?resources=help,users,search,statuses', auth=auth)

print(json.dumps(json.loads(rr.text), indent=3, sort_keys=True))

If you are being limited in your search results and running the cell above doesn't illuminate why this is so and neither does the subscription dashboard then running one of the following code cells might.  The first is a direct probe of the 30 Day endpoint and the second is a probe of the full archive.  

Remember that you need to run the appropriate authorization code at the top of this workbook!

In [None]:
# probe of 30 Day endpoint
r = requests.get('https://api.twitter.com/1.1/tweets/search/30day/30DayTesting.json?query=pizza', auth=auth)
print(json.dumps(json.loads(r.text), indent=3, sort_keys=True))

In [None]:
# probe of Full Archive
r = requests.get('https://api.twitter.com/1.1/tweets/search/fullarchive/fullArchiveTesting.json?query=physicalactivityandpregnancy', auth=auth)
print(json.dumps(json.loads(r.text), indent=3, sort_keys=True))

## 30 Day

An example on how to access the 30 Day archive can be seen immediately below.  Keep in mind that the limits on the 30 Day archive sandbox are significantly higher than those associated with the Full Archive sandbox so you'll want to catch tweets that you can't grab from the standard search before they move out of this rolling window.

In [None]:
from TwitterAPI import TwitterAPI

SEARCH_TERM = 'pizza'
PRODUCT = '30day'
LABEL = '30DayTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                {'query':SEARCH_TERM})

for item in r:
    print(item['text'] if 'text' in item else item)

In [None]:
r.text #you can run this if you need to see what the text of the last item was for debugging

## Full Archive

This is where _almost_ all the tweets that have ever been published can be found.  What tweets are not included?  Those that have been deleted or corrected for whatever reason.  If you want deleted tweets then you'll need access to some archive that scraped them before they were deleted.  If they were edited (possibly with a Chrome extension called [Covfefe](https://chrome.google.com/webstore/detail/covfefe/ccdjnhaifeigaidilnnajickpbjhbfom)) you are likely in a similar situation.

In [None]:
from TwitterAPI import TwitterAPI

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                {'query':SEARCH_TERM})

for item in r:
    print(item['text'] if 'text' in item else item)

In [None]:
r.text #you can run this if you need to see what the text of the last item was for debugging

## Extended Tweets

If you have been tracking the text of the tweets you'll likely notice that many of them are incomplete, containing `...`.  This is a result of Twitter's new extended tweet mechanism that enables tweets of up to 280 characters instead of the original 140.  To get the full text of a tweet in every case you need to see if it is an extended tweet and, if it is, then look elsewhere in the tweet for the full text, as shown below.

In [None]:
#Full Archive Tweets with full text and tweet full information
from TwitterAPI import TwitterAPI, TwitterPager

SEARCH_TERM = 'POTUS'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query': SEARCH_TERM})

#with open("restartTweetsTest.csv","a", encoding="utf-8") as outfile:
for item in r:
    if 'extended_tweet' in item.keys():
        tweet_text = "EXTENDED: " + item['extended_tweet']['full_text']
    else:
        tweet_text = "ORIGINAL:" + item['text']
    print(tweet_text)

## Full Archive with Paging

Look at the premium search documentation and note that some of the features supported with searching the seven day archive are not supported on the 30 Day or Full Archive API.  For example, specifying a  maximum number of results to return or the what ID to search since is not available (see [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search)).

Note the code below which amalgamates the full archive example just shown with the paging example from earlier in this workbook.

In [None]:
#Full Archive Tweets without full text
from TwitterAPI import TwitterAPI, TwitterPager

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

pager = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query': SEARCH_TERM})

#with open("restartTweetsTest.csv","a", encoding="utf-8") as outfile:
for item in pager.get_iterator(wait=30):
    line = item['text'] + '|' + str(item['id']) + '|' + str(item['created_at']) + '|' + str(item['user']['location']) + '|' + str(item['user']['name']) + '|' + str(item['user']['screen_name'])
    print(line if 'text' in item else item)
    #outfile.write((repr(line) + '\n') if 'text' in item else item)

Note that if you have a retweet of an extended tweet then the only way to get the full content is to grab the original tweet.  This can be done using the information in the retweet but this is not currently covered here.

## Adding Date Ranges

Adding date ranges to searches of the Full Archive is an essential task given that within the sandbox you'll only be able to collect up to 5k tweets per month.  This is easily done by passing a `fromdate` value and a `todate` value within the curly braces that the value for `query` is passed.  This is shown below in a code block that also incorporates extended tweets and paging through results.

Note that this is not a block of code to just let run since it will consume part of your full archive access for the month.

In [None]:
from TwitterAPI import TwitterAPI, TwitterPager

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

pager = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query': SEARCH_TERM, 'fromDate': 20180917000,'toDate':20180918000})

pagerObject =[]

    if 'extended_tweet' in item.keys():
        tweet_text = item['extended_tweet']['full_text']
    else:
        tweet_text = item['text']
    line = tweet_text + '|' + str(item['id']) + '|' + str(item['created_at']) + '|' + str(item['user']['location']) + '|' + str(item['user']['name']) + '|' + str(item['user']['screen_name'])
    print(line)

## Using a Database

If you're going to be collecting a lot of tweets then keeping them all in .csv files (as above) will likely become detrimental at some point unless your workflow is always going to involve processing each of these files in its entirety.  If you're ever going to want to search quickly for content within what you have collected and possibly on across a range of features then you should consider load the tweets you collect into a database as you collect them.

There are many databases that you could choose for doing this but the one we will cover here is MongoDB.  Why?  The Community Edition (aka "The Free Version") is robust enough to handle many research tasks, it has decent performance, there is a handy tool that accompanies it that you can use to interact with the database outside of Python, and how MongoDB stores the items it holds (usually called "posts") is pretty much JSON, which is the format the the tweets are returned in.  So, a pretty nice fit.

This notebook is not going to cover how to install MongoDB Community Edition but the instructions you should be looking at are [HERE](https://docs.mongodb.com/manual/administration/install-community/).  Note that these instructions require that MongoDB is restarted as a service each time the computer is turned back on.  If you want to have it autostart then you'll need to do some searching for a solution that fits your operating system.

A free tool for interacting with MongoDB without some programming language is Robo3t (formerly "Robomongo").  It can be downloaded [HERE](https://robomongo.org/download).  If you need more power then the paid version—Sudio3T—is available from the same place.

The rest of this section assumes that you have installed MongoDB _and that it is running_.  If you are using this workbook as part of a webinar then regardless of whether you are following along in Google Colab or on your own machine this is not likely to be the case and so you'll likely need to simply look at how this is done in principle.  If you are using Colab and would like to try MongoDB then a method for doing so is [HERE](https://colab.research.google.com/github/Giffy/MongoDB_PyMongo_Tutorial/blob/master/1_Run_MongoDB_in_colab.ipynb#scrollTo=8IxGGMVFnWgx)(Note the caveat that your data will be deleted after 12 hours).

We'll start by installing the library that will let us connect to the instance of MongoDB that _you already have running in the background_.

In [92]:
#run this if you need to install pymongo, the library that lets python interact with MongoDB

!pip install pymongo



Once the pymongo library is installed we can import it into our current session.  Note the creation of the variable `client`.  By default MongoDB is listening on port 27017 for connections and this is where that connection information is passed.

In [93]:
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

We'll create/connect to a database called `testDB` and assign the connection to that database to the variable `db`.

In [94]:
db = client.testDB

The database itself is not actually where the information is stored in MongoDB.  It has a substructure called a "collection" and so we'll create one called "testCollection" and assign it to the variable `posts` for collecting data.

In [95]:
posts = db.testCollection

Note that the variable `posts` basically expands to MongoClient('localhost', 27017).testDB.testCollection.  We use `db` and `posts` as variables because it simplifies writing out the entire string of connection information.

Now that we have the connection information sorted out we can carry out a search and add the content to the database, as below.  Note the structures related to TwitterAPI that are taken from the above sections.  The new steps are the creation of a dictionary object (essentially a piece of JSON if you're not familiar with dictionaries) in `post = {"text": tweet_text ... }`.  Note that a design decision is being made about what to call the items being stored in the database.  `tweet_text` is what the Twitter API returns as the key to accessing the content of the tweet.  Setting this to be `text` for the entry in our database is a choice.  You can modify or keep the key/field names as works for you.  If you do not need to keep the entire content of a tweet then there is likely little point in replicating the format of the tweet exactly.

`posts.insert_one(post)` is the instruction that writes the content of the tweet in `post` to the collection in the database.  The pymongo method `insert_one` returns the id of the inserted content which we capture here with `post_id` and then print to the screen on the next line.

In [96]:
SEARCH_TERM = 'corona virus'
COUNT = 100
MODE = 'extended'

r = api.request('search/tweets', {'q': SEARCH_TERM, 'count': COUNT, 'mode': MODE})

for item in r.get_iterator():
    tweet_text = repr(item['text'])
    post = {"text": tweet_text,
        "id": str(item['id']),
        "created_at": str(item['created_at']),
        "user_location": str(item['user']['location']),
        "user_name": str(item['user']['screen_name']),
        "search_term": SEARCH_TERM}
    post_id = posts.insert_one(post).inserted_id
    print(post_id)

5ecea7244921c1eb2a74cf83
5ecea7244921c1eb2a74cf84
5ecea7244921c1eb2a74cf85
5ecea7244921c1eb2a74cf86
5ecea7244921c1eb2a74cf87
5ecea7244921c1eb2a74cf88
5ecea7244921c1eb2a74cf89
5ecea7244921c1eb2a74cf8a
5ecea7244921c1eb2a74cf8b
5ecea7244921c1eb2a74cf8c
5ecea7244921c1eb2a74cf8d
5ecea7254921c1eb2a74cf8e
5ecea7254921c1eb2a74cf8f
5ecea7254921c1eb2a74cf90
5ecea7254921c1eb2a74cf91
5ecea7254921c1eb2a74cf92
5ecea7254921c1eb2a74cf93
5ecea7254921c1eb2a74cf94
5ecea7254921c1eb2a74cf95
5ecea7254921c1eb2a74cf96
5ecea7254921c1eb2a74cf97
5ecea7254921c1eb2a74cf98
5ecea7254921c1eb2a74cf99
5ecea7254921c1eb2a74cf9a
5ecea7254921c1eb2a74cf9b
5ecea7254921c1eb2a74cf9c
5ecea7254921c1eb2a74cf9d
5ecea7254921c1eb2a74cf9e
5ecea7254921c1eb2a74cf9f
5ecea7264921c1eb2a74cfa0
5ecea7264921c1eb2a74cfa1
5ecea7264921c1eb2a74cfa2
5ecea7264921c1eb2a74cfa3
5ecea7264921c1eb2a74cfa4
5ecea7264921c1eb2a74cfa5
5ecea7264921c1eb2a74cfa6
5ecea7264921c1eb2a74cfa7
5ecea7264921c1eb2a74cfa8
5ecea7264921c1eb2a74cfa9
5ecea7264921c1eb2a74cfaa


Of course, if you really want (or don't mind) all the content of all the tweets collected then you can significantly reduce what is needed to input each tweet into the collection in the database.  Note that this works because `item` is a dictionary given both the Twitter API and the library that we are using to access it (TwitterAPI, no space).  If you use another library to scrape tweets then you may have to do some additional work in advance of this.

In [97]:
SEARCH_TERM = 'corona virus'
COUNT = 100
MODE = 'extended'

r = api.request('search/tweets', {'q': SEARCH_TERM, 'count': COUNT, 'mode': MODE})

for item in r.get_iterator():
    print(posts.insert_one(item).inserted_id)

5ecea7e34921c1eb2a74cfe7
5ecea7e34921c1eb2a74cfe8
5ecea7e34921c1eb2a74cfe9
5ecea7e34921c1eb2a74cfea
5ecea7e34921c1eb2a74cfeb
5ecea7e34921c1eb2a74cfec
5ecea7e34921c1eb2a74cfed
5ecea7e34921c1eb2a74cfee
5ecea7e34921c1eb2a74cfef
5ecea7e34921c1eb2a74cff0
5ecea7e34921c1eb2a74cff1
5ecea7e34921c1eb2a74cff2
5ecea7e34921c1eb2a74cff3
5ecea7e34921c1eb2a74cff4
5ecea7e34921c1eb2a74cff5
5ecea7e34921c1eb2a74cff6
5ecea7e34921c1eb2a74cff7
5ecea7e34921c1eb2a74cff8
5ecea7e34921c1eb2a74cff9
5ecea7e34921c1eb2a74cffa
5ecea7e34921c1eb2a74cffb
5ecea7e34921c1eb2a74cffc
5ecea7e44921c1eb2a74cffd
5ecea7e44921c1eb2a74cffe
5ecea7e44921c1eb2a74cfff
5ecea7e44921c1eb2a74d000
5ecea7e44921c1eb2a74d001
5ecea7e44921c1eb2a74d002
5ecea7e44921c1eb2a74d003
5ecea7e44921c1eb2a74d004
5ecea7e44921c1eb2a74d005
5ecea7e44921c1eb2a74d006
5ecea7e44921c1eb2a74d007
5ecea7e44921c1eb2a74d008
5ecea7e44921c1eb2a74d009
5ecea7e44921c1eb2a74d00a
5ecea7e44921c1eb2a74d00b
5ecea7e44921c1eb2a74d00c
5ecea7e44921c1eb2a74d00d
5ecea7e44921c1eb2a74d00e


### Getting Data Out: Find One Method

This allows you to find a Tweet matching a single criteria in a MongoDB collection.  If there is more than one document matching the search term then only the first match found is returned. user_name is used in this example. Note that this is searching for Tweets from the collection created by the above cells so they must be run in advance.  Also note that you will need an actual user_name from the data collected for this to work and so you will likely need to change the input value.

In [99]:
import pprint

pprint.pprint(posts.find_one({"user_name": "10thchait"}))

{'_id': ObjectId('5ecea7244921c1eb2a74cf83'),
 'created_at': 'Wed May 27 17:45:05 +0000 2020',
 'id': '1265700567596150784',
 'search_term': 'corona virus',
 'text': "'RT @haaveumetanish: Corona virus vaccine be like: "
         "https://t.co/9JdmXCBYyg'",
 'user_location': '',
 'user_name': '10thchait'}


### Getting Data Out: Find (All) Method

This method allows you to find all Tweets matching a specified criteria in a MongoDB collection. This example displays all the Tweets containing "Minute" in the text. Again, this will only search in one collection.

In [101]:
#find() method
for post in posts.find():
    if "virus" in post['text']:
        #pprint.pprint(post)
        print(post['text'])
    

'RT @haaveumetanish: Corona virus vaccine be like: https://t.co/9JdmXCBYyg'
'RT @haaveumetanish: Corona virus vaccine be like: https://t.co/9JdmXCBYyg'
'RT @yelyahwilliams: i hate the news. if it’s not corona virus it’s other diseases, like blatant fucking racism.'
'RT @ahteshamBokhari: Hospital built in Islamabad in 40 days for corona virus patients. Imagine if this task would have been given to PPP ht…'
'RT @yelyahwilliams: i hate the news. if it’s not corona virus it’s other diseases, like blatant fucking racism.'
'RT @yelyahwilliams: i hate the news. if it’s not corona virus it’s other diseases, like blatant fucking racism.'
'RT @yelyahwilliams: i hate the news. if it’s not corona virus it’s other diseases, like blatant fucking racism.'
'RT @yelyahwilliams: i hate the news. if it’s not corona virus it’s other diseases, like blatant fucking racism.'
'10 new corona virus cases in #Tripura\nTotal cases 242\nTotal Cases in India 158052'
'RT @yelyahwilliams: i hate the news. if it’s not