# Introduction to Twitter Scraping for Researchers

This notebook was written by [John Simpson](mailto:john.simpson@computecanada.ca) and is meant to provide some simple, working examples for researchers who would like to collect information from Twitter.  While Twitter provides their own tools and libraries for this they are a little too granular and possibly unfamiliar to many in the research community.  For this reason this workbook uses a Python library build by a third party that greatly streamlines the process of collecting tweets.  Special thanks is due to Victoria Meah and Miranda Kimber for helping test earlier version of this workbook in support of their research into the impact of social media on pregnancy and fitness.

This notebook assumes:

1. Basic familiarity with the Jupyter Notebook environment.
2. A functioning python environment on the system it is run in and that you have the authority to install software on it.
3. That you have a developer account with Twitter [HERE](https://developer.twitter.com/en/)
4. That you have an app created with Twitter [HERE](https://developer.twitter.com/en/apps)
5. That you have a MongoDB instance set up on your machine and appropriately configured.
6. That you pay attention to the various notes and warnings around the cells.

I won't promise you any support but if you send me a note I'll help as I am able.

The Python library used is called TwitterAPI (no space) and it can be found at https://github.com/geduldig/TwitterAPI.  Most of the code that is throughout this workbook is drawn directly from the examples on these pages.

With all this said, let's get started by installing the TwitterAPI library.

In [None]:
!pip install TwitterAPI

With the TwitterAPI library installed on the system we should be able to open it for use throughout this workbook with the following command:

In [None]:
from TwitterAPI import TwitterAPI

[Note that almost every piece of code in the remainder of this workbook assumes that the two cells above have been run and run successfully.  If you open the workbook and immediately try to run a cell other than this one first then it is likely that you will receive an error.  Simply run the cells above and try to run the cell you want to run again.  If you are receiving errors then two likely possibilities are an incorrect installation of Python or no network connection.]

## Authentication

Use of this workbook–and the Twitter Developer API (Application Programming Interface) in general—requires a developer account from Twitter.  Unlike the early days of Twitter when anyone with a regular Twitter account who requested a developer account would just be given one, Twitter now screens requests for developer accounts, a process that can stall getting started by many days.  If you had a developer account previously and created applications (apps) that used the Twitter application programming interfaces (APIs) then you may still be able to use these apps to do some work but it is possible that their ability to access the Twitter archive has been reduced and, if so, that you'll need to apply for a new developer account to correct this.  A developer account may be requested from https://developer.twitter.com/en/apply/user (assuming you already have a regular Twitter account).

As the [TwitterAPI Documentation](https://geduldig.github.io/TwitterAPI/authentication.html) points out: _Twitter supports both user and application authentication, called oAuth 1 and oAuth 2, respectively. User authentication gives you access to all API endpoints, basically read and write persmission. It is also required in order to using the Streaming API. Application authentication gives you access to just the read portion of the API – so, no creating or destroying tweets. Application authentication, however, has elevated rate limits._ 

We will use oAuth1 throughout this workbook even though we'll only be reading tweets since it can be used in more situations (in particular when we try to read from the streaming API).  If it is necessary to read Twitter API endpoints (other than the streaming endpoint) at a faster rate than this workbook initially provides then consider switching to oAuth2.

You will need oAuth1 to do any of the following:

* Post Tweets or other resources;
* Connect to Streaming endpoints;
* Search for users;
* Use any geo endpoint;
* Access Direct Messages or account credentials;
* Retrieve user's email addresses;

You can get away with oAuth2 (or application-only authentication) if you are only looking to perform the following:

* Pull user timelines;
* Access friends and followers of any account;
* Access lists resources;
* Search in Tweets;
* Retrieve any user information, excluding the user's email address;

Both authentication methods will require you to collect some information about keys and tokens and paste it into the appropriate section of the cell below.  This key and token information is generated when you create a profile for an app on the Twitter Developer site.  App profiles can be created at https://developer.twitter.com/en/apps.  That same page will hold a list of all the profiles that you have created and clicking on the "Details" button for each app will bring you to a summary page.  There will be a link/tab near the top of the page called "Keys and Tokens" and clicking this will bring you to the page with the key and token information.

Paste in the required key and token information from the Twitter Developer site into the cell below and then run it in order to use this workbook.  Remember that you'll need to run the cell below (which loads your credentials) and _one_ of the authorization methods below (default to oAuth1 unless you are sure you need oAuth2).

In [None]:
API_KEY = 'AlIKjzJNlWbVgoR33vSVwBf0ti'
API_KEY_SECRET = 'VxpM9rCI1iMQfUTgu0n7TFZRFh2qhWuaEqd9n3td1GXt6dqyOCo'
ACCESS_TOKEN = '557113581-CVLw2gBZNXr0HogsMAf8TOYbLvCjL975bSAc1ixet'
ACCESS_TOKEN_SECRET = 'FY587A8pwA0442wiYZssmlCDY8DSudoMTDYwNbwlE3M1Hl'

!!! IMPORTANT !!!

If the code in the cells below fails it is likely because you need to put your own authentication details in the cell above.  More specifically, you will need to copy-paste in the api key, api key secret, access token, and access token secret from the "keys and tokens" tab of the description of the app that you set up with your Twitter developer account.

!!! IMPORTANT !!!

### oAuth1 (User Identification)

In [None]:
api = TwitterAPI(API_KEY, 
                 API_KEY_SECRET, 
                 ACCESS_TOKEN, 
                 ACCESS_TOKEN_SECRET)

api.auth

If successful the output of the cell above should look something like:

    <requests_oauthlib.oauth1_auth.OAuth1 at 0x107b8bba8>

### oAuth2 (App Identification)
!!!WARNING!!! 

Using oAuth2 will prevent you from using the streaming endpoint.  If you choose to try oAuth2 in the streaming example and receive the following error

    TwitterRequestError: Twitter request failed (401)

then simply run the oAuth1 section and then try the streaming portion of this workbook again.

!!!WARNING!!! 

In [None]:
api = TwitterAPI(API_KEY,
                 API_KEY_SECRET,
                 auth_type='oAuth2')

api.auth

If successful the output of the cell above should look something like:

    <TwitterAPI.BearerAuth.BearerAuth at 0x107b9acc0>

## What is a tweet, _really_ ?

Given that most people use "tweet" to refer to snippets of text that are usually 140 characters or less (But can now be up to 280 characters) most people are generally surprised to discover that this is only the proverbial "tip of the iceberg" in terms of what a tweet really is.  In this section we'll see exactly what a tweet is, how to improve looking at the full content, and then how to grab the portions that we want (usually the "text").

To make this easy we'll only request a single tweet by its ID number.  Every tweet has its own unique ID and can be requested if that ID is known.  We request the tweet with ID# 210462857140252672 and then print the response object.

In [None]:
r = api.request('statuses/show/:%d' % 210462857140252672)
print(r)

The output of running the cell above will be something like `<TwitterAPI.TwitterAPI.TwitterResponse object at 0x107b9af28>`, which isn't quite what we are looking for.  This `TwitterResponse object` is a bundle of information related to the request including status code returned (`r.status_code`), how much of your quota is left (`r.get_quota`), the response headers (`r.headers`), etc.  What we wantis the "text" portion of this response (`r.text`).

In [None]:
r.text

That's a lot more than 140 characters!

Exactly what is there is hard to determine though given the formatting.  We can do better.

This content is in is JavaScript Object Notation (JSON), which is really a nested list of properties.  Python doesn't know this is JSON though so we need to tell it.  We do this in the next cell by importing the `json` library, converting `r.text` to json using the load string method ( `.loads()` ), and then outputting that json with formatting using the output string method ( `.dumps()` ) with some some options added for readability.

In [None]:
import json
parsed_r = json.loads(r.text)
print(json.dumps(parsed_r, indent=3, sort_keys=True))

This is much nicer to read, especially since the various components have been alphabetized.  Having the response text as JSON also allows us to easily access each subcomponent.  We show this in the next cell by printing the text (sometimes called the "body" of the tweet), the ID# of the tweet, and the screen name of the user.

In [None]:
print("Tweet Body: ",parsed_r['text'])
print("Tweet ID: ",parsed_r['id'])
print("Screen Name: ",parsed_r['user']['screen_name'])
print("Declared User Location: ", parsed_r['user']['location'])

Rather than parse the output to JSON everytime we can combine the Twitter Response Object's `.get_iterator()` method with a for-loop to do this directly.  It's less work overall and is cleaner.

In [None]:
r = api.request('statuses/show/:%d' % 210462857140252672)
for item in r.get_iterator():
    print("Tweet Body: ",item['text'])
    print("Tweet ID: ",item['id'])
    print("Screen Name: ",item['user']['screen_name'])

### Response Codes
As we move forward, eventually you're likely to end up with an error.  When these are related to our interaction with Twitter rather than a more local mistake then Twitter helpfully provides a code to help diagnose the problem.  A list of all these codes is [HERE](https://developer.twitter.com/en/docs/basics/response-codes.html).

## Streaming

There are two approaches to collecting information from Twitter: grabbing tweets as they are published and searching through the archive of past tweets.  The first approach is known as "streaming" and we'll look at how to use it now.  It is important to note up front that the results returned using this method are incomplete: you will _not_ necessarily capture every single tweet that you intend to this way.  Still, you can get a lot of tweets in a very short time and likely enough to get you started (assuming your search terms are not overly restrictive).

Streaming amounts to applying a set of filters to the stream of all tweets and capturing what is matched by those filters.  For this first example we'll simply print the body of each tweet out on the screen.  There are lots of opinions about Donald Trump right now so we'll use 'trump' as the term we are tracking.

!!! IMPORTANT !!!

The cell below will run indefinitely.  You'll want to stop it at some point—likely after only a few seconds!—so be prepared to click the stop button at the top of this workbook once you have some tweets in the output cell.  If you leave the search term as 'trump' then you'll have enough tweets to prove that it works after about two seconds!

It is also possible that you'll hit a rate limit with the search term "trump".  If you approach a 1% sample from Twitter with your request then it will cut you off.  This is roughly about 60 tweets per second but this will vary. 

Lastly, note that the API only returns tweets inside a 10ms window per second.  You will not be getting everything.

!!! IMPORTANT !!!

In [None]:
TRACK_TERM = 'trump'

r = api.request('statuses/filter', {'track': TRACK_TERM})

for item in r.get_iterator():
    print(item['text'],item['id'] if 'text' in item else item)

## Write to File

As satisfying as it may be to have an endless stream of tweets scroll in front of you there isn't much value in it unless you are able to capture the tweets for analysis in the future. If you really only need a few and only need them once then you can just cut and copy from the output above.  In most cases though you'll want a lot of tweets or want to collect them multiple times.  In such cases the ideal thing to do is write the tweets to a database because then you will have the search features of the database at your disposal.  We'll get to interacting with database later in this workbook but keep in mind that actually setting up that database is beyond the scope of this workbook (If you're not sure where to start, [MongoDB](https://www.mongodb.com/) is worth considering since its internal format is very similar to the JSON (JavaScript Object Notation) that tweets come in).

At this point we will simply write the text and ID number of each tweet to a file.  We do this using `with open...` because this method of opening a file will ensure that it will close properly if the program crashes/halts unexpectedly, something that in inevitable at this point because the only way we have to stop the code we are running right now is to interrupt it.  There is some casting values to strings using the `str` function when writing to the output file because the ` .write() `  method requires a single string as an input.  Note as well the addition of the line break (`\n`) to ensure that each tweet body starts on a new line.

[Note that if you are a Windows user then you may have difficulty saving this file because of the character encoding used.  I'm working on finding a reliable fix for this.  Setting the encoding to utf-8, as is done below, seems to work in many but not all cases.]

In [None]:
TRACK_TERM = 'trump'

r = api.request('statuses/filter', {'track': TRACK_TERM})

with open("streamTweets.csv","a",encoding="utf-8") as outfile:
    for item in r.get_iterator():
        line = item['text'] + ',' + str(item['id'])
        print(line if 'text' in item else item)
        outfile.write((line + '\n') if 'text' in item else item)

Looking at the output (there should be a file called "streamTweets.csv" in the same directory as this notebook) we can see that the body of each tweet is printed on its own line followed by a comma which is followed by the tweet ID... mostly.  Scrolling through the list will reveal that there are tweets that span multiple lines and blank spaces.  Why is this?  The body of some tweets includes line break characters (` \n `).

If we compare what was printed to the screen to what was written in the file we'll see the `\n`'s on the screen translated to blank lines in the file.

While there is an argument to be made that removing these characters makes no difference to the content of the tweet the counterargument is that line breaks are important punctuation and should be kept with the original tweet.  We will keep the line breaks.

A popular way to do this would be to use a regular expression and replace each `\n` that occurs with a `\\\\n` so that each `\` is appropriately escaped as it is passed from the variable to the file.  The problem with this method is that there are other characters that might appear as well (either in tweets or elsewhere) and while we could write a regular expression to do the substitution in each case Python offers a better way: the [representation function](https://docs.python.org/3.5/library/functions.html#repr).  To do this we pass the line variable to the function `repr` as we write it to the output file, as in the example below.

(Remember to stop the cell after a few seconds.)

In [None]:
TRACK_TERM = 'trump'

r = api.request('statuses/filter', {'track': TRACK_TERM})

with open("sentimentTest.csv","a") as outfile:
    for item in r.get_iterator():
        line = item['text'] + ',' + str(item['id'])
        print(line if 'text' in item else item)
        outfile.write((repr(line) + '\n') if 'text' in item else item)

Checking streamTweets.csv shows that this approach is working well.  While we won't write to an output file in every example in the rest of this workbook keep in mind that you can use the same methods in all the examples that follow.

## Standard Search

The [Standard Search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) allows for searching in the past 7 days.  It is rate limited to 180 requests per 15 minutes using oAuth1 and 450 requests per 15 minute using oAuth2.  It is also "not exhaustive", meaning that the full body of tweets matching search criteria within the window is unlikely to be returned (maybe if the body of tweets is very small).

We'll shift away from the politically hot topic of Donald Trump to the topic of pizza for these next examples.

Note the use of the `.get_quota()` method on the response object in order to see how much of our quota remains.  That's right, there are quotas on the free account we are using.  If you're only looking back 7 days then you only have to worry about being rate limited.  If you're looking back farther then you can make up to 250 requests per month to the 30-Day API and up to 50 requests per month to the Full Archive.

In [None]:
SEARCH_TERM = 'pizza'

r = api.request('search/tweets', {'q': SEARCH_TERM})

for item in r.get_iterator():
    print(item['text'] if 'text' in item else item)

print('\nQUOTA: %s' % r.get_quota())

Seems to be working but there are not many tweets being returned.  We can increase this by specifying the `count` parameter.  We'll also add a simple counter just to see what we are actually getting.  

The set of all the parameters that can be invoked is available [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets).

In [None]:
SEARCH_TERM = '#pizza'
COUNT = 100 

r = api.request('search/tweets', {'q': SEARCH_TERM, 'count': COUNT})

a = 1
for item in r.get_iterator():
    print(a)
    print(item['text'] if 'text' in item else item)
    a=a+1

print('\nQUOTA: %s' % r.get_quota())

So, not 100 tweets but certainly more than before.  To go back further we'll need to look at paging.

## Paging 

Twitter returns results in chunks that are called "pages".  In the example above we are seeing just the first page of results and setting the `count` parameter to the maximum number of possible results returnable per page.  If you want to go back further then it becomes necessary to send multiple requests to the API in succession with each one asking for the next page.  While this can be implemented "by hand" TwitterAPI makes this much easier by providing a paging function called 'TwitterPager' (You can read more about it [HERE](https://geduldig.github.io/TwitterAPI/paging.html) that does all of the heavy lifting for you by tracking what page to ask for, ensuring the request rate is not too high, and generally managing the connection.  It is invoked in an almost identical way to everything you have seen so far in this workbook.

Unlike the previous examples where we were printing out the body of each tweet here we will print out only the date the tweet was created and the ID.  This is done simply because it makes it easier to track what is happening.

It is important to note that as with the streaming API endpoint you will need to stop the code at some point.  While you will eventually hit the 7-day limit it is unlikely that you want to wait that long for this toy example. 

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = 'pizza'
COUNT = 100

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT})

for item in pager.get_iterator():
    #print(item['text'] if 'text' in item else item)
    print((item['created_at'], item['id']) if 'text' in item else item)    

You'll note that as the code runs the dates and the tweet IDs both roll backwards and the search tool moves from the present (tweets become accessible via the search APIs about 30 seconds after they are created) to the past.

The TwitterAPI documentation provides some [advanced ways to add fault tolerance](https://geduldig.github.io/TwitterAPI/faulttolerance.html).  These are fairly sophisticated and involve checking status codes and the like.  While valuable it is inevitable that you will end up halting your program for one reason or another and need it to restart scraping where it left off.  

The code below does this by first checking to see if there is an object called "item" that has a value keyed to 'id'.  If it does then it captures this ID and uses it as input into the TwitterPager function so that all new tweets collected will be earlier than it.  If the value does not exist then an empty string is assigned as the ID to start from which the TwitterAPI will ignore and start providing input from the present.

This code will work as long as the notebook stays open, no matter how often the cell is interrupted.  If you close the notebook and reopen it then you'll need to pass in the ID value from the last line of the output file to restart in the correct location.  If you need a more sophisticated method for handling faults then follow the link above to the TwitterAPI documentation.

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = 'pizza'
COUNT = 100

try:
    SINCE_ID = item['id']
except:
    SINCE_ID = ''

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT,'since_id':SINCE_ID})

with open("restartTweetsTest.csv","a", encoding="utf-8") as outfile:
    for item in pager.get_iterator():
            line = item['text'] + ',' + str(item['id'])
            print(line if 'text' in item else item)
            outfile.write((repr(line) + '\n') if 'text' in item else item)

## Premium Access

Access beyond stream filtering and searching imperfectly through the past 7 days requires some extra steps beyond simply making an app.

1. Setting up a dev environment.  Within the Twitter developer site click on your name in the top right corner.  From the menu select "Dev environments".  Follow the interface to create the environments that you would like and associate an app with each.
2. Note the name of each dev environment because it will go into one of the variables called `LABEL`, below.  I named my 30-Day development environment "30DayTesting" and my full archive development environment "fullArchiveTesting".  So in the 30 Day example I set `LABEL` to "30DayTesting" and in the Full Archive example I set `LABEL` to "fullArchiveTesting".

To see exactly what is available in the premium sandbox have a look at the overview [HERE](https://developer.twitter.com/en/docs/tweets/search/overview/premium.html) and the search guide [HERE](https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators).


## Rate Limits

The search of the 7-day archive that we have done and the searches of the 30-day archive and the full archive that we are about to do are all subject to rate limits.  These matter less for the 7-day archive because its quotas reset every 15 minutes while there are quotas attached the premium apis that reset only once per month.  Failure to respect these will result in Twitter rejecting your search requests until your quota refreshes, typically with an error code of 429.

The limits for the Standard API can be seen [HERE](https://developer.twitter.com/en/docs/basics/rate-limits.html).  To see the current quota status for your account that apply to accessing the 7-day archive you can run the code block immediately below [Note that this code block also demonstrates the basics of using the Python requests library to access the Twitter API rather than the TwitterAPI library.].  There is also the `.get_quotas` method within the TwitterAPI that was used above but this shows substantially less information [and it is not currently clear to me why it is different...].

To see the subscription usage for your premium account you can log into your developer account on the Twitter website, click on your name in the top right corner, and then choose "Subscriptions" (the direct link should be [https://developer.twitter.com/en/account/subscriptions]()).  This will give you a dashboard with content that looks like the following:

![](twitterSubscriptionDashboard.png)

Note that in this example the quota for the full archive has been exceeded!  Again, we are only using the sandbox account and it can be quite easy to go over the limits here.

In sum, the limits you will held to will be as follows (unless you pay to upgrade):

* fullarchive: 50 searches/month and 1 million tweets returned
* 30day: 250 searches/month and 1 million tweets returned

You won't be able to hit 1 million tweets per month within the sandbox account given that your searches will be limited to 100 tweets per request.  This sets the total threshold for the full archive at 5,000 tweets and for the 30-day archive at 25,000 tweets.

In [None]:
# How to do this came from https://stackoverflow.com/questions/33308634/how-to-perform-oauth-when-doing-twitter-scraping-with-python-requests

import requests
from requests_oauthlib import OAuth1
import json

url = 'https://api.twitter.com/1.1/account/verify_credentials.json'
auth = OAuth1(API_KEY, API_KEY_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
requests.get(url, auth=auth)

# using 'rr' as the variable to ensure that any work done above when the variable was 'r'
# isn't overwritten.
rr = requests.get('https://api.twitter.com/1.1/application/rate_limit_status.json?resources=help,users,search,statuses', auth=auth)

print(json.dumps(json.loads(rr.text), indent=3, sort_keys=True))

If you are being limited in your search results and running the cell above doesn't illuminate why this is so and neither does the subscription dashboard then running one of the following code cells might.  The first is a direct probe of the 30 Day endpoint and the second is a probe of the full archive.  

Remember that you need to run the appropriate authorization code at the top of this workbook!

In [None]:
r = requests.get('https://api.twitter.com/1.1/tweets/search/30day/30DayTesting.json?query=pizza', auth=auth)
print(json.dumps(json.loads(r.text), indent=3, sort_keys=True))

In [None]:
r = requests.get('https://api.twitter.com/1.1/tweets/search/fullarchive/fullArchiveTesting.json?query=physicalactivityandpregnancy', auth=auth)
print(json.dumps(json.loads(r.text), indent=3, sort_keys=True))

## 30 Day

In [None]:
from TwitterAPI import TwitterAPI

SEARCH_TERM = 'pizza'
PRODUCT = '30day'
LABEL = '30DayTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                {'query':SEARCH_TERM})

for item in r:
    print(item['text'] if 'text' in item else item)

In [None]:
r.text #you can run this if you need to see what the text of the last item was for debugging

## Full Archive

In [None]:
from TwitterAPI import TwitterAPI

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), 
                {'query':SEARCH_TERM})

for item in r:
    print(item['text'] if 'text' in item else item)

In [None]:
r.text #you can run this if you need to see what the text of the last item was for debugging

## Full Archive with Paging

Look at the premium search documentation and note that some of the features supported with searching the seven day archive are not supported on the 30 Day or Full Archive API.  For example, specifying a  maximum number of results to return or the what ID to search since is not available (see [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search)).

Note the changes to the code below which amalgamate both the full archive example just shown with the paging example from earlier in this workbook.

In [None]:
#Full Archive Tweets without full text
from TwitterAPI import TwitterAPI, TwitterPager

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

pager = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query': SEARCH_TERM, 'fromDate': 20180917000,'toDate':20181017000})

pagerObject =[]

#with open("restartTweetsTest.csv","a", encoding="utf-8") as outfile:
for item in pager.get_iterator(wait=30):
    pagerObject.append(item)
    line = item['text'] + '|' + str(item['id']) + '|' + str(item['created_at']) + '|' + str(item['user']['location']) + '|' + str(item['user']['name']) + '|' + str(item['user']['screen_name'])
    print(line if 'text' in item else item)
    #outfile.write((repr(line) + '\n') if 'text' in item else item)

If you have been tracking the text of the tweets you'll likely notice that many of them are incomplete, containing `...`.  This is a result of Twitter's new extended tweet mechanism that enables tweets of up to 280 characters instead of the original 140.  To get the full text of a tweet in every case you need to see if it is an extended tweet and, if it is, then look elsewhere in the tweet for the full text, as shown below.

In [None]:
#Full Archive Tweets with full text and tweet full information
from TwitterAPI import TwitterAPI, TwitterPager

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

pager = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query': SEARCH_TERM, 'fromDate': 20180917000,'toDate':20181017000})

pagerObject =[]

#with open("restartTweetsTest.csv","a", encoding="utf-8") as outfile:
for item in pagerObject:
    if 'extended_tweet' in item.keys():
        tweet_text = item['extended_tweet']['full_text']
    else:
        tweet_text = item['text']
    line = tweet_text + '|' + str(item['id']) + '|' + str(item['created_at']) + '|' + str(item['user']['location']) + '|' + str(item['user']['name']) + '|' + str(item['user']['screen_name'])
    print(line)

Note that if you have a retweet of an extended tweet then the only way to get the full content is to grab the original tweet.  This can be done using the information in the retweet but this is not currently covered here.

## Under Development: MongoDB, Pymongo and Robo3T

[This section is incomplete.  Most of the code should work but the explanations are still being assembled.  As with the rest of the notebook, use at your own risk and under your own discretion.]

[MongoDB](https://www.mongodb.com/) is one option to consider for a database that can be used to hold a collection of tweets.  It is ideal for this because its internal structure for the items it holds (usually called "posts") follows the same JSON principles that Twitter uses for tweets.

The rest of this section assumes that you have installed MongoDB and that it is running.  Regardless of whether you are following along in Google Colab or on your own machine this is not likely to be the case and so we'll simply look at how this is done in principle.

In any case, this section is still in an alpha stage of development and is missing most of the documentation.

In [None]:
#run this if you need to install pymongo, the library that lets python interact with MongoDB

!pip install pymongo

In [None]:
import pymongo  #run this to open the pymongo library

In [None]:
from pymongo import MongoClient
client = MongoClient('localhost', 27017)

In [None]:
db = client.notebooktest

In [None]:
collection = db.notebooktest

## Moving Tweets into MongoDB

The value given posts will be the name of the collection Tweets will be stored in.
Note that the search methods below can only be performed in one collection. Therefore, in order to perform these methods on ALL Tweets collected, it is best to move all Tweets into one large collection and organize by values after (date, search term, etc.).

In [None]:
from TwitterAPI import TwitterAPI, TwitterPager

SEARCH_TERM = 'pizza'
PRODUCT = 'fullarchive'
LABEL = 'fullArchiveTesting'

pager = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query': SEARCH_TERM, 'fromDate': 20180917000,'toDate':20181017000})

pagerObject =[]

posts = db
for item in pagerObject:
        if 'extended_tweet' in item.keys():
            tweet_text = item['extended_tweet']['full_text']
        else:
            tweet_text = item['text']
        #line = tweet_text + '|' + str(item['id']) + '|' + str(item['created_at']) + '|' + str(item['user']['location']) + '|' + str(item['user']['name']) + '|' + str(item['user']['screen_name'])

        post = {"text": tweet_text,
            "id": str(item['id']),
            "created_at": str(item['created_at']),
            "user_location": str(item['user']['location']), 
            "user_name":str(item['user']['name']),
            "user_screen_name": str(item['user']['screen_name']),
            "search_term": "pregnancy_fitness"}
        post_id = posts.insert_one(post).inserted_id
        print(post_id)

## Find One Method

This allows you to find a Tweet matching a single criteria in a MongoDB collection. User screen name is used in this example. Note that this will only search for Tweets from the collection in the above cell.

In [None]:
import pprint

In [None]:
#find_one method
pprint.pprint(posts.find_one({"user_screen_name": "DiariesBump"}))
"""{u'text': tweet_text,
 u'id': str(item['id']),
 u'created_at': str(item['created_at']),
 u'user_location': str(item['user']['location']), 
 u'user_name':str(item['user']['name']),
 u'user_screen_name': str(item['user']['screen_name']),
 u'search_term': "pregnancy_fitness"}"""

## Find Method

This method allows you to find all Tweets matching a specified criteria in a MongoDB collection. This example displays all the Tweets containing "Minute" in the text. Again, this will only search in one collection.

In [None]:
#find() method
for post in posts.find():
    if "autumn" in post['text']:
        #pprint.pprint(post)
        print(post['text'])
    

# Moving Tweets into one collection, removing duplicates and converting dates to pymongo format

The following cells will move all Tweets into one collection with duplicates removed and dates listed in pymongo format.

## Collection names database

In [None]:
db.collection_names()

## Install datetime

In [None]:
!pip install datetime

In [None]:
from datetime import datetime

## Move all Tweets in one collection

In [None]:
#Putting tweet ID's into doc
tweetIDs = set()
for doc in posts.find():
    #print(doc)
    tweetIDs.add(doc['id'])

In [None]:
#Issue: This cell only lists created_at_num as September 18 2018 in all Tweets

#Putting tweets into one collection with dates in pymongo format and duplicates removed
tweetIDs = set()
posts = db.alltweets
for col in db.collection_names():
    for post in db[col].find():
        if post['id'] not in tweetIDs:
            tweetIDs.add(post['id'])
            created_at_num = doc['created_at']
            postb = {"text": post['text'],
            "id": post['id'],
            "created_at": post['created_at'],
            "created_at_num":datetime.strptime(created_at_num, '%a %b %d %H:%M:%S +0000 %Y'),
            "user_location": post['user_location'], 
            "user_name":post['user_name'],
            "user_screen_name": post['user_screen_name'],
            "search_term": post['search_term']}
            post_id = posts.insert_one(postb).inserted_id
            #print(post_id)

In [None]:
#New number of Tweets with duplicates removed
posts.count_documents({})

## Find Tweets by search term

In [None]:
search_test = posts.find({"search_term": "pregnancy_exercise"})

In [None]:
for doc in search_test:
    print(doc['id'])

## Converting dates to pymongo format
These cells were used above to convert the dates to pymongo format

In [None]:
!pip install datetime

In [None]:
from datetime import datetime

In [None]:
datetime.datetime.strptime(date, "%Y-%m-%d")

In [None]:
from datetime import datetime
created_at = 'Tue Oct 16 23:30:11 +0000 2018' # Get this string from the Twitter API
dt = datetime.strptime(created_at, '%a %b %d %H:%M:%S +0000 %Y')
print(dt)

# Attempts to organize by dates below

We did not use the notebook following this point as the remainder of the project could be done manually. Due to time constraints on my summer project we did not continue to use the notebook and all the Tweets were moved into excel. 

In [None]:
from datetime import timedelta

In [None]:
#Trying to get tweets sorted by date
pre_guideline_tweets = datetime.utcfromtimestamp(2018-9-17) + timedelta(days=31)
tweets = posts.find({'created_at_num': { '$gte': pre_guideline_tweets }})

In [None]:
tweets.count()

In [None]:
start = datetime.fromisoformat('2018-09-17')
end = datetime.fromisoformat('2018-10-17')

#tweets = posts.find({'time': {'$gte': start, '$lt': end}})
for post in db.alltweets.find({'time': {'$gte': end, '$lt': start}}):
    print(post)

In [None]:
print(post)

In [None]:
import datetime

date_search = db.alltweets.find({'created_at_num':{'$gt':datetime.date('2018-09-17 04:36:00.000Z') - timedelta(days=31)}})

In [None]:
date_search.count()