# Twitter Scraping Workshop
This notebook is written by Vicky Lin and is meant to be used in conjunction to the Twitter Scraping workshop on March 9, 2020. Social media data collection and analysis are still fairly new and do not have a streamlined process, making it incredibly difficult and time consuming. For this reason, this workbook uses a Python library built by a third party that will help in this process. This notebook draws from John Simpson's [Introduction to Twitter Scraping for Researchers](https://github.com/ualberta-rcg/twitter_scraping). Special thanks to Lisa Strohschein, John Simpson, Victoria Romanik, Anthony Jehn, and the University of Alberta Department of Sociology.

On the [GitHub page](https://github.com/vlin-1/twitter-workshop) for this workshop, there is a Resources folder with extra resources and information on Python and Twitter. If you encounter further trouble after this workshop, this folder may be a good place to start.

The Python library used in this notebook is called [TwitterAPI](https://github.com/geduldig/TwitterAPI) (no spaces).

All the blocks of code in Google CoLab can be run by pressing the "play" button next to the block of code. If successful, inside the square brackets will be a number, indicating the order in which you have run each block of code in this notebook. This can be helpful if you are encountering errors and are not sure where you may have gone wrong.

## A Brief Introduction to Python
Python is a programming language, and it will be used for the Twiter scraping workshop. This notebook will take you through a brief introduction on how to use Python, assuming that you have no knowledge of it.

Just like in R, I recommend you use `#` to leave comments in your code and make notes to yourself. When reading lines of code, the `#` symbol tells the computer, "Stop reading!" **Be aware that any code after a `#` will not be read.**

### Variables
**Variables** are assigned using `=` symbol. Numbers and integers are fine on their own. A string, however, will contain any values that are not integers and need to have either single-quotes or double-quotes around it.

In [None]:
age = 42
first_name = 'Ahmed'
print(first_name, 'is', age, 'years old')

If you aren't sure what type of data your variable is, use the `type()` function to find out.

In [None]:
print(type(age))
print(type(first_name))

As seen above, the `print()` function indicates how Python should return whatever value or variable has been inputted. Some cells of code won't have an explicit output, so the `print()` can be helpful to determine that your code did what you want.

### Indenting
In Python, leading whitespace (spaces and tabs) is used to determine the grouping of statements. Lines of code that are flush left are executed independently from each other (but are sequential). Indents indicate that there is a predicate to the current line of code, allowing for multi-line statements. In Python, line continuation is implied inside parentheses(), brackets[], and braces {}. Indents make reading lines codes easier, but can also cause errors when used incorrectly or if there is whitespace where there should not be.

This:

In [None]:
a = (1 + 2 + 3 + 4)
print (a)

is the same as:

In [None]:
a = (1 + 2 +
    3 + 4)
print(a)

### Libraries
There are a multitude of Python libraries that you may want to take advantage of. A **library** is a collection of modules that contain functions for use by other programs. There are a huge variety of libraries in Python to explore, but for this workshop we'll focus on the library that we'll be taking advanage of: TwitterAPI.

**You have to import a library module before using it.**
- Use `import` to load a library module into a program's memory.
- Then refer to things from the module as `module_name.thing_name`.
    - Python uses `.` to mean "part of."

In [None]:
# install before importing
!pip install TwitterAPI
import TwitterAPI

Use `help` to learn about the content of a library module.

In [None]:
help(TwitterAPI)

## The Real Question... What Can You Do With Twitter Data?
For the next blocks of code, you will need to upload some data into Google CoLab. The data sets should have already been emailed to you. On the upper left hand side, there is a small folder icon that you will be able to click and drag both .csv files into. Alternatively, you can click the "upload" button and upload them that way!

Run the next blocks of code to see what you can achieve with Twitter data!

In [None]:
!pip install matplotlib plotly pandas numpy cufflinks datetime
import plotly.offline as py
import pandas as pd
import plotly.graph_objs as go
from datetime import datetime
py.init_notebook_mode(connected=False)
import sys
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
    '''))
    init_notebook_mode(connected=False)
if 'google.colab' in sys.modules:
    get_ipython().events.register('pre_run_cell', enable_plotly_in_cell)

In [None]:
df1 = pd.read_csv('tweets_clean.csv') # change dates
df1['created_at'] = pd.to_datetime(df1['created_at'])
df1 = df1.set_index('created_at')
df1['day'] = df1.index.date
dates1 = df1['day'].value_counts().keys().tolist()
counts1 = df1['day'].value_counts().sort_index().tolist()

df2 = pd.read_csv('fulltweets_clean.csv') # do the same with the second data set
df2['created_at'] = pd.to_datetime(df2['created_at'])
df2 = df2.set_index('created_at')
df2['day'] = df2.index.date
dates2 = df2['day'].value_counts().keys().tolist()
counts2 = df2['day'].value_counts().sort_index().tolist()

In [None]:
# plot
trace = go.Scatter(
    x = sorted(dates1),
    y = counts1,
    mode = 'lines',
    name = 'With Retweets')
trace1 = go.Scatter(
    x = sorted(dates2),
    y = counts2,
    mode = 'lines',
    name = 'No Retweets')
data = [trace, trace1]
layout = dict(title = 'Use of \'#coronavirus\' on Twitter, Mid-December to Mid-January',
              title_x = 0.45,
              xaxis = dict(title = 'Date'),
              yaxis = dict(title = '# of Tweets'))
fig = dict(data=data, layout=layout)
py.iplot(fig)

### That Looks Complicated
And it is! The code above is the last step in the Twitter scraping process. While we won't get that far today, we're going to lay out the foundation for Twitter scraping, mainly what information can be parsed from a tweet and how to use TwitterAPI's search endpoints.

## Installation and Authentication
Let's start by installing the TwitterAPI library (if you haven't already).

In [None]:
!pip install TwitterAPI

Once the TwitterAPI library is installed, we should be able to open it to use throughout the workbook with the following command:

In [None]:
from TwitterAPI import TwitterAPI

#### A Note on Authentication
Taking advantage of Twitter Developer API and this notebook requires a Twitter Developer account. A Twitter Developer account has to be requested and the process may take a few days. Applying for a Twitter Developer account requires a regular Twitter account, and can be done here: https://developer.twitter.com/en/apply/user

There are two types of authentication that you may want to take advantage of. Here are the main differences:

You will need oAuth1 (User authentication) for the following:
- Post Tweets or other resources;
- Connect to Streaming endpoints;
- Search for users;
- Use any geo endpoints;
- Access Direct Messages or account credentials;
- Retrieve user's email addresses

You can use oAuth2 (Application-only authentication) for the following:
- Pull user timelines;
- Access friends and followers of any account;
- Access lists resources;
- Search in Tweets;
- Retrieve any user information, exclusing the user's email address

You can use either of these in this notebook, but be aware of which on you will be using and what they can do. Both authentication methods require some information about keys and tokens pased into the appropriate secton of the cell below. This key and token information is generated when you create a profile for an app of the Twitter Developer site.

Paste in the required key and token information from the Twitter Developer site into the cell below and run it. You'll need to run the cell below to load your credentials and only _one_ of the suthorization methods below.

In [None]:
API_KEY = 'Is90CzaxCkHzbOzJPtrMSx6Sh' # put your own keys here
API_KEY_SECRET = 'SqnZD2mHvZswLrlVrVe9NAfpYqv2QHIjl0RaLWDYLMqeqRSYnP'
ACCESS_TOKEN = '1176951583923306496-2VpRZkjRM9VfeEhvJXS6AePE6yRsbT'
ACCESS_TOKEN_SECRET = 'CxjYpCKawdsbExnY1n2dJd4VpgLbv9gY4RKpJPsEbjhz4'

**Only run ONE of the two blocks of code below**

### oAuth1 (User Identification)

In [None]:
api = TwitterAPI(API_KEY,
                API_KEY_SECRET,
                ACCESS_TOKEN,
                ACCESS_TOKEN_SECRET)
api.auth

If successful, the output should look something like this:
`<requests_oauthlib.oauth1_auth.OAuth1 at 0x107b8bba8>`

### oAuth2 (App Identification)

In [None]:
api = TwitterAPI(API_KEY,
                API_KEY_SECRET,
                auth_type='oAuth2')
api.auth

If successful, the output should look something like this:
`<TwitterAPI.BearerAuth.BearerAuth at 0x107b9acc0>`

The rest of this notebook assumes that the above two blocks of code have been run and executed successfully. Opening this workbook without running these two lines of code will likely cause errors later on. If you are encountering errors, you may need to run these two lines again.

## What is a tweet, and what is so great about them?
A "tweet" is so much more than just the text that someone has put out into the Twitterverse. All bits and pieces that you can and cannot see can be retrieved, but how do you find out what is there in the first place?

Let's find out the metadata attached to a tweet. For simplicity's sake, we'll only request a single tweet by its ID number. Each tweet has a unique ID number and any information relating to its metadata can be retrieved using its ID. Neat!

In [None]:
r = api.request('statuses/show/:%d' % 1207238508731150336)
print(r.text)

Our request (`r`) returns a bundle of information that isn't quite relevant to us. As you may have guessed, we only want the text portion of what gets returned from our request (`r.text`). But this is still so confusing to read!

The content that we want is in JavaScript Object Notation (JSON), which is really a nested list of properties. This is how we'll be able to read the tweet with some more clarity! Python doesn't know this, so we have to tell it by importing the `json` library. To convert this text to JSON, we use the load string method (`.loads()`), and then output it using the output string method (`.dumps()`) with some options for extra readability.

In [None]:
import json

parsed_r = json.loads(r.text)
print(json.dumps(parsed_r, indent=3, sort_keys=True))

Wow, look at all those items! Some of these may be of interest, some of them may not be. We can pick and choose which items we want using the Twitter Response Object's `.get_iterator()` method instead of parsing the output to JSON every time.

In [None]:
r = api.request('statuses/show/:%d' % 1207238508731150336)
for item in r.get_iterator():
    print("Tweet Body: ",item['text'])
    print("Tweet ID: ",item['id'])
    print("Screen Name: ",item['user']['screen_name'])

## Standard Search
The Standard Search API allows for searching in the past 7 days. It is rate limited to 180 requests per 15 minutes using oAuth1 and 450 requests per 15 minute using oAuth2. It is also "not exhaustive", meaning that the full body of tweets matching search criteria within the window is unlikely to be returned (maybe if the body of tweets is very small).

We are using these search endpoints for free, so it is worthwhile to use the `.get_quote()` method on the response object in order to see how much of our quota remains. You don't really have to worry about this if you're only looking at the past 7 days. If you want to search past that, you'll have to use the premium endpoints with the 30 Day Archive and the Full Archive, which we are not going to cover today.

In [None]:
SEARCH_TERM = '#coronavirus'

r = api.request('search/tweets', {'q': SEARCH_TERM})

for item in r.get_iterator():
    print(item['text'])

print('\nQUOTA: %s' % r.get_quota())

It works, but we aren't getting too much back. You may have also noticed that the text gets truncated too, and we definitely do not want that. If you have ever used social media, you will know that people often write text spanning multiple lines and leaving white space. Let's fix all that by changing our parameters and wrapping our text.

In [None]:
SEARCH_TERM = '#coronavirus'
COUNT = 100 # increases the amount of tweets getting returned
MODE = 'extended' # untruncates the tweet text

r = api.request('search/tweets', {'q': SEARCH_TERM, 
                                  'count': COUNT, 
                                  'tweet_mode': MODE})

for item in r.get_iterator():
    tweet_text = repr(item['full_text']) # wraps the tweet text
    print(tweet_text + '|' + str(item['id']))

You might notice that in the last line, we are using `|` as a delimiter instead of a comma. This is purely because the text in a tweet will often contain commas and we want to reduce any possible confusion in our output.

Twitter's API offers several parameters that we can take advantage of; a full list of there parameters can be found [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets). Try adjusting it yourself!

### A note on writing to file
What good is searching for tweets if you can't save them for analysis? If you really only need a few tweets, you can copy and paste the output, but it is much more likely that you'll want or have much more, or you'll want to keep collecting multiple time. To do this, you'll want to interact with a database. Unfortunately, setting up a database is way beyond what we're going to cover in this workshop. At this point, we will save the text and ID number of each tweet to a file.

In [None]:
SEARCH_TERM = '#coronavirus'
COUNT = 100
MODE = 'extended'

r = api.request('search/tweets', {'q': SEARCH_TERM, 
                                  'count': COUNT, 
                                  'mode': MODE})

with open("searchTweets.csv","a", encoding="utf-8") as outfile:
    for item in r.get_iterator():
        tweet_text = repr(item['text'])
        line = tweet_text + '|' + str(item['id'])
        #print(line if 'text' in item else item)
        outfile.write((line + '\n') if 'text' in item else item)

You will be able to check the output file by downloading it from the same tab that you uploaded your data initially! I won't continue writing to file for the rest of this notebook, but you can use the same methods in the examples that follow.

### Paging
Twitter returns results in chunks called "pages." In the previous examples, we only get the first page of results. TwitterAPI has made a handy paging function called 'TwitterPager' that sens multiple requests to the API in succession with each one asking for the next page.

To make it easier to keep track of what is happening, we will print only the date the tweet was created and the tweet's ID.

With the high volume of tweets related to our search term, you will need to stop the code at some point (by pressing the 'stop' button). While you will eventually reach the 7 day limit, you won't want to wait that long for this example.

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = '#coronavirus'
COUNT = 100

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 
                                            'count': COUNT})

for item in pager.get_iterator():
    print(item['created_at'], item['id'])

You'll notice that the search rolls backwards from the present to the past.

It is inevitable that your code will end up stopping for one reason or another and you'll need to pick up scraping where your program left off. To do this, the code below checks to see if there is an object called "item" that has a value keyed to 'id'. If it does then it captures this ID and uses it as input into the TwitterPager function so that all new tweets collected will be after it (that is, more recent). If the value does not exist then an empty string is assigned as the ID to start from which the TwitterAPI will ignore and start providing input from the present.

This code will work as long as the notebook stays open, no matter how often the cell is interrupted. If you close the notebook and reopen it then you'll need to pass in the ID value from the last line of the output file to restart in the correct location.

In [None]:
from TwitterAPI import TwitterPager

SEARCH_TERM = '#coronavirus'
COUNT = 100

try:
    SINCE_ID = item['id']
except:
    SINCE_ID = '1232834029134675968'

pager = TwitterPager(api, 'search/tweets', {'q': SEARCH_TERM, 'count': COUNT,'since_id':SINCE_ID})

for item in pager.get_iterator():
    tweet_text = repr(item['text'])
    print(item['created_at'], str(item['id']))

## Summary
Today, we looked at:
- Using Google CoLab and uploading your own data set
- A couple Python basics, including how to assign variables and indenting
- Installing and importing Python libraries
- Types of Twitter authentication
- What a tweet is
- Using json to parse relevant information from tweets
- Doing a standard search with Twitter's API
- Some parameters for standard search
- Paging with standard search

This is just the tip of the iceberg! There is so much to Twitter scraping that we did not have a chance to go over today, but feel free to check out the documentation on premium search endpoints [HERE](https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search)!
If you're eager to scrape a lot of tweets into a database, I would suggest looking into MongoDB. Its documentation can be found [HERE](https://docs.mongodb.com/manual/).