# Scraping data from Twitter

If you prefer to use Twitter's API to obtain the data you'll find steps to do so further down. However, Twitter's API limitations don't allow for extracting much data.

Hence, I've created a script called `extractor.py` that works with Selenium and BeautifulSouop to extract the data and it will work without any API keys. You will, however, need to install the Selenium package for Python and download the driver you wish to use. My script runs using Firefox with [geckodriver](https://github.com/mozilla/geckodriver/releases)

So, you can generate a better dataset using this than the free version of Twitter's API. There's a tradeoff regarding the amount of time it takes to retrieve the data.

## Retrieving the data using the script

### 1. Install the Selenium package for Python

If you're using Conda, it'll be enough to run `conda install selenium`, otherwise, you can install it with pip using `pip install selenium`.

### 2. Download a driver for a browser

As mentioned above, my version of the script works with Firefox using the geckodriver, which you can download from [here](https://github.com/mozilla/geckodriver/releases), depending on the version of the browser you have installed.

You can also run the browser in headless mode, meaning that you won't see the driver performing tasks. It will do it invisibly. You can read about how to do that in Selenium docs.

### 3. Add the driver to your `PATH`

On Linux based systems, you can achieve this by running:

```bash
$ export PATH="/path/to/driver/dir:$PATH"
```

> **Note:** this will only keep the path in your `PATH` for the current terminal session. You can look for ways to add it in a definitive way if you so wish.

### 4. Import the script and use its functions:

> **Update:** After running this method for a couple of times Twitter seems to have blocked my IP for the specific queries I was running. Therefore, I've added a little extra step, which is using a proxy server. This forced me to switch to Firefox, because Chrome wasn't loading the websites using the proxy server. So, you should add a host for a proxy server before using this. You can easily find proxy servers by just running a Google search for 'Fresh proxies' or something like that.

In [None]:
import twitter_data_extractor as td

# This will take a good while
duque_data = td.download_data_for_period('@IvanDuque', '2018-01-27', '2018-06-16', daily_limit=100)

In [None]:
import json

with open('data/duque_scraped_data.json', 'w') as f:
    json.dump(duque_data, f)

This will load about 14 thousand tweets mentioning @IvanDuque. You can tweak the search parameters. Changing the interval to query, or the number of tweets to extract per day. The extractor will iterate over the dates in the interval, running a query on Twitter, and return a list of tweets at the end, taking about `daily_limit` results from each day. This won't necessarily be equal to `daily_limit`, but always at least `daily_limit`, depending on the amount of tweets Twitter loads each time a scroll is performed.

Then, you can load the data for the other candidate in the same fashion:

In [None]:
petro_data = td.download_data_for_period('@petrogustavo', '2018-01-27', '2018-06-16', daily_limit=100)

with open('data/petro_scraped_data.json', 'w') as f:
    json.dump(petro_data, f)

> **Note:** If you don't pass the `daily_limit` parameter, all tweets found for the period will be returned.

And finally, you can concatenate the results and store them in a single file

In [None]:
whole_data = duque_data + petro_data

with open('data/scraped_data.json', 'w') as f:
    json.dump(whole_data, f)

And that's it! You've now got almost 30 thousand tweets to work with.

If you're just here to load the data, you don't need to continue reading, unless you prefer to load the data from Twitter's API.

Below are the necessary steps to retrieve data from Twitter's API if you prefer to use that. However, I would recommend using the extractor, if you've got the patience.

## Retrieving the data from Twitter's API

This were the steps taken to retrieve data from Twitter, mentioning the candidates Iván Duque and Gustavo Petro.

This was done using Twitter's HTTP API with the requests package.

You can apply for a developer account [here](https://developer.twitter.com/en/apply-for-access)

In [10]:
import base64
import requests

consumer_key = "" # Set this to your consumer key
consumer_secret = "" # Set this to your consumer secret

## Using the Twitter API

There are Python packages that allow access to Twitter's APIs, however, for this project, they will be left aside, so that the data can be used exactly in the way it's needed.

Therefore, the `requests` package will be used to perform the requests.

### Authenticating requests

Twitter requires that the requests to its APIs are authenticated. The whole authentication process is described in detail [here](https://developer.twitter.com/en/docs/basics/authentication/overview/application-only), but it basically consists of 2 steps:

1. Base 64 encoding the consumer key and secret
2. Obtaining an access token from Twitter's Oauth2 endpoint

So, here's a very simple way to do so:

In [None]:
encoded = base64.b64encode("%s:%s" % (consumer_key, consumer_secret))

Here, we have succesfully base64 encoded the consumer key and secret.

Now, it's necessary to obtain an access token:

In [14]:
headers = {
    'Authorization': 'Basic ' + encoded, # An auth header needs to be sent, with the encoded string from before
    'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8' # This is the required content type
}

# Twitter's docs explain that this must be the body sent to the Oauth2 endpoint
req_body = 'grant_type=client_credentials'

# This is just the URL for the Oauth2 endpoint
auth_url = 'https://api.twitter.com/oauth2/token'

# We send a POST request to the endpoint, passing the body and headers
auth_token_res = requests.post(auth_url, data=req_body, headers=headers)

# We extract the response, which is a JSON object
auth_object = auth_token_res.json()

# We extract the access token from the response
access_token = auth_object['access_token']

And there it is!

We've succesfully gotten an auth token that we'll use for requesting the data from now

Now, in order to start making requests to Twitter's API, and since we'll be using the same auth token and URL, we'll set those next

### Twitter's API URLs

Twitter has an interesting way for generating query endpoints for users to access their API. It looks like this:

```
                                               And this label refers to the label you chose
 This looks the same for everybody              for your environment. Mine is 'Development'
|----------------------------------------|            |
https://api.twitter.com/1.1/tweets/search/:product/:label.json
                                              |
                        This refers to the "product". e.g. 30day or fullarchive
                                     in the case of search
```

For the auth token, we'll use an `Authorization` header with a value of `"Bearer <token>"`, where `<token>` is of course, the access token we obtained before

In [None]:
base_api_url = 'https://api.twitter.com/1.1/tweets/search/30day/Development.json'

headers = {
    'Authorization': 'Bearer ' + access_token
}

### Requesting data from the API

Now, we can start issuing GET requests to the endpoint, adjusting the `'query'` param according to our needs.
For the case of this project, only two things will be included in the query, a "mention" of the user we'll be targeting, and we would also like to exclude retweets.

The "mention" part is pretty straightforward, it only needs to be included in the query with an '@' sign.

Twitter provides a couple of ways for excluding retweets, namely, you can use `is:retweet` or `filter:retweet`, which you would need to include with a "`-`" sign to specify negation. Like `-is:retweet`.

However, those operators are not available on the Sandbox version, which is the free version of premium endpoints, like the full archive search or 30 day search ones. So, to workaround this issue, we can negate the exact match "RT", because we know that all retweets will include this right at the start. This might lead to a couple of false negatives, in case a user had "RT" as part of their tweet, but since the text of tweets is tokenized, we don't have to worry about tweets including words like "ART" being ommitted, because RT will be searched as a whole "word".

Also, the fromDate and toDate params will be used when getting the actual data for the project, to specify the dates of the electoral campaigns

With all that said, we can now build our query and start getting data

In [62]:
params = {
    'query': '@IvanDuque -"RT"'
}

data = requests.get(base_api_url, headers=headers, params=params)

data = data.json()

results = data['results']

We've gotten quite a good amount of information. Each of these requests will return 100 tweets, and there is a `'next'`, field, which contains a token we can use for retrieving the next 100 tweets for the query

## Data limitations

With the sandbox version of the API, we're limited to only 50 requests to the full archive endpoint, with 100 tweets each. Since the 30day allows some more (250), we'll use it for testing.

For this project, we're particularly interested in the period of the 2018 electoral campaign in Colombia. This campaigns ran from january 27, 2018, to may 27, 2018 for the first round of voting. And for the second one, from may 27, 2018, to june 17, 2018. So in general, we're interested in the period between january 27, 2018 and june 17, 2018.

That leaves us with 5 months of data we're interested in. So, considering that we're limited to 50 requests, it would be fine to split those equally among the relevant months.

We would also like to focus on two candidates in particular: Iván Duque, the elected president of Colombia, and Gustavo Petro, his most noticeable opponent. So, it would be okay to leave half of the monthly requests to each one of the candidates.

To summarize, here's how we'll split our requests:

|Date / N° of requests  | Gustavo Petro | Iván Duque |
|-----------------------| ------------- | ---------- |
|2018-01-27 - 2018-02-27| 5             | 5          |
|2018-02-27 - 2018-03-27| 5             | 5          |
|2018-03-27 - 2018-04-27| 5             | 5          |
|2018-04-27 - 2018-05-27| 5             | 5          |
|2018-05-27 - 2018-06-17| 5             | 5          |
|total                  | 25            | 25         |

This will leave us with a total of 2500 tweets about each candidate

However, since for now we're only testing, we will use a slightly different variation, working with the 30day index.

So, the idea here will be to split the last month in five intervals, and request data at least twice for only one of the candidates. This way we can mimic the process we'll be doing with the data we really want. So let's get to it:

In [None]:
# We'll import a couple of utilities to get the dates as necessary
from datetime import datetime, timedelta

# We set an initial date 30 days ago, and shift it 6 days at a time to get the intervals
current_date = datetime.now() - timedelta(days=30)
intervals = []

for i in range(0, 5):
    # Set a 'next' date, 6 days from the current one
    next_date = current_date + timedelta(days=6)

    interval = {
        'fromDate': current_date.strftime("%Y%m%d") + '0000',
        'toDate': next_date.strftime("%Y%m%d") + '0000' # format the dates in the way expected by Twitter
    }

    intervals.append(interval)
    current_date = next_date

Now that we've got the intervals, we can start requesting data

In [97]:
def request_data(intervals, search_term, requests_per_interval=2):
    results = []
    
    for interval in intervals:
        # Params need to be reset for each interval
        params = {'query': search_term + ' -"RT"'}
        # Add the fromDate and toDate fields to the params
        params.update(interval)
        
        # We perform an initial request to ensure that our result will have a 'next' field.
        current_request = requests.get(base_api_url, headers=headers, params=params)
        
        for i in range(0, requests_per_interval):
            data = current_request.json()
            results += data['results']
            if i + 1 < requests_per_interval:
                # We add the next param, to fetch the next "page" of results
                params['next'] = data['next']
                current_request = requests.get(base_api_url, headers=headers, params=params)
        
    return results

And we fetch the data

In [101]:
test_data = request_data(intervals, '@IvanDuque')

len(test_data)

1000

And that's it! We've now gotten 1000 tweets we can store in a JSON file or other type of file, so that we can later process it and adjust it to our needs

## Exporting the tweets

We'll now save this data into a new JSON file, to use it in the other parts of the project

In [103]:
import json

# the json package provides a simple dump function to encode data as json
with open('data/test_data.json', 'w') as f:
    json.dump(test_data, f) # here, the json dump is being sent directly to the file

Done. We now have a file that contains all of our downloaded data.

# Working with the actual data

Now we're more than ready to repeat the process with the data we're really going to use in our project.

We'll still use the headers from before, with the same auth token. So that doesn't need to be changed.

The endpoint for the full archive, however, is different

In [104]:
# You'll notice that my environment here is called 'development' with a lowercase 'd'
# and that the endpoint for the full archive is just 'fullarchive'

base_api_url = 'https://api.twitter.com/1.1/tweets/search/fullarchive/development.json'

Also, for this endpoint, the approach we'll use with the dates will be a little bit different

In [108]:
# Define the relevant dates in our time period of interest
dates = ["201801270000", "201802270000", "201803270000", "201804270000", "201805270000", "201806172359"]

# Produce intervals between one date and the next
intervals = [{'fromDate': dates[i], 'toDate': dates[i + 1]} for i in range(0, len(dates) - 1)]

And we can now request our data just like we did before

In [None]:
# We'll first request data for the candidate Iván Duque
duque_data = request_data(intervals, '@IvanDuque', 5)

In [110]:
# Ande save it to its own file
with open('data/duque_data.json', 'w') as f:
    json.dump(duque_data, f)

In [None]:
# And now we'll request the data for the candidate Gustavo Petro
petro_data = request_data(intervals, '@petrogustavo', 5)

In [112]:
# And we save it to its own file as well
with open('data/petro_data.json', 'w') as f:
    json.dump(petro_data, f)

In [113]:
# And finally, we'll concatenate both lists and store the result in a single file
all_data = duque_data + petro_data

with open('data/data.json', 'w') as f:
    json.dump(all_data, f)

And that's it. Our data files are now ready to be used!