# Filename: get_data.pynb

# Purpose:

This script will download the tweets we need for our project and put them into a pandas DataFrame. There are several ways we can get the data we need for our project.

As of right now, we use two methods to get the data we need: through using specific keywords regarding the 2020 Presidential Election and by scraping tweets with hashtags regarding the 2020 Presidential election and its candidates.

## Install Libraries

In [2]:
#!pip install tweepy # Run this line only if you don't have tweepy installed

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.2.0 tweepy-3.8.0


In [38]:
import pandas as pd

import tweepy 
from tweepy import OAuthHandler

### Fetch Tweets using Twitter API with `Tweepy`

The below class is made using [this code](https://www.kaggle.com/amar09/sentiment-analysis-on-scrapped-tweets?source=post_page-----1804db3478ac----------------------) from Kaggle User [Amardeep Chauhan](https://www.kaggle.com/amar09). 

The class uses `tweepy` to access the Twitter API and fetch tweets relating to a specified keyword. 

The keywords we will use are:

- 2020 Presidential Election
- #2020Election
- #2020PresidentialElection
- #Election2020
- #KnowThe2020Candidates
- #POTUS2020
- #2020America
- TODO: Add names of all candidates

In [39]:
# Keys and tokens
consumer_key = ''
consumer_secret = ''

access_token = ''
access_token_secret = ''

In [40]:
class TwitterClient(object):
    """
    Initialization method. Creates a tweepy API object in order to use tweets.
    """
    def __init__(self):
        try:
            # Create OAuthHandler Object
            auth = OAuthHandler(consumer_key, consumer_secret)
            
            # Set access token and secret token
            auth.set_access_token(access_token, access_token_secret)
            
            # Create tweepy API object to fetch tweets
            self.api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
            
        except tweepy.TweepError as e:
            print(f'Error: Tweeter Authentication Failed - \n{str(e)}')
            
    """
    Fetches tweets using a specified query. Stores the tweets in a list after
    parsing them, which means extracting only the text and appending only unique tweets 
    to the resultant list.
    
    self: The TwitterClient object that will help us use the Twitter API.
    query: The specified query to search for.
    max_tweets: The maximum number of tweets in total to fetch. The default is 1000.
    
    Returns a pandas DataFrame of unique tweets relating to the keyword. The shape is not
    necessarily going to have max_tweets entries due to retweets. 
    """
    def get_tweets(self, query, max_tweets = 1000):
        tweets = []
        since_Id = None
        max_id = -1
        tweet_count = 0
        tweets_per_query = 100
        
        print('Fetching tweets for', query + '...')
        
        while tweet_count < max_tweets:
            try:
                # Try searching for tweets that have a max_id <= 0 i.e. older than -1.
                if(max_id <= 0):
                    if(not since_Id): # Get any tweets relating to the query
                        new_tweets = self.api.search(q = query, count = tweets_per_query)
                        
                    else: # Get tweets more recent that since_Id
                        new_tweets = self.api.search(q = query, count = tweets_per_query, since_id = since_Id)
                else:
                    if(not since_Id):
                        new_tweets = self.api.search(q = query, count = tweets_per_query, max_id = str(max_id - 1))
                        
                    else:
                        new_tweets = self.api.search(q = query, count = tweets_per_query, max_id = str(max_id - 1), 
                                                     since_id = since_Id)
                
                if not new_tweets:
                    print('No more tweets found.')
                    break
                
                # Start parsing the list of tweets
                for tweet in new_tweets:
                    parsed_tweet = {}
                    parsed_tweet['tweets'] = tweet.text
                    
                    # Append parsed tweet to tweets list
                    if tweet.retweet_count > 0: 
                        if parsed_tweet not in tweets: # If tweet has retweets, ensure that its appended only once
                            tweets.append(parsed_tweet)
                    else:
                        tweets.append(parsed_tweet)
                
                tweet_count += len(new_tweets)
                print('\tDownloaded {0} tweets'.format(tweet_count))
                max_id = new_tweets[-1].id # Prepare to get tweets that are older than the returned tweets 
                
            except tweepy.TweepError as e: # Exit if there are any errors
                print('Tweepy Error: ' + str(e))
                break
                
        print('Finished!\n')
        
        # Return DataFrame of tweets
        return pd.DataFrame(tweets)
                    

In [41]:
twitter_client = TwitterClient()

In [42]:
# Create a DataFrame for each keyword 
keywords = ['2020 Presidential Election', '#2020Election', '#2020PresidentialElection', 
            '#Election2020', '#KnowThe2020Candidates', '#POTUS2020', '#2020America']

In [43]:
tweets_df1 = twitter_client.get_tweets(keywords[0], max_tweets = 5000)
print(f'tweets_df1 Shape - {tweets_df1.shape}')

Fetching tweets for 2020 Presidential Election...
	Downloaded 100 tweets
	Downloaded 200 tweets
	Downloaded 300 tweets
	Downloaded 400 tweets
	Downloaded 500 tweets
	Downloaded 600 tweets
	Downloaded 700 tweets
	Downloaded 800 tweets
	Downloaded 900 tweets
	Downloaded 1000 tweets
	Downloaded 1096 tweets
	Downloaded 1196 tweets
	Downloaded 1296 tweets
	Downloaded 1396 tweets
	Downloaded 1496 tweets
	Downloaded 1596 tweets
	Downloaded 1696 tweets
	Downloaded 1796 tweets
	Downloaded 1896 tweets
	Downloaded 1996 tweets
	Downloaded 2096 tweets
	Downloaded 2196 tweets
	Downloaded 2296 tweets
	Downloaded 2396 tweets
	Downloaded 2496 tweets
	Downloaded 2596 tweets
	Downloaded 2696 tweets
	Downloaded 2796 tweets
	Downloaded 2896 tweets
	Downloaded 2996 tweets
	Downloaded 3096 tweets
	Downloaded 3196 tweets
	Downloaded 3296 tweets
	Downloaded 3396 tweets
	Downloaded 3496 tweets
	Downloaded 3596 tweets
	Downloaded 3696 tweets
	Downloaded 3796 tweets
	Downloaded 3896 tweets
	Downloaded 3996 tweets

In [44]:
tweets_df2 = twitter_client.get_tweets(keywords[1], max_tweets = 5000)
print(f'tweets_df2 Shape - {tweets_df2.shape}')

Fetching tweets for #2020Election...
	Downloaded 100 tweets
	Downloaded 200 tweets
	Downloaded 300 tweets
	Downloaded 400 tweets
	Downloaded 500 tweets
	Downloaded 600 tweets
	Downloaded 700 tweets
	Downloaded 800 tweets
	Downloaded 900 tweets
	Downloaded 1000 tweets
	Downloaded 1100 tweets
	Downloaded 1200 tweets
	Downloaded 1300 tweets
	Downloaded 1400 tweets
	Downloaded 1500 tweets
	Downloaded 1600 tweets
	Downloaded 1700 tweets
	Downloaded 1800 tweets
	Downloaded 1900 tweets
	Downloaded 2000 tweets
	Downloaded 2100 tweets
	Downloaded 2200 tweets
	Downloaded 2300 tweets
	Downloaded 2400 tweets
	Downloaded 2500 tweets
	Downloaded 2600 tweets
	Downloaded 2700 tweets
	Downloaded 2800 tweets
	Downloaded 2900 tweets
	Downloaded 3000 tweets
	Downloaded 3100 tweets
	Downloaded 3200 tweets
	Downloaded 3300 tweets
	Downloaded 3400 tweets
	Downloaded 3500 tweets
	Downloaded 3600 tweets
	Downloaded 3700 tweets
	Downloaded 3800 tweets
	Downloaded 3900 tweets
	Downloaded 4000 tweets
	Downloaded 

In [45]:
tweets_df3 = twitter_client.get_tweets(keywords[2], max_tweets = 5000)
print(f'tweets_df3 Shape - {tweets_df3.shape}')

Fetching tweets for #2020PresidentialElection...
	Downloaded 29 tweets
	Downloaded 129 tweets
	Downloaded 143 tweets
No more tweets found.
Finished!

tweets_df3 Shape - (108, 1)


In [46]:
tweets_df4 = twitter_client.get_tweets(keywords[3], max_tweets = 5000)
print(f'tweets_df4 Shape - {tweets_df4.shape}')

Fetching tweets for #Election2020...
	Downloaded 100 tweets
	Downloaded 200 tweets
	Downloaded 300 tweets
	Downloaded 400 tweets
	Downloaded 500 tweets
	Downloaded 600 tweets
	Downloaded 700 tweets
	Downloaded 800 tweets
	Downloaded 900 tweets
	Downloaded 1000 tweets
	Downloaded 1100 tweets
	Downloaded 1200 tweets
	Downloaded 1300 tweets
	Downloaded 1400 tweets
	Downloaded 1500 tweets
	Downloaded 1600 tweets
	Downloaded 1700 tweets
	Downloaded 1800 tweets
	Downloaded 1900 tweets
	Downloaded 2000 tweets
	Downloaded 2100 tweets
	Downloaded 2200 tweets
	Downloaded 2300 tweets
	Downloaded 2400 tweets
	Downloaded 2500 tweets
	Downloaded 2600 tweets
	Downloaded 2700 tweets
	Downloaded 2800 tweets
	Downloaded 2900 tweets
	Downloaded 3000 tweets
	Downloaded 3100 tweets
	Downloaded 3200 tweets
	Downloaded 3300 tweets
	Downloaded 3400 tweets
	Downloaded 3500 tweets
	Downloaded 3600 tweets
	Downloaded 3700 tweets
	Downloaded 3800 tweets
	Downloaded 3900 tweets
	Downloaded 4000 tweets
	Downloaded 

In [47]:
tweets_df5 = twitter_client.get_tweets(keywords[4], max_tweets = 5000)
print(f'tweets_df5 Shape - {tweets_df5.shape}')

Fetching tweets for #KnowThe2020Candidates...
No more tweets found.
Finished!

tweets_df5 Shape - (0, 0)


In [48]:
tweets_df6 = twitter_client.get_tweets(keywords[5], max_tweets = 5000)
print(f'tweets_df6 Shape - {tweets_df6.shape}')

Fetching tweets for #POTUS2020...
	Downloaded 100 tweets
	Downloaded 200 tweets
	Downloaded 300 tweets
	Downloaded 363 tweets
No more tweets found.
Finished!

tweets_df6 Shape - (149, 1)


In [49]:
tweets_df7 = twitter_client.get_tweets(keywords[6], max_tweets = 5000)
print(f'tweets_df7 Shape - {tweets_df7.shape}')

Fetching tweets for #2020America...
No more tweets found.
Finished!

tweets_df7 Shape - (0, 0)


In [61]:
# Merge together all of the DataFrames
# Note that tweets_df5 and tweets_df7 are both empty.

final_df = pd.concat([tweets_df1, tweets_df2, tweets_df3, tweets_df4, tweets_df5, tweets_df6, tweets_df7])
print(f'Final DataFrame Shape - {final_df.shape}')

Final DataFrame Shape - (5032, 1)


In [62]:
# Save to a .csv file
final_df.to_csv('tweets2020.csv')