# Filename: get_data.pynb

# Purpose:

This script will download the tweets we need for our project and put them into a pandas DataFrame. There are several ways we can get the data we need for our project.

As of right now, we use two methods to get the data we need: through using specific keywords regarding the 2020 Presidential Election and by scraping tweets with hashtags regarding the 2020 Presidential election and its candidates.

## Install Libraries

In [2]:
#!pip install tweepy # Run this line only if you don't have tweepy installed

Collecting tweepy
  Downloading https://files.pythonhosted.org/packages/36/1b/2bd38043d22ade352fc3d3902cf30ce0e2f4bf285be3b304a2782a767aec/tweepy-3.8.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.7.0 (from tweepy)
  Downloading https://files.pythonhosted.org/packages/c2/e2/9fd03d55ffb70fe51f587f20bcf407a6927eb121de86928b34d162f0b1ac/requests_oauthlib-1.2.0-py2.py3-none-any.whl
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->tweepy)
  Downloading https://files.pythonhosted.org/packages/05/57/ce2e7a8fa7c0afb54a0581b14a65b56e62b5759dbc98e80627142b8a3704/oauthlib-3.1.0-py2.py3-none-any.whl (147kB)
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.1.0 requests-oauthlib-1.2.0 tweepy-3.8.0


In [5]:
import pandas as pd

import tweepy 
from tweepy import OAuthHandler

### Fetch Tweets using Twitter API with `Tweepy`

The below class is made using [this code](https://www.kaggle.com/amar09/sentiment-analysis-on-scrapped-tweets?source=post_page-----1804db3478ac----------------------) from Kaggle User [Amardeep Chauhan](https://www.kaggle.com/amar09). 

The class uses `tweepy` to access the Twitter API and fetch tweets relating to a specified keyword. 

The keywords we will use are:

- 2020 Presidential Election
- TODO: Come up with more

In [8]:
# Keys and tokens
consumer_key = ''
consumer_secret = ''

access_token = ''
access_token_secret = ''

In [9]:
class TwitterClient(object):
    """
    Initialization method. Creates a tweepy API object in order to use tweets.
    """
    def __init__(self):
        try:
            # Create OAuthHandler Object
            auth = OAuthHandler(consumer_key, consumer_secret)
            
            # Set access token and secret token
            auth.set_access_token(access_token, access_token_secret)
            
            # Create tweepy API object to fetch tweets
            self.api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
            
        except tweepy.TweepError as e:
            print(f'Error: Tweeter Authentication Failed - \n{str(e)}')
            
    """
    Fetches tweets using a specified query. Stores the tweets in a list after
    parsing them.
    
    TODO: Documentation
    """
    def get_tweets(self, query, max_tweets = 1000):
        tweets = []
        since_Id = None
        max_id = -1
        tweet_count = 0
        tweets_per_query = 100
        
        while tweet_count < max_tweets:
            try:
                # TODO: Figure out what these if-else statements do
                if(max_id <= 0):
                    if(not since_Id):
                        new_tweets = self.api.search(q = query, count = tweets_per_query)
                        
                    else:
                        new_tweets = self.api.search(q = query, count = tweets_per_query, since_id = since_Id)
                else:
                    if(not since_Id):
                        new_tweets = self.api.search(q = query, count = tweets_per_query, max_id = str(max_id - 1))
                        
                    else:
                        new_tweets = self.api.search(q = query, count = tweets_per_query, max_id = str(max_id - 1), 
                                                     since_id = since_Id)
                
                if not new_tweets:
                    print('No more tweets found.')
                    break
                
                # Start parsing the list of tweets
                for tweet in new_tweets:
                    parsed_tweet = {}
                    parsed_tweet['tweets'] = tweet.text
                    
                    # Append parsed tweet to tweets list
                    if tweet.retweet_count > 0: 
                        if parsed_tweet not in tweets: # If tweet has retweets, ensure that its appended only once
                            tweets.append(parsed_tweet)
                    else:
                        tweets.append(parsed_tweet)
                
                tweet_count += len(new_tweets)
                print('Downloaded {0} tweets'.format(tweet_count))
                max_id = new_tweets[-1].id
                
            except tweepy.TweepError as e: # Exit if there are any errors
                print('Tweepy Error: ' + str(e))
                break
                
        
        # Return DataFrame of tweets
        return pd.DataFrame(tweets)
                    

In [10]:
twitter_client = TwitterClient()

# Create dataframe of tweets
tweets_df = twitter_client.get_tweets('2020 Presidential Election', max_tweets = 1000)
print(f'tweets_df Shape - {tweets_df.shape}')

Downloaded 100 tweets
Downloaded 200 tweets
Downloaded 300 tweets
Downloaded 400 tweets
Downloaded 500 tweets
Downloaded 600 tweets
Downloaded 700 tweets
Downloaded 800 tweets
Downloaded 900 tweets
Downloaded 1000 tweets
tweets_df Shape - (76, 1)


In [11]:
tweets_df.head()

Unnamed: 0,tweets
0,"RT @JoshRudes: .@washingtonpost has been ""reli..."
1,RT @PredictIt: Here is the 2020 Presidential w...
2,RT @ABC: At least three states are considering...
3,RT @20predict: Taylor Swift will be one of the...
4,@mdornic I would like to thank you for helpin...


In [19]:
tweets_df.iloc[0].tweets

'RT @JoshRudes: .@washingtonpost has been "reliably told" that Trump is intentionally withholding a White House visit and US military aid "i…'