# Importing Libraries

The libraries that are necessary to import the data from Twitter API are the followings:

In [1]:
import tweepy
import pandas as pd
import numpy as np
import json
import re

Before initiating the data extraction process from Twitter it is important to make a developer account on Twitter and get the required credentials to be able to extract data properly from Twitter. Below is the list of credentials:
- API_KEY
- API_KEY_SECRET
- BEARER_TOKEN
- ACCESS_TOKEN
- ACCESS_TOKEN_KEY

For the purpose of this project, we ask a research and academic permission from Twitter to have access to the full-archive of Twitter database as the regular Twitter developer account only allows the user to extract data from the past 30 days.



# Extracting Tweet Data for the 42nd Federal Election

Using the mentioned credentials and **Tweepy** library we can connect to Twitter API to extract the data.

The list of all tweet IDs with the official hashtag for the 42nd federal election of Canada (#elxn42) were previously extracted and provided for free on the following link at Canadian Dataverse Ripository:

https://borealisdata.ca/dataset.xhtml?persistentId=hdl:10864/11310

There are different files available in which the only dataset that is useful for us is the tweet IDs which are also provided in the following path: 

**Data\42nd election\elxn42-tweet-ids.txt**

By using the Oauth Handler from Tweepy library and providing hte following credentials, we can connect to the API and extract the data with the time-limit for data extraction.

In [2]:
# THE CREDENTIALS FOR THIS PART IS GIVEN EXCLUSIVELY TO THIS USER AND FOR SECURITY REASONS
# PLEASE USE YOUR OWN API CREDENTIALS IF YOU WOULD LIKE TO RUN THIS CODE

API_KEY = 'YOU API KEY'
API_KEY_SECRET = 'YOUR API KEY SECRET'
BEARER_TOKEN = 'YOUR BEARER TOKEN'
ACCESS_TOKEN = 'YOUR ACCESS TOKEN'
ACCESS_TOKEN_KEY = 'YOUR ACCESS TOKEN KEY'


auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_KEY)
api = tweepy.API(auth, wait_on_rate_limit= True)


# Create empty DataFrame append data to it
elnx42 = pd.DataFrame(columns=['Tweet ID', 'Tweet Date', 'Full Text', 'Likes_count', 'Retweet_count', 'Author name', 'Author ID', 'Author Follower', 'Author Friends', 'Retweet_status'])


In [3]:
# Get the list of IDs to run for data extraction
ids_total = pd.read_csv('Data/42nd election/elxn42-tweet-ids.txt', names = ['Tweet ID'])
ids = ids_total.sample(frac=0.03, replace=False, random_state=18)


ids.reset_index(drop=True, inplace=True)

In [None]:
for i,j in enumerate(ids['Tweet ID']):
    elnx42.loc[i,'Tweet ID'] = ids['Tweet ID'][i]
    
    try:
        status = api.get_status(ids['Tweet ID'][i], tweet_mode = "extended")
        elnx42.loc[i,'Tweet Date'] = status.created_at
        elnx42.loc[i,'Full Text'] = status.full_text
        elnx42.loc[i,'Likes_count'] = status.favorite_count
        elnx42.loc[i,'Retweet_count'] = status.retweet_count
        elnx42.loc[i,'Author name'] = status.author.name
        elnx42.loc[i,'Author ID'] = status.author.id
        elnx42.loc[i,'Author Follower'] = status.author.followers_count
        elnx42.loc[i,'Author Friends'] = status.author.friends_count
        elnx42.loc[i,'Retweet_status'] = status.retweeted
    except:
        elnx42.loc[i,'Tweet Date'] = 0
        elnx42.loc[i,'Full Text'] = 0
        elnx42.loc[i,'Likes_count'] = 0
        elnx42.loc[i,'Retweet_count'] = 0
        elnx42.loc[i,'Author name'] = 0
        elnx42.loc[i,'Author ID'] = 0
        elnx42.loc[i,'Author Follower'] = 0
        elnx42.loc[i,'Author Friends'] = 0
        elnx42.loc[i,'Retweet_status'] = 0
    
    if i % 100 == 0:
        elnx42.to_csv('Data/backup4.csv')
    
    

Due to the local system power limit, the data have been extracted within 4 step through the week, in which each step took around 15 to 16 hours for extracting 250K lines of data.
Using the following sample code, we made sure that we are not picking the same IDs twice:

In [None]:
ids_total = pd.read_csv(r'D:\Brainstation\Capstone\42nd election\elxn42-tweet-ids.txt', names = ['Tweet ID'])
# backup1 = pd.read_csv('backup.csv')
ids_remaining = ids_total.merge(df, how = 'left', left_on = 'Tweet ID', right_on = 'Tweet ID')
ids_remaining = ids_remaining[ids_remaining['Full Text'].isna()]

ids_remaining = pd.DataFrame(ids_remaining['Tweet ID'], columns = ['Tweet ID'])
ids_remaining.head()
ids = ids_remaining.sample(frac=0.03, replace=False, random_state=18)

# Extracting Tweet Data for the 43rd Federal Election

The same set of tweet IDs for the 43rd federal election has also been downloaded from the same website using the following link:

https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP2/QAMPPI

This dataset will only be used for the purpose of testing the whole pipeline and see if it can predict the winner properly or not

In [None]:
elnx43 = pd.DataFrame(columns=['Tweet ID', 'Tweet Date', 'Full Text', 'Likes_count', 'Retweet_count', 'Author name', 'Author ID', 'Author Follower', 'Author Friends', 'Retweet_status'])

ids_total_43rd = pd.read_csv('Data/43rd election/elxn43-ids.txt', names = ['Tweet ID'])
ids_43rd = ids_total_43rd.sample(frac=0.0001, replace=False, random_state=18)


ids_43rd.reset_index(drop=True, inplace=True)

In [None]:
for i,j in enumerate(ids_43rd['Tweet ID']):
    elnx43.loc[i,'Tweet ID'] = ids_43rd['Tweet ID'][i]
    
    try:
        status = api.get_status(ids_43rd['Tweet ID'][i], tweet_mode = "extended")
        elnx43.loc[i,'Tweet Date'] = status.created_at
        elnx43.loc[i,'Full Text'] = status.full_text
        elnx43.loc[i,'Likes_count'] = status.favorite_count
        elnx43.loc[i,'Retweet_count'] = status.retweet_count
        elnx43.loc[i,'Author name'] = status.author.name
        elnx43.loc[i,'Author ID'] = status.author.id
        elnx43.loc[i,'Author Follower'] = status.author.followers_count
        elnx43.loc[i,'Author Friends'] = status.author.friends_count
        elnx43.loc[i,'Retweet_status'] = status.retweeted
    except:
        elnx43.loc[i,'Tweet Date'] = 0
        elnx43.loc[i,'Full Text'] = 0
        elnx43.loc[i,'Likes_count'] = 0
        elnx43.loc[i,'Retweet_count'] = 0
        elnx43.loc[i,'Author name'] = 0
        elnx43.loc[i,'Author ID'] = 0
        elnx43.loc[i,'Author Follower'] = 0
        elnx43.loc[i,'Author Friends'] = 0
        elnx43.loc[i,'Retweet_status'] = 0
    
elnx43.to_csv('Data/test_elnx43.csv')    