# Data Collection

The purpose of this analysis is to examine the impact of the amplification of COVID-19 related discussions within Twitter on financial markets.

## COVID-19 Tweets & Setiment Scores

**Why use twitter?**

**About the dataset**

Rabindra Lamsai of the School of Computer and Systems Sciences - Jawaharlal Nehru University (New Delhi) provides a [dataset](https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset) that is an excellent fit for the puroses of this analysis. The data provided is a set of csv files that contain tweets related to the COVID-19 pandemic and their associated sentiment scores.

Theses tweets were collected as a part of an [ongoing project](https://live.rlamsal.com.np). It monitors Twitter in real time for COVID-19 related tweets by filtering by 54 different keywords that are commonly sued while referencing the pandemic (insert citation).

**Twitter content redistribution policy**

The CSVs in this dataset do not actually contian tweet data, but rather tweet IDs. This was done in order to comply with Twitter's content redistribution policy. Tweet IDs in the dataset will need to be hydrated before analsysis is possible.

### Web Scraping
For this analysis, we will require third-party data. While this requires some level of manual retrieval, we will automate the data collection process as much as possible to esnure the reproducability of this analysis.

#### Analyzing site structure

Now that we have successfully imported BeautifulSoup and created an object that contains the html of our desired webpage, we must analyze the site's structure to generate a methodology for scraping the desired data.

Here is a quick scroll through of the [page](https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset) that we are interested in:

<img src="./images/iee-webpage.gif">

All of the data that we need to retireve is located within the right sidebar, however there are well over 100 links that need to be clicked to access the data. Furthermore, these files will need to be moved to the desired directory and subsequently concatenated. To make this process more efficient, automation should be implemented. To do so we will need to employ webscraping techniques in both selenium and BeautifulSoup.

#### Automating data collection

The dataset files are all located in a sidebar on the right of the page. We will do the following to programatically retrieve all 132 of these files. The firs step in doing so is selecting the appropriate page elements and adding them to BeautifulSoup.

**Logging in with selenium**

In [12]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import secrets # create a secrets.py file in the root of the project directory and enter confiential info there

url = 'https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset'

ieee_username = secrets.ieee_username
ieee_password = secrets.ieee_password

driver = webdriver.Chrome()

driver.get(url)

login_link = driver.find_element_by_xpath('//*[@id="login"]/a[2]')
login_link.click()

username_input = driver.find_element_by_xpath('/html/body/div[3]/div/div/div[3]/div/div/form/div/div/div/div[2]/div[1]/input[1]')
username_input.send_keys(ieee_username)

password_input = driver.find_element_by_xpath('//*[@id="password"]')
password_input.send_keys(ieee_password)

submit_login = driver.find_element_by_xpath('//*[@id="modalWindowRegisterSignInBtn"]')
submit_login.click()

<br/>

**Passing the webpage to BeautifulSoup**

In [6]:
from bs4 import BeautifulSoup as bs
import re
import os

soup = bs(driver.page_source)

<br/>

**Retrieving the download links**

In [6]:
# Create empty list to store all of the links that we scrape from the webpage
data_files = []

# Loop through links that contain a string which matches the regex defined below
# Regex: match text that contains corona_tweets_, followed by one or more digits, a period and either csv or zip
for link in soup.find_all('a', string = re.compile('corona_tweets_\d+\.(csv|zip)')):
    
    # For each a tag (i.e. link), pull out the value of its href attribute.     
    download_link = link['href']
    # Append each link to the data_files list defined above
    data_files.append(download_link)

<br/>

**Downloading csv and zip files**

In [7]:
import urllib.request

# Loop through urls in our array
# Search each url using regex to look for the name of the desired file.
for url in data_files:
    
    # If the url has a .csv extension...
    if re.search(r'corona_tweets_\d+\.csv', url):
        
        # create a string from the regex search that will be used as the file name...
        csv_file_name = re.search(r'corona_tweets_\d+\.csv', url).group(0)
        
        # create a variable to store our full path by joining our desired save location with the csv file name...
        # Note: downloaded file will be saved to /covid-tweets/raw_data/csv within our project directory
        csv_full_file_name = os.path.join('./covid-tweets/raw_data/csv', csv_file_name)
        
        # and finally use urllib to retrieve the file
        urllib.request.urlretrieve(url, csv_full_file_name)
        
    # When the file does not match the regex search above...
    else:
        
        # create a string from the regex search that will be used as the file name...
        zip_file_name = re.search(r'corona_tweets_\d+\.zip', url).group(0)
        
        # create a variable to store our full path by joining our desired save location with the zip file name...
        # Note: downloaded file will be saved to /covid-tweets/raw_data/zip within our project directory
        zip_full_file_name = os.path.join('./covid-tweets/raw_data/zip', zip_file_name)
        urllib.request.urlretrieve(url, zip_full_file_name)

In [None]:
# TODO: add twark to the project and use it to hydrate the tweet IDs

## Data Cleaning

Prior to analyzing our data, it is important to first ensure data integrity by properly cleaning our desired data sets. Doing so will include several steps as several data sets from multiple third party sources must be used in the abscence of satisfactory first party data. In order to get to a state where data cleaning is feasible, we will need to simplify our files.

Creating a concise set of data that can be cleaned appropriately requires the following steps:
1. Extracting the zip files
2. Concatenating all raw data csv files
3. Hydrating the tweets within the csv files
4. Cleaning the resulting tweet data

### Extracting zip files

### Concatenating COVID-19 Raw Data

In [11]:
# import glob
# import pandas as pd

# Change working directory to raw_data where we all of our raw data is stored. Can comment out once directory has changed.
# os.chdir("./covid-tweets/raw_data")

# create a variable to hold our desired extension, csv in this case.
# extension = 'csv'

# store all filenames in current working directory that have a csv extension.
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# concatenate all of our files stored in all_filenames using pandas and save output to a new variable.
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])

# export our combined_csv as a csv file to a separate folder named raw_data_concat for ease of access.
# combined_csv.to_csv("../raw_data_concat/raw_data_concat.csv", index=False, encoding='utf-8-sig')

# Change back to root directory of project to allow for the continuous running of the script
# os.chdir("../../")

If you need to check for the current working directory at any time to resolve issues related to save locations, use the following command:

<br/>

**Hydrating tweets**

In [None]:
# TODO: add twark to the project and use it to hydrate the tweet IDs

In [None]:
os.getcwd()