# Relationship Between COVID-19 Twitter Mentions and Market Indices

The purpose of this analysis is to examine the impact of the amplification of COVID-19 related discussions within Twitter on financial markets.

## Data Collection

### Web Scraping
For this analysis, we will require third-party data. While this requires some level of manual retrieval, we will automate the data collection process as much as possible to esnure the reproducability of this analysis.

#### Logging in with selenium

In [31]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import secrets
import time

url = 'https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset'

ieee_username = secrets.ieee_username
ieee_password = secrets.ieee_password

# def wait(x): 
#     time.sleep(x)

driver = webdriver.Chrome()

# wait(2)

driver.get(url)

# wait(2)

login_link = driver.find_element_by_xpath('//*[@id="login"]/a[2]')
login_link.click()

# wait(2)

username_input = driver.find_element_by_xpath('/html/body/div[3]/div/div/div[3]/div/div/form/div/div/div/div[2]/div[1]/input[1]')
username_input.send_keys(ieee_username)

# wait(2)

password_input = driver.find_element_by_xpath('//*[@id="password"]')
password_input.send_keys(ieee_password)

# wait(2)

submit_login = driver.find_element_by_xpath('//*[@id="modalWindowRegisterSignInBtn"]')
submit_login.click()

#### Pass the webpage to BeautifulSoup

In [32]:
from bs4 import BeautifulSoup as bs
import re

soup = bs(driver.page_source)

# Create empty list to store all of the links that we scrape from the webpage
data_files = []

# Loop through links that contain a string which matches the regex defined below
# Regex: match text that contains corona_tweets_, followed by one or more digits, a period and either csv or zip
for link in soup.find_all('a', string = re.compile('corona_tweets_\d+\.(csv|zip)')):
    
    # For each a tag (i.e. link), pull out the value of its href attribute.     
    download_link = link['href']
    # Append each link to the data_files list defined above
    data_files.append(download_link)
    
    
print(data_files[:5])

['https://ieee-dataport.s3.amazonaws.com/open/14206/corona_tweets_01.csv?response-content-disposition=attachment%3B%20filename%3D%22corona_tweets_01.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20200801%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200801T211320Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=5d9f3a0126ee5a1ed4cce3702bab858abd8fb62c5c15952f2c8c3567f6fd0028', 'https://ieee-dataport.s3.amazonaws.com/open/14206/corona_tweets_02.csv?response-content-disposition=attachment%3B%20filename%3D%22corona_tweets_02.csv%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJOHYI4KJCE6Q7MIQ%2F20200801%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200801T211320Z&X-Amz-SignedHeaders=Host&X-Amz-Expires=86400&X-Amz-Signature=50163f772439c813da48c8fcceb2e25f57848e5a5a4b78a8b2cfc1ee41731a30', 'https://ieee-dataport.s3.amazonaws.com/open/14206/corona_tweets_03.csv?response-content-disposition=attachment%3B%20filename%3D%22corona_tweets_03.c

#### Download dataset files

In [36]:
import urllib.request

for link in data_files:
    url = link
    file_name = re.search(r'corona_tweets_\d+\.(csv|zip)', link).group(0)
    full_file_name = os.path.join('./covid-tweets/raw_data', file_name)
    urllib.request.urlretrieve(url, full_file_name)

#### Analyzing site structure

Now that we have successfully imported BeautifulSoup and created an object that contains the html of our desired webpage, we must analyze the site's structure to generate a methodology for scraping the desired data.

Here is a quick scroll through of the page that we are interested in:

<img src="./images/iee-scroll-through.gif">

**Pulling Dataset Files**

The dataset files are all located in a sidebar on the right of the page. We will do the following to programatically retrieve all 132 of these files. The firs step in doing so is selecting the appropriate page elements and adding them to BeautifulSoup.

In [42]:
# right_column = html.find('div', id ='right-column')

# print(right_column.prettify())

## Data Cleaning

Prior to analyzing our data, it is important to first ensure data integrity by properly cleaning our desired data sets. Doing so will include several steps as several data sets from multiple third party sources must be used in the abscence of satisfactory first party data. In order to get to a state where data cleaning is feasible, we will need to simplify our files.

Creating a concise set of data that can be cleaned appropriately requires the following steps:
1. Extracting the zip files
2. Concatenating the csv files
3. Hydrating the tweets within the csv files

### Extracting zip files

### Concatenating COVID-19 Raw Data

In [10]:
import os
import glob
import pandas as pd

# Change working directory to raw_data where we all of our raw data is stored. Can comment out once directory has changed.
# os.chdir("./covid-tweets/raw_data")

# create a variable to hold our desired extension, csv in this case.
# extension = 'csv'

# store all filenames in current working directory that have a csv extension.
# all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# concatenate all of our files stored in all_filenames using pandas and save output to a new variable.
# combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])

# export our combined_csv as a csv file to a separate folder named raw_data_concat for ease of access.
# combined_csv.to_csv("../raw_data_concat/raw_data_concat.csv", index=False, encoding='utf-8-sig')

# Change back to root directory of project to allow for the continuous running of the script
# os.chdir("../../")

'/Users/xaviergill/Documents/coding-projects/covid_tweet_and_market_analysis'

In [11]:
os.getcwd()

'/Users/xaviergill/Documents/coding-projects/covid_tweet_and_market_analysis'