# Relationship Between COVID-19 Twitter Mentions and Market Indices

The purpose of this analysis is to examine the impact of the amplification of COVID-19 related discussions within Twitter on financial markets.

## Data Collection

### Web Scraping
For this analysis, we will require third-party data. While this requires some level of manual retrieval, we will automate the data collection process as much as possible to esnure the reproducability of this analysis.

#### Logging in with selenium

In [24]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
import secrets
import time

url = 'https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset'

ieee_username = secrets.ieee_username
ieee_password = secrets.ieee_password

def wait(x): 
    time.sleep(x)

driver = webdriver.Chrome()

wait(2)

driver.get(url)

wait(2)

login_link = driver.find_element_by_xpath('//*[@id="login"]/a[2]')
login_link.click()

wait(2)

username_input = driver.find_element_by_xpath('/html/body/div[3]/div/div/div[3]/div/div/form/div/div/div/div[2]/div[1]/input[1]')
username_input.send_keys(ieee_username)

wait(2)

password_input = driver.find_element_by_xpath('//*[@id="password"]')
password_input.send_keys(ieee_password)

wait(2)

submit_login = driver.find_element_by_xpath('//*[@id="modalWindowRegisterSignInBtn"]')
submit_login.click()

# Logout of website so that we can re-run code without any issues
# logout_link = driver.find_element_by_xpath('//*[@id="login"]/a[3]')
# logout_link.click()


#### Add beautifulsoup4

In [8]:
# from bs4 import BeautifulSoup as bs
# from requests import get

# # Create variable to store the URL from which we plan to pull COVID-19 tweet data
# url = 'https://ieee-dataport.org/open-access/coronavirus-covid-19-tweets-dataset'

# # Use get to retrieve webpage HTML and store in response variable
# response = get(url)

# # Parse response.text with 'htlm.parser' and store it in a variable
# html = bs(response.text, 'html.parser')

# # Check that our object html contains the html content of the page that we will scrape
# print(html.prettify()[:500])

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html dir="ltr" version="XHTML+RDFa 1.0" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:article="http://ogp.me/ns/article#" xmlns:book="http://ogp.me/ns/book#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:product="http://ogp.me/ns/product#" xmlns:profile="http


#### Analyzing site structure

Now that we have successfully imported BeautifulSoup and created an object that contains the html of our desired webpage, we must analyze the site's structure to generate a methodology for scraping the desired data.

Here is a quick scroll through of the page that we are interested in:

<img src="./images/iee-scroll-through.gif">

**Pulling Dataset Files**

The dataset files are all located in a sidebar on the right of the page. We will do the following to programatically retrieve all 132 of these files. The firs step in doing so is selecting the appropriate page elements and adding them to BeautifulSoup.

In [42]:
# right_column = html.find('div', id ='right-column')

# print(right_column.prettify())

## Data Cleaning

Prior to analyzing our data, it is important to first ensure data integrity by properly cleaning our desired data sets. Doing so will include several steps as several data sets from multiple third party sources must be used in the abscence of satisfactory first party data. In order to get to a state where data cleaning is feasible, we will need to simplify our files.

Creating a concise set of data that can be cleaned appropriately requires the following steps:
1. Extracting the zip files
2. Concatenating the csv files
3. Hydrating the tweets within the csv files

### Extracting zip files

### Concatenating COVID-19 Raw Data

In [18]:
import os
import glob
import pandas as pd

# Change working directory to raw_data where we all of our raw data is stored. Can comment out once directory has changed.
# os.chdir("./covid-tweets/raw_data")

# create a variable to hold our desired extension, csv in this case.
extension = 'csv'

# store all filenames in current working directory that have a csv extension.
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

# concatenate all of our files stored in all_filenames using pandas and save output to a new variable.
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames])

# export our combined_csv as a csv file to a separate folder named raw_data_concat for ease of access.
combined_csv.to_csv("../raw_data_concat/raw_data_concat.csv", index=False, encoding='utf-8-sig')

# Change back to root directory of project to allow for the continuous running of the script
# os.chdir("../../")

'/Users/xaviergill/Documents/coding-projects/covid_tweet_and_market_analysis/covid-tweets/raw_data'