# Webscraping prototype

The following code is a prototype that scrapes data from Casino City's Gaming Directory. Their current directory informs us on the re-opening of gaming properties in the United States following the Covid-19 pandemic.

https://www.gamingdirectory.com/covid-19/reopened/

Notes:
The site requires registration in order to see an entire directory listing. Otherwise, you are limited to the first 20 records or so. Since the data sits behind a login, scraping cannot be done on the entire dataset. To circumvent this, you must register, login, and save the page locally, and run the code on the local file.

The scraping was challenging because of the poor structure of the data. Specifically, the `<div>` tags did not have unique identifiers, making it difficult to differentiate one field from the other. 

Reference: https://www.dataquest.io/blog/web-scraping-tutorial-python/

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# grab the html file on your local drive
url = "Gaming 2020-06-25.html"

# soupify it
soup = BeautifulSoup(open(url),'html.parser')

# If you were to grab data from the live site, this is the syntax, but you will only get the first 20 records
# page = requests.get("https://www.gamingdirectory.com/covid-19/reopened/")
# soup = BeautifulSoup(page.content,'html.parser')

# split the data by the div that separates each row
split_data = soup.prettify().split('<div style="clear:both">')

# here are the relevant column headers
DateReopened = []
Property = []
Jurisdiction = []
PropertyType = []
DateClosed = []
Status = []
Article = []
ArticleLink = []

# loop through the results
counter = 0
for row in split_data:
#     Start grabbing from second row, as first row does not contain data
    if counter > 1 and counter < len(split_data)-1:
        soupedrow = BeautifulSoup(row,'html.parser')
        DateReopened.append(soupedrow.find_all('div',class_='dateCol data')[0].text)
        DateClosed.append(soupedrow.find_all('div',class_='dateCol data')[1].text)
        Property.append(soupedrow.find_all('div',class_='orgCol data')[0].text)
        PropertyType.append(soupedrow.find_all('div',class_='statusCol data')[0].text)
        Status.append(soupedrow.find_all('div',class_='statusCol data')[1].text)
        Jurisdiction.append(soupedrow.find_all('div',class_='stateCol data')[0].text)
        Article.append(soupedrow.find_all('div',class_='articleCol data')[0].text)
    counter = counter + 1
    
# get rid of \n (new lines)
DateReopened = [item.strip() for item in DateReopened]
DateClosed = [item.strip() for item in DateClosed]
Property = [item.strip() for item in Property]
PropertyType = [item.strip() for item in PropertyType]
Status = [item.strip() for item in Status]
Jurisdiction = [item.strip() for item in Jurisdiction]
Article = [item.strip() for item in Article]

# put the data into a dataframe
GamingExport = pd.DataFrame({
    "DateReopened": DateReopened,
    "DateClosed": DateClosed,
    "Property": Property,
    "PropertyType": PropertyType,
    "Status": Status,
    "Jurisdiction": Jurisdiction,
    "Article": Article
})

# let's check out the data here
GamingExport


Unnamed: 0,DateReopened,DateClosed,Property,PropertyType,Status,Jurisdiction,Article
0,25 Jun 2020,16 Mar 2020,Luxor Hotel and Casino,Commercial Casino,Open,Nevada,MGM Resorts announces opening dates of more La...
1,25 Jun 2020,16 Mar 2020,Seneca Buffalo Creek Casino,Indian Casino,Open,New York,Seneca Nation announces reopening dates for ga...
2,25 Jun 2020,17 Mar 2020,SouthWind Casino - Newkirk,Indian Casino,Open,Oklahoma,SouthWind Casino - Newkirk reopens
3,24 Jun 2020,17 Mar 2020,Emerald Downs,Horse Track,Racing without Spectators,Washington,Emerald Downs reopening update
4,24 Jun 2020,19 Mar 2020,Hippodrome Trois-RiviÃ¨res,Horse Track Racino,Open,Quebec,Racing returns to Quebec
...,...,...,...,...,...,...,...
1348,,,Eagle River Bingo & Casino,Bingo Hall,Open,Alaska,
1349,,,Pembroke Hall Bingo,Bingo Hall,Open,Virginia,
1350,,20 Mar 2020,Ranchman's 23,Indian Casino,Open,North Dakota,
1351,,,Riverside Entertainment Center,Commercial Casino,Open,New Brunswick,


In [None]:
# Export it as a csv
GamingExport.to_csv('GamingExport.csv')