# Webscraping prototype

The following code is a prototype that scrapes data from Casino City's Gaming Directory. Their current directory informs us on the re-opening of gaming properties in the United States following the Covid-19 pandemic.

https://www.gamingdirectory.com/covid-19/reopened/

Notes:
The site requires registration in order to see an entire directory listing. Otherwise, you are limited to the first 20 records or so. Since the data sits behind a login, scraping cannot be done on the entire dataset. To circumvent this, you must register, login, and save the page locally, and run the code on the local file.

The scraping was challenging because of the poor structure of the data. Specifically, the `<div>` tags did not have unique identifiers, making it difficult to differentiate one field from the other. 

Reference: https://www.dataquest.io/blog/web-scraping-tutorial-python/

## Format for State data

State level data is structured differently from the national data list. Some modifications have been made below.


In [45]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# grab the html file on your local drive
url = "property_ca.html"

# soupify it
soup = BeautifulSoup(open(url),'html.parser')

# If you were to grab data from the live site, this is the syntax, but you will only get the first 20 records
# page = requests.get("https://www.gamingdirectory.com/covid-19/reopened/")
# soup = BeautifulSoup(page.content,'html.parser')

# split the data by the div that separates each row
split_data = soup.prettify().split('<div style="clear:both">')

# here are the relevant column headers
Property = []
City = []
PropertyType = []
Status = []
DateClosed = []
DateReopened = []
Article = []
ArticleLink = []


# loop through the results
counter = 0
for row in split_data:
#     Start grabbing from second row, as first row does not contain data
    if counter > 2 and counter < len(split_data)-1:
        soupedrow = BeautifulSoup(row,'html.parser')
        Property.append(soupedrow.find_all('div',class_='propertyCol')[0].text)
        City.append(soupedrow.find_all('div',class_='cityCol')[0].text)
        PropertyType.append(soupedrow.find_all('div',class_='typeCol')[0].text)
        Status.append(soupedrow.find_all('div',class_='statusCol')[0].text)
        DateClosed.append(soupedrow.find_all('div',class_='dateCol')[1].text)
        DateReopened.append(soupedrow.find_all('div',class_='dateCol')[0].text)
#         Article.append(soupedrow.find_all('div',class_='articleCol'))
    counter = counter + 1
    
# get rid of \n (new lines)
Property = [item.strip() for item in Property]
City = [item.strip() for item in City]
PropertyType = [item.strip() for item in PropertyType]
Status = [item.strip() for item in Status]
DateClosed = [item.strip() for item in DateClosed]
DateReopened = [item.strip() for item in DateReopened]
# Article = [item.strip() for item in Article]

# put the data into a dataframe
GamingExport = pd.DataFrame({
    "Property": Property,
    "City": City,
    "PropertyType": PropertyType,
    "Status": Status,
    "DateClosed": DateClosed,
    "DateReopened": DateReopened,

#     "Article": Article
})

# let's check out the data here
GamingExport

Unnamed: 0,Property,City,PropertyType,Status,DateClosed,DateReopened
0,500 Club,Clovis,Card Room,Closed,,19 Mar 2020
1,Agua Caliente Casino Palm Springs,Palm Springs,Indian Casino,Reopened,22 May 2020,17 Mar 2020
2,Agua Caliente Resort Casino Spa Rancho Mirage,Rancho Mirage,Indian Casino,Reopened,22 May 2020,17 Mar 2020
3,Alameda County Fairgrounds,Pleasanton,Horse Track,Racing without Spectators,19 Jun 2020,17 Mar 2020
4,Army Street Bingo,San Francisco,Bingo Hall,Closed,,11 Mar 2020
...,...,...,...,...,...,...
125,Viejas Casino & Resort,Alpine,Indian Casino,Reopened,18 May 2020,20 Mar 2020
126,Westlane Card Room,Stockton,Card Room,Closed,,18 Mar 2020
127,Winnedumah Winn's Casino,Independence,Indian Casino,Reopened,7 Jun 2020,17 Mar 2020
128,Win-River Resort & Casino,Redding,Indian Casino,Reopened,15 May 2020,17 Mar 2020


In [27]:
# Export it as a csv
GamingExport.to_csv('GamingExport_ca.csv')