# BFRO Site Scraper

This notebook is for scraping Bigfoot sighting data from Bigfoot Field Research Organization's report database found here http://www.bfro.net/gdb/. There are a few rounds of scraping the appropriate href links from pages to get to a complete list of all sighting report links. Then we loop through the sighting report pages and pull out all information from the page and store in a list of dictionaries to create a pandas dataframe we can clean and write to a CSV file.

In [332]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import random

In [198]:
# Grab html from the bfro geographic database page 
response = requests.get("http://www.bfro.net/gdb")
soup = BeautifulSoup(response.content, 'html.parser')

# Grab all href from main page
links = soup.find_all('a', href=True)

# Grab state and providence links for us and canada
# Store separate because states have county links before sighting report links
canada_links = []
us_state_links = []

base_url = "http://www.bfro.net"

for a in links:
    if "state=ca-" in a['href']:
        canada_links.append(base_url + a['href'])
    elif "state=int" in a['href']:
        pass
    elif "state" in a['href']:
        us_state_links.append(base_url + a['href'])

assert len(us_state_links) == 49
assert len(canada_links) == 9

In [199]:
# Loop through US state links and grab all the county links
us_county_links = []

for url in us_state_links:
    
    response = requests.get(url)
    assert response.ok
    
    soup = BeautifulSoup(response.content, 'html.parser')
    anchor_tags = soup.find_all('a', href=lambda href: href and "county" in href)
    
    if anchor_tags:
        for a in anchor_tags:
            us_county_links.append("http://www.bfro.net/gdb/" + a['href'])
        
    


In [323]:
# pull out links for the report pages
def get_report_urls(urls):
    report_urls = []
    for url in urls:
        response = requests.get(url)
        assert response.ok
        soup = BeautifulSoup(response.content, 'html.parser')
        anchor_tags = soup.find_all('a', href=lambda href: href and 'show_report.asp?id' in href)
    
        if anchor_tags:
            for a in anchor_tags:
                report_urls.append("http://www.bfro.net/gdb/" + a['href'])
    return report_urls

report_urls = get_report_urls(us_county_links) + get_report_urls(canada_links)


In [424]:
len(report_urls)

5345

In [422]:
def scrape_report_data(url):
    
    report_dict = {}
    
    try:
        response = requests.get(url)
    except requests.exceptions.RequestException as e:  # This is the correct syntax
        raise SystemExit(e)
    
    soup = BeautifulSoup(response.content, 'html.parser')
 
    
    # Extract the header information stored in span with class    
    html_class = ['reportheader', 'reportclassification']
    for c in html_class:
        element = soup.find('span', {'class': c})
        if element:
            report_dict[c] = element.text.strip()
        else:
            report_dict[c] = "did not find "
        
    # Extract other details
    def extract(text):
        if text == "LOCATION DETAILS":
            element = soup.find('span', {'class':'field'}, string=text)
            if element:
                return element.parent.text.strip()
            else: 
                return "did not find"
        else:
            element = soup.find('span', {'class': 'field'}, string=text)
            if element:
                return element.parent.text.replace(text, "").strip()
            else:
                return ""

    year = extract('YEAR:')
    season = extract('SEASON:')
    month = extract('MONTH:')
    state = extract("STATE:")
    county = extract("COUNTY:")
    nearest_town = extract("NEAREST TOWN:")
    observed = extract("OBSERVED:")
    also_noticed = extract("ALSO NOTICED:")
    other_witnesses = extract("OTHER WITNESSES:")
    other_stories = extract("OTHER STORIES:")
    time_and_conditions = extract("TIME AND CONDITIONS:")
    environment = extract("ENVIRONMENT:")
    country = extract("COUNTRY:")
    province = extract("PROVINCE:")
    location_details = extract("LOCATION DETAILS:")
    
    report_dict['year'] = year
    report_dict['season'] = season
    report_dict['month'] = month
    report_dict['state'] = state
    report_dict['county'] = county
    report_dict['nearest_town'] = nearest_town
    report_dict['observed'] = observed
    report_dict['also_noticed'] = also_noticed
    report_dict['other_witnesses'] = other_witnesses
    report_dict['other_stories'] = other_stories
    report_dict['time_and_conditions'] = time_and_conditions
    report_dict['environment'] = environment
    report_dict['country'] = country
    report_dict['province'] = province
    report_dict['location_details'] = location_details
    
    return report_dict

In [425]:
# For all sighting report urls, grab html and retrieve data, put into a dictionary
# and append to a list of all results 
# this takes a while 
report_data = []
for url in report_urls:
    report_data.append(scrape_report_data(url))

In [427]:
len(report_data)

5345

In [429]:
df = pd.DataFrame.from_dict(report_data)

In [431]:
df.sample(10)

Unnamed: 0,reportheader,reportclassification,year,season,month,state,county,nearest_town,observed,also_noticed,other_witnesses,other_stories,time_and_conditions,environment,country,province,location_details
4150,Report # 26179,(Class A),2002,Winter,February,Utah,Washington County,"Ivins, Utah",My name is [omitted]. My ex-husband and I were...,,Two. Myself & my ex-husband. We were driving...,"No, but there alot of business named ""Big Foot...",About 7:00 P.M. in the evening.,Desert but by the river it is lush with large ...,,,Near the Santa Clara River by the Shivwits Ind...
4785,Report # 47470,(Class B),2010,Fall,September,Washington,Spokane County,"Mica, WA",It was fall. About 9 at night. Just getting d...,,1 witness.,,Just before dark. At dusk. Roughly 8:30-9:00p...,Pine forest near somewhat recent clearcut.,,,"On Mica Peak, WA"
2078,Report # 23929,(Class A),1970,Fall,September,Michigan,Baraga County,Between Three Lakes & Tioga,The following is a letter written to The Minin...,,,,,,,,
3073,Report # 4950,(Class B),1998,Spring,May,Ohio,Harrison County,Dennison/Leesville,My wife was going to use the outhouse just bef...,Something hit the Winnebago on the side and ev...,"My wife was only witness, all others were aslee",,We have a camp out on Memorial weekend every y...,Typical Ohio woodlands. My land is a saddle b...,,,Its about 5 mi. from Tappan Lake and about the...
3464,Report # 676,(Class B),1994,Summer,June,Oregon,Jefferson County,,0300 6/15/94 a large animal walked through ou...,,drinking brandy by a camp fire,,2:30am - 5:30am,second growth mixed forest of cedar and fir wi...,,,3 miles w of camp monte on the metolius river ...
1339,Report # 48686,(Class B),2015,Spring,May,Iowa,Hardin County,Iowa Falls,I was driving home from work at nearly midnigh...,It ran over the broad in less than two seconds.,No,"No, but when I am outside at night, occasional...",Night clear and warm. Slight breeze,"Iowa greenbelt. Forest, hilly, and very dense ...",,,OO avenue and 155th street.
1524,Report # 11831,(Class A),2005,Summer,June,Illinois,Grundy County,Seneca,I was just out snake hunting the roads and I s...,A pungent musky old mop smell.,No,,6:30 - 7:00 p.m. starting to get dark.,Marsh and woodland.,,,This is on private land.
8,Report # 1258,(Class A),Late 1970's,Summer,,Alaska,Fairbanks County,,I was part of a group of about a dozen Army pe...,,Just sitting around relaxing,I was telling a friend about this and he said ...,,High rugged mountains,,,"Black Rapids Glacier, Alaska Richardson Hiway ..."
40,Report # 67423,(Class A),2020,Summer,June,Alabama,Cleburne County,Heflin,Me and my wife and two girls were going to see...,As I past where we seen it cross right down th...,2 just riding listening to the radio,No,"Early Afternoon, Hot , clear skies",Mostly woods and river bottoms,,,You get of at the first AL exit after you go a...
3292,Report # 1300,(Class B),1999,Fall,November,Oklahoma,Le Flore County,Talehina,I was deer hunting with a friend for whom I ha...,"There was a decrease of wildlife in the area, ...",,,,The terrain was very dense and was extremly th...,,,


In [435]:
df.to_csv('data/bfro_raw.csv')