# Bigfoot Data - Collection and Cleaning
This jupyter notebook will let us scrape report urls from [The Bigfoot Field Reaserchers Organization (BFRO)](www.bfro.net) and convert them into a structured dataframe. After standard cleaning procedures, the cleaned DataFrame will be saved. 
## Part I: Data Collection
We are collecting Bigfoot sighting data from the BFRO, a group that has been documenting reports since 1995. This extensive collection of entries will provide a robust dataset for exploration. The BFRO considers itself the only *scientific* Bigfoot research group, and therefore holds itself to a high research standard which is reflected in the quality and consistency of reports. This structure will be key to a streamlined scraping process. 

In [84]:
# Import Dependencies
import requests
from bs4 import BeautifulSoup
import logging
import pandas as pd
import json

In [70]:

bigfoot_url = 'https://www.bfro.net/'

# Configure logging
logging.basicConfig(
    filename='bigfoot_scraper.log',  # Log file name
    filemode='a',  # Append mode
    level=logging.DEBUG,  # Log everything from DEBUG and above
    format='%(asctime)s - %(levelname)s - %(message)s'
)

In [71]:
def soupify_website(url):
    """
    Fetches the HTML content of a given URL and parses it into a BeautifulSoup object.

    Parameters:
        url (str): The URL of the webpage to be scraped.

    Returns:
        BeautifulSoup: A BeautifulSoup object representing the parsed HTML of the webpage.
    
    Raises:
        ValueError: If the HTTP response status is not 200(OK)
    """
    
    # test connection
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        return soup
    else:
        raise ValueError(f"Failed to fetch {url}: Status code {response.status_code}")

def get_links(soup_list):
    """
    Extracts all hyperlinks from a list of BeautifulSoup objects.

    Parameters:
        soup_list (list): A list of BeautifulSoup objects to search for links.

    Returns:
        list: A list of hyperlink strings (`href` values) extracted from the provided BeautifulSoup objects.
    
    Notes:
        - Only links with an `href` attribute will be included.
        - Duplicate links are not removed; the returned list may contain duplicates.
    """
    link_list = []
    for soup in soup_list:
        links = soup.find_all('a')
        for link in links:
            if link.get('href'):
                url = link.get('href')
                link_list.append(url)
    return link_list



### Example Report: Our Content
To illustrate our scraping process, we will consider [Report #77968](https://www.bfro.net/GDB/show_report.asp?id=77968) from Douglas County Nebraska. While the sections detailing the encounter observations and follow up research are intriguing and could be useful for future expansion, for this ntoebook at least, our interest lies in those first sections of information. 
<figure>
<img src='images/Report%2077968%20relevant%20information.png' width = 500>
<figcaption>
The relevant information from <a href="https://www.bfro.net/GDB/show_report.asp?id=77968"> Report #77968 </a> that we will be collecting.
</figcaption> 
</figure>

These sections provide key details, such as location, date, classification, and reporter. Here we are lucky as each of these sections is seperated from one another by `span` elements, making it a simple task to create a collect the desired inforrmation and store it in a dictionary for processing.

In [72]:
def create_sighting_dictionary(url):
    """
    Extracts information from an individual report page and returns it as a dictionary.

    Parameters:
        url (str): The URL of the report page.

    Returns:
        dict: A dictionary with the extracted information, or None if the page cannot be parsed.
    """
    logging.info(f"Fetching report from: {url}")
    report = soupify_website(url)

    if not report:
        logging.error(f"Failed to parse report page: {url}")
        return None

    try:
        report_dict = {}

        # Extract 'Report Number'
        report_header = report.find('span', class_='reportheader')
        report_dict['Report Number'] = report_header.text.strip() if report_header else 'N/A'

        # Extract 'Report Classification'
        report_class = report.find('span', class_='reportclassification')
        report_dict['Report Class'] = report_class.text.strip() if report_class else 'N/A'

        # Extract additional fields
        fields = report.find_all('span', class_='field')
        for field in fields:
            # Get the full text of the parent element
            text = field.parent.text.strip()

            # Only process fields in the format "Header: Value"
            if ':' in text:
                # Split into field name and value
                field_name, value = text.split(':', 1)

                # Clean up field name and value
                field_name = field_name.strip().lower()
                value = value.strip()

                # Validate input (ensure no line breaks in the value)
                if len(value.split('\n')) == 1:
                    # Store the field and value in the dictionary
                    report_dict[field_name] = value

        return report_dict

    except Exception as e:
        logging.error(f"Error processing report {url}: {e}")
        return None

### Getting the Reports
The path to each Bigfoot sighting listing follows a structured hierarchy: GDB/State/County/Listing.
- The GDB (Geographic Database of Bigfoot Sightings and Reports) serves as the starting point. It contains several tables, each linking to a state (or province, etc.) page.
- Each state page features similar tables that provide links to county pages.
- County pages only exist if there are recorded Bigfoot sightings in that county. These pages link to the county's sighting listings.
- Finally, county sightings pages list all recorded sightings for that county, with links and short descriptions for each report.

Having created the function to extract data from individual listings, we will now work in reverse of the above heiarchy. 

In [73]:
def scrape_county(url):
    """
    Scrapes all report URLs from a county sightings page.

    Parameters:
        url (str): The URL of the county sightings page to scrape.

    Returns:
        list: A list of dictionaries containing data from individual Bigfoot sighting reports.
    """
    logging.info(f"Starting to scrape county page: {url}")
    report_dictionaries = []
    base_url = 'https://www.bfro.net/GDB/'

    # Fetch the county page
    county_soup = soupify_website(url)
    if not county_soup:
        logging.error(f"Failed to fetch or parse county page: {url}")
        return report_dictionaries

    # Find all report links on the county page
    county_reports = county_soup.find_all('span', class_='reportcaption')
    report_urls = get_links(county_reports)
    logging.info(f"Found {len(report_urls)} reports on county page: {url}")

    # Process each report URL
    for report_url in report_urls:
        full_url = base_url + report_url
        try:
            logging.debug(f"Scraping report: {full_url}")
            report_dict = create_sighting_dictionary(full_url)
            if report_dict:
                report_dictionaries.append(report_dict)
            else:
                logging.warning(f"No data extracted for report: {full_url}")
        except Exception as e:
            logging.error(f"Error scraping report {full_url}: {e}")

    logging.info(f"Finished scraping county page: {url}")
    return report_dictionaries


In [74]:
# Testing with Douglas County NE, should have 3 listings as of 12/3/2024
county_url = 'https://www.bfro.net/GDB/show_county_reports.asp?state=NE&county=Douglas'
sighting_url = 'https://www.bfro.net/GDB/'
my_reports = scrape_county(county_url)
print(f"Douglass County has has {len(my_reports)} sightings")
print("Printing sighting dictionaries:")
for report in my_reports:
    print(report)
    

Douglass County has has 3 sightings
Printing sighting dictionaries:
{'Report Number': 'Report # 77968', 'Report Class': '(Class A)', 'year': '2024', 'season': 'Fall', 'date': 'Nov 8th 2024', 'state': 'Nebraska', 'county': 'Douglas County', 'location details': '6th street and Martha, near Lauritzen Gardens[Investigator (MM) Notes:GPS coordinates for where the creature walked between structures:41.237966, -95.923810  ]', 'nearest town': 'Omaha', 'nearest road': '6th and Martha', 'observed': 'On the 8th of November 2024, I was in backyard in the driveway of my ex-wife\'s house.  I was standing beside my truck putting away some tools as we were talking.  I don\'t know what made me turn and look up the driveway, as I did I saw a tall (large) hairy figure walking behind a duplex behind the garage.  This space between the building is approximately 20-25 feet.  I only saw this figure for a few seconds.  I turned to my ex-wife and said, "Did I just see what I thought I saw?" She asked "What did

In [75]:
def scrape_state(state_url):
    state_soup = soupify_website(state_url)
    tables = state_soup.find_all('table', class_='countytbl')

    county_links = get_links(tables)

    state_reports = []

    for link in county_links:
        sighting_links = scrape_county(sighting_url+link)
        for sighting in sighting_links:
            state_reports.append(sighting)
    return state_reports
            

nebraska_sighting_url = scrape_state('https://www.bfro.net/GDB/state_listing.asp?state=NE')

In [None]:
gdb_soup = soupify_website('https://www.bfro.net/GDB')
gdb_tables = gdb_soup.find_all('table', class_='countytbl')
state_urls = [x for x in get_links(gdb_tables) if 'state=int' not in x]

In [77]:
all_reports = []

for url in state_urls:
    full_url = 'https://www.bfro.net' + url
    logging.info(f"Scraping state page: {full_url}")
    try:
        # Scrape the state page for reports
        state_dicts = scrape_state(full_url)

        # Append each report dictionary to the master list
        all_reports.extend(state_dicts)

    except Exception as e:
        logging.error(f"Error scraping state page {full_url}: {e}")


In [None]:
print(f"We found a total of {len(all_reports)} entries.")

# I dont want to run this for an hour again so we're saving 
with open("data/raw_scraping_data.json", "w") as file:
    json.dump(all_reports, file)

We found a total of 5153 entries.


In [89]:
bigfoot_df = pd.DataFrame(all_reports)

In [90]:
bigfoot_df.head()

Unnamed: 0,Report Number,Report Class,year,season,month,state,county,location details,nearest town,nearest road,observed,also noticed,other witnesses,other stories,time and conditions,environment,date,a & g references
0,Report # 13038,(Class A),2004,Winter,February,Alaska,Anchorage County,Up near powerline clearings east of Potter Mar...,Anchorage / Hillside,No real roads in the area,I and two of my friends were bored one night s...,"Some tracks in the snow, and a clearing in the...",My two friends were snowmachining behind me bu...,I have not heard of any other incidents in Anc...,Middle of the night. The only light was the he...,"In the middle of the woods, in a clearing cove...",,
1,Report # 8792,(Class B),2003,Winter,December,Alaska,Anchorage County,"Few houses on the way, a power relay station. ...",Anchorage,Dowling,"Me and a couple of friends had been bored, whe...","We smelled of colonge and after shave, and one...","4. Me, w-man, warren and sean. We were at my h...",no,"Started at 11, ended at about 3-3:30. Weather ...","A pine forest, with a bog or swamp on the righ...",Friday night,
2,Report # 1255,(Class B),1998,Fall,September,Alaska,Bethel County,"45 miles by air west of Lake Iliamna, Alaska i...",,,My hunting buddy and I were sitting on a ridge...,nothing unusual,Scouting for caribou with high quality binoculars,,,Call Iliamna Air taxi for lat & Long of Long L...,3,
3,Report # 11616,(Class B),2004,Summer,July,Alaska,Bristol Bay County,"Approximately 95 miles east of Egegik, Alaska....",Egegik,,"To whom it may concern, I am a commercial fish...",Just these foot prints and how obvious it was ...,"One other witness, and he was fishing prior to...","I've only heard of one other story, from an ol...","Approximately 12:30 pm, partially coudy/sunny.","Lake front,creek spit, gravel and sand, alder ...",20,
4,Report # 637,(Class A),2000,Summer,June,Alaska,Cordova-McCarthy County,"On the main trail toward the glacier, before t...","Kennikot, Alaska",not sure,My hiking partner and I arrived late to the Ke...,I did hear what appeared to be grunting in the...,"I was the only witness, there was one other in...",,About 12:00 Midnight / full moon / clear / dim...,This sighting was located at approximately 1 t...,16,


In [83]:
bigfoot_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5153 entries, 0 to 5152
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Report Number        5153 non-null   object
 1   Report Class         5153 non-null   object
 2   year                 5153 non-null   object
 3   season               5153 non-null   object
 4   month                4532 non-null   object
 5   state                5153 non-null   object
 6   county               5153 non-null   object
 7   location details     4390 non-null   object
 8   nearest town         4832 non-null   object
 9   nearest road         4457 non-null   object
 10  observed             5114 non-null   object
 11  also noticed         3451 non-null   object
 12  other witnesses      4681 non-null   object
 13  other stories        3707 non-null   object
 14  time and conditions  4670 non-null   object
 15  environment          4874 non-null   object
 16  date  

In [107]:
bigfoot_df['Report Number'] = bigfoot_df['Report Number'].apply(lambda x:pd.to_numeric( x.split('Report # ')[1]))
bigfoot_df['Report Class'] = bigfoot_df['Report Class'].apply(lambda x: x[7:-1])

In [109]:
print(bigfoot_df['year'].value_counts())

year
2012         194
2000         188
2004         173
2006         173
2005         168
            ... 
1978-1988      1
1967/1993      1
early90s       1
Sep 2014       1
1890           1
Name: count, Length: 435, dtype: int64


In [113]:
filtered_years = bigfoot_df[bigfoot_df['year'].str.match(r'^\d{4}$', na=False)]

print(f'There are {len(filtered_years)} entries with a valid year field.')

There are 4777 entries with a valid year field.


In [114]:
filtered_years.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4777 entries, 0 to 5152
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Report Number        4777 non-null   int64 
 1   Report Class         4777 non-null   object
 2   year                 4777 non-null   object
 3   season               4777 non-null   object
 4   month                4315 non-null   object
 5   state                4777 non-null   object
 6   county               4777 non-null   object
 7   location details     4067 non-null   object
 8   nearest town         4482 non-null   object
 9   nearest road         4141 non-null   object
 10  observed             4742 non-null   object
 11  also noticed         3191 non-null   object
 12  other witnesses      4337 non-null   object
 13  other stories        3434 non-null   object
 14  time and conditions  4327 non-null   object
 15  environment          4513 non-null   object
 16  date       

In [119]:
filtered_years['month'].value_counts()

month
October      581
August       576
July         561
September    466
June         430
November     426
May          279
April        247
December     210
January      209
March        181
February     149
Name: count, dtype: int64

In [126]:
month_name_to_number = month_dict = {
    "January": 1,
    "February": 2,
    "March": 3,
    "April": 4,
    "May": 5,
    "June": 6,
    "July": 7,
    "August": 8,
    "September": 9,
    "October": 10,
    "November": 11,
    "December": 12
}

month_name_to_number = month_dict = {
    "January": 1,
    "February": 2,
    "March": 3,
    "April": 4,
    "May": 5,
    "June": 6,
    "July": 7,
    "August": 8,
    "September": 9,
    "October": 10,
    "November": 11,
    "December": 12
}

filtered_years.loc[:, 'month numeric'] = filtered_years['month'].apply(lambda x: month_name_to_number.get(x, None))
filtered_years.head()

Unnamed: 0,Report Number,Report Class,year,season,month,state,county,location details,nearest town,nearest road,observed,also noticed,other witnesses,other stories,time and conditions,environment,date,a & g references,month numeric
0,13038,A,2004,Winter,February,Alaska,Anchorage County,Up near powerline clearings east of Potter Mar...,Anchorage / Hillside,No real roads in the area,I and two of my friends were bored one night s...,"Some tracks in the snow, and a clearing in the...",My two friends were snowmachining behind me bu...,I have not heard of any other incidents in Anc...,Middle of the night. The only light was the he...,"In the middle of the woods, in a clearing cove...",,,2.0
1,8792,B,2003,Winter,December,Alaska,Anchorage County,"Few houses on the way, a power relay station. ...",Anchorage,Dowling,"Me and a couple of friends had been bored, whe...","We smelled of colonge and after shave, and one...","4. Me, w-man, warren and sean. We were at my h...",no,"Started at 11, ended at about 3-3:30. Weather ...","A pine forest, with a bog or swamp on the righ...",Friday night,,12.0
2,1255,B,1998,Fall,September,Alaska,Bethel County,"45 miles by air west of Lake Iliamna, Alaska i...",,,My hunting buddy and I were sitting on a ridge...,nothing unusual,Scouting for caribou with high quality binoculars,,,Call Iliamna Air taxi for lat & Long of Long L...,3,,9.0
3,11616,B,2004,Summer,July,Alaska,Bristol Bay County,"Approximately 95 miles east of Egegik, Alaska....",Egegik,,"To whom it may concern, I am a commercial fish...",Just these foot prints and how obvious it was ...,"One other witness, and he was fishing prior to...","I've only heard of one other story, from an ol...","Approximately 12:30 pm, partially coudy/sunny.","Lake front,creek spit, gravel and sand, alder ...",20,,7.0
4,637,A,2000,Summer,June,Alaska,Cordova-McCarthy County,"On the main trail toward the glacier, before t...","Kennikot, Alaska",not sure,My hiking partner and I arrived late to the Ke...,I did hear what appeared to be grunting in the...,"I was the only witness, there was one other in...",,About 12:00 Midnight / full moon / clear / dim...,This sighting was located at approximately 1 t...,16,,6.0


In [129]:
bare_bones_bigfoot = filtered_years[['Report Number', 'Report Class', 'year', 'season', 'month', 'date', 'state', 'county']]

filtered_years.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4777 entries, 0 to 5152
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Report Number        4777 non-null   int64  
 1   Report Class         4777 non-null   object 
 2   year                 4777 non-null   object 
 3   season               4777 non-null   object 
 4   month                4315 non-null   object 
 5   state                4777 non-null   object 
 6   county               4777 non-null   object 
 7   location details     4067 non-null   object 
 8   nearest town         4482 non-null   object 
 9   nearest road         4141 non-null   object 
 10  observed             4742 non-null   object 
 11  also noticed         3191 non-null   object 
 12  other witnesses      4337 non-null   object 
 13  other stories        3434 non-null   object 
 14  time and conditions  4327 non-null   object 
 15  environment          4513 non-null   object

In [130]:
bare_bones_bigfoot.to_csv('data/bare_bones_bigfoot.csv')