# Data Cleaning

# Introduction
This notebook focuses on cleaning the raw Bigfoot sightings dataset collected during the scraping process. The goal is to create a well-structured and standardized dataset that can be used for analysis and visualization.


In [None]:
# Import dependencies
import pandas as pd
from datetime import datetime

# Load the raw scraping data
bigfoot_df = pd.read_json('../data/raw_scraping_data.json')

## Exploring the Dataset
### Overview of Data
First, we use the `.info()` method to inspect the dataset and identify columns with missing values.

In [66]:
bigfoot_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5153 entries, 0 to 5152
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Report Number        5153 non-null   object
 1   Report Class         5153 non-null   object
 2   year                 5153 non-null   object
 3   season               5153 non-null   object
 4   month                4532 non-null   object
 5   state                5153 non-null   object
 6   county               5153 non-null   object
 7   location details     4390 non-null   object
 8   nearest town         4832 non-null   object
 9   nearest road         4457 non-null   object
 10  observed             5114 non-null   object
 11  also noticed         3451 non-null   object
 12  other witnesses      4681 non-null   object
 13  other stories        3707 non-null   object
 14  time and conditions  4670 non-null   object
 15  environment          4874 non-null   object
 16  date  

### Preview of Data
Use the `.head()` method to see the first few rows of the DataFrame.

In [67]:
bigfoot_df.head(2)

Unnamed: 0,Report Number,Report Class,year,season,month,state,county,location details,nearest town,nearest road,observed,also noticed,other witnesses,other stories,time and conditions,environment,date,a & g references
0,Report # 13038,(Class A),2004,Winter,February,Alaska,Anchorage County,Up near powerline clearings east of Potter Mar...,Anchorage / Hillside,No real roads in the area,I and two of my friends were bored one night s...,"Some tracks in the snow, and a clearing in the...",My two friends were snowmachining behind me bu...,I have not heard of any other incidents in Anc...,Middle of the night. The only light was the he...,"In the middle of the woods, in a clearing cove...",,
1,Report # 8792,(Class B),2003,Winter,December,Alaska,Anchorage County,"Few houses on the way, a power relay station. ...",Anchorage,Dowling,"Me and a couple of friends had been bored, whe...","We smelled of colonge and after shave, and one...","4. Me, w-man, warren and sean. We were at my h...",no,"Started at 11, ended at about 3-3:30. Weather ...","A pine forest, with a bog or swamp on the righ...",Friday night,


## Cleaning the Data

### 1. Standardizing Columns

In [78]:
bigfoot_df.columns = bigfoot_df.columns.str.replace(' ','_').str.lower()

### 2. Fixing Up Column Values
These are two quick fixes that are vital to a data set that plays nice. 

In [81]:
bigfoot_df['report_number'] = bigfoot_df.loc[:, 'report_number'].apply(
  lambda x: pd.to_numeric(x.split('Report # ')[1]) if isinstance(x, str) and 'Report # ' in x else x)

bigfoot_df['report_class'] = bigfoot_df.loc[:, 'report_class'].apply(
    lambda x: x[6:-1].strip() if len(x) > 1 else x
)

# Standardize dates
bigfoot_df['date_cleaned'] = pd.to_datetime(
    bigfoot_df['date'], errors='coerce', format='%Y-%m-%d'
)

# Clean up environment
bigfoot_df['environment'] = bigfoot_df['environment'].str.lower()
bigfoot_df.head(2)


Unnamed: 0,report_number,report_class,year,season,month,state,county,location_details,nearest_town,nearest_road,observed,also_noticed,other_witnesses,other_stories,time_and_conditions,environment,date,a_&_g_references,date_cleaned
0,13038,A,2004,Winter,February,Alaska,Anchorage County,Up near powerline clearings east of Potter Mar...,Anchorage / Hillside,No real roads in the area,I and two of my friends were bored one night s...,"Some tracks in the snow, and a clearing in the...",My two friends were snowmachining behind me bu...,I have not heard of any other incidents in Anc...,Middle of the night. The only light was the he...,"in the middle of the woods, in a clearing cove...",,,NaT
1,8792,B,2003,Winter,December,Alaska,Anchorage County,"Few houses on the way, a power relay station. ...",Anchorage,Dowling,"Me and a couple of friends had been bored, whe...","We smelled of colonge and after shave, and one...","4. Me, w-man, warren and sean. We were at my h...",no,"Started at 11, ended at about 3-3:30. Weather ...","a pine forest, with a bog or swamp on the righ...",Friday night,,NaT


### 3. Filter Valid Year Entries
We see that there are several year entries that do not fit the standard four digit year.
To deal with this we will use regular expressions to keep only rows where the year is a 
four digit number.


In [70]:
bigfoot_df['year'].value_counts(ascending=True)

year
1890              1
1996 or 1997      1
1995-96           1
1970-71           1
1972-73           1
               ... 
2005            168
2006            173
2004            173
2000            188
2012            194
Name: count, Length: 435, dtype: int64

In [71]:
# get all entries with a standard date using regular expressions
filtered_years = bigfoot_df[bigfoot_df['year'].str.match(r'^\d{4}$', na=False)]
filtered_years.head()

Unnamed: 0,Report Number,Report Class,year,season,month,state,county,location details,nearest town,nearest road,observed,also noticed,other witnesses,other stories,time and conditions,environment,date,a & g references,date_cleaned
0,13038,A,2004,Winter,February,Alaska,Anchorage County,Up near powerline clearings east of Potter Mar...,Anchorage / Hillside,No real roads in the area,I and two of my friends were bored one night s...,"Some tracks in the snow, and a clearing in the...",My two friends were snowmachining behind me bu...,I have not heard of any other incidents in Anc...,Middle of the night. The only light was the he...,"in the middle of the woods, in a clearing cove...",,,NaT
1,8792,B,2003,Winter,December,Alaska,Anchorage County,"Few houses on the way, a power relay station. ...",Anchorage,Dowling,"Me and a couple of friends had been bored, whe...","We smelled of colonge and after shave, and one...","4. Me, w-man, warren and sean. We were at my h...",no,"Started at 11, ended at about 3-3:30. Weather ...","a pine forest, with a bog or swamp on the righ...",Friday night,,NaT
2,1255,B,1998,Fall,September,Alaska,Bethel County,"45 miles by air west of Lake Iliamna, Alaska i...",,,My hunting buddy and I were sitting on a ridge...,nothing unusual,Scouting for caribou with high quality binoculars,,,call iliamna air taxi for lat & long of long l...,3,,NaT
3,11616,B,2004,Summer,July,Alaska,Bristol Bay County,"Approximately 95 miles east of Egegik, Alaska....",Egegik,,"To whom it may concern, I am a commercial fish...",Just these foot prints and how obvious it was ...,"One other witness, and he was fishing prior to...","I've only heard of one other story, from an ol...","Approximately 12:30 pm, partially coudy/sunny.","lake front,creek spit, gravel and sand, alder ...",20,,NaT
4,637,A,2000,Summer,June,Alaska,Cordova-McCarthy County,"On the main trail toward the glacier, before t...","Kennikot, Alaska",not sure,My hiking partner and I arrived late to the Ke...,I did hear what appeared to be grunting in the...,"I was the only witness, there was one other in...",,About 12:00 Midnight / full moon / clear / dim...,this sighting was located at approximately 1 t...,16,,NaT


In [72]:
filtered_years['year'].value_counts()

year
2012    194
2000    188
2004    173
2006    173
2005    168
       ... 
1945      1
1926      1
1940      1
1870      1
1890      1
Name: count, Length: 93, dtype: int64

### 4. Handling Missing Values
We replace missing values with placeholders.

In [82]:
bigfoot_df.fillna({'nearest_town': 'Unknown', 'nearest_road': 'Unknown'}, inplace=True)

## Analyzing Cleaned Data
### Overview of Clean Data
We once again use the `.info()` method.

In [73]:
filtered_years.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4777 entries, 0 to 5152
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Report Number        4777 non-null   int64         
 1   Report Class         4777 non-null   object        
 2   year                 4777 non-null   object        
 3   season               4777 non-null   object        
 4   month                4315 non-null   object        
 5   state                4777 non-null   object        
 6   county               4777 non-null   object        
 7   location details     4067 non-null   object        
 8   nearest town         4482 non-null   object        
 9   nearest road         4141 non-null   object        
 10  observed             4742 non-null   object        
 11  also noticed         3191 non-null   object        
 12  other witnesses      4337 non-null   object        
 13  other stories        3434 non-null   o

### Aggregating by State and Coounty
Group by `state` and `county` to count the number of sightings in each reigion. 

In [None]:
filtered_years.groupby(['state','county']).count()['report_number']

state    county           
Alabama  Autauga County       1
         Baldwin County       2
         Barbour County       1
         Bibb County          1
         Blount County        4
                             ..
Wyoming  Sublette County      1
         Sweetwater County    1
         Teton County         2
         Uinta County         2
         Washakie County      1
Name: report_number, Length: 1502, dtype: int64


### State-Level Sightings
Aggregate sightings by state to see the distribution.

In [None]:
filtered_years.groupby('state').count()['report_number']

state
Alabama            91
Alaska             18
Arizona            77
Arkansas           98
California        398
Colorado          121
Connecticut        23
Delaware            4
Florida           277
Georgia           122
Idaho              94
Illinois          226
Indiana            75
Iowa               57
Kansas             45
Kentucky          101
Louisiana          33
Maine              16
Maryland           28
Massachusetts      27
Michigan          191
Minnesota          71
Mississippi        22
Missouri          153
Montana            49
Nebraska           17
Nevada              7
New Hampshire      11
New Jersey         69
New Mexico         42
New York           99
North Carolina     84
North Dakota        4
Ohio              282
Oklahoma           97
Oregon            231
Pennsylvania      102
Rhode Island        5
South Carolina     42
South Dakota       13
Tennessee          88
Texas             222
Utah               56
Vermont             9
Virginia           82
Wash

## Exporting Cleaned Data
Save the cleaned dataset as a JSON file for future use.

In [77]:
filtered_years.to_json('../data/bigfoot_coordinates_clean_cols.json', orient='records')

Conclusion

The cleaned dataset is now ready for analysis and visualization. Key steps included validating year entries, handling missing values, and standardizing columns. Further work will focus on analyzing geographic and temporal patterns in the data.

