# Sharks

This notebook cleans and processes a dataset of shark attack data.

## Imports

In [6]:
import pandas as pd

## Data sourcing

In [54]:
sharks = pd.read_excel("https://www.sharkattackfile.net/spreadsheets/GSAF5.xls")

In [55]:
sharks.head()

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,...,Species,Source,pdf,href formula,href,Case Number,Case Number.1,original order,Unnamed: 21,Unnamed: 22
0,2025-04-19 00:00:00,2025.0,Unprovoked,Maldives,Kulhudhuffushi City,Haa Dhaalu atoll,snorkeling,Unknown Male,M,30,...,Not stated,Todd Smith: The edition: https://en.sun.mv/96031,,,,,,,,
1,2025-04-12 00:00:00,2025.0,Unprovoked,USA,Florida,Everglades National Park Flamingo Lodge Highwa...,Undisclosed,Unknown Male,M,?,...,Not stated,Kevin McMurray Trackingsharks.com: Florida New...,,,,,,,,
2,2025-03-26 00:00:00,2025.0,Unprovoked,Australia,WA,Sandtrax Port Beach North Fremantle Perth,Swimming,Unknown Male,M,30+,...,1.5m Tiger shark,Kevin McMurray Trackingsharks.com: www.surfer....,,,,,,,,
3,2025-03-10 00:00:00,2025.0,Unprovoked,Australia,WA,Duke of Orleans Bay,Surfing,Steven Jeffrey Payne,M,37,...,Great White Shark,Bob Myatt,,,,,,,,
4,2025-03-07 00:00:00,2025.0,Unprovoked,Australia,NSW,Gunyah beach Bundeena Port Hacking,Swimming,Mangyong Zhang,F,56,...,Bull shark,Bob Myatt,,,,,,,,


## Data Exploration

### Pandas has opinions

- You shouldn't scroll through large amounts of data
- You are a reckless fool and should only edit copies
- Empty values are a problem

In [72]:
sharks[["Activity", "Age", "Name"]].sample()

Unnamed: 0,Activity,Age,Name
5088,Fishing,MAKE LINE GREEN,14-foot boat Sintra


In [79]:
sharks["Sex"].unique()

array(['M', 'F ', 'F', 'M ', nan, ' M', 'm', 'lli', 'M x 2', 'N', '.'],
      dtype=object)

In [77]:
sharks["Sex"].value_counts()

Sex
M        5624
F         798
M           3
N           2
F           1
 M          1
m           1
lli         1
M x 2       1
.           1
Name: count, dtype: int64

## Data Cleaning

In [None]:
sharks.columns = sharks.columns.str.lower()

In [None]:
sharks["sex"] = sharks["sex"].replace("F ", "F")

In [87]:
sharks = sharks.drop(columns="pdf")

In [88]:
sharks.head()

Unnamed: 0,date,year,type,country,state,location,activity,name,sex,age,...,time,species,source,href formula,href,case number,case number.1,original order,unnamed: 21,unnamed: 22
0,2025-04-19 00:00:00,2025.0,Unprovoked,Maldives,Kulhudhuffushi City,Haa Dhaalu atoll,snorkeling,Unknown Male,M,30,...,Not stated,Not stated,Todd Smith: The edition: https://en.sun.mv/96031,,,,,,,
1,2025-04-12 00:00:00,2025.0,Unprovoked,USA,Florida,Everglades National Park Flamingo Lodge Highwa...,Undisclosed,Unknown Male,M,?,...,1500hrs,Not stated,Kevin McMurray Trackingsharks.com: Florida New...,,,,,,,
2,2025-03-26 00:00:00,2025.0,Unprovoked,Australia,WA,Sandtrax Port Beach North Fremantle Perth,Swimming,Unknown Male,M,30+,...,1430hrs,1.5m Tiger shark,Kevin McMurray Trackingsharks.com: www.surfer....,,,,,,,
3,2025-03-10 00:00:00,2025.0,Unprovoked,Australia,WA,Duke of Orleans Bay,Surfing,Steven Jeffrey Payne,M,37,...,1210 hrs,Great White Shark,Bob Myatt,,,,,,,
4,2025-03-07 00:00:00,2025.0,Unprovoked,Australia,NSW,Gunyah beach Bundeena Port Hacking,Swimming,Mangyong Zhang,F,56,...,1340hrs,Bull shark,Bob Myatt,,,,,,,


In [93]:
sharks["year"].unique()

array([2025., 2024., 2026., 2023., 2022., 2021., 2020., 2019., 2018.,
       2017.,   nan, 2016., 2015., 2014., 2013., 2012., 2011., 2010.,
       2009., 2008., 2007., 2006., 2005., 2004., 2003., 2002., 2001.,
       2000., 1999., 1998., 1997., 1996., 1995., 1984., 1994., 1993.,
       1992., 1991., 1990., 1989., 1969., 1988., 1987., 1986., 1985.,
       1983., 1982., 1981., 1980., 1979., 1978., 1977., 1976., 1975.,
       1974., 1973., 1972., 1971., 1970., 1968., 1967., 1966., 1965.,
       1964., 1963., 1962., 1961., 1960., 1959., 1958., 1957., 1956.,
       1955., 1954., 1953., 1952., 1951., 1950., 1949., 1948., 1848.,
       1947., 1946., 1945., 1944., 1943., 1942., 1941., 1940., 1939.,
       1938., 1937., 1936., 1935., 1934., 1933., 1932., 1931., 1930.,
       1929., 1928., 1927., 1926., 1925., 1924., 1923., 1922., 1921.,
       1920., 1919., 1918., 1917., 1916., 1915., 1914., 1913., 1912.,
       1911., 1910., 1909., 1908., 1907., 1906., 1905., 1904., 1903.,
       1902., 1901.,

In [None]:
pd.NA  # The concept of a problem

<NA>

In [98]:
sharks.notna().sum()

date              7012
year              7010
type              6994
country           6962
state             6527
location          6446
activity          6427
name              6793
sex               6433
age               4018
injury            6977
fatal y/n         6451
time              3486
species           3881
source            6993
href formula      6794
href              6796
case number       6798
case number.1     6797
original order    6799
unnamed: 21          1
unnamed: 22          2
dtype: int64

In [99]:
sharks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7012 entries, 0 to 7011
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            7012 non-null   object 
 1   year            7010 non-null   float64
 2   type            6994 non-null   object 
 3   country         6962 non-null   object 
 4   state           6527 non-null   object 
 5   location        6446 non-null   object 
 6   activity        6427 non-null   object 
 7   name            6793 non-null   object 
 8   sex             6433 non-null   object 
 9   age             4018 non-null   object 
 10  injury          6977 non-null   object 
 11  fatal y/n       6451 non-null   object 
 12  time            3486 non-null   object 
 13  species         3881 non-null   object 
 14  source          6993 non-null   object 
 15  href formula    6794 non-null   object 
 16  href            6796 non-null   object 
 17  case number     6798 non-null   o

In [None]:
sharks["number_of_children"].fillna()

Unnamed: 0,date,year,type,country,state,location,activity,name,sex,age,...,time,species,source,href formula,href,case number,case number.1,original order,unnamed: 21,unnamed: 22
202,25-Sep-2022,2022.0,Unprovoked,SOUTH AFRICA,Western Cape Province,"Central Beach, Plettenberg Bay",Swimming,Kimon Bisogno,F,39,...,07h53,"White shark, 13'","Mirror, 9/25/2022",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2022.09.25,2022.09.25,6802.0,,
205,03-Sep-2022,2022.0,Unprovoked,USA,Hawaii,"Lower Paia Beach Park, Maui",Swimming or Snorkeling,female,F,51,...,16h10,,"Star Advertiser, 9/3/2022",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2022.09.03,2022.09.03,6800.0,,
206,31-Aug-2022,2022.0,Unprovoked,AUSTRALIA,New South Wales,Avoca,Surfing,Sunni Pace,M,14,...,07h00,Bronze whaler,"Surfline, 9/2/2022",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2022.08.31,2022.08.31,6799.0,,
207,17-Aug-2022,2022.0,Unprovoked,AUSTRALIA,New South Wales,Coffs Harbour,Kayaking,John Vincent,M,,...,,"White shark, 3 m",A Currie. GSAF,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2022.08.17,2022.08.17,6798.0,,
208,15-Aug-2022,2022.0,Unprovoked,USA,South Carolina,"Myrtle Beach, Horry County",Swimming,female,F,,...,11h17,,"C. Creswell, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2022.08.15.c,2022.08.15.c,6797.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7006,Before 1906,0.0,Unprovoked,AUSTRALIA,New South Wales,,Swimming,Arab boy,M,,...,,Said to involve a grey nurse shark that leapt ...,"L. Becke in New York Sun, 9/9/1906; L. Schultz...",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0006,ND.0006,7.0,,
7007,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,...,,,"H. Taunton; N. Bartlett, p. 234",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0005,ND.0005,6.0,,
7008,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,...,,,"H. Taunton; N. Bartlett, pp. 233-234",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0004,ND.0004,5.0,,
7009,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,...,,,"F. Schwartz, p.23; C. Creswell, GSAF",http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,ND.0003,ND.0003,4.0,,


In [None]:
pd.