## Dataset cleaning

The purpose of this notebook is to **clean** the *global shark attacks*dataset as much as we can so the values that stay are **reliable*** and **rigorous** enough to:

- Get early stage conclusions paying with pandas
- Apply some statistical concepts such as formulas and hypothesis testings
- Plot the statistics and results



First we have to import the necessary **libraries** that will help us with the data cleaning

In [1]:
import pandas as pd
import numpy as np
import regex as re
import datetime
pd.set_option('display.max_rows', None)

Second step is uploading the data set on jupyter and  **filling** the nan with "N/A" so we can extract the values of each column properly, **drop** columns that we don't need and **rename** the ones that we are going to keep in the dataframe


In [2]:
sharks= pd.read_csv("datasets/global-shark-attack.csv",sep=";")


In [3]:
sharks.fillna("N/A",inplace=True)
sharks.drop(["Case Number","Investigator or Source","pdf","href formula","Case Number.2","original order"],axis=1,inplace=True)
sharks.head()

Unnamed: 0,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,href,Case Number.1
0,2018-05-27,2018,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,52.0,Minor injury to foot. PROVOKED INCIDENT,N,,"Lemon shark, 3'",http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27
1,2018-05-13,2018,Unprovoked,USA,South Carolina,"Hilton Head Island, Beaufort County",Swimming,Jei Turrell,M,10.0,Severe bite to right forearm,N,15h00,,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.13.b
2,2018-04-25,2018,Questionable,AUSTRALIA,New South Wales,Lennox Head,Surfing,Matthew Lee,M,,No injury,N,07h00,Questionable,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.25.b
3,2018-04-10,2018,Invalid,BRAZIL,Alagoas,"Praia de Sauaçuhy, Maceió",Fishing,Josias Paz,M,56.0,Injury to ankle from marine animal trapped in ...,N,,Shark involvement not confirmed,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.10.R
4,2018-04-09,2018,Unprovoked,NEW CALEDONIA,,"Magenta Beach, Noumea",Windsurfing,,,,"No injury, shark bit board",N,17h00,2 m shark,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.09


In [4]:
sharks.columns=["Month","Year","Type","Country","Area","Location","Activity","Name","Gender","Age","Injury","Severity","Hour","Shark specie","link source","Case number"]
sharks.head()

Unnamed: 0,Month,Year,Type,Country,Area,Location,Activity,Name,Gender,Age,Injury,Severity,Hour,Shark specie,link source,Case number
0,2018-05-27,2018,Unprovoked,USA,Florida,"Lighhouse Point Park, Ponce Inlet, Volusia County",Fishing,male,M,52.0,Minor injury to foot. PROVOKED INCIDENT,N,,"Lemon shark, 3'",http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.27
1,2018-05-13,2018,Unprovoked,USA,South Carolina,"Hilton Head Island, Beaufort County",Swimming,Jei Turrell,M,10.0,Severe bite to right forearm,N,15h00,,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.05.13.b
2,2018-04-25,2018,Questionable,AUSTRALIA,New South Wales,Lennox Head,Surfing,Matthew Lee,M,,No injury,N,07h00,Questionable,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.25.b
3,2018-04-10,2018,Invalid,BRAZIL,Alagoas,"Praia de Sauaçuhy, Maceió",Fishing,Josias Paz,M,56.0,Injury to ankle from marine animal trapped in ...,N,,Shark involvement not confirmed,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.10.R
4,2018-04-09,2018,Unprovoked,NEW CALEDONIA,,"Magenta Beach, Noumea",Windsurfing,,,,"No injury, shark bit board",N,17h00,2 m shark,http://sharkattackfile.net/spreadsheets/pdf_di...,2018.04.09


The next step is going **column by column** doing the actual cleaning, identifying all the values per column and then **erase, change or unite** some of them if necessary

**MONTH**: replacing 0000-00-00 format per Jan-Feb etc. format


In [5]:
sharks["Month"].replace(to_replace="^\d{4}-",value="",regex=True,inplace=True)
sharks["Month"].replace(to_replace="-\d{2}$",value="",regex=True,inplace=True)
sharks["Month"].replace(to_replace="^0",value="",regex=True,inplace=True)
sharks["Month"].replace({'1':'Jan','2':'Feb','3':'Mar','4':'Apr','5':'May','6':'Jun','7':'Jul','8':'Aug','9':'Sep','10':'Oct','11':'Nov','12':'Dec'},inplace=True)


**YEARS:** seems clean enough


In [6]:
years=pd.DataFrame(sharks["Year"].value_counts())

**TYPE:** mainly unifying values. We set "questionable" as "provoked" as there were only two entries


In [7]:
sharks["Type"].replace("Boat","Boating",inplace=True)
sharks["Type"].replace("Boatomg","Boating",inplace=True)
sharks["Type"].replace("Questionable","Provoked",inplace=True)
sharks["Type"].replace("N/A","Unknown",inplace=True)


**COUNTRY:** change some country names that have typos


In [8]:
sharks["Country"].replace("INDIAN OCEAN?","INDONESIA",inplace=True)
sharks["Country"].replace("THE BALKANS","SLOVENIA",inplace=True)
sharks["Country"].replace("GRAND CAYMAN","CAYMAN ISLANDS",inplace=True)
sharks["Country"].replace("CEYLON","SRI LANKA",inplace=True)
sharks["Country"].replace("PERSIAN GULF","IRAN",inplace=True)
sharks["Country"].replace("BURMA","MYANMAR",inplace=True)
sharks["Country"].replace("Fiji","FIJI",inplace=True)
sharks["Country"].replace("OKINAWA","JAPAN",inplace=True)
sharks["Country"].replace("DIEGO GARCIA","DIEGO GARCIA ISLAND",inplace=True)
sharks["Country"].replace("Sierra Leone","SIERRA LEONE",inplace=True)
sharks["Country"].replace("GULF OF ADEN","YEMEN",inplace=True)
sharks["Country"].replace("SAN DOMINGO","DOMINICAN REPUBLIC",inplace=True)
sharks["Country"].replace("SUDAN?","SUDAN",inplace=True)
sharks["Country"].replace("RED SEA?","EGYPT",inplace=True)
sharks["Country"].replace("CEYLON (SRI LANKA)","SRI LANKA",inplace=True)
sharks["Country"].replace("Between PORTUGAL & INDIA","PORTUGAL",inplace=True)
sharks["Country"].replace("ASIA?","ASIA",inplace=True)
sharks["Country"].replace("Seychelles","SEYCHELLES",inplace=True)
sharks["Country"].replace("IRAN/IRAQ","IRAN",inplace=True)
sharks["Area"].replace("KwaZulu-Natal between Port Edward and Port St Johns","KwaZulu-Natal",inplace=True)

**ACTIVITY:** there is a lot of cleaning to do. First we use **regex** and pandas replace function to group into different categories (mainly water sports, swimming, diving and bathing)


In [9]:

sharks["Activity"].replace(to_replace=["Jumped overboard","Thrown overboard","Jumped overboard to rescue companion","Fell overboard, hanging onto lifebuoy","Jumping","Jumped into the water"],value="Thrown or jumped from the boat",inplace=True)
sharks["Activity"].replace(to_replace=["Wading","Standing","Playing","Walking","Treading water"],value="Bathing/Floating/Walking near the shore",inplace=True)
sharks["Activity"].replace(to_replace=["Sailing","Canoeing","Boating","Boat","SUP"],value="Water sports",inplace=True)
sharks["Activity"].replace(to_replace=["Floating", "Bathing","Splashing"],value="Bathing/Floating/Walking near the shore",inplace=True)
sharks["Activity"].replace(to_replace=".*[Ff]ell.*",value="Thrown or jumped from the boat",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]urf.*",value="Water sports",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Bb]oard.*",value="Water sports",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Rr]ow.*",value="Water sports",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]ki.*",value="Water sports",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Pp]ad.*",value="Water sports",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Kk]ay.*",value="Water sports",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ff]ish.*",value="Fishing",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Nn]et.*",value="Fishing",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ff]loat.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Bb]ath.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ww]alk.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Pp]lay.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ww]ash.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ff]loat.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]tand.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]it.*",value="Bathing/Floating/Walking near the shore",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]wim.*",value="Swimming",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=["Snorkeling"],value="Diving",inplace=True)
sharks["Activity"].replace(to_replace=".*[Dd]iv.*",value="Diving",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Dd]isaster.*",value="Sea or air disaster",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Aa]ir.*",value="Sea or air disaster",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ww]reck.*",value="Sea or air disaster",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Cc]apsized.*",value="Sea or air disaster",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss][a-i]nk.*",value="Sea or air disaster",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]wamp.*",value="Sea or air disaster",regex=True,inplace=True)
sharks["Activity"].replace(to_replace=".*[Ss]hark.*",value="Direct interaction with a shark",regex=True,inplace=True)

activities=pd.DataFrame(sharks["Activity"].value_counts())


As there were still a lot of different ativities hard to group in any of the main groups (most of them with ot more than 3 attacks) we have created a **dictionary and map** it in the activity column ina new gropu called *others*


In [10]:
to_keep_map = dict(zip(list(activities[activities >= 4].dropna().index), list(activities[activities >= 4].dropna().index)))

In [11]:
to_map = dict(zip(list(activities[activities < 4].dropna().index), ["other" for i in range(238)]))

In [12]:
final_map = {**to_keep_map, **to_map}

In [13]:
sharks["Activity"] = sharks["Activity"].map(final_map)

**AGE:** using a lot of **regex** to convert the strings and categorical values and make sure that we end up with **all**  **numerical** values

In [14]:
sharks["Age"].replace(to_replace=".*[Tt]een.*",value="15",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*[Yy]oung.*",value="16",regex=True,inplace=True)
sharks["Age"].replace(to_replace="[s].*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="[\?].*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace='".*',value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*[>]",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*ul",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="F.*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*[\,]",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*leage",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="[X].*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="[Bt)'].*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="Ele.*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="Eld?e.*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*Ca.",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*MAK",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*mid-",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*mon",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".*A\.M\.",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="E LINE GREEN",value="",inplace=True)
sharks["Age"].replace(to_replace="[&].*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="to.*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="or.*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace=".both",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace="½.*",value="",regex=True,inplace=True)
sharks["Age"].replace(to_replace='  ',value="N/A",inplace=True)

ages=pd.DataFrame(sharks["Age"].value_counts())




**GENDER:** just renaming the names of the values into more specific ones

In [15]:
sharks["Gender"].replace(to_replace=["lli","N","."],value="F",inplace=True)
sharks["Gender"].replace(to_replace="M",value="Male",inplace=True)
sharks["Gender"].replace(to_replace="F",value="Female",inplace=True)
sharks["Gender"].replace(to_replace="N/A",value="Unknown",inplace=True)

gender=pd.DataFrame(sharks["Gender"].value_counts())


**SEVERITY:** replacing Y and N for *death* or *no death* and keeping an *unknown* group of values

In [16]:
sharks["Severity"].replace(to_replace="y",value="Y",inplace=True)
sharks["Severity"].replace(to_replace="Y",value="Death",inplace=True)
sharks["Severity"].replace(to_replace="N",value="No death",inplace=True)
sharks["Severity"].replace(to_replace="UNKNOWN",value="Unknown",inplace=True)
fatality=pd.DataFrame(sharks["Severity"].value_counts())



**SHARK SPECIE:** using **regex** to unify shark species values and clean numbers in categorical data

In [17]:
sharks["Shark specie"].replace(to_replace=["Shark involvement not confirmed","Shark involvement prior to death not confirmed","Shark involvement prior to death suspected but not confirmed","Questionable Incident","Shark involvement suspected but not confirmed","Invalid","Shark involvement questionable","Shark involvement prior to death unconfirmed","Questionable incident","Questionable","Shark involvement prior to death was not confirmed","No shark involvement"],value="Questionable shark involvement",inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Ww]hite.*",value="White shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Tt]iger.*",value="Tiger shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Bb]ull.*",value="Bull shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Bb]lack.*",value="Blacktip shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Ll]emon.*",value="Lemon shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Bb]lue.*",value="Blue shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Nn]urse.*",value="Nurse shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Rr]eef.*",value="Reef shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Bb]ronze.*",value="Bronze whaler shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Ww]ob.*",value="Wobbegong shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Ss]and.*",value="Sand shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Mm]ak.*",value="Mako shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Rr]agg.*",value="Raggedtooth shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Hh]ammer.*",value="Hammerhead shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Zz]amb.*",value="Zambesi shark",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Uu]nident.*",value="N/A",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[Ss]mall.*",value="N/A",regex=True,inplace=True)
sharks["Shark specie"].replace(to_replace=".*[1-9].*",value="N/A",regex=True,inplace=True)

sharkies=pd.DataFrame(sharks["Shark specie"].value_counts())



**HOUR:** not enough time to work on it but would be interesting to clean it and analyze the peak hours of attacks

In [18]:
times=pd.DataFrame(sharks["Hour"].value_counts())


#### After all the data is clean we **save** the final dataset **into a csv** that we can upload later for future analysis ####

In [19]:
sharks.to_csv("sharks_final_dataset.csv")
