# Cleaning of shark attacks data set

## Objetive

The main purpose of this project is double. In one hand, we want to know what shark species is la que más está registrada en los ataques. In the other hand, we want to determine the profile of the people who has provoked the incidents the most (nationality, activity, sex and age).
Global Shark Attack File defines a provoked incident as one in which the shark was speared, hooked, captured or in which a human drew "first blood"

In [None]:
#Imports
import pandas as pd
import numpy as np
import re

# Functions
# Evaluate the NA's
def evaluar_NA(data):
    # Pandas series denoting features and the sum of their null values
    null_sum = data.isna().sum()
    # Total
    total = null_sum.sort_values(ascending=False)
    # Percentage
    percent = ( ((null_sum / len(data.index))*100).round(2) ).sort_values(ascending=False) 
    # concatenate along the columns to create the complete dataframe
    df_NA = pd.concat([total, percent], axis=1, keys=['Number of NA', 'Percent NA'])   
    return df_NA

# data
data = pd.read_csv('../data/attacks.csv', encoding='latin-1')

Quick preview of the data

In [16]:
print(data.shape)
print(data.columns)


(25723, 24)
Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')


Our first task is to subset the initial dataset to keep only the variables that we are going to use:
* Type
* Country
* Activity
* Sex
* Age
* Species

In [17]:
# Subset the dataframe with the needed columns
df_subset = data[["Case Number", "Type", "Country", "Activity", "Sex ", "Age", "Species "]]

# Rename the Sex and Species columns to remove the last space and avoid problems in the future...
df_subset = df_subset.rename(columns={"Sex ": "Sex", "Species ": "Species"})

df_subset.head()

Unnamed: 0,Case Number,Type,Country,Activity,Sex,Age,Species
0,2018.06.25,Boating,USA,Paddling,F,57.0,White shark
1,2018.06.18,Unprovoked,USA,Standing,F,11.0,
2,2018.06.09,Invalid,USA,Surfing,M,48.0,
3,2018.06.08,Unprovoked,AUSTRALIA,Surfing,M,,2 m shark
4,2018.06.04,Provoked,MEXICO,Free diving,M,,"Tiger shark, 3m"


## The most trouble maker human

In [18]:
# Select only the needed columns
df_humans = df_subset[["Type", "Country", "Activity", "Sex", "Age",]]
# Select only the provoked incidents
df_humans = df_humans[df_humans["Type"] == "Provoked"]
#df_humans

First at all, we are goint the evaluate the amount of NA's in the dataset

In [5]:
# Evaluate the percentage of 
evaluar_NA(df_humans)

Unnamed: 0,Number of NA,Percent NA
Age,294,51.22
Sex,57,9.93
Activity,35,6.1
Country,3,0.52
Type,0,0.0


Due to the high percentage of NA's in the **Age** column, we've decided to exclude it from the dataset.

In [7]:
# Drop the "Age" column
df_humans_cl = df_humans.drop(["Age"], axis=1)

# Drop NA's
df_humans_cl = df_humans_cl.dropna(axis=0)
df_humans_cl.shape

(487, 4)

Now, throgh regex expressions, we are going to extract the activy. With this regex, we want to group the activities into a few categories, in order to better group the data.

In [19]:
df_humans_cl["regActivity"] = df_humans_cl["Activity"].str.findall(r"((?:[A-Za-z-]*\s){0,1}(?:[A-Za-z]*ing))")
#print(df_humans_cl["regActivity"].value_counts())


In [25]:
# Drop possible empty lists
df_humans_cl = df_humans_cl[df_humans_cl["regActivity"].str.len() != 0]

# Select only the first element on the "regActivity" list
df_humans_cl["regActivity"] = df_humans_cl["regActivity"].apply(lambda x: x[0] )

# Convert the regActivity list to string
df_humans_cl["regActivity"] = df_humans_cl["regActivity"].apply(''.join)
df_humans_cl

Unnamed: 0,Type,Country,Activity,Sex,regActivity
4,Provoked,MEXICO,Free diving,M,Free diving
10,Provoked,AUSTRALIA,Feeding sharks,M,Feeding
14,Provoked,AUSTRALIA,Feeding sharks,F,Feeding
41,Provoked,AUSTRALIA,Kayak fishing for sharks,M,Kayak fishing
55,Provoked,MALAYSIA,Fishing / Wading,M,Fishing
...,...,...,...,...,...
6224,Provoked,VANUATU,Attempting to drive shark from area,M,Attempting
6226,Provoked,USA,Skin diving. Grabbed shark's tail; shark turne...,M,Skin diving
6250,Provoked,BAHAMAS,Testing movie camera in full diving dress,M,Testing
6254,Provoked,CUBA,"Shark fishing, knocked overboard",M,Shark fishing


Create the final dataset

In [26]:
# Select the final columns and rename them
humans = df_humans_cl[["Country", "Sex", "regActivity"]].rename(columns = {"regActivity" : "Activity"})
# Export the data
humans.to_csv("output/sharks.csv")

Unnamed: 0,Country,Sex,Activity
4,MEXICO,M,Free diving
10,AUSTRALIA,M,Feeding
14,AUSTRALIA,F,Feeding
41,AUSTRALIA,M,Kayak fishing
55,MALAYSIA,M,Fishing
...,...,...,...
6224,VANUATU,M,Attempting
6226,USA,M,Skin diving
6250,BAHAMAS,M,Testing
6254,CUBA,M,Shark fishing


### Shark species involved in attacks by country

In [None]:
df_sharks = df_subset[["Country", "Species"]]

In [None]:
# Check NAs
evaluar_NA(sharks)

In [None]:
# Drop "Species" rows with null values
sharks_cl = df_sharks.dropna(axis=0)
sharks_cl.shape


Extracting the shark species

In [None]:
#prueba
# Extract from 1 up to 2 words before the word 'shark'
sharks_cl['spp'] = sharks_cl["Species"].str.findall(r"((?:[A-Za-z-]*\s){1,2}(?:[Ss]hark|[Cc]atfish|[Pp]ointer))")
sharks_cl.iloc[50:100, [1,2]]

In [None]:
# Extract only the values wich are not empty
sharks_cl = sharks_cl[sharks_cl["spp"].str.len() != 0]
sharks_cl.loc[114, "spp_str"]

In [None]:
# Convert to strings
sharks_cl["spp_str"] = sharks_cl["spp"].apply(''.join)
# Delete rows with only 'shark', 'm shark', 'No shark', 'A small shark'
sharks_cl2 = sharks_cl[sharks_cl["spp_str"].isin([' m shark', ' shark', 'No shark','Not a shark', 'A small shark'])== False]
sharks_cl2["spp_str"] = sharks_cl2["spp_str"].str.title()


In [None]:
# Final 
sharks = sharks_cl2[["Country","spp_str"]]
# Rename and reset index
sharks = sharks.rename(columns = {"spp_str" : "Species"}).reset_index()
# Export to csv
sharks.to_csv("output/sharks.csv")