# Cleaning of shark attacks data set

## Objetive

The main purpose of this project is double. In one hand, we want to know what shark species is la que más está registrada en los ataques. In the other hand, we want to determine the profile of the people who has provoked the incidents the most (nationality, activity, sex and age).

In [41]:
#Imports
import pandas as pd
import numpy as np
import re

In [42]:
#Import data
data = pd.read_csv('../data/attacks.csv', encoding='latin-1')
print(data.shape)
data.head
data.columns


(25723, 24)


Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

Our first task is to subset the initial dataset to keep only the variables that we are going to use:
* Type
* Country
* Activity
* Sex
* Age
* Species

In [43]:
# Subset the dataframe with the needed columns
df_subset = data[["Case Number", "Type", "Country", "Activity", "Sex ", "Age", "Species "]]

# Rename the Sex and Species columns to remove the last space and avoid problems in the future...
df_subset = df_subset.rename(columns={"Sex ": "Sex", "Species ": "Species"})
df_subset.head()

Unnamed: 0,Case Number,Type,Country,Activity,Sex,Age,Species
0,2018.06.25,Boating,USA,Paddling,F,57.0,White shark
1,2018.06.18,Unprovoked,USA,Standing,F,11.0,
2,2018.06.09,Invalid,USA,Surfing,M,48.0,
3,2018.06.08,Unprovoked,AUSTRALIA,Surfing,M,,2 m shark
4,2018.06.04,Provoked,MEXICO,Free diving,M,,"Tiger shark, 3m"


In this part, we select the coluns the NULLs values from "Species" column. As we are only interested in provoked attaks, we first do a subset by type of attack.

In [49]:
#df_subset = df_subset[["Type"] == 'Provoked'],
#df = df_subset[df_subset["Type"] == "Provoked"]
df = df_subset

In [50]:
def evaluar_NA(data):
    # Pandas series denoting features and the sum of their null values
    null_sum = data.isna().sum()
    # Total
    total = null_sum.sort_values(ascending=False)
    # Percentage
    percent = ( ((null_sum / len(data.index))*100).round(2) ).sort_values(ascending=False) 
    # concatenate along the columns to create the complete dataframe
    df_NA = pd.concat([total, percent], axis=1, keys=['Number of NA', 'Percent NA'])   
    return df_NA

In [51]:
evaluar_NA(df)

Unnamed: 0,Number of NA,Percent NA
Species,22259,86.53
Age,22252,86.51
Sex,19986,77.7
Activity,19965,77.62
Country,19471,75.69
Type,19425,75.52
Case Number,17021,66.17


Because of the huge amount of NA values in shark species, we are going to split the data into two parts, one for analizar las especies y otro para los humanos

### Sharks attacks provoked by human by country

In [52]:
d_sharks = df[["Country", "Species"]]

In [53]:
# Drop "Species" rows with null values
species_clean = d_sharks["Species"].dropna(axis=0)
species_clean.shape

# Subset the dataframe with the indices of the not null "Species rows"
d_sharks_cl = d_sharks.loc[species_clean.index, ]
d_sharks_cl.shape

(3464, 2)

In [54]:
# Check NAs
evaluar_NA(d_sharks_cl)

Unnamed: 0,Number of NA,Percent NA
Country,12,0.35
Species,0,0.0


Extract the shark species

In [82]:
#prueba
#d_sharks_cl['spp'] = d_sharks_cl["Species"].str.findall(r"\bshark")
d_sharks_cl['spp'] = d_sharks_cl["Species"].str.findall(r"(?:[A-Za-z]+\s){1,3}[shark]+")
print(d_sharks_cl['spp'])

0                                           [White shark]
3                                               [m shark]
4                                           [Tiger shark]
6                                           [Tiger shark]
7                                           [Lemon shark]
                              ...                        
6276                                        [tiger shark]
6293                                                   []
6294                                                   []
6295                                                   []
6296    [Said to involve a, grey nurse shark, of the w...
Name: spp, Length: 3464, dtype: object
