# The Shark Data Set Exploratory Analysis


Objectives: 

1. Define question to anwer
2. Clean up at least 3 columns from the Shark Data Set (https://www.kaggle.com/teajay/global-shark-attacks/version/1)
3. Answer the question based on the analysis

# Objective #1: Defining the question

Sua chance de morrer por um ataque de tubarão é maior se você provocá-lo?

# Objective #2: Clean up 3 columns from the Data Set

In [1]:
import pandas as pd
import numpy as np

## Dropping 'Case Number' columns

In [2]:
shark = pd.read_csv('GSAF5.csv',encoding='latin-1')

In [3]:
(~(shark['Case Number.1'] == shark['Case Number.2'])).sum()
(~(shark['Case Number.1'] == shark['Case Number'])).sum()
(~(shark['Case Number.2'] == shark['Case Number'])).sum()

2

In [4]:
# Which one to delete?
mask1 = (~(shark['Case Number.1'] == shark['Case Number.2']))
mask2 = (~(shark['Case Number.1'] == shark['Case Number']))
mask3 = (~(shark['Case Number.2'] == shark['Case Number']))

# The divergent rows are:
div_cases = shark[['Case Number','Case Number.1','Case Number.2','pdf','Investigator or Source']].loc[(mask1|mask2|mask3),:]
div_cases.index


# Aqui decidimos que substituiriamos as strings conflitantes (divergentes), mas manteríamos apenas uma coluna "case number".
# Substituimos os valores corretos tirando os das coluna "pdf"

#We will use a variable to store temporary values of the DF: shark_temp
shark_temp = shark.copy()
shark_temp.iloc[list(div_cases.index),0] = shark.iloc[list(div_cases.index),16].apply(lambda x: x.split('-')[0])

#Agora, podemos dropar as colunas repetidas "Case Number.1" e "Case Number.2"
keep_cols = ['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'original order', 'Unnamed: 22','Unnamed: 23']

shark_temp = shark_temp.loc[:,keep_cols]

## Dropping last 3 columns 

* Unnamed: 22
* Unnamed: 23
* original order

In [5]:
#since those two last columns have no meaningfull information, we will discard them.

print(shark_temp.loc[:,'Unnamed: 22'].isna().sum())
print(shark_temp.loc[:,'Unnamed: 23'].isna().sum())

#Column "original order" has unique numbers ranging from 1 to 5993, which is like an index, but in case timer order. 
shark_temp = shark_temp.loc[:,'Case Number':'href']

5991
5990


## Cleaning column names

In [6]:
#cleaning column names:
col_list = [i.lower().strip().replace(' ', '_') for i in shark_temp.columns]
shark_temp.columns = col_list

## Cleaning the 'type' column

Transform some 'invalid' values to 'unprovoked'

Based on the following words, since they imply in peaceful activities:
'swim','surf','div','board','bath' 

In [7]:
# checking number of values in each category before 'clean up'
shark_temp.type.value_counts()

Unprovoked      4386
Provoked         557
Invalid          519
Sea Disaster     220
Boat             200
Boating          110
Name: type, dtype: int64

In [8]:
shark_temp.loc[shark_temp.type == 'Invalid',:].head(2)

Unnamed: 0,case_number,date,year,type,country,area,location,activity,name,sex,age,injury,fatal_(y/n),time,species,investigator_or_source,pdf,href_formula,href
50,2016.06.07,07-Jun-16,2016,Invalid,USA,South Carolina,"Folly Beach, Charleston County",Surfing,Jack O'Neill,M,27,"No injury, board damaged",N,11h30,Said to involve an 8' shark but more likely da...,"C. Creswell, GSAF",2016.06.07-Oneill.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...
73,2016.04.08,08-Apr-16,2016,Invalid,CAPE VERDE,Boa Vista Island,,,a British citizen,M,60,"""Serious""",N,,Shark involvement not confirmed,L.O.Guttke,2016.04.08-CapeVerde.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...


In [9]:
import re
shark_act = shark_temp.loc[shark_temp.type == 'Invalid',:].loc[:,'activity']

In [10]:
# function to apply in 

def func_re (x):
    '''
    This function detects all strings containing the words according to the filters list,
    and returns  'Unprovoked' when the word is found and 'Invalid' when not.

    example 1:
        Input: 'Surfing'
        Output: 'Unprovoked' 
        
    example 2:
        Input: 'fishing'
        Output: 'Invalid' 
    '''
    
    filters = ['[Ss]urf\w+','[Ss]wim\w+','[Bb]ath\w+', '[Ww]ad\w+','[Bb]oard\w+','[Ss]norkel\w+', '[Kk]ayak\w+','[Dd]iv\w+']
    
    y = 'Invalid'
    for activity in filters:
        try:
            if len(re.findall(activity, x)) != 0:
                y = 'Unprovoked'
                break
        except:
            break
    return y

In [11]:
# implementing above function to shark_temp DataFrame

shark_temp.loc[shark_temp.type == 'Invalid',['type']] = shark_act.apply(lambda x: func_re(x))

In [12]:
# checking 'type' values after clean up
shark_temp.type.value_counts()

Unprovoked      4618
Provoked         557
Invalid          287
Sea Disaster     220
Boat             200
Boating          110
Name: type, dtype: int64

## Cleaning 'Fatal(Y/N)' column

* A clean version of the 'Fatal(Y/N)' column will be stored in a new column: 'fatal'

In [13]:
shark.loc[:,'Fatal (Y/N)'].unique()

array(['N', 'Y', nan, 'UNKNOWN', ' N', 'F', 'N ', '#VALUE!', 'n'],
      dtype=object)

In [14]:
# primeiro: copiar coluna "fatal(Y/N)" em 'fatal'
shark_temp['fatal'] = shark.loc[:,'Fatal (Y/N)']
shark_temp['fatal'].value_counts()

N          4315
Y          1552
UNKNOWN      94
 N            8
n             1
F             1
N             1
#VALUE!       1
Name: fatal, dtype: int64

In [15]:
# Not fatal [' N','N ','n','N'] -> 0
# Fatal ['Y','F'] -> 1 
# Other [Nan, #VALUE!, UNKNOWN] -> np.nan

shark_temp['fatal'] = shark_temp.loc[:,'fatal'].apply(lambda x: 0 if x in [' N','N ','n','N'] else (1 if x in ['Y','F'] else np.nan))

In [16]:
shark_temp['fatal'].value_counts()

0.0    4325
1.0    1553
Name: fatal, dtype: int64

## Discarding rows with unnecessary values to the Analysis

The 'type' columns contain the following unique values:
* Invalid (leftovers after previous cleaning step)
* Boat
* Boating
* Sea Disaster
* Provoked 
* Unprovoked

Except for 
* Provoked 
* Unprovoked \
the other valuer are being discarded since they don't add value to our final result.


In [17]:
shark_temp = shark_temp.loc[~(shark_temp.type == 'Invalid'),:]
shark_temp = shark_temp.loc[~(shark_temp.type == 'Boating'),:]
shark_temp = shark_temp.loc[~(shark_temp.type == 'Boat'),:]
shark_temp = shark_temp.loc[~(shark_temp.type == 'Sea Disaster'),:]

## Dropping NaN in 'country' column

In [18]:
shark_temp = shark_temp.loc[~(shark_temp.country.isna()),:]

## Cleaning 'sex' column

In [19]:
shark_temp.sex.unique()

array(['M', 'F', nan, 'M ', 'lli'], dtype=object)

In [20]:
# sex 'lli' is a dude> Brian Kang check: {shark.loc[shark['Sex '] == 'lli',:]}
shark_temp.loc[shark_temp.sex == 'lli','sex'] = 'M'


In [21]:
shark_temp.sex = shark_temp.sex.apply(lambda x: x.strip() if type(x) == str else 'undefined')

# Objective #3: Answering the defined question

What are your chances of dying in a shark attack if you provoked it first?  

In [25]:
shark_temp.groupby(['type'])['fatal'].describe().sort_values(by=['mean'], ascending=False).iloc[:,:2].reset_index()

Unnamed: 0,type,count,mean
0,Unprovoked,4521.0,0.267419
1,Provoked,551.0,0.030853


**Final answer:**

You are more likely to survive if you provoke the shark before it attacks You! 


# Bonus Question: 

What are the odds of surviving if a woman provokes a shark? 

In [24]:
shark_temp.groupby(['type','sex'])['fatal'].describe().sort_values(by=['mean'], ascending=False).iloc[:,:2]
# .loc[top_10]

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean
type,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Unprovoked,undefined,160.0,0.3625
Unprovoked,M,3843.0,0.272964
Unprovoked,F,518.0,0.196911
Provoked,F,19.0,0.052632
Provoked,M,476.0,0.031513
Provoked,undefined,56.0,0.017857


Answer: A woman has 67% more chance of dying than a man if she provokes the shark first. 
BUT, if she does not provoke it first, the chances are 30% higher than a man to survive the same event.
