In [1]:
import pandas as pd
import numpy as np
import random
import os

from functools import reduce


# TO-DO: 
- how to add the dataset to gitignore?
- how can i create a separate module that holds my functions?
- change the functions so that they use dictionaries and not 


![Shark attacks, a project by Roberto Henríquez Perozo. Data Analytics Bootcamp at IronHack](shark-attacks.png)



<br><br>


<center>
    <h1> PART I: Data cleaning and exploration</h1>
</center>


## 🎣️ Step 0 - Basic knowledge
To begin the development of this project, it would be good to hold a minimum understanding of `Shark Attacks`.

As I did not know much about this topic at the day the project started, I have recurred to the shark-attack wiki: https://en.wikipedia.org/wiki/Shark_attack

With this information in mind, below is the process of data exploration, cleaning, and wrangling.


## 🎣️ Step 1 - Defining the dataset path, and importing it to begin basic dataset exploration

In [2]:
# To follow along and access the DataSet, download it from KAGGLE using this link
# https://www.kaggle.com/teajay/global-shark-attacks

# Once you have downloaded the DataSet, change the following `dataset` variable to match the 
# path where you have saved the 'attacks.csv' file.
dataset = 'attacks.csv' 


df = pd.read_csv(dataset, encoding='latin-1')

Now, we will check some basic information about the dataset, in order to formulate a more educated hypothesis which we could actually put to test with the data available.

Here, I notice that the shape of the `df` with no duplicates is very small when compared to the whole `df`.

## 🌊️  FUNCT
This comparison could be turned into its own function, as it will be executed quite often

In [3]:
# I'll delete de duplicated rows
print('before', df.shape)
df = df.drop_duplicates()
print('after', df.shape)

# And also take a look at the columns
df.columns

before (25723, 24)
after (6312, 24)


Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [4]:
# I want to take a look at the time structures
df.Date

0        25-Jun-2018
1        18-Jun-2018
2        09-Jun-2018
3        08-Jun-2018
4        04-Jun-2018
            ...     
6307             NaN
6308             NaN
6309             NaN
8702             NaN
25722            NaN
Name: Date, Length: 6312, dtype: object

In [5]:
# I want to see what is the data on the last couple
# of columns which have unexplicit labels
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']]

Unnamed: 0,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,2018.06.04,6299.0,,
...,...,...,...,...,...
6307,,,6309.0,,
6308,,,6310.0,,
6309,,,,,
8702,,,,,


In [6]:
# Too many null values on the last two columns... let's count them
print(df.shape)
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']].isnull().sum()

(6312, 24)


Case Number.1       10
Case Number.2       10
original order       3
Unnamed: 22       6311
Unnamed: 23       6310
dtype: int64

In [7]:
# If there is only 1 value in the 'Unnamed: 22' column, and 2 values in the
# 'Unnamed: 22' column, I'll not consider this data for my analysis.
df = df.drop(columns=['Unnamed: 22', 'Unnamed: 23'])

In [8]:
# The following columns seemed a little bit rare, so i do a value count to find out what they are about
df['Case Number.1'].value_counts().sort_values()

2007.10.13      1
1962.12.09      1
1993.02.18      1
1872.00.00      1
1920.00.00.a    1
               ..
1913.08.27.R    2
2009.12.18      2
2006.09.02      2
2013.10.05      2
1962.06.11.b    2
Name: Case Number.1, Length: 6285, dtype: int64

In [9]:
df['Case Number.2'].value_counts().sort_values()

1956.06.20      1
1962.12.09      1
1993.02.18      1
1872.00.00      1
1920.00.00.a    1
               ..
1990.05.10      2
1962.06.11.b    2
2005.04.06      2
2012.09.02.b    2
1920.00.00.b    2
Name: Case Number.2, Length: 6286, dtype: int64

# 🌊️ Creating a Function

After some thought and googling, it seems like these are some sort of notation used to categorize and order the shark attacks.

They also use the a notation that includes dates. 
## Is it the same as the Date on the Date column?

To find out, I wanted to pick random samples of the dataframe, but python's built-in `random` module kept giving me trouble when I tried to use it alongside pandas.

In [10]:
# It was better to make it into a function
#VER 03
def sampler(df, column, sample_size):
# This function generates an iterator out of random rows from a pandas dataframe's specific column
    
    # Defining how many samples to fetch from the df
    for i in range(sample_size):
        
        # Now a random index is generated out of the total length of the column...
        i = random.choice(range(len(df[column])))
        # ... to return the data values in that index as an iterator:
        yield df.iloc[i][column]
        
        # For future versions, it would be good to look at how I can return the
        # data as a tupple with just the data, and not have it return a formatted string
        # or even better, a pandas dataframe with the results


# Now let's try it out
sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10)

<generator object sampler at 0x7f10b1181570>

In [11]:
# Since sampler generates iterators, list() must be used to see its contents
list(sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10))

[Date             6-Aug-2005
 Case Number.1    2005.08.06
 Case Number.2    2005.08.06
 Name: 1562, dtype: object,
 Date               Mar-1948
 Case Number.1    1948.03.00
 Case Number.2    1948.03.00
 Name: 4551, dtype: object,
 Date             Reported 29-Nov-2005
 Case Number.1            2005.11.29.R
 Case Number.2            2005.11.29.R
 Name: 1520, dtype: object,
 Date             18-May-2004
 Case Number.1     2004.05.18
 Case Number.2     2004.05.18
 Name: 1673, dtype: object,
 Date             11-Sep-1992
 Case Number.1     1992.09.11
 Case Number.2     1992.09.11
 Name: 2534, dtype: object,
 Date                     1954
 Case Number.1    1954.00.00.e
 Case Number.2    1954.00.00.e
 Name: 4351, dtype: object,
 Date             31-Jul-2011
 Case Number.1     2011.07.31
 Case Number.2     2011.07.31
 Name: 890, dtype: object,
 Date             23-May-2012
 Case Number.1     2012.05.23
 Case Number.2     2012.05.23
 Name: 792, dtype: object,
 Date             Reported 07-Aug-

## 🦈️
From this output, we can see that the `Case Number` Columnns are actually replicating the info that we already have on the `Date` column. Therefore, we will drop both `Case Number` columns

In [12]:
df = df.drop(columns=['Case Number.1', 'Case Number.2'])
df

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6299.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6307,0,,,,,,,,,,,,,,,,,,,6309.0
6308,0,,,,,,,,,,,,,,,,,,,6310.0
6309,0,,,,,,,,,,,,,,,,,,,
8702,,,,,,,,,,,,,,,,,,,,


Taking a closer look at the tail of the df, we notice that the last couple of rows are still holding many null values.


In [13]:
# it's only the last 10 columns which have the nulls.
df.tail(15)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
6297,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6.0
6298,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5.0
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,4.0
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,3.0
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,male,M,15.0,"FATAL. ""Shark bit him in half, carrying away t...",Y,,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2.0
6302,0,,,,,,,,,,,,,,,,,,,6304.0
6303,0,,,,,,,,,,,,,,,,,,,6305.0
6304,0,,,,,,,,,,,,,,,,,,,6306.0
6305,0,,,,,,,,,,,,,,,,,,,6307.0
6306,0,,,,,,,,,,,,,,,,,,,6308.0


In [14]:
#since they are only 10 instances, we can drop them manually using their index:
df = df.drop([6302, 6303,6304,6305,6306,6307,6308,6309,8702,25722])

In [15]:
#results
df.tail(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
6297,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6.0
6298,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5.0
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,4.0
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,3.0
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,male,M,15.0,"FATAL. ""Shark bit him in half, carrying away t...",Y,,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2.0


# 🦈️
Whith this cleaned dataframe, we can look deeper into the actual data.

Notice that we still have additional columns that are not giving us any **'meaty information'** !

In [16]:
# These are just old indexes
# df['original order'].value_counts()

# And these are only the titles of the pdfs
# df['pdf'].value_counts()

# let's get rid of them
df = df.drop(columns = ['pdf', 'original order'])

# 🦈️
The `href` and `href formula` columns look very similar, but since they can't be read on the DataFrame that pandas provides, we'll try to use our `sampler()` function again to compare them both

In [17]:
list(sampler(df, ['href', 'href formula'],2))
# This however, returns invalid links which are not accurately represented.
# example: 

[href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 545, dtype: object,
 href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 1433, dtype: object]

# 🦈️
Well.. Oops?
To actually see the contents, I've resorted to two separate methods, `random.sample` and a `for` loop.

These cells can be re-run a couple of times to make sure the data in the columns is homogeneous

In [18]:
# With Random sample
display(random.sample(list(df['href']), 5))
display(random.sample(list(df['href formula']), 5))

['http://sharkattackfile.net/spreadsheets/pdf_directory/1894.06.15.a-R-Rover.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1863.00.00.R2-Ceylon.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1984.11.11-Stetson.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/2010.02.06.a-RivieraBeach.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1858.01.09.R-Tonga.pdf']

['http://sharkattackfile.net/spreadsheets/pdf_directory/2014.10.18-Roberson.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/2005.03.05-SolomonIslands.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/2002.05.31.b-Fontan.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1986.11.04-Lund.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1959.08.12-Tomaz.pdf']

In [19]:
# With a FOR loop
for i in range(5):
    e = random.choice(range(1000))
    print(f"index: {e}, href:         {df.iloc[e]['href']}")
    print(f"index: {e}, href formula: {df.iloc[e]['href formula']}")

index: 724, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2012.11.04.b-Riglos.pdf
index: 724, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2012.11.04.b-Riglos.pdf
index: 706, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2013.01.26-boat.pdf
index: 706, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2013.01.26-boat.pdf
index: 111, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2017.07.24-Matsu.pdf
index: 111, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2017.07.24-Matsu.pdf
index: 104, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2017.08.03-Massachusetts.pdf
index: 104, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2017.08.03-Massachusetts.pdf
index: 287, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2016.04.13-Senkowicz.pdf
index: 287, href formula: http://sharkattackfile.net/spreadsheets/p

## 🦈️
The links on both columns seem to match, most of the times anyways.

In some cases, the `href` seems to have an duplication on its links which corrupted them and made them innaccessible.

However, the `href formula` actually saved the correct URL format.

In [20]:
# Examples of the corrupted link versus their working counterpart
print(df.iloc[332]['href']), print(df.iloc[332]['href formula'])
print()
print(df.iloc[324]['href']), print(df.iloc[324]['href formula'])
print()
print(df.iloc[588]['href']), print(df.iloc[588]['href formula'])
print()
print(df.iloc[569]['href']), print(df.iloc[569]['href formula'])

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.11.15.a-Engelman.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2015.11.15.a-Engelman.pdf

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2014.00.00.b-OceanicWhitetip.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2014.00.00.b-OceanicWhitetip.pdf

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2014.04.03-Armstrong.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2014.04.03-Armstrong.pdf


(None, None)

In [21]:
# for the sake of simplicity, we will drop the `href` column, and replace it with the `href formula`
df = df.drop(columns='href')

## 🏊️ IDEAS:
- The pdfs presented on the links seem quite structured
 
 - It could be possible to parse them later down the road and use a **REGEX** to find more data
 
 - Like, adding a column that lists the **'Moon Phase'** described on some of the pdfs

- I also have ran query a few times to notice that all pdfs have actually been uploaded to the same website and have the same naming structure

## 🦈️

Some column names can be simplified, and some have unnecesary white spaces.
 
 Let's fix that right away

In [22]:
df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'href formula'],
      dtype='object')

In [23]:
#The column with the name of the victims does not bring much relevant information to our study
df = df.drop(columns='Name')

In [24]:
df = df
df.columns = ['CaseNum', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Sex', 'Age', 'Injury', 'Fatal', 'Time',
       'Species', 'Source', 'href']
df.columns

Index(['CaseNum', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species',
       'Source', 'href'],
      dtype='object')

## 🏊️ SPECIES IDEAS: 
- create a standardized column for species

## 🦈️
Many of the values from the Species column are `nulls`. We'll fill them with the same `Invalid` value that other cells already have

In [25]:
#BEFORE
df.Species.value_counts()

White shark                                           163
Shark involvement prior to death was not confirmed    105
Invalid                                               102
Shark involvement not confirmed                        88
Tiger shark                                            73
                                                     ... 
Oceanic whitetip sharks were in the vicinity            1
15'                                                     1
2.4 m shark                                             1
White shark, 2 m to 4 m [6'9" to 13']                   1
1.4 m [4'6"] blacktip shark                             1
Name: Species, Length: 1549, dtype: int64

In [26]:
df.Species = df.Species.fillna('Invalid')

#AFTER
df.Species.value_counts() 

Invalid                                               2940
White shark                                            163
Shark involvement prior to death was not confirmed     105
Shark involvement not confirmed                         88
Tiger shark                                             73
                                                      ... 
Oceanic whitetip sharks were in the vicinity             1
15'                                                      1
2.4 m shark                                              1
White shark, 2 m to 4 m [6'9" to 13']                    1
1.4 m [4'6"] blacktip shark                              1
Name: Species, Length: 1549, dtype: int64

In [27]:
# With the .describe() method, we can see that there are 1549 unique values in this column
# It would be interesting to create a new column which narrows this down to less unique values.
df.Species.describe()

count        6302
unique       1549
top       Invalid
freq         2940
Name: Species, dtype: object

## 🦈️ 

[Can you guess the Pokémon?](https://i.ytimg.com/vi/mg1A94zBWBw/hqdefault.jpg)

There are more than 400 species of sharks, and while lurking around the `Species` column you can find all sort of weird animals. Did you know there's even one species of shark known as the *'Cookie Cutter shark'* ?

I certainly had no clue. 

Here I tried to sort a bit of the data, by creating a secondary column which *mapped* the different species, while also taking out possible confusions. 

- *note: the scrutiny for this categorization is quite laxed, as this project is more an exercise with data and less a research paper. The data, however, can be processed even further by forking this github repo.*

Below is the process of creating such a secondary table.

In [28]:
# Mapping the Species column to decrease the amount of unique values
df['Species2'] = df['Species']  # .map(lambda x: 'White shark' if 'White shark' in x else x)
df.Species2.value_counts() 

Invalid                                               2940
White shark                                            163
Shark involvement prior to death was not confirmed     105
Shark involvement not confirmed                         88
Tiger shark                                             73
                                                      ... 
Oceanic whitetip sharks were in the vicinity             1
15'                                                      1
2.4 m shark                                              1
White shark, 2 m to 4 m [6'9" to 13']                    1
1.4 m [4'6"] blacktip shark                              1
Name: Species2, Length: 1549, dtype: int64

# 🏊️ 

In [29]:
def shark_identifier2(x):
    
    #THERE ARE SO MANY ERRORS IN THIS COLUMN, let's filter them
    # NOT CONFIRMED
    if 'not confirmed' in x.lower():
        return "OTHER / NOT KNOWN"
    if 'unidentified' in x.lower():
        return 'OTHER / NOT KNOWN'
    if ' or ' in x.lower():
        return 'OTHER / NOT KNOWN'
    #INVALIDS
    if 'no shark involvement' in x.lower():
        return 'INVALID ENTRY'
    if 'invalid' in x.lower():
        return 'INVALID ENTRY'
    if 'questionable' in x.lower():
        return 'INVALID ENTRY'
    if 'doubtful' in x.lower():
        return 'INVALID ENTRY'
    #OTHER INVALIDS
    if 'hoax' in x.lower():
        return 'HOAX'
    if 'drown' in x.lower():
        return 'DROWNED'
    if 'stingray' in x.lower():
        return 'STINGRAY'
    
    
    #  --- WHO'S THAT POKEMON?---
    
    if 'white shark' in x.lower():
        return "White shark"
    if 'tiger shark' in x.lower():
        return "Tiger shark"
    if 'bull shark' in x.lower():
        return "Bull shark"
    if 'nurse shark' in x.lower():
        return 'Nurse shark'
    if 'brown shark' in x.lower():
        return 'Brown shark'
    if 'mako shark' in x.lower():
        return 'Mako Shark'
    if 'blue shark' in x.lower():
        return 'Blue shark'
    if 'bronze whaler shark' in x.lower():
        return 'Bronze whaler shark'
    if 'blacktip shark' in x.lower():
        return 'Blacktip shark'
    if 'whitetip shark' in x.lower():
        return 'Whitetip shark'
    if 'sandbar shark' in x.lower():
        return 'Sandbar shark'
    if 'lemon shark' in x.lower():
        return 'Lemon shark'
    if 'hammerhead shark' in x.lower():
        return 'Hammerhead shark'
    if 'raggedtooth shark' in x.lower():
        return 'Raggedtooth shark'
    if 'thresher shark' in x.lower():
        return 'Thresher shark'
    if 'dusky shark' in x.lower():
        return 'Dusky shark'
    if 'wobbegong shark' in x.lower():
        return 'Wobbegong shark'
    if 'dusky shark' in x.lower():
        return 'Dusky shark'
    if 'spinner shark' in x.lower():
        return 'Spinner shark'
    if 'blue nose shark' in x.lower():
        return 'Blue nose shark'
    if 'leopard shark' in x.lower():
        return 'Leopard shark'
    if 'silvertip shark' in x.lower():
        return 'Silvertip shark'
    if 'gray shark' in x.lower():
        return 'Gray shark'
    if 'grey shark' in x.lower():
        return 'Gray shark'
    if 'reef shark' in x.lower():
        return 'Reef shark'
    if 'carpet shark' in x.lower():
        return 'Carpet shark'
    if 'whaler shark' in x.lower():
        return 'Whaler shark'
    if 'zambesi shark' in x.lower():
        return 'Zambesi shark'
    
    # -- trying to filter sizes --

    if 'small shark' in x.lower():
        return 'Small shark'
    
    else:
        return 'OTHER / NOT KNOWN'
    
df['Species2'] = df['Species'].map(shark_identifier2)

In [30]:
print(df.Species2.value_counts())

INVALID ENTRY          3052
OTHER / NOT KNOWN      1467
White shark             625
Tiger shark             275
Bull shark              171
Nurse shark              94
Reef shark               65
Bronze whaler shark      60
Blacktip shark           56
Small shark              55
Mako Shark               53
Wobbegong shark          46
Hammerhead shark         44
Raggedtooth shark        43
Blue shark               38
Lemon shark              34
Zambesi shark            29
Whitetip shark           23
Spinner shark            20
Dusky shark              12
Carpet shark              8
Whaler shark              6
Sandbar shark             5
Thresher shark            4
Brown shark               3
Gray shark                3
DROWNED                   2
HOAX                      2
Blue nose shark           2
Silvertip shark           2
Leopard shark             2
STINGRAY                  1
Name: Species2, dtype: int64


In [31]:
# See which species generate the most fatalities
# df2.groupby('Species2', 'Fatal').filter(lambda x : x > 2)

## 🦈️ Injuries and types of attack
The GSAF categorizes scavenging bites on humans as "questionable incidents."

## PROVOKED
Provoked attacks occur when a human touches, hooks, nets, or otherwise aggravates the animal. Incidents that occur outside of a shark's natural habitat, such as aquariums and research holding-pens, are considered provoked, as are all incidents involving captured sharks. Sometimes humans inadvertently provoke an attack, such as when a surfer accidentally hits a shark with a surf board.

## UNPROVOKED
- Hit-and-run attack
- Sneak Attack
- Bump-and-bite attack 

For more information on how to differentiate PROVOKED vs UNPROVOKED attacks :
https://en.wikipedia.org/wiki/Shark_attack#Types_of_attacks

## 🏊️ TYPE OF ATTACK
- On the Type column, dont count sea disasters, questionable and boatomg
- Stardarize
- Size of the shark according to Species column

In [32]:
df.Type.isnull().sum()

4

In [33]:
df.Type = df.Type.fillna('Invalid')
df.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          551
Sea Disaster     239
Boating          203
Boat             137
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [34]:
# Boat, Boating and Boatomg to 1 category
filt = lambda x: 'Boat' if 'Boat' in x else x
df.Type = df.Type.map(filt)

# Questionable to Invalid
filt = lambda x : 'Invalid' if 'Questionable' in x else x
df.Type = df.Type.map(filt)

df.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          553
Boat             341
Sea Disaster     239
Name: Type, dtype: int64

# Cleaning Activities

In [35]:
df.Activity.value_counts()

Surfing                                              971
Swimming                                             869
Fishing                                              431
Spearfishing                                         333
Bathing                                              162
                                                    ... 
Floating face-down in knee-deep water                  1
Abalone diving using Hookah (near calving whales)      1
Standing, watching seine netters                       1
Boogie boarding / wading                               1
Swimming near breakwater                               1
Name: Activity, Length: 1532, dtype: int64

In [36]:
def filt(x):
    if type(x) is str:
        if 'floating' in x.lower():
            return 'Floating'
        if 'diving' in x.lower():
            return 'Diving'
        if 'dive' in x.lower():
            return 'Diving'
        if 'skiing' in x.lower():
            return 'Skiing'
        if 'ski ' in x.lower():
            return 'Skiing'
        if 'surf' in x.lower():
            return 'Surfing'
        if 'snorkel' in x.lower():
            return 'Snorkeling'
        if 'fishing' in x.lower():
            return 'Fishing'
        if 'drift' in x.lower():
            return 'Drifting'
        if 'swim' in x.lower():
            return 'Swimming'
        if 'bathing' in x.lower():
            return 'Swimming'
        if 'paddle' in x.lower():
            return 'Paddleboarding'
        
        if 'raft' in x.lower():
            return 'Rafting'
        if 'playing' in x.lower():
            return 'Playing'
        
        if 'wading' in x.lower():
            return 'Wading'
        if 'research' in x.lower():
            return 'Research'
        if 'rescue' in x.lower():
            return 'Rescuing someone / somthing'
        if 'rescuing' in x.lower():
            return 'Rescuing someone / somthing'
        if 'overboard' in x.lower():
            return 'Fell from boat into water'        
        if 'fell' in x.lower():
            return 'Fell to the water'
        
        if 'boat' in x.lower():
            return 'Boating'
        if 'yatch' in x.lower():
            return 'Boating'
        if 'disaster' in x.lower():
            return 'Sea Disaster'
        if 'wreck' in x.lower():
            return 'Shipwreck'
        if 'capsize' in x.lower():
            return 'Shipwreck'
        if 'sank' in x.lower():
            return 'Shipwreck'
        if 'torpedo' in x.lower():
            return 'War (Torpedo)'
        if 'warship' in x.lower():
            return 'War (Warship)'
        if ('plane' and 'crash') in x.lower():
            return 'Airplane crash'
        if 'airliner' in x.lower():
            return 'Airplane crash'
        
        
        if 'filming' in x.lower():
            return 'Filming / Photoshoot'
        if 'sailing' in x.lower():
            return 'Sailing'
        if 'net ' in x.lower():
            return 'Fishing'
        if ' net' in x.lower():
            
            return 'Fishing'
        if 'photo' in x.lower():
            return 'Filming / Photoshoot'
        if 'sinking' in x.lower():
            return 'Sea Disaster'
        
        if 'board' in x.lower():
            return 'Other types of Boarding sports'
        else:
            return x
df['Activity2'] = df.Activity.map(filt)
df.Activity2.value_counts()

Swimming                                                 1274
Surfing                                                  1203
Fishing                                                  1119
Diving                                                    600
Wading                                                    163
                                                         ... 
Hunting turtle                                              1
Crouching in the water                                      1
Jumping in swells                                           1
Dismantling cable buoys of the cable ship All America       1
Stuffing a shark into an automobile                         1
Name: Activity2, Length: 370, dtype: int64

In [37]:
# Fill the empty values too
df.Activity.isnull().sum()

544

In [38]:
df.Activity2 = df.Activity2.fillna('NOT SPECIFIED')

In [39]:
df.Activity2.isnull().sum()

0

# Fatalities
Sort them out and map them

In [40]:
df.Fatal.isnull().sum()

539

In [41]:
df.Fatal = df.Fatal.fillna('UNKNOWN')

In [42]:
def filt(x):
    if 'UNKNOWN' in x.upper():
        return x
    if 'N' in x.upper():
        return 'N'
    if 'Y' in x.upper():
        return 'Y'
    else:
        return 'UNKNOWN'

df['Fatal'] = df.Fatal.map(filt)
df.Fatal.value_counts()

N          4301
Y          1389
UNKNOWN     612
Name: Fatal, dtype: int64

In [43]:
df.Injury.value_counts()

FATAL                                                                                                       802
Survived                                                                                                     97
Foot bitten                                                                                                  87
No injury                                                                                                    82
Leg bitten                                                                                                   72
                                                                                                           ... 
Right elbow bitten                                                                                            1
Left inner thigh                                                                                              1
Minor cuts above his right eye                                                                          

In [44]:
print(df.count() < 5000) # To know how much data are we missin on each column

CaseNum      False
Date         False
Year         False
Type         False
Country      False
Area         False
Location     False
Activity     False
Sex          False
Age           True
Injury       False
Fatal        False
Time          True
Species      False
Source       False
href         False
Species2     False
Activity2    False
dtype: bool


## 🦈️
The `Age` and `Time` column have many null values and is not going to be uses to test our hypothesis, so we will drop it.

In [45]:
df = df.drop(columns=['Age', 'Time'])
df.columns
display(df.dtypes)

CaseNum       object
Date          object
Year         float64
Type          object
Country       object
Area          object
Location      object
Activity      object
Sex           object
Injury        object
Fatal         object
Species       object
Source        object
href          object
Species2      object
Activity2     object
dtype: object

In [46]:
df.duplicated().sum()

0

In [47]:
# drop dupes and compare lengths
df_pdf_nodupes = df.drop_duplicates()

len(df) - len(df_pdf_nodupes), 'duped values'

(0, 'duped values')

<br><br><br><br><br>
<br><br><br>

<h1><center> 🏄️ Exporting the cleaned Dataset </center></h1>


In [55]:
# Phew... 
#
# That was a long haul right there.
# It might not be perfect, but we have
# some cleaner data with which we can 
# make a more educated guess abour 
# this topic.
#
# Now, we can take this dataframe and
# export it. This will be one of the
# products of this project and 
# hopefully benefit the future of
# sharing the planet with sharks.
#
#
#
# Oh, I almos forgot... 
#
# Exporting as '.csv':
#
# It cannot get any simpler than this:
df.to_csv('exported.csv')

In [49]:
def distance_measurments(x):
    
    #THERE ARE SO MANY ERRORS IN THIS COLUMN, let's filter them
    # NOT CONFIRMED
    
    # -- trying to filter sizes --
    
    if """10'""" in x.lower():
        return """10' shark"""
    if """11'""" in x.lower():
        return """11' shark"""
    if """12'""" in x.lower():
        return """12' shark"""
    if """13'""" in x.lower():
        return """13' shark"""
    if """14'""" in x.lower():
        return """14' shark"""
    if """15'""" in x.lower():
        return """15' shark"""
    if """16'""" in x.lower():
        return """16' shark"""
    if """17'""" in x.lower():
        return """17' shark"""
    if """18'""" in x.lower():
        return """18' shark"""
    if """19'""" in x.lower():
        return """19' shark"""
    if """20'""" in x.lower():
        return """20' shark"""
    if """21'""" in x.lower():
        return """21' shark"""
    
    if """1'""" in x.lower():
        return """1' shark"""
    if """2'""" in x.lower():
        return """2' shark"""
    if """3'""" in x.lower():
        return """3' shark"""
    if """4'""" in x.lower():
        return """4' shark"""
    if """5'""" in x.lower():
        return """5' shark"""
    if """6'""" in x.lower():
        return """6' shark"""
    if """7'""" in x.lower():
        return """7' shark"""    
    if """8'""" in x.lower():
        return """8' shark"""
    if """9'""" in x.lower():
        return """9' shark"""

    if 'small shark' in x.lower():
        return 'Small shark'
    
    else:
        return x
    
# df['Species2'] = df['Species'].map(shark_identifier)

In [50]:
# Since there is no column that states if the attack was provoked or not,
# I want to analyze the injury column to distinguish between the cases that were provoked
# and those that were unprovoked.

random.sample(list(df.Injury.value_counts().items()),20)

[('FATAL, disappeared, body not recovered', 1),
 ('Injured by sharks, but managed to swim ashore 6.5 hours later', 1),
 ('Minor cuts to dorsum & sole of left foot when he stepped on shark PROVOKED INCIDENT',
  1),
 ('Lacerations to face and right leg', 1),
 ('No injury, shark bit rudder', 1),
 ('No injury to occupant; shark leapt into boat', 1),
 ('No injury, kayak bitten', 11),
 ('Shark involvement prior to death unconfirmed', 5),
 ('Severe injury to forearm near elbow', 1),
 ('Cut to arm while roping shark PROVOKED INCIDENT', 1),
 ('Thigh bitten, shark teeth embedded in canoe', 1),
 ('No injury to occupant, netted shark rammed & bit boat PROVOKED INCIDENT',
  1),
 ('Dr. A.R. Hernandez of ship Cabo Hornos treated passenger whose leg was severed by a shark',
  1),
 ('FATAL, other human remains bitten by sharks, 13 people missing', 1),
 ('No injury to occupants, sharks bit chunks from boat', 1),
 ('Foot severely bitten, surgically amputated', 1),
 ('Arm severed, but survived. Note: Some

In [51]:
# Categorizing  Provoked and  Unprovoked attacks
# df_provoked = np.where(df_nodupes.Injury.isin(provoked), True, False) 

# Passing that categorization to a new PROVOKED COLUMN
def provoked_attacks(x):
    
    provoked = ['PROVOKED', 'hook', 'shot']
    
    for e in provoked:
        if e in str(x):
            return 'PROVOKED'
        else:
            return x
df['Provoked'] = df.Injury.map(provoked_attacks)
df['Provoked'].tail(50)

6252                                                FATAL
6253                                  FATAL, leg severed 
6254                                             PROVOKED
6255                                      Buttocks bitten
6256                                            No injury
6257                                             Survived
6258                                                FATAL
6259    FATAL, shark leapt into raft and bit the man w...
6260                                       Buttock bitten
6261                                          Head bitten
6262                    FATAL, foot lacerated & crushed  
6263    FATAL, femoral artery severed, died 12 days la...
6264    FATAL, fell into water when shark seized his r...
6265        FATAL, left leg bitten with severe blood loss
6266                                FATAL, died of sepsis
6267                                      Buttocks bitten
6268                 Foot lacerated, surgically amputated
6269          

In [52]:
#df_clean.loc[df_clean["trany"].str.startswith("M"),"trany"] = "Manual"

provoked = ['PROVOKED', 'hook', 'shot']
#map(lambda words, x : words in x, provoked, df_nodupes.loc[df_nodupes['Injury'].str])
df.loc[df['Injury'].str]

  values = list(values)
  if data and all(isinstance(e, tuple) for e in data):


TypeError: 'Series' objects are mutable, thus they cannot be hashed