
![Shark attacks, a project by Roberto Henríquez Perozo. Data Analytics Bootcamp at IronHack](INPUT/shark-attacks.png)

<br><br>
<center>
    <h1> PART I: Data cleaning and exploration</h1>
</center>

## 🎣️ Step 0 - Basic knowledge
To begin the development of this project, it would be good to hold a minimum understanding of `Shark Attacks`.

As I did not know much about this topic at the day the project started, I recurred to the shark-attack wiki: https://en.wikipedia.org/wiki/Shark_attack

With this information in mind, below is the process of data exploration, cleaning, and wrangling.

In [1]:
# Start by importing the modules
import pandas as pd
import numpy as np
import random
import os

from functools import reduce

#pd.compat.PY3 = True #🔥️  no longer needed ?

## 🎣️ Step 1 - Defining the dataset path, and importing it to begin basic dataset exploration

In [2]:
# To follow along, download the DataSet from KAGGLE using this link
# https://www.kaggle.com/teajay/global-shark-attacks

# Once you have downloaded the DataSet, change the following `dataset` variable to match the 
# path where you have saved the 'attacks.csv' file.
dataset = 'INPUT/attacks.csv' 


df = pd.read_csv(dataset, encoding='latin-1')

Now, we will check some basic information about the dataset, in order to formulate a more educated hypothesis which we could actually put to test with the data available.

Here, I notice that the shape of the `df` with no duplicates is very small when compared to the whole `df`.

### 🌊️  Declutter Function
This comparison could be turned into its own function, as it will be executed quite often

In [3]:
# I'll delete de duplicated rows
def declutter(df):
    print('Data Frame shape before declutter:', df.shape)
    df = df.drop_duplicates()
    print('Data Frame shape after declutter: ', df.shape)
    return df

#calling the function to drop dupes
df = declutter(df)

# And also take a look at the columns
df.columns

('Data Frame shape before declutter:', (25723, 24))
('Data Frame shape after declutter: ', (6312, 24))


Index([           u'Case Number',                   u'Date',
                         u'Year',                   u'Type',
                      u'Country',                   u'Area',
                     u'Location',               u'Activity',
                         u'Name',                   u'Sex ',
                          u'Age',                 u'Injury',
                  u'Fatal (Y/N)',                   u'Time',
                     u'Species ', u'Investigator or Source',
                          u'pdf',           u'href formula',
                         u'href',          u'Case Number.1',
                u'Case Number.2',         u'original order',
                  u'Unnamed: 22',            u'Unnamed: 23'],
      dtype='object')

In [4]:
# I want to take a look at the time structures
df.Date.head()

0    25-Jun-2018
1    18-Jun-2018
2    09-Jun-2018
3    08-Jun-2018
4    04-Jun-2018
Name: Date, dtype: object

In [5]:
# I want to see what is the data on the last couple
# of columns which have unexplicit labels
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']].head()

Unnamed: 0,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,2018.06.04,6299.0,,


In [6]:
# Too many null values on the last two columns... let's count them
print(df.shape)
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']].isnull().sum()

(6312, 24)


Case Number.1       10
Case Number.2       10
original order       3
Unnamed: 22       6311
Unnamed: 23       6310
dtype: int64

## 🎣️ Step 2 - Dropping null values and unnecessary columns

In [7]:
# If there is only 1 value in the 'Unnamed: 22' column, and 2 values in the
# 'Unnamed: 22' column, I'll not consider this data for my analysis.
df = df.drop(columns=['Unnamed: 22', 'Unnamed: 23'])

In [8]:
# The column with the name of the victims does not bring much relevant information to our study
df = df.drop(columns='Name')

The `Age` and `Time` column have many null values and is not going to be uses to test our hypothesis, so we will drop them both.

In [9]:
df = df.drop(columns=['Age', 'Time'])

In [10]:
# The following columns seemed a little bit rare, so i do a value count to find out what they are about
df['Case Number.1'].value_counts().sort_values().head(10)

ND.0016         1
1902.08.08      1
2006.02.12.a    1
2010.10.02.b    1
1924.01.25      1
1928.01.00      1
1968.04.15      1
1845.11.04      1
2010.05.18.a    1
2010.05.18.b    1
Name: Case Number.1, dtype: int64

In [11]:
df['Case Number.2'].value_counts().sort_values().head(10)

ND.0015         1
1902.08.08      1
2006.02.12.a    1
2010.10.02.b    1
1924.01.25      1
1928.01.00      1
1968.04.15      1
1845.11.04      1
2010.05.18.a    1
2010.05.18.b    1
Name: Case Number.2, dtype: int64

### 🌊️ Creating a 'sampler' function

After some thought and googling, it seems like these are some sort of notation used to categorize and order the shark attacks. They also use the a notation that includes dates. 

**Is it the same as the Date on the Date column?**

To find out, I wanted to pick random samples of the dataframe, but python's built-in `random` module kept giving me trouble when I tried to use it alongside pandas.

In [12]:
#VER 03
def sampler(df, column, sample_size):
# This function generates an iterator out of random rows from a pandas dataframe's specific column
    
    # Defining how many samples to fetch from the df
    for i in range(sample_size):
        
        # Now a random index is generated out of the total length of the column...
        i = random.choice(range(len(df[column])))
        # ... to return the data values in that index as an iterator:
        yield df.iloc[i][column]
        
        # For future versions, it would be good to look at how I can return the
        # data as a tupple with just the data, and not have it return a formatted string
        # or even better, a pandas dataframe with the results


# Now let's try it out
sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10)

<generator object sampler at 0x7f6800f03a50>

In [13]:
# Since sampler generates iterators, list() must be used to see its contents
list(sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 5))

[Date             11-Apr-2009
 Case Number.1     2009.04.11
 Case Number.2     2009.04.11
 Name: 1136, dtype: object, Date             11-Dec-2015
 Case Number.1     2015.12.11
 Case Number.2     2015.12.11
 Name: 327, dtype: object, Date                   1909
 Case Number.1    1909.00.00
 Case Number.2    1909.00.00
 Name: 5427, dtype: object, Date              31-Mar-2007
 Case Number.1    2007.03.31.b
 Case Number.2    2007.03.31.b
 Name: 1392, dtype: object, Date              11-Apr-2001
 Case Number.1    2001.04.11.d
 Case Number.2    2001.04.11.d
 Name: 1960, dtype: object]

### 🦈️ 
From this output, we can see that the `Case Number` Columnns are actually replicating the info that we already have on the `Date` column. Therefore, we will drop both `Case Number` columns

In [14]:
df = df.drop(columns=['Case Number.1', 'Case Number.2'])

 ### 🦈️  Taking a closer look at the tail of the df, we notice that the last couple of rows are still holding many null values.


In [15]:
# it's only the last 10 columns which have the nulls.
df.tail(12)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Sex,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,href,original order
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,M,FATAL,Y,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,3.0
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,M,"FATAL. ""Shark bit him in half, carrying away t...",Y,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2.0
6302,0,,,,,,,,,,,,,,,,6304.0
6303,0,,,,,,,,,,,,,,,,6305.0
6304,0,,,,,,,,,,,,,,,,6306.0
6305,0,,,,,,,,,,,,,,,,6307.0
6306,0,,,,,,,,,,,,,,,,6308.0
6307,0,,,,,,,,,,,,,,,,6309.0
6308,0,,,,,,,,,,,,,,,,6310.0
6309,0,,,,,,,,,,,,,,,,


In [16]:
# Since they are only 10 instances, we can drop them manually using their index:
df = df.drop([6302, 6303,6304,6305,6306,6307,6308,6309,8702,25722])

In [17]:
#results
df.tail()

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Sex,Injury,Fatal (Y/N),Species,Investigator or Source,pdf,href formula,href,original order
6297,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,M,FATAL,Y,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6.0
6298,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,M,FATAL,Y,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5.0
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,M,FATAL,Y,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,4.0
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,M,FATAL,Y,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,3.0
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,M,"FATAL. ""Shark bit him in half, carrying away t...",Y,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2.0


### 🦈️ We still have additional columns that are not giving us any *'MEATY' info* !

In [18]:
# These are just old indexes
# df['original order'].value_counts()

# And these are only the titles of the pdfs
# df['pdf'].value_counts()

# let's get rid of them
df = df.drop(columns = ['pdf', 'original order'])

## 🎣️ Step 3 - Digging deeper into the data, and dropping more unnecesary data

The `href` and `href formula` columns look very similar, but since they can't be read on the DataFrame that pandas provides, we'll try to use our `sampler()` function again to compare them both

In [19]:
list(sampler(df, ['href', 'href formula'],5))
# This however, returns invalid links which are not accurately represented.
# example: 

[href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 527, dtype: object,
 href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 2167, dtype: object,
 href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 5739, dtype: object,
 href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 5201, dtype: object,
 href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 6266, dtype: object]

### 🦈️ Well, Oops? That didn't work as planned...

The links are not clickable, and to actually see the contents, I've resorted to two separate methods, `random.sample` and a `for` loop.

These cells can be re-run a couple of times to make sure the data in the columns is homogeneous

In [20]:
# With Random sample
display(random.sample(list(df['href']), 5))
display(random.sample(list(df['href formula']), 5))

[u'http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.28.R-Franz.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/1961.10.09-Haen.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/2016.07.15.a-MyrtleBeach.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/1933.02.15.R-Tumia.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/1944.09.03-PhilipStanton.pdf']

[u'http://sharkattackfile.net/spreadsheets/pdf_directory/1920.03.08-Burgess.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/2016.05.21.a-Girl.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/1996.02.26-Good.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/1977.03.13.c-Harrison.pdf',
 u'http://sharkattackfile.net/spreadsheets/pdf_directory/1991.07.30-Ivana-Iacaccia.pdf']

In [21]:
 # 🔥️ FIX SYNTAX ERROR with .format method
 #With a FOR loop 
"""
for i in range(5):
    e = random.choice(range(1000))
    print(f"index: {e} href:         {df.iloc[e]['href']})
    print(f"index: {e} href formula: {df.iloc[e]['href formula']}")
"""

'\nfor i in range(5):\n   e = random.choice(range(1000))\n   print(f"index: {e} href:         {df.iloc[e][\'href\']})\n   print(f"index: {e} href formula: {df.iloc[e][\'href formula\']}")\n'

### 🦈️ The links on both columns seem to match, most of the times anyways.

In some cases, the `href` seems to have an duplication on its links which corrupted them and made them innaccessible.

However, the `href formula` actually saved the correct URL format.

In [22]:
# 🔥️ FIX SYNTAX ERRORS

# Examples of the corrupted link versus their working counterpart
print(df.iloc[332]['href'])
print(df.iloc[332]['href formula'])
print()
print(df.iloc[324]['href'])
print(df.iloc[324]['href formula'])

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.11.15.a-Engelman.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2015.11.15.a-Engelman.pdf
()
http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf


### 🦈️  For the sake of simplicity, we will drop the `href` column, and replace it with the `href formula`. Also, some column names can be simplified, and some have unnecesary white spaces. Let's fix that right away:

In [23]:
df = df.drop(columns='href')
df.columns

Index([u'Case Number', u'Date', u'Year', u'Type', u'Country', u'Area',
       u'Location', u'Activity', u'Sex ', u'Injury', u'Fatal (Y/N)',
       u'Species ', u'Investigator or Source', u'href formula'],
      dtype='object')

In [24]:
df.columns = ['CaseNum', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Sex', 'Injury', 'Fatal', 'Species', 'Source', 'href']
df.columns

Index([u'CaseNum', u'Date', u'Year', u'Type', u'Country', u'Area', u'Location',
       u'Activity', u'Sex', u'Injury', u'Fatal', u'Species', u'Source',
       u'href'],
      dtype='object')

### 🦈️ Many of the values from the **Species** column are `nulls`. We'll fill them with the same `Invalid` value that other cells already have

In [25]:
#BEFORE
df.Species.value_counts().head()

White shark                                           163
Shark involvement prior to death was not confirmed    105
Invalid                                               102
Shark involvement not confirmed                        88
Tiger shark                                            73
Name: Species, dtype: int64

In [26]:
df.Species = df.Species.fillna('Invalid')

#AFTER
df.Species.value_counts().head()

Invalid                                               2940
White shark                                            163
Shark involvement prior to death was not confirmed     105
Shark involvement not confirmed                         88
Tiger shark                                             73
Name: Species, dtype: int64

In [27]:
# With the .describe() method, we can see that there are 1549 unique values in this column
# It would be interesting to create a new column which narrows this down to fewer unique values.
df.Species.describe()

count        6302
unique       1549
top       Invalid
freq         2940
Name: Species, dtype: object

## 🎣️ Step 4 - Categorizing and beginning hypothesis formulation

### 🦈️    [Who's that Pokémon?!](https://i.ytimg.com/vi/mg1A94zBWBw/hqdefault.jpg) 

There are more than 400 species of sharks, and while lurking around the `Species` column you can find all sort of weird animals. Did you know there's even one species of shark known as the *'**Cookie Cutter shark**'* ?

![cookie-cutter-shark](INPUT/cookie-cutter-shark.jpeg)

I certainly had no clue. 

### 🦈️ Mapping shark species 👁️
Here I tried to sort the data by creating a secondary column which *mapped* the different species, while also classifying possible confusions as `"OTHER / NOT KNOWN" `. 

- *note: the scrutiny for this categorization is quite laxed, as this project is more an exercise with data and less a research paper. The data, however, can be processed even further by forking this github repo.*

In [28]:
# Duplicating the Species column to decrease the amount of unique values
df['Species2'] = df['Species']

### 🌊️ Shark Identifier function👁️
This can be turned into key:value pairs for easier reading

In [29]:
def shark_identifier2(x):
    
    #THERE ARE SO MANY ERRORS IN THIS COLUMN, let's filter them
    
    # NOT CONFIRMED
    if 'not confirmed' in x.lower():
        return "OTHER / NOT KNOWN"
    if 'unidentified' in x.lower():
        return 'OTHER / NOT KNOWN'
    if ' or ' in x.lower():
      #  print(x) 
        return 'OTHER / NOT KNOWN'
    
    # INVALIDS
    if 'no shark involvement' in x.lower():
        return 'INVALID ENTRY'
    if 'invalid' in x.lower():
        return 'INVALID ENTRY'
    if 'questionable' in x.lower():
        return 'INVALID ENTRY'
    if 'doubtful' in x.lower():
        return 'INVALID ENTRY'
    
    # OTHER INVALIDS
    if 'hoax' in x.lower():
        return 'INVALID ENTRY'
    if 'drown' in x.lower():
        return 'DROWNED'
    if 'stingray' in x.lower():
        return 'STINGRAY'
    
    
    #  --- WHO'S THAT POKEMON?---
    
    if 'white shark' in x.lower():
        return "White shark"
    if 'tiger shark' in x.lower():
        return "Tiger shark"
    if 'bull shark' in x.lower():
        return "Bull shark"
    if 'nurse shark' in x.lower():
        return 'Nurse shark'
    if 'brown shark' in x.lower():
        return 'Brown shark'
    if 'mako shark' in x.lower():
        return 'Mako Shark'
    if 'blue shark' in x.lower():
        return 'Blue shark'
    if 'bronze whaler shark' in x.lower():
        return 'Bronze whaler shark'
    if 'blacktip shark' in x.lower():
        return 'Blacktip shark'
    if 'whitetip shark' in x.lower():
        return 'Whitetip shark'
    if 'sandbar shark' in x.lower():
        return 'Sandbar shark'
    if 'lemon shark' in x.lower():
        return 'Lemon shark'
    if 'hammerhead shark' in x.lower():
        return 'Hammerhead shark'
    if 'raggedtooth shark' in x.lower():
        return 'Raggedtooth shark'
    if 'thresher shark' in x.lower():
        return 'Thresher shark'
    if 'dusky shark' in x.lower():
        return 'Dusky shark'
    if 'wobbegong shark' in x.lower():
        return 'Wobbegong shark'
    if 'dusky shark' in x.lower():
        return 'Dusky shark'
    if 'spinner shark' in x.lower():
        return 'Spinner shark'
    if 'blue nose shark' in x.lower():
        return 'Blue nose shark'
    if 'leopard shark' in x.lower():
        return 'Leopard shark'
    if 'silvertip shark' in x.lower():
        return 'Silvertip shark'
    if 'gray shark' in x.lower():
        return 'Gray shark'
    if 'grey shark' in x.lower():
        return 'Gray shark'
    if 'reef shark' in x.lower():
        return 'Reef shark'
    if 'carpet shark' in x.lower():
        return 'Carpet shark'
    if 'whaler shark' in x.lower():
        return 'Whaler shark'
    if 'zambesi shark' in x.lower():
        return 'Zambesi shark'
    
    # -- trying to filter sizes --

    if 'small shark' in x.lower():
        return 'Other small shark'
    
    else:
        return 'OTHER / NOT KNOWN'
    
df['Species2'] = df['Species'].map(shark_identifier2)

In [30]:
df.Species2.value_counts()

INVALID ENTRY          3054
OTHER / NOT KNOWN      1467
White shark             625
Tiger shark             275
Bull shark              171
Nurse shark              94
Reef shark               65
Bronze whaler shark      60
Blacktip shark           56
Other small shark        55
Mako Shark               53
Wobbegong shark          46
Hammerhead shark         44
Raggedtooth shark        43
Blue shark               38
Lemon shark              34
Zambesi shark            29
Whitetip shark           23
Spinner shark            20
Dusky shark              12
Carpet shark              8
Whaler shark              6
Sandbar shark             5
Thresher shark            4
Brown shark               3
Gray shark                3
Silvertip shark           2
Blue nose shark           2
DROWNED                   2
Leopard shark             2
STINGRAY                  1
Name: Species2, dtype: int64

### 🦈️ Types of attack 👁️
The GSAF categorizes *scavenging bites* on humans as "questionable incidents."

### - PROVOKED
Provoked attacks occur when a human touches, hooks, nets, or otherwise aggravates the animal. Incidents that occur outside of a shark's natural habitat, such as aquariums and research holding-pens, are considered provoked, as are all incidents involving captured sharks. Sometimes humans inadvertently provoke an attack, such as when a surfer accidentally hits a shark with a surf board.

### - UNPROVOKED
 - Hit-and-run attack
 - Sneak Attack
 - Bump-and-bite attack 

For more information on how to differentiate PROVOKED vs UNPROVOKED attacks :
https://en.wikipedia.org/wiki/Shark_attack#Types_of_attacks

### 🦈️ TYPE OF ATTACK (Considering Injuries)
- I wanted to use the information on the `Injury` column to define if the human hooked, shot, or provoked the shark. However, after mapping the `Injury` column, I noticed that the ammount of Provoked attacks count from this mapping technique varied very slightly from the number of provoked attacks on the `Type` column. **A decission was made to disregard these two additional cases.**
- Size of the shark according to Species column


In [31]:
df.Type.isnull().sum()

4

In [32]:
df.Type = df.Type.fillna('Invalid')
df.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          551
Sea Disaster     239
Boating          203
Boat             137
Questionable       2
Boatomg            1
Name: Type, dtype: int64

In [33]:
# Boat, Boating and Boatomg to 1 category
filt = lambda x: 'Boat' if 'Boat' in x else x
df.Type = df.Type.map(filt)

# Questionable to Invalid
filt = lambda x : 'Invalid' if 'Questionable' in x else x
df.Type = df.Type.map(filt)

df.Type.value_counts()

Unprovoked      4595
Provoked         574
Invalid          553
Boat             341
Sea Disaster     239
Name: Type, dtype: int64

In [34]:
#  🔥️ this cell generates an error code related to column creation

# making a copy of the original DF
DFX = df.copy()

# removing those pesky floats that keep giving me trouble
float_to_unicode = lambda x : 'ERROR' if type(x) == float else x
DFX.TypeX = df.Injury.map(float_to_unicode).copy()

# Filter list and function to identify provoked attacks 👁️
provoked = ['provoked', 'hook', 'shot']

def filt_reloaded(string):
    for word in string.split():
        if word.lower() in provoked:
            return 'PROVOKED'
    return string 

DFX['TypeX'] = DFX.TypeX.map(filt_reloaded).copy()
DFX.TypeX.value_counts().head() 

  


FATAL          802
Survived        97
Foot bitten     87
No injury       82
Leg bitten      72
Name: Injury, dtype: int64

### 🦈️ Activities
Now we will clean the `activities` column

In [35]:
df.Activity.value_counts().head(15)

Surfing          971
Swimming         869
Fishing          431
Spearfishing     333
Bathing          162
Wading           149
Diving           127
Standing          99
Snorkeling        89
Scuba diving      76
Body boarding     61
Body surfing      49
Swimming          47
Kayaking          33
Pearl diving      32
Name: Activity, dtype: int64

### 🌊️ Turn this filter into a key:value pairs  👁️
This column could also be used to identify if the attacks were `PROVOKED` or `UNPROVOKED`

In [36]:
def filt(x):
    if type(x) is unicode:
        if 'floating' in x.lower():
            return 'Floating'
        if 'diving' in x.lower():
            return 'Diving'
        if 'dive' in x.lower():
            return 'Diving'
        if 'skiing' in x.lower():
            return 'Skiing'
        if 'ski ' in x.lower():
            return 'Skiing'
        if 'surf' in x.lower():
            return 'Surfing'
        if 'snorkel' in x.lower():
            return 'Snorkeling'
        if 'fishing' in x.lower():
            return 'Fishing'
        if 'drift' in x.lower():
            return 'Drifting'
        if 'swim' in x.lower():
            return 'Swimming'
        if 'bathing' in x.lower():
            return 'Swimming'
        if 'paddle' in x.lower():
            return 'Paddleboarding'
        
        if 'raft' in x.lower():
            return 'Rafting'
        if 'playing' in x.lower():
            return 'Playing'
        
        if 'wading' in x.lower():
            return 'Wading'
        if 'research' in x.lower():
            return 'Research'
        if 'rescue' in x.lower():
            return 'Rescuing someone / something'
        if 'rescuing' in x.lower():
            return 'Rescuing someone / something'
        if 'overboard' in x.lower():
            return 'Fell from boat into water'        
        if 'fell' in x.lower():
            return 'Fell to the water'
        
        if 'boat' in x.lower():
            return 'Boating'
        if 'yatch' in x.lower():
            return 'Boating'
        if 'disaster' in x.lower():
            return 'Sea Disaster'
        if 'wreck' in x.lower():
            return 'Shipwreck'
        if 'capsize' in x.lower():
            return 'Shipwreck'
        if 'sank' in x.lower():
            return 'Shipwreck'
        if 'torpedo' in x.lower():
            return 'War (Torpedo)'
        if 'warship' in x.lower():
            return 'War (Warship)'
        if ('plane' and 'crash') in x.lower():
            return 'Airplane crash'
        if 'airliner' in x.lower():
            return 'Airplane crash'
        
        
        if 'filming' in x.lower():
            return 'Filming / Photoshoot'
        if 'sailing' in x.lower():
            return 'Sailing'
        if 'net ' in x.lower():
            return 'Fishing'
        if ' net' in x.lower():
            
            return 'Fishing'
        if 'photo' in x.lower():
            return 'Filming / Photoshoot'
        if 'sinking' in x.lower():
            return 'Sea Disaster'
        
        if 'board' in x.lower():
            return 'Other types of Boarding sports'
        else:
            return x
    elif x == 'nan':
        return 'NOT SPECIFIED'
   #     print(type(x), x) #for debugging  🔥️
    elif type(x) == float:
        return 'INVALID - FLOAT'
    else:
     #   print(type(x)) #for debugging  🔥️
        return 'INVALID'

# Fill the empty values first
df.Activity = df.Activity.fillna('NOT SPECIFIED')

# Create the categorized column and count the values
df['Activity2'] = df.Activity.map(filt)
df.Activity2.value_counts()

Swimming                                                               1274
Surfing                                                                1203
Fishing                                                                1119
Diving                                                                  600
INVALID                                                                 544
Wading                                                                  163
Other types of Boarding sports                                          138
Snorkeling                                                              100
Standing                                                                 99
Boating                                                                  81
Fell from boat into water                                                80
Shipwreck                                                                68
Floating                                                                 52
Skiing      

In [37]:
df.Activity2 = df.Activity2.fillna('NOT SPECIFIED')
df.Activity2.isnull().sum()

0

### 🦈️ Fatalities
Sort them out and map them

In [38]:
df.Fatal.isnull().sum()

539

In [39]:
df.Fatal = df.Fatal.fillna('UNKNOWN')

In [40]:
def filt(x):
    if 'UNKNOWN' in x.upper():
        return x
    if 'N' in x.upper():
        return 'N'
    if 'Y' in x.upper():
        return 'Y'
    else:
        return 'UNKNOWN'

df['Fatal'] = df.Fatal.map(filt)
df.Fatal.value_counts()

N          4301
Y          1389
UNKNOWN     612
Name: Fatal, dtype: int64

## 🎣️ Step 5 - Review of columns

In [41]:
print(df.count() < 5000) # To know how much data are we missing on each column

CaseNum      False
Date         False
Year         False
Type         False
Country      False
Area         False
Location     False
Activity     False
Sex          False
Injury       False
Fatal        False
Species      False
Source       False
href         False
Species2     False
Activity2    False
dtype: bool


In [42]:
df.duplicated().sum()

0

In [43]:
# drop dupes and compare lengths
df_pdf_nodupes = df.drop_duplicates()

len(df) - len(df_pdf_nodupes), 'duped values'

(0, 'duped values')

<br><br><br><br><br>
<br><br><br>

<h1><center> 🏄️ 
    <br> Step 6
    <br> Exporting the cleaned Dataset 
    </center></h1>

In [44]:
# Phew... 
#
# That was a long haul right there.
# It might not be perfect, but we have
# some cleaner data with which we can 
# make a more educated guess abour 
# this topic.
#
# Now, we can take this dataframe and
# export it. This will be one of the
# products of this project and 
# hopefully benefit the future of
# sharing the planet with sharks.
#
#
#
# Oh, I almos forgot... 
#
# Exporting as '.csv':
#
# It cannot get any simpler than this:
df.to_csv('OUTPUT/exported.csv', encoding='latin-1')

![hammer head sharks swimming](INPUT/hammer-head-shark.jpg)

<center><h1> 🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️🏁️

In [45]:
# Have you ever tried to divide by zero ?
# I just don't want the code below to be run.

0/0
################################################################
################################################################
################################################################

ZeroDivisionError: integer division or modulo by zero

## 🏊️ IDEAS FOR FUTURE PROJECTS:
- The pdfs presented on the `href` seem quite structured
 
 - It could be possible to parse them later down the road and use a **REGEX** to find more data
 
 - Like, adding a column that lists the **'Moon Phase'** described on some of the pdfs

- I also have ran query a few times to notice that all pdfs have actually been uploaded to the same website and have the same naming structure

In [None]:
# Some logic like the one described below could be used to get an idea of shark measurements
# Another idea would be to use REGEX

def distance_measurments(x):
    
    #THERE ARE SO MANY ERRORS IN THIS COLUMN, let's filter them
    # NOT CONFIRMED
    
    # -- trying to filter sizes --
    
    if """10'""" in x.lower():
        return """10' shark"""
    if """11'""" in x.lower():
        return """11' shark"""
    if """12'""" in x.lower():
        return """12' shark"""
    if """13'""" in x.lower():
        return """13' shark"""
    if """14'""" in x.lower():
        return """14' shark"""
    if """15'""" in x.lower():
        return """15' shark"""
    if """16'""" in x.lower():
        return """16' shark"""
    if """17'""" in x.lower():
        return """17' shark"""
    if """18'""" in x.lower():
        return """18' shark"""
    if """19'""" in x.lower():
        return """19' shark"""
    if """20'""" in x.lower():
        return """20' shark"""
    if """21'""" in x.lower():
        return """21' shark"""
    
    if """1'""" in x.lower():
        return """1' shark"""
    if """2'""" in x.lower():
        return """2' shark"""
    if """3'""" in x.lower():
        return """3' shark"""
    if """4'""" in x.lower():
        return """4' shark"""
    if """5'""" in x.lower():
        return """5' shark"""
    if """6'""" in x.lower():
        return """6' shark"""
    if """7'""" in x.lower():
        return """7' shark"""    
    if """8'""" in x.lower():
        return """8' shark"""
    if """9'""" in x.lower():
        return """9' shark"""

    if 'small shark' in x.lower():
        return 'Small shark'
    
    else:
        return x
    
# df['Species2'] = df['Species'].map(shark_identifier)

# OTHER PENDING TASKS TO-DO: 
- De-clutter ipynb and solve other errors >> Ctrl+F this emoji >> 🔥️
- Optimize variables 👁️
 - Fatalities
 - Species classification
 - Type of attack (Provoked/Unprovoked)
- Create a separate module that holds my functions
- Change the functions so that they use dictionaries and not huge if/elif/else loop.
- Add pictures of sharks to `analysis.ipynb`