In [1]:
import pandas as pd
import numpy as np
import random
import os

from functools import reduce


# Questions: 
- how to add the dataset to gitignore?


![Shark attacks, a project by Roberto Henríquez Perozo. Data Analytics Bootcamp at IronHack](shark-attacks.png)



<br><br>


<center>
    <h1> PART I: Data cleaning and exploration</h1>
</center>


## 🎣️ Step 0 - Basic knowledge
To begin the development of this project, it would be good to hold a minimum understanding of `Shark Attacks`.

As I did not know much about this topic at the day the project started, I have recurred to the shark-attack wiki: https://en.wikipedia.org/wiki/Shark_attack

With this information in mind, below is the process of data exploration, cleaning, and wrangling.


## 🎣️ Step 1 - Defining the dataset path, and importing it to begin basic dataset exploration

In [2]:
# To follow along and access the DataSet, download it from KAGGLE using this link
# https://www.kaggle.com/teajay/global-shark-attacks

# Once you have downloaded the DataSet, change the following `dataset` variable to match the 
# path where you have saved the 'attacks.csv' file.

dataset = 'attacks.csv' 
df = pd.read_csv(dataset, encoding='latin-1')

Now, we will check some basic information about the dataset, in order to formulate a more educated hypothesis which we could actually put to test with the data available.

Here, I notice that the shape of the `df` with no duplicates is very small when compared to the whole `df`.

## 🌊️  FUNCT
This comparison could be turned into its own function, as it will be executed quite often

In [3]:
# I'll delete de duplicated rows
print('before', df.shape)
df = df.drop_duplicates()
print('after', df.shape)

# And also take a look at the columns
df.columns

before (25723, 24)
after (6312, 24)


Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22',
       'Unnamed: 23'],
      dtype='object')

In [4]:
# I want to take a look at the time structures
df.Date

0        25-Jun-2018
1        18-Jun-2018
2        09-Jun-2018
3        08-Jun-2018
4        04-Jun-2018
            ...     
6307             NaN
6308             NaN
6309             NaN
8702             NaN
25722            NaN
Name: Date, Length: 6312, dtype: object

In [5]:
# I want to see what is the data on the last couple
# of columns which have unexplicit labels
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']]

Unnamed: 0,Case Number.1,Case Number.2,original order,Unnamed: 22,Unnamed: 23
0,2018.06.25,2018.06.25,6303.0,,
1,2018.06.18,2018.06.18,6302.0,,
2,2018.06.09,2018.06.09,6301.0,,
3,2018.06.08,2018.06.08,6300.0,,
4,2018.06.04,2018.06.04,6299.0,,
...,...,...,...,...,...
6307,,,6309.0,,
6308,,,6310.0,,
6309,,,,,
8702,,,,,


In [6]:
# Too many null values on the last two columns... let's count them
print(df.shape)
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']].isnull().sum()

(6312, 24)


Case Number.1       10
Case Number.2       10
original order       3
Unnamed: 22       6311
Unnamed: 23       6310
dtype: int64

In [7]:
# If there is only 1 value in the 'Unnamed: 22' column, and 2 values in the
# 'Unnamed: 22' column, I'll not consider this data for my analysis.
df = df.drop(columns=['Unnamed: 22', 'Unnamed: 23'])

In [8]:
# Now we'll look at the columns again
df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2', 'original order'],
      dtype='object')

In [9]:
# The following columns seemed a little bit rare, so i do a value count to find out what they are about
df['Case Number.1'].value_counts().sort_values()

1983.06.29      1
1960.04.24      1
2005.11.15      1
1935.01.21.R    1
1972.01.01.c    1
               ..
1990.05.10      2
1952.08.04      2
2006.09.02      2
2014.08.02      2
2009.12.18      2
Name: Case Number.1, Length: 6285, dtype: int64

In [10]:
df['Case Number.2'].value_counts().sort_values()

1949.07.28      1
1960.04.24      1
2005.11.15      1
1935.01.21.R    1
1972.01.01.c    1
               ..
1923.00.00.a    2
2009.12.18      2
1990.05.10      2
2005.04.06      2
1966.12.26      2
Name: Case Number.2, Length: 6286, dtype: int64

# 🌊️ Creating a Function

After some thought and googling, it seems like these are some sort of notation used to categorize and order the shark attacks.

They also use the a notation that includes dates. 
## Is it the same as the Date on the Date column?

To find out, I wantet to pick random samples of the dataframe, but python's built-in `random` module kept giving me trouble when I tried to use it alongside pandas.

In [11]:
# It's easier to make it into a function
""" 
OLDER VERSION OF SAMPLER FUNCTION:
def sampler(df, column, sample_size):
    # This function generates an iterator out of random rows from a pandas dataframe's specific column
    
    # Defining how many samples to fetch (this is the df index)
    for i in range(sample_size):
        
        # The 1000 value can be changed in future versions to match the size of the population
        e = random.choice(range(1000))
        yield f"index: {e}, sample: {df.iloc[e][column]}"
"""

' \nOLDER VERSION OF SAMPLER FUNCTION:\ndef sampler(df, column, sample_size):\n    # This function generates an iterator out of random rows from a pandas dataframe\'s specific column\n    \n    # Defining how many samples to fetch (this is the df index)\n    for i in range(sample_size):\n        \n        # The 1000 value can be changed in future versions to match the size of the population\n        e = random.choice(range(1000))\n        yield f"index: {e}, sample: {df.iloc[e][column]}"\n'

In [12]:
# It's easier to make it into a function
"""
VER 02
def sampler(df, column, sample_size):
# This function generates an iterator out of random rows from a pandas dataframe's specific column
    
    # Defining how many samples to fetch from the df
    for i in range(sample_size):
        
        # Now a random index is generated out of the total length of the column...
        i = random.choice(range(len(df[column])))
        # ... to return the data values in that index as an iterator:
        yield f"index: {i}, sample: {df.iloc[i][column]}"
        
        # For future versions, it would be good to look at how I can return the
        # data as a tupple with just the data, and not have it return a formatted string
        # or even better, a pandas dataframe with the results

# Now let's try it out
sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10)
"""

'\nVER 02\ndef sampler(df, column, sample_size):\n# This function generates an iterator out of random rows from a pandas dataframe\'s specific column\n    \n    # Defining how many samples to fetch from the df\n    for i in range(sample_size):\n        \n        # Now a random index is generated out of the total length of the column...\n        i = random.choice(range(len(df[column])))\n        # ... to return the data values in that index as an iterator:\n        yield f"index: {i}, sample: {df.iloc[i][column]}"\n        \n        # For future versions, it would be good to look at how I can return the\n        # data as a tupple with just the data, and not have it return a formatted string\n        # or even better, a pandas dataframe with the results\n\n# Now let\'s try it out\nsampler(df, [\'Date\', \'Case Number.1\', \'Case Number.2\'], 10)\n'

In [13]:
# It's easier to make it into a function

#VER 03
def sampler(df, column, sample_size):
# This function generates an iterator out of random rows from a pandas dataframe's specific column
    
    # Defining how many samples to fetch from the df
    for i in range(sample_size):
        
        # Now a random index is generated out of the total length of the column...
        i = random.choice(range(len(df[column])))
        # ... to return the data values in that index as an iterator:
        yield df.iloc[i][column]
        
        # For future versions, it would be good to look at how I can return the
        # data as a tupple with just the data, and not have it return a formatted string
        # or even better, a pandas dataframe with the results


# Now let's try it out
sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10)

<generator object sampler at 0x7f6bfc328b48>

In [14]:
###

# list(sampler(df, ['href', 'href formula'],10))

In [15]:
# Since sampler generates iterators, list() must be used to see its contents
list(sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10))

[Date             11-Apr-2017
 Case Number.1     2017.04.11
 Case Number.2     2017.04.11
 Name: 160, dtype: object,
 Date             10-Mar-1992
 Case Number.1     1992.03.10
 Case Number.2     1992.03.10
 Name: 2558, dtype: object,
 Date             12-Apr-1955
 Case Number.1     1955.04.12
 Case Number.2     1955.04.12
 Name: 4297, dtype: object,
 Date             Late 1600s Reported 1728
 Case Number.1                1642.00.00.b
 Case Number.2                1642.00.00.b
 Name: 6164, dtype: object,
 Date             Reported 27-Sep-1906
 Case Number.1          1906.09.27.R.b
 Case Number.2          1906.09.27.R.b
 Name: 5469, dtype: object,
 Date             31-Dec-1957
 Case Number.1     1957.12.31
 Case Number.2     1957.12.31
 Name: 4181, dtype: object,
 Date             25-Jun-2007
 Case Number.1     2007.06.25
 Case Number.2     2007.06.25
 Name: 1369, dtype: object,
 Date              05-Jul-2003
 Case Number.1    2003.07.05.b
 Case Number.2    2003.07.05.b
 Name: 1757, dty

## 🦈️
From this output, we can see that the `Case Number` Columnns are actually replicating the info that we already have on the `Date` column. Therefore, we will drop both `Case Number` columns

In [16]:
df = df.drop(columns=['Case Number.1', 'Case Number.2'])
df

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
0,2018.06.25,25-Jun-2018,2018.0,Boating,USA,California,"Oceanside, San Diego County",Paddling,Julie Wolfe,F,57,"No injury to occupant, outrigger canoe and pad...",N,18h00,White shark,"R. Collier, GSAF",2018.06.25-Wolfe.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6303.0
1,2018.06.18,18-Jun-2018,2018.0,Unprovoked,USA,Georgia,"St. Simon Island, Glynn County",Standing,Adyson McNeely,F,11,Minor injury to left thigh,N,14h00 -15h00,,"K.McMurray, TrackingSharks.com",2018.06.18-McNeely.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6302.0
2,2018.06.09,09-Jun-2018,2018.0,Invalid,USA,Hawaii,"Habush, Oahu",Surfing,John Denges,M,48,Injury to left lower leg from surfboard skeg,N,07h45,,"K.McMurray, TrackingSharks.com",2018.06.09-Denges.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6301.0
3,2018.06.08,08-Jun-2018,2018.0,Unprovoked,AUSTRALIA,New South Wales,Arrawarra Headland,Surfing,male,M,,Minor injury to lower leg,N,,2 m shark,"B. Myatt, GSAF",2018.06.08-Arrawarra.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6300.0
4,2018.06.04,04-Jun-2018,2018.0,Provoked,MEXICO,Colima,La Ticla,Free diving,Gustavo Ramos,M,,Lacerations to leg & hand shark PROVOKED INCIDENT,N,,"Tiger shark, 3m",A .Kipper,2018.06.04-Ramos.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6299.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6307,0,,,,,,,,,,,,,,,,,,,6309.0
6308,0,,,,,,,,,,,,,,,,,,,6310.0
6309,0,,,,,,,,,,,,,,,,,,,
8702,,,,,,,,,,,,,,,,,,,,


From this output, we notice that the last couple of rows are still holding many null values.


In [17]:
# but if we look closely, it's only the last 10 columns which have the nulls.
df.tail(15)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
6297,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6.0
6298,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5.0
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,4.0
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,3.0
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,male,M,15.0,"FATAL. ""Shark bit him in half, carrying away t...",Y,,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2.0
6302,0,,,,,,,,,,,,,,,,,,,6304.0
6303,0,,,,,,,,,,,,,,,,,,,6305.0
6304,0,,,,,,,,,,,,,,,,,,,6306.0
6305,0,,,,,,,,,,,,,,,,,,,6307.0
6306,0,,,,,,,,,,,,,,,,,,,6308.0


In [18]:
#since they are only 10 instances, we can drop them manually:
df = df.drop([6302, 6303,6304,6305,6306,6307,6308,6309,8702,25722])

#results
df.tail(5)

Unnamed: 0,Case Number,Date,Year,Type,Country,Area,Location,Activity,Name,Sex,Age,Injury,Fatal (Y/N),Time,Species,Investigator or Source,pdf,href formula,href,original order
6297,ND.0005,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,Roebuck Bay,Diving,male,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, p. 234",ND-0005-RoebuckBay.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,6.0
6298,ND.0004,Before 1903,0.0,Unprovoked,AUSTRALIA,Western Australia,,Pearl diving,Ahmun,M,,FATAL,Y,,,"H. Taunton; N. Bartlett, pp. 233-234",ND-0004-Ahmun.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,5.0
6299,ND.0003,1900-1905,0.0,Unprovoked,USA,North Carolina,Ocracoke Inlet,Swimming,Coast Guard personnel,M,,FATAL,Y,,,"F. Schwartz, p.23; C. Creswell, GSAF",ND-0003-Ocracoke_1900-1905.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,4.0
6300,ND.0002,1883-1889,0.0,Unprovoked,PANAMA,,"Panama Bay 8ºN, 79ºW",,Jules Patterson,M,,FATAL,Y,,,"The Sun, 10/20/1938",ND-0002-JulesPatterson.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,3.0
6301,ND.0001,1845-1853,0.0,Unprovoked,CEYLON (SRI LANKA),Eastern Province,"Below the English fort, Trincomalee",Swimming,male,M,15.0,"FATAL. ""Shark bit him in half, carrying away t...",Y,,,S.W. Baker,ND-0001-Ceylon.pdf,http://sharkattackfile.net/spreadsheets/pdf_di...,http://sharkattackfile.net/spreadsheets/pdf_di...,2.0


# 🦈️
Whith this cleaned dataframe, we can look deeper into the actual data.

Notice that we still have additional columns that are not giving us any **'meaty information'** !

In [19]:
# After checking, these are just old indexes
# df['original order'].value_counts()

# And these are only the titles of the pdfs
# df['pdf'].value_counts()

df = df.drop(columns = ['pdf', 'original order'])

# 🦈️
The `href` and `href formula` columsn look very similar, but since they can't be read on the DataFrame that pandas provides, we'll try to use our `sampler()` function again to compare them both

In [20]:
list(sampler(df, ['href', 'href formula'],2))
# This however, returns invalid links which are not accurately represented.
# example: 

[href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 6280, dtype: object,
 href            http://sharkattackfile.net/spreadsheets/pdf_di...
 href formula    http://sharkattackfile.net/spreadsheets/pdf_di...
 Name: 4762, dtype: object]

# 🦈️
To actually see the contents, I've resorted to two separate methods, `random.sample` and a `for` loop:

In [21]:
# With Random sample
display(random.sample(list(df['href']), 5))
display(random.sample(list(df['href formula']), 5))

['http://sharkattackfile.net/spreadsheets/pdf_directory/ND-0010-Puna Hawaii.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/2016.06.02.b-Matigan.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1964.06.29-Air-Collision.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/2013.01.26-boat.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1930.12.02-dinghy.pdf']

['http://sharkattackfile.net/spreadsheets/pdf_directory/1958.12.13-Weaver.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/2018.04.15.c-deMelo.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1910.11.28-Key.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1980.01.25-Richard.pdf',
 'http://sharkattackfile.net/spreadsheets/pdf_directory/1982.01.29-Phillips-Page.pdf']

In [22]:
# With a FOR loop
for i in range(5):
    e = random.choice(range(1000))
    print(f"index: {e}, href:         {df.iloc[e]['href']}")
    print(f"index: {e}, href formula: {df.iloc[e]['href formula']}")

index: 749, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2012.08.31.R-Hamish.pdf
index: 749, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2012.08.31.R-Hamish.pdf
index: 976, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2010.10.23-MacNichol.pdf
index: 976, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2010.10.23-MacNichol.pdf
index: 856, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2011.10.05-Castellani.pdf
index: 856, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2011.10.05-Castellani.pdf
index: 581, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2014.02.17.R-OneDLL
index: 581, href formula: http://sharkattackfile.net/spreadsheets/pdf_directory/2014.02.17.R-OneDLL
index: 88, href:         http://sharkattackfile.net/spreadsheets/pdf_directory/2017.09.10.b-Samson.pdf
index: 88, href formula: http://sharkattackfile.net/spreadsheets/pd

## 🦈️
The links on both columns seem to match, most of the times anyways.

In some cases, the `href` seems to have an duplication on its links which corrupted them and made them innaccessible.

However, the `href formula` actually saved the correct URL format.

In [23]:
print(df.iloc[332]['href']), print(df.iloc[332]['href formula'])
print()
print(df.iloc[324]['href']), print(df.iloc[324]['href formula'])
print()
print(df.iloc[588]['href']), print(df.iloc[588]['href formula'])
print()
print(df.iloc[569]['href']), print(df.iloc[569]['href formula'])

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.11.15.a-Engelman.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2015.11.15.a-Engelman.pdf

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2015.12.21.a-Brazil.pdf

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2014.00.00.b-OceanicWhitetip.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2014.00.00.b-OceanicWhitetip.pdf

http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/2014.04.03-Armstrong.pdf
http://sharkattackfile.net/spreadsheets/pdf_directory/2014.04.03-Armstrong.pdf


(None, None)

In [24]:
# for the sake of simplicity, we will drop the `href` column, and replace it with the `href formula`
df = df.drop(columns='href')
df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'href formula'],
      dtype='object')

## 🏊️ IDEAS:
 The pdfs presented on the links seem quite structured
 
 It could be possible to parse them later down the road and use a **REGEX** to find more data
 
 Like, adding a column that lists the **'Moon Phase'** described on some of the pdfs

 I also have ran query a few times to notice that all pdfs have actually been uploaded to
 the same website and have the same naming structure

## 🦈️

Some column names can be simplified, and some have typing errors.
 
 Let's fix that first

In [27]:
# PENDING: RENAME LABELS WITH A SPACE

df_label = df
df_label.columns = [['CaseNum', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal', 'Time',
       'Species', 'Source', 'href']]
df_label.columns

MultiIndex([( 'CaseNum',),
            (    'Date',),
            (    'Year',),
            (    'Type',),
            ( 'Country',),
            (    'Area',),
            ('Location',),
            ('Activity',),
            (    'Name',),
            (     'Sex',),
            (     'Age',),
            (  'Injury',),
            (   'Fatal',),
            (    'Time',),
            ( 'Species',),
            (  'Source',),
            (    'href',)],
           )

In [25]:
# Some column names can be simplified
df.columns

Index(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex ', 'Age', 'Injury', 'Fatal (Y/N)', 'Time',
       'Species ', 'Investigator or Source', 'href formula'],
      dtype='object')

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.Age.isnull().sum()

In [None]:
df['Type'].isnull().sum()

## It looks still like some of these pdfs are duplicates, even after dropping duplicates :

In [None]:
# how many times each pdf on the dataframe
df3 = df["href"]
df3.value_counts()

In [None]:
# drop dupes and compare lengths
df_pdf_nodupes = df_pdf.drop_duplicates()

len(df_pdf) - len(df_pdf_nodupes), 'duped values'

## Since the lengths are not the same, I will check if those duplicated entries are only in this column

In [None]:
# there are 20 duplicated values on the pdf columns
df_nodupes.duplicated('pdf').value_counts()

In [None]:
# But only 18 dupes if we take Location into count
df_nodupes.duplicated(['pdf','Location']).value_counts()

### I'll look at the rest of the data now.

In [None]:
print(df_nodupes.shape)
df_nodupes.duplicated().sum()

In [None]:
df_nodupes = df_nodupes.drop_duplicates()
df_nodupes.duplicated().sum()

In [None]:
print(df_nodupes.shape)
df_nodupes[["Date", "Location", "pdf"]]

In [None]:
df_nodupes.Country.value_counts()

In [None]:
# While checking the columns 'Species ' and 'Sex ' have unnecesary spaces at the end of the string
# to remove these, and also take out the '(Y/N)' from the column 'Fatal'

In [None]:
df_label = df_nodupes
df_label.columns = ['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal', 'Time',
       'Species', 'Investigator or Source', 'pdf', 'href formula', 'href',
       'Case Number.1', 'Case Number.2']
df_label.columns

In [None]:
df_label['Fatal'].value_counts()

In [None]:
remove_spaces = lambda x:  x.remove(' ') if ' ' in x else x
"""

df_label['Fatal'] = list(
                            map(remove_spaces(
                            df_label['Fatal']),
                            ))
"""
    
df_label['Fatal'].value_counts()


# I want to see the indexes which have a duplicated pdf row
"""

dupes = []
for a,b in list(df_label['pdf'].duplicated().items()):
    if b:
        dupes.append(a)
dupes """

In [None]:
"""
df_label.loc[dupes]
"""


In [None]:
# dfx = df_nodupes["pdf"].value_counts() if 

# Transform this to sort the shark species


In [None]:
list(df_label['Species'].value_counts().items())

In [None]:
# @@ Use this to fill null values: 
# df_clean["drive"] = df_clean.drive.fillna("NoTransmision")

# Injuries and types of attack
The GSAF categorizes scavenging bites on humans as "questionable incidents."

## PROVOKED
Provoked attacks occur when a human touches, hooks, nets, or otherwise aggravates the animal. Incidents that occur outside of a shark's natural habitat, such as aquariums and research holding-pens, are considered provoked, as are all incidents involving captured sharks. Sometimes humans inadvertently provoke an attack, such as when a surfer accidentally hits a shark with a surf board.

## UNPROVOKED
- Hit-and-run attack
- Sneak Attack
- Bump-and-bite attack 

For more information on how to differentiate PROVOKED vs UNPROVOKED attacks :
https://en.wikipedia.org/wiki/Shark_attack#Types_of_attacks

In [None]:
# Since there is no column that states if the attack was provoked or not,
# I want to analyze the injury column to distinguish between the cases that were provoked
# and those that were unprovoked.

random.sample(list(df_label.Injury.value_counts().items()),20)

In [None]:
# Categorizing  Provoked and  Unprovoked attacks
#df_clean.loc[df_clean["trany"].str.startswith("M"),"trany"] = "Manual"

provoked = ['PROVOKED', 'hook', 'shot']
#map(lambda words, x : words in x, provoked, df_nodupes.loc[df_nodupes['Injury'].str])
df_nodupes.loc[df_nodupes['Injury'].str]

In [None]:
df_nodupes.loc[df_nodupes['Injury'].str]

# df_provoked = np.where(df_nodupes.Injury.isin(provoked), True, False) 

# Passing that categorization to a new PROVOKED COLUMN
df_nodupes['Provoked'] = df_provoked
df_nodupes['Provoked'] 

In [None]:
display(df.columns) # To know which are the columns in the DF
display(df.count()) # To know how much data are we missin on each column
display(df.dtypes)

In [None]:
df_label[['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Name', 'Sex', 'Age', 'Injury', 'Fatal', 'Time', 'Species',
       'Investigator or Source', 'pdf', 'href',
       'Case Number.1', 'Case Number.2']].head(50)