
# TO-DO: 
- how to add the dataset to gitignore?
- how can i create a separate module that holds my functions?
- change the functions so that they use dictionaries and not huge if/elif/else loop.

- Why column labels now begin with a `u`?! 
    ```
  Index([u'CaseNum', u'Date', u'Year', u'Type', u'Country', u'Area', u'Location',
       u'Activity', u'Sex', u'Age', u'Injury', u'Fatal', u'Time', u'Species',
       u'Source', u'href'],
      dtype='object')
    ```



![Shark attacks, a project by Roberto Henríquez Perozo. Data Analytics Bootcamp at IronHack](INPUT/shark-attacks.png)

<br><br>
<center>
    <h1> PART I: Data cleaning and exploration</h1>
</center>

## 🎣️ Step 0 - Basic knowledge
To begin the development of this project, it would be good to hold a minimum understanding of `Shark Attacks`.

As I did not know much about this topic at the day the project started, I have recurred to the shark-attack wiki: https://en.wikipedia.org/wiki/Shark_attack

With this information in mind, below is the process of data exploration, cleaning, and wrangling.

In [1]:
# Start also by importing the modules
import pandas as pd
import numpy as np
import random
import os

from functools import reduce

## 🎣️ Step 1 - Defining the dataset path, and importing it to begin basic dataset exploration

In [3]:
# To follow along and access the DataSet, download it from KAGGLE using this link
# https://www.kaggle.com/teajay/global-shark-attacks

# Once you have downloaded the DataSet, change the following `dataset` variable to match the 
# path where you have saved the 'attacks.csv' file.
dataset = 'attacks.csv' 


df = pd.read_csv(dataset, encoding='ascii')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 7: ordinal not in range(128)

Now, we will check some basic information about the dataset, in order to formulate a more educated hypothesis which we could actually put to test with the data available.

Here, I notice that the shape of the `df` with no duplicates is very small when compared to the whole `df`.

## 🌊️  FUNCT
This comparison could be turned into its own function, as it will be executed quite often

In [None]:
# I'll delete de duplicated rows
print('before', df.shape)
df = df.drop_duplicates()
print('after', df.shape)

# And also take a look at the columns
df.columns

In [None]:
# I want to take a look at the time structures
df.Date.head()

In [None]:
# I want to see what is the data on the last couple
# of columns which have unexplicit labels
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']].head()

In [None]:
# Too many null values on the last two columns... let's count them
print(df.shape)
df[['Case Number.1', 'Case Number.2', 'original order', 'Unnamed: 22', 'Unnamed: 23']].isnull().sum()

In [None]:
# If there is only 1 value in the 'Unnamed: 22' column, and 2 values in the
# 'Unnamed: 22' column, I'll not consider this data for my analysis.
df = df.drop(columns=['Unnamed: 22', 'Unnamed: 23'])

In [None]:
# The following columns seemed a little bit rare, so i do a value count to find out what they are about
df['Case Number.1'].value_counts().sort_values().head(10)

In [None]:
df['Case Number.2'].value_counts().sort_values().head(10)

# 🌊️ Creating a 'sampler' function

After some thought and googling, it seems like these are some sort of notation used to categorize and order the shark attacks. They also use the a notation that includes dates. 

**Is it the same as the Date on the Date column?**

To find out, I wanted to pick random samples of the dataframe, but python's built-in `random` module kept giving me trouble when I tried to use it alongside pandas.

In [None]:
#VER 03
def sampler(df, column, sample_size):
# This function generates an iterator out of random rows from a pandas dataframe's specific column
    
    # Defining how many samples to fetch from the df
    for i in range(sample_size):
        
        # Now a random index is generated out of the total length of the column...
        i = random.choice(range(len(df[column])))
        # ... to return the data values in that index as an iterator:
        yield df.iloc[i][column]
        
        # For future versions, it would be good to look at how I can return the
        # data as a tupple with just the data, and not have it return a formatted string
        # or even better, a pandas dataframe with the results


# Now let's try it out
sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 10)

In [None]:
# Since sampler generates iterators, list() must be used to see its contents
list(sampler(df, ['Date', 'Case Number.1', 'Case Number.2'], 5))

## 🦈️
From this output, we can see that the `Case Number` Columnns are actually replicating the info that we already have on the `Date` column. Therefore, we will drop both `Case Number` columns

In [None]:
df = df.drop(columns=['Case Number.1', 'Case Number.2'])

 ## 🦈️
 Taking a closer look at the tail of the df, we notice that the last couple of rows are still holding many null values.


In [None]:
# it's only the last 10 columns which have the nulls.
df.tail(12)

In [None]:
#since they are only 10 instances, we can drop them manually using their index:
df = df.drop([6302, 6303,6304,6305,6306,6307,6308,6309,8702,25722])

In [None]:
#results
df.tail()

# 🦈️
With this cleaned dataframe, we can look deeper into the actual data.

Notice that we still have additional columns that are not giving us any **'meaty information'** !

In [None]:
# These are just old indexes
# df['original order'].value_counts()

# And these are only the titles of the pdfs
# df['pdf'].value_counts()

# let's get rid of them
df = df.drop(columns = ['pdf', 'original order'])

# 🦈️
The `href` and `href formula` columns look very similar, but since they can't be read on the DataFrame that pandas provides, we'll try to use our `sampler()` function again to compare them both

In [None]:
list(sampler(df, ['href', 'href formula'],2))
# This however, returns invalid links which are not accurately represented.
# example: 

# 🦈️
Well.. Oops?
To actually see the contents, I've resorted to two separate methods, `random.sample` and a `for` loop.

These cells can be re-run a couple of times to make sure the data in the columns is homogeneous

In [None]:
# With Random sample
display(random.sample(list(df['href']), 5))
display(random.sample(list(df['href formula']), 5))

In [None]:
 # 🔥️ FIX SYNTAX ERROR
 #With a FOR loop 
"""
for i in range(5):
    e = random.choice(range(1000))
    print(f"index: {e} href:         {df.iloc[e]['href']})
    print(f"index: {e} href formula: {df.iloc[e]['href formula']}")
"""

## 🦈️
The links on both columns seem to match, most of the times anyways.

In some cases, the `href` seems to have an duplication on its links which corrupted them and made them innaccessible.

However, the `href formula` actually saved the correct URL format.

In [None]:
# 🔥️ FIX SYNTAX ERRORS

# Examples of the corrupted link versus their working counterpart
print(df.iloc[332]['href'])
print(df.iloc[332]['href formula'])
print()
print(df.iloc[324]['href'])
print(df.iloc[324]['href formula'])
print()
print(df.iloc[588]['href'])
print(df.iloc[588]['href formula'])
print()
print(df.iloc[569]['href'])
print(df.iloc[569]['href formula'])

## 🦈️

For the sake of simplicity, we will drop the `href` column, and replace it with the `href formula`. Also, some column names can be simplified, and some have unnecesary white spaces. Let's fix that right away:

In [None]:
df = df.drop(columns='href')

In [None]:
df.columns

In [None]:
#The column with the name of the victims does not bring much relevant information to our study
df = df.drop(columns='Name')

In [None]:
df.columns = ['CaseNum', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location',
       'Activity', 'Sex', 'Age', 'Injury', 'Fatal', 'Time',
       'Species', 'Source', 'href']
df.columns

## 🦈️
Many of the values from the **Species** column are `nulls`. We'll fill them with the same `Invalid` value that other cells already have

In [None]:
#BEFORE
df.Species.value_counts()

In [None]:
df.Species = df.Species.fillna('Invalid')

#AFTER
df.Species.value_counts() 

In [None]:
# With the .describe() method, we can see that there are 1549 unique values in this column
# It would be interesting to create a new column which narrows this down to less unique values.
df.Species.describe()

## 🦈️    [Can you guess the Pokémon?](https://i.ytimg.com/vi/mg1A94zBWBw/hqdefault.jpg)

There are more than 400 species of sharks, and while lurking around the `Species` column you can find all sort of weird animals. Did you know there's even one species of shark known as the *'Cookie Cutter shark'* ?

I certainly had no clue. 

Here I tried to sort a bit of the data, by creating a secondary column which *mapped* the different species, while also taking out possible confusions. 

- *note: the scrutiny for this categorization is quite laxed, as this project is more an exercise with data and less a research paper. The data, however, can be processed even further by forking this github repo.*

Below is the process of creating such a secondary table.

In [None]:
# Mapping the Species column to decrease the amount of unique values
df['Species2'] = df['Species']
df.Species2.value_counts() 

# 🌊️ Shark Identifier function

This can be turned into key:value pairs for easier reading

In [None]:
def shark_identifier2(x):
    
    #THERE ARE SO MANY ERRORS IN THIS COLUMN, let's filter them
    
    # NOT CONFIRMED
    if 'not confirmed' in x.lower():
        return "OTHER / NOT KNOWN"
    if 'unidentified' in x.lower():
        return 'OTHER / NOT KNOWN'
    if ' or ' in x.lower():
        return 'OTHER / NOT KNOWN'
    
    # INVALIDS
    if 'no shark involvement' in x.lower():
        return 'INVALID ENTRY'
    if 'invalid' in x.lower():
        return 'INVALID ENTRY'
    if 'questionable' in x.lower():
        return 'INVALID ENTRY'
    if 'doubtful' in x.lower():
        return 'INVALID ENTRY'
    
    # OTHER INVALIDS
    if 'hoax' in x.lower():
        return 'HOAX'
    if 'drown' in x.lower():
        return 'DROWNED'
    if 'stingray' in x.lower():
        return 'STINGRAY'
    
    
    #  --- WHO'S THAT POKEMON?---
    
    if 'white shark' in x.lower():
        return "White shark"
    if 'tiger shark' in x.lower():
        return "Tiger shark"
    if 'bull shark' in x.lower():
        return "Bull shark"
    if 'nurse shark' in x.lower():
        return 'Nurse shark'
    if 'brown shark' in x.lower():
        return 'Brown shark'
    if 'mako shark' in x.lower():
        return 'Mako Shark'
    if 'blue shark' in x.lower():
        return 'Blue shark'
    if 'bronze whaler shark' in x.lower():
        return 'Bronze whaler shark'
    if 'blacktip shark' in x.lower():
        return 'Blacktip shark'
    if 'whitetip shark' in x.lower():
        return 'Whitetip shark'
    if 'sandbar shark' in x.lower():
        return 'Sandbar shark'
    if 'lemon shark' in x.lower():
        return 'Lemon shark'
    if 'hammerhead shark' in x.lower():
        return 'Hammerhead shark'
    if 'raggedtooth shark' in x.lower():
        return 'Raggedtooth shark'
    if 'thresher shark' in x.lower():
        return 'Thresher shark'
    if 'dusky shark' in x.lower():
        return 'Dusky shark'
    if 'wobbegong shark' in x.lower():
        return 'Wobbegong shark'
    if 'dusky shark' in x.lower():
        return 'Dusky shark'
    if 'spinner shark' in x.lower():
        return 'Spinner shark'
    if 'blue nose shark' in x.lower():
        return 'Blue nose shark'
    if 'leopard shark' in x.lower():
        return 'Leopard shark'
    if 'silvertip shark' in x.lower():
        return 'Silvertip shark'
    if 'gray shark' in x.lower():
        return 'Gray shark'
    if 'grey shark' in x.lower():
        return 'Gray shark'
    if 'reef shark' in x.lower():
        return 'Reef shark'
    if 'carpet shark' in x.lower():
        return 'Carpet shark'
    if 'whaler shark' in x.lower():
        return 'Whaler shark'
    if 'zambesi shark' in x.lower():
        return 'Zambesi shark'
    
    # -- trying to filter sizes --

    if 'small shark' in x.lower():
        return 'Other small shark'
    
    else:
        return 'OTHER / NOT KNOWN'
    
df['Species2'] = df['Species'].map(shark_identifier2)

In [None]:
print(df.Species2.value_counts())

# 🦈️ Injuries and types of attack
The GSAF categorizes scavenging bites on humans as "questionable incidents."

## PROVOKED
Provoked attacks occur when a human touches, hooks, nets, or otherwise aggravates the animal. Incidents that occur outside of a shark's natural habitat, such as aquariums and research holding-pens, are considered provoked, as are all incidents involving captured sharks. Sometimes humans inadvertently provoke an attack, such as when a surfer accidentally hits a shark with a surf board.

## UNPROVOKED
- Hit-and-run attack
- Sneak Attack
- Bump-and-bite attack 

For more information on how to differentiate PROVOKED vs UNPROVOKED attacks :
https://en.wikipedia.org/wiki/Shark_attack#Types_of_attacks

🔥️
## 🏊️ TYPE OF ATTACK / IDEAS
- On the Type column, don't count `sea disasters`, `questionable`, and `boatomg` values
- Stardarize
- Size of the shark according to Species column

In [None]:
df.Type.isnull().sum()

In [None]:
df.Type = df.Type.fillna('Invalid')
df.Type.value_counts()

In [None]:
# Boat, Boating and Boatomg to 1 category
filt = lambda x: 'Boat' if 'Boat' in x else x
df.Type = df.Type.map(filt)

# Questionable to Invalid
filt = lambda x : 'Invalid' if 'Questionable' in x else x
df.Type = df.Type.map(filt)

df.Type.value_counts()

## 🦈️ Activities
Now we will clean the `activities` column

In [None]:
df.Activity.value_counts()

# 🌊️ Turn this filter into a key:value pairs 
🔥️

In [None]:
def filt(x):
    if type(x) is str:
        if 'floating' in x.lower():
            return 'Floating'
        if 'diving' in x.lower():
            return 'Diving'
        if 'dive' in x.lower():
            return 'Diving'
        if 'skiing' in x.lower():
            return 'Skiing'
        if 'ski ' in x.lower():
            return 'Skiing'
        if 'surf' in x.lower():
            return 'Surfing'
        if 'snorkel' in x.lower():
            return 'Snorkeling'
        if 'fishing' in x.lower():
            return 'Fishing'
        if 'drift' in x.lower():
            return 'Drifting'
        if 'swim' in x.lower():
            return 'Swimming'
        if 'bathing' in x.lower():
            return 'Swimming'
        if 'paddle' in x.lower():
            return 'Paddleboarding'
        
        if 'raft' in x.lower():
            return 'Rafting'
        if 'playing' in x.lower():
            return 'Playing'
        
        if 'wading' in x.lower():
            return 'Wading'
        if 'research' in x.lower():
            return 'Research'
        if 'rescue' in x.lower():
            return 'Rescuing someone / somthing'
        if 'rescuing' in x.lower():
            return 'Rescuing someone / somthing'
        if 'overboard' in x.lower():
            return 'Fell from boat into water'        
        if 'fell' in x.lower():
            return 'Fell to the water'
        
        if 'boat' in x.lower():
            return 'Boating'
        if 'yatch' in x.lower():
            return 'Boating'
        if 'disaster' in x.lower():
            return 'Sea Disaster'
        if 'wreck' in x.lower():
            return 'Shipwreck'
        if 'capsize' in x.lower():
            return 'Shipwreck'
        if 'sank' in x.lower():
            return 'Shipwreck'
        if 'torpedo' in x.lower():
            return 'War (Torpedo)'
        if 'warship' in x.lower():
            return 'War (Warship)'
        if ('plane' and 'crash') in x.lower():
            return 'Airplane crash'
        if 'airliner' in x.lower():
            return 'Airplane crash'
        
        
        if 'filming' in x.lower():
            return 'Filming / Photoshoot'
        if 'sailing' in x.lower():
            return 'Sailing'
        if 'net ' in x.lower():
            return 'Fishing'
        if ' net' in x.lower():
            
            return 'Fishing'
        if 'photo' in x.lower():
            return 'Filming / Photoshoot'
        if 'sinking' in x.lower():
            return 'Sea Disaster'
        
        if 'board' in x.lower():
            return 'Other types of Boarding sports'
        else:
            return x
    else:
        return 'WTF?'
df['Activity2'] = df.Activity.map(filt)
df.Activity2.value_counts()

In [None]:
# Fill the empty values too
df.Activity.isnull().sum()

In [None]:
df.Activity2 = df.Activity2.fillna('NOT SPECIFIED')

In [None]:
df.Activity2.isnull().sum()

In [None]:
df.Activity2.value_counts()

# Fatalities
Sort them out and map them

In [None]:
df.Fatal.isnull().sum()

In [None]:
df.Fatal = df.Fatal.fillna('UNKNOWN')

In [None]:
def filt(x):
    if 'UNKNOWN' in x.upper():
        return x
    if 'N' in x.upper():
        return 'N'
    if 'Y' in x.upper():
        return 'Y'
    else:
        return 'UNKNOWN'

df['Fatal'] = df.Fatal.map(filt)
df.Fatal.value_counts()

In [None]:
df.Injury.value_counts()

In [None]:
print(df.count() < 5000) # To know how much data are we missin on each column

## 🦈️
The `Age` and `Time` column have many null values and is not going to be uses to test our hypothesis, so we will drop it.

In [None]:
df = df.drop(columns=['Age', 'Time'])
df.columns
display(df.dtypes)

In [None]:
df.duplicated().sum()

In [None]:
# drop dupes and compare lengths
df_pdf_nodupes = df.drop_duplicates()

len(df) - len(df_pdf_nodupes), 'duped values'

<br><br><br><br><br>
<br><br><br>

<h1><center> 🏄️ Exporting the cleaned Dataset </center></h1>


In [None]:
# Phew... 
#
# That was a long haul right there.
# It might not be perfect, but we have
# some cleaner data with which we can 
# make a more educated guess abour 
# this topic.
#
# Now, we can take this dataframe and
# export it. This will be one of the
# products of this project and 
# hopefully benefit the future of
# sharing the planet with sharks.
#
#
#
# Oh, I almos forgot... 
#
# Exporting as '.csv':
#
# It cannot get any simpler than this:
df.to_csv('exported.csv')

In [None]:
def distance_measurments(x):
    
    #THERE ARE SO MANY ERRORS IN THIS COLUMN, let's filter them
    # NOT CONFIRMED
    
    # -- trying to filter sizes --
    
    if """10'""" in x.lower():
        return """10' shark"""
    if """11'""" in x.lower():
        return """11' shark"""
    if """12'""" in x.lower():
        return """12' shark"""
    if """13'""" in x.lower():
        return """13' shark"""
    if """14'""" in x.lower():
        return """14' shark"""
    if """15'""" in x.lower():
        return """15' shark"""
    if """16'""" in x.lower():
        return """16' shark"""
    if """17'""" in x.lower():
        return """17' shark"""
    if """18'""" in x.lower():
        return """18' shark"""
    if """19'""" in x.lower():
        return """19' shark"""
    if """20'""" in x.lower():
        return """20' shark"""
    if """21'""" in x.lower():
        return """21' shark"""
    
    if """1'""" in x.lower():
        return """1' shark"""
    if """2'""" in x.lower():
        return """2' shark"""
    if """3'""" in x.lower():
        return """3' shark"""
    if """4'""" in x.lower():
        return """4' shark"""
    if """5'""" in x.lower():
        return """5' shark"""
    if """6'""" in x.lower():
        return """6' shark"""
    if """7'""" in x.lower():
        return """7' shark"""    
    if """8'""" in x.lower():
        return """8' shark"""
    if """9'""" in x.lower():
        return """9' shark"""

    if 'small shark' in x.lower():
        return 'Small shark'
    
    else:
        return x
    
# df['Species2'] = df['Species'].map(shark_identifier)

In [None]:
# Since there is no column that states if the attack was provoked or not,
# I want to analyze the injury column to distinguish between the cases that were provoked
# and those that were unprovoked.

random.sample(list(df.Injury.value_counts().items()),20)

In [None]:
# Categorizing  Provoked and  Unprovoked attacks
# df_provoked = np.where(df_nodupes.Injury.isin(provoked), True, False) 

# Passing that categorization to a new PROVOKED COLUMN
def provoked_attacks(x):
    
    provoked = ['PROVOKED', 'hook', 'shot']
    
    for e in provoked:
        if e in str(x):
            return 'PROVOKED'
        else:
            return x
df['Provoked'] = df.Injury.map(provoked_attacks)
df['Provoked'].tail(50)

In [None]:
#df_clean.loc[df_clean["trany"].str.startswith("M"),"trany"] = "Manual"

provoked = ['PROVOKED', 'hook', 'shot']
#map(lambda words, x : words in x, provoked, df_nodupes.loc[df_nodupes['Injury'].str])
df.loc[df['Injury'].str]

## 🏊️ IDEAS FOR FUTURE PROJECTS:
- The pdfs presented on the `href` seem quite structured
 
 - It could be possible to parse them later down the road and use a **REGEX** to find more data
 
 - Like, adding a column that lists the **'Moon Phase'** described on some of the pdfs

- I also have ran query a few times to notice that all pdfs have actually been uploaded to the same website and have the same naming structure