# PROJECT 02 :
-----

### BRIEF : data analysis of global shark attacks for a business idea
You will initially examine the Shark Attack dataset, understanding its structure and formulating a hypothesis or several hypotheses about the data. 
 We hypothesize that shark attacks are more common in certain locations and peak during specific months.
We define a Business Case, such as 

* ‚ÄúAs a company that sells medical products, I want to identify destinations with high shark attack rates.‚Äù
* ‚ÄúAs a company providing supply transportation services, I want to know when and where shark attacks peak to plan the safe transport of medical supplies to hospitals.‚Äù

 Throughout the project, we will use Python and the pandas library to apply at least five data cleaning techniques to handle missing values, duplicates, and formatting inconsistencies. After cleaning, we will perform basic exploratory data analysis to validate our hypotheses and extract insights. 

#### üìù BUSINESS IDEA ‚Äî 3 Bullet Points

* Problem to Solve: Coastal hospitals and emergency response teams are not always prepared with the right medical supplies during periods of high shark-attack frequency. This business solves the problem by predicting when and where attacks are most likely, so medical supplies can be stocked in advance.

* Business Concept: Use historical shark attack data to create global heatmaps and seasonal risk forecasts. Then provide pharmaceutical products (painkillers, antibiotics, blood bags, emergency kits) and transportation support to hospitals and ambulances near high-risk beaches.

* Data Used to Profit: The business will analyze Country, Date/Month, Gender, Age, Fatality, and Type of Injury to identify high-risk locations, peak attack months, and most common injury types. This allows optimized supply production, targeted sales, and efficient delivery to the areas that need it most.



## üåÄCOLUMNS TO CLEAN : 
-----

**- COUNTRY (global comparison between countries to invest in more)** @Blanca

    * we just take the country column and make sure every country name is accurately named
    * We are gonna check the column of COUNTRY and make sure every country name is correct
    * We remove NULL COUNTRY data rows that dont have any country
    * We need to make sure that the name of COUNTRY  is capitalized and written the same way for example :
        United States of Amercia == USA == US
        it has to have the same name and consistent!

**- DATE ( MONTH + YEAR )** @Blanca

    * split the date into day - month columns and only use month column
    * interpret months which have shark attacks happen the most

**- GENDER ( F or M )** @Cecilia

    * We check unique values and make it so it is only two values F or M and deleted all rows that have other values
    * we noticed mostly M get attacked
    * percentages Male to Female victims
    * We remove NULL DATE data rows that dont have a date or that the date doesnt include a month and a year
    * We are gonna check the column of DATE and seperate it into three columns DAY + MONTH + YEAR
    * We verify that the new YEAR column matches with the old YEAR column and keep the ones that match
        * NEW YEAR COLUMN is the one split from the DATE column
        * OLD YEAR COLUMN is the one already existing in the original sheet
    * Once we finish comparing the new YEAR column vs old YEAR column and we find them not matching on some data rows. we remove the none matching ones so we keep clean data of accurate years
    * We remove the DAY and OLD YEAR columns
    * We are gonna keep the MONTH and matching YEAR

**- AGE (victims age ranges)** @Samia

    * majority of victims survive
    * we split the age groups into three categories (minors under 18 / adults 18-40 and 40+) 
    * keep in mind complications depending on age when getting treated
    * percentages of victims based on age ranges
    * we split the age groups into three categories (minors under 18 / adults 18-40 and elders 40+) 
    * note that there are complications depending on age

**- FATALITY ( Y or N )** @Cecilia
    * depends on the column 'Type' of injury and if it includes death or high severity
    * mostly survived 
    * for the pharamaceutical logistics & transportation of injured people to the hospital
    * assumption we have a percentage of survivals highest and we use it to sell for the
    * We check unique values and make sure it is only two values Y or N and deleted all rows that have other values
    * assumption we have a high percentage of survivals and we use it to sell the idea to profit from selling products to hospitals

**- INJURY TYPE** @Samia

    * clean the type of injury by severity
    * seperate the injury type into different severity
    * seperate the injury type into different body parts
    * treatment depends on type of injury and thus the supplies as well



## RAW DATA
-------

In [2]:
import pandas as pd

df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
print(df)


                         Date    Year          Type     Country  \
0                 25 Aug 2023  2023.0    Unprovoked   AUSTRALIA   
1                 21 Aug-2023  2023.0  Questionable     BAHAMAS   
2                 07-Jun-2023  2023.0    Unprovoked     BAHAMAS   
3                 02-Mar-2023  2023.0    Unprovoked  SEYCHELLES   
4                 18-Feb-2023  2023.0  Questionable   ARGENTINA   
..                        ...     ...           ...         ...   
556                      1733  1733.0       Invalid     ICELAND   
557                      1723  1723.0    Unprovoked      ROATAN   
558  Late 1600s Reported 1728  1642.0       Invalid      GUINEA   
559               Before 1824     0.0    Unprovoked   AUSTRALIA   
560      No date, Before 1963     0.0       Invalid         USA   

                      State                           Location  \
0           New South Wales  Lighthouse Beach, Port Macquarie    
1    New Providence   Isoad             Saunders Beach, Nassau 

In [3]:
import pandas as pd

df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
df.head(5)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
0,25 Aug 2023,2023.0,Unprovoked,AUSTRALIA,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF"
1,21 Aug-2023,2023.0,Questionable,BAHAMAS,New Providence Isoad,"Saunders Beach, Nassau",,male,M,20/30,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023"
2,07-Jun-2023,2023.0,Unprovoked,BAHAMAS,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF"
3,02-Mar-2023,2023.0,Unprovoked,SEYCHELLES,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023"
4,18-Feb-2023,2023.0,Questionable,ARGENTINA,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023"


In [None]:
import pandas as pd

# Load the CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")

# Option 1: Display the entire DataFrame (all rows and columns)
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns

df  # Jupyter will render the full table

In [4]:
df.columns

Index(['Date', 'Year', 'Type', 'Country', 'State', 'Location', 'Activity',
       'Name', 'Sex', 'Age', 'Injury', 'Fatal Y/N', 'Time', 'Species ',
       'Source'],
      dtype='object')

In [5]:
df.columns.to_list()

['Date',
 'Year',
 'Type',
 'Country',
 'State',
 'Location',
 'Activity',
 'Name',
 'Sex',
 'Age',
 'Injury',
 'Fatal Y/N',
 'Time',
 'Species ',
 'Source']

In [6]:
df.shape

(561, 15)

In [8]:
df['Age'].describe().round(2)

count     238
unique     72
top        17
freq       14
Name: Age, dtype: object

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       561 non-null    object 
 1   Year       560 non-null    float64
 2   Type       557 non-null    object 
 3   Country    556 non-null    object 
 4   State      508 non-null    object 
 5   Location   503 non-null    object 
 6   Activity   433 non-null    object 
 7   Name       502 non-null    object 
 8   Sex        480 non-null    object 
 9   Age        238 non-null    object 
 10  Injury     553 non-null    object 
 11  Fatal Y/N  0 non-null      float64
 12  Time       157 non-null    object 
 13  Species    526 non-null    object 
 14  Source     556 non-null    object 
dtypes: float64(2), object(13)
memory usage: 65.9+ KB


In [11]:
df['Age'].info()
#238 non-null object

<class 'pandas.core.series.Series'>
RangeIndex: 561 entries, 0 to 560
Series name: Age
Non-Null Count  Dtype 
--------------  ----- 
238 non-null    object
dtypes: object(1)
memory usage: 4.5+ KB


In [12]:
df['Age'].unique()

array(['44', '20/30', '73', '6', '32', '50', '11', '26', nan, '35', '62',
       '34', '31', '33', '25', '10', '8', '18', '68', '20', '19', '27',
       '60', 'Teen', '65', '9', '40', '39', '43', '23', '38', '63', '47',
       '48', '42', '24', '12', '16', '14', '7', '49', '17', '52', '53',
       '45', '36', '54', '51', '22', '28', '56', '8 or 10', '75',
       '23 & 20', '37', '29', '21', '15', '30', '16 to 18', '67', '77',
       'mid-20s', 'Ca. 33', '? & 19', '46', '37, 67, 35, 27,  ? & 27',
       '21, 34,24 & 35', '34 & 19', '2 to 3 months', '13', '5', '1'],
      dtype=object)

In [13]:
df['Age'].nunique()

72

In [14]:
df['Age'].describe()

count     238
unique     72
top        17
freq       14
Name: Age, dtype: object

## CLEAN DATA
-------

### IMPORT UTILS and INIT from SRC

In [15]:
import sys, os
sys.path.append(os.path.abspath(".."))
from src import utils
from src import init

In [16]:
df['Age'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 561 entries, 0 to 560
Series name: Age
Non-Null Count  Dtype 
--------------  ----- 
238 non-null    object
dtypes: object(1)
memory usage: 4.5+ KB


In [17]:
df['Age'].describe()

count     238
unique     72
top        17
freq       14
Name: Age, dtype: object

In [20]:
df['Age'].unique()

array(['44', '20/30', '73', '6', '32', '50', '11', '26', nan, '35', '62',
       '34', '31', '33', '25', '10', '8', '18', '68', '20', '19', '27',
       '60', 'Teen', '65', '9', '40', '39', '43', '23', '38', '63', '47',
       '48', '42', '24', '12', '16', '14', '7', '49', '17', '52', '53',
       '45', '36', '54', '51', '22', '28', '56', '8 or 10', '75',
       '23 & 20', '37', '29', '21', '15', '30', '16 to 18', '67', '77',
       'mid-20s', 'Ca. 33', '? & 19', '46', '37, 67, 35, 27,  ? & 27',
       '21, 34,24 & 35', '34 & 19', '2 to 3 months', '13', '5', '1'],
      dtype=object)

In [19]:

df['Age'].nunique()


72

In [23]:
import pandas as pd
import numpy as np
import re

# Load CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")

def clean_age(value):
    if pd.isna(value):
        return np.nan
    
    value = str(value).strip().lower()

    # Remove common unwanted characters
    value = value.replace("years", "").replace("ca.", "").replace("about", "").strip()

    # Words that indicate no age
    if value in ["n/a", "na", "none", "?", "", "unknown"]:
        return np.nan

    # Teen keyword (assume average age 15)
    if "teen" in value:
        return 15

    # mid-20s -> 25
    if "mid" in value and "20" in value:
        return 25

    # Capture numbers if present
    nums = re.findall(r"\d+", value)

    if len(nums) == 1:
        return int(nums[0])
    
    # If multiple numbers like "20/30", "24 & 35", "16 to 18", take the average
    if len(nums) >= 2:
        nums = [int(n) for n in nums]
        return sum(nums) / len(nums)

    return np.nan

# Apply cleaning function
df['Age_clean'] = df['Age'].apply(clean_age)

# Convert to numeric properly
df['Age_clean'] = pd.to_numeric(df['Age_clean'], errors='coerce')

# Fill missing values with median
median_age = df['Age_clean'].median()
df['Age_clean'].fillna(median_age, inplace=True)

# Confirm results
print(df['Age_clean'].head(20))
print("Median age used:", median_age)

# (Optional) Save cleaned file
df.to_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\cleaned.csv", index=False)


0     44.0
1     25.0
2     73.0
3      6.0
4     32.0
5     50.0
6     11.0
7     26.0
8     26.0
9     26.0
10    26.0
11    26.0
12    35.0
13    26.0
14    26.0
15    50.0
16    26.0
17    62.0
18    34.0
19    31.0
Name: Age_clean, dtype: float64
Median age used: 26.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age_clean'].fillna(median_age, inplace=True)


-----
### AGE CLEANED

In [32]:
import pandas as pd
import numpy as np
import re

# Load CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
# def Function : clean_age(value=age)
def clean_age(value):
    if pd.isna(value):
        return np.nan
    
    value = str(value).strip().lower()

    # Remove useless text
    value = value.replace("years", "").replace("year", "").replace("ca.", "").replace("about", "").strip()

    # Words meaning no age
    if value in ["n/a", "na", "none", "?", "", "unknown"]:
        return np.nan

    # Teen keyword (approx.)
    if "teen" in value:
        return 15

    # mid-20s ‚Üí 25
    if "mid" in value and "20" in value:
        return 25

    # Extract numbers
    nums = re.findall(r"\d+", value)

    # Single number
    if len(nums) == 1:
        return int(nums[0])
    
    # Multiple numbers -> average them (20/30, 23 & 20, 16 to 18)
    if len(nums) >= 2:
        nums = [int(n) for n in nums]
        return sum(nums) / len(nums)

    return np.nan

# Apply cleaner
df['Age_clean'] = df['Age'].apply(clean_age)

# Convert to numeric
df['Age_clean'] = pd.to_numeric(df['Age_clean'], errors='coerce')

# Fill missing with median
median_age = df['Age_clean'].median()
df['Age_clean'].fillna(median_age, inplace=True)

# Convert to integer (no floats)
df['Age_clean'] = df['Age_clean'].round().astype(int)

# Show result
print(df[['Age', 'Age_clean']].head(20))
print("Median age used:", median_age)

# Save cleaned dataset
df.to_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\cleaned.csv", index=False)

# Check result
print("\n",df["Age_clean"].value_counts())

      Age  Age_clean
0      44         44
1   20/30         25
2      73         73
3       6          6
4      32         32
5      50         50
6      11         11
7      26         26
8     NaN         26
9     NaN         26
10    NaN         26
11    NaN         26
12     35         35
13    NaN         26
14    NaN         26
15     50         50
16    NaN         26
17     62         62
18     34         34
19     31         31
Median age used: 26.0

 Age_clean
26    328
17     15
20      9
43      8
25      8
     ... 
67      1
46      1
2       1
5       1
1       1
Name: count, Length: 61, dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age_clean'].fillna(median_age, inplace=True)


In [None]:
percentiles = df['Age_clean'].describe(percentiles=[0.25, 0.5, 0.75])

print("Age Percentile Statistics:")
print(f"Min:   {int(percentiles['min'])}")
print(f"25%:   {int(percentiles['25%'])}")
print(f"50% (Median): {int(percentiles['50%'])}")
print(f"75%:   {int(percentiles['75%'])}")
print(f"Max:   {int(percentiles['max'])}")

# the majority were people up to 26 years old

Age Percentile Statistics:
Min:   1
25%:   26
50% (Median): 26
75%:   26
Max:   77


In [46]:
# Replace the original Age column with the cleaned integer values
df['Age'] = df['Age_clean']

# Optionally, drop the temporary Age_clean column
df.drop(columns=['Age_clean'], inplace=True)

# Check the result
print(df['Age'].head())

KeyError: 'Age_clean'

In [50]:
df.head(30)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
0,25 Aug 2023,2023.0,Unknown,AUSTRALIA,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF"
1,21 Aug-2023,2023.0,Unknown,BAHAMAS,New Providence Isoad,"Saunders Beach, Nassau",,male,M,25,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023"
2,07-Jun-2023,2023.0,Unknown,BAHAMAS,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF"
3,02-Mar-2023,2023.0,Unknown,SEYCHELLES,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023"
4,18-Feb-2023,2023.0,Unknown,ARGENTINA,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023"
5,08-Feb-2022,2022.0,Unknown,COSTA RICA,Guanacoste,Playa Del Coco,Diving,female,F,50,Right forearm and left hand injured,,,"Bull shark, 3m",Diario Extra Del Costa Rica 2/9/2022
6,15-Nov-2021,2021.0,Unknown,BRAZIL,S√£o Paulo.,Boqueir√£o Beach,Playing,male,M,11,Minor cuts to left thigh,,12h00,dogfish,"K. McMurray, TrackingSharks.com"
7,16-Oct-2021,2021.0,Missing / Unknown,AUSTRALIA,Queensland,Sudbury Island,Spearfishing,Torrance Sambo,M,26,Disappeared,,,,"K. McMurray, TrackingSharks.com"
8,10-Sep-2021,2021.0,Unknown,EGYPT,,Sidi Abdel Rahmen,Swimming,Mohamed,M,26,Laceration to arm caused by metal object,,,No shark invovlement,Dr. M. Fouda & M. Salrm
9,29-Jul-2020,2020.0,No injury,AUSTRALIA,Tasmania,Tenth Island,Sightseeing,5.5 m runabout. Occupants: Sean & James Vinar,,26,"No injury to occupants, injury to shark attemp...",,09h08,"White shark, 4m","C. Black, GSAF"


-----
### INJURY TYPE & FATALITY DEPENDS ON THE INJURY TYPE

In [None]:
import pandas as pd
import numpy as np

def classify_injury(injury, fatal):
    if pd.isna(injury):
        injury = ""
    injury = injury.lower()
    fatal = str(fatal).strip().upper()

    # HOAX / FALSE
    if "hoax" in injury or "false" in injury:
        return "Hoax / False report"

    # NOT A SHARK
    if any(x in injury for x in ["not a shark", "stingray", "barracuda", "propeller", "fish"]):
        return "Not a shark"

    # NO INJURY
    if "no injury" in injury or "no injuries" in injury:
        return "No injury"

    # PROVOKED
    if "provoked" in injury:
        return "Provoked incident"

    # MISSING
    if any(x in injury for x in ["missing", "disappeared", "body not recovered"]):
        return "Missing / Unknown"

    # FATAL CASES
    if fatal == "Y":
        # drowned then bitten after death
        if "post-mortem" in injury or ("drown" in injury and "post" in injury):
            return "Fatal | Drowned, shark scavenged"
        if "unconfirmed" in injury or "probable" in injury:
            return "Fatal | Unconfirmed shark involvement"
        return "Fatal | Shark confirmed"

    # NON-FATAL CASES
    if fatal == "N":
        if any(x in injury for x in ["bitten", "lacerat", "puncture", "wound", "abrasion"]):
            return "Non-fatal | Confirmed shark bite"
        # Text but no bite
        return "Non-fatal | Other"

    # Unknown
    return "Unknown"

# Apply to dataframe
df["Type"] = df.apply(lambda row: classify_injury(row["Injury"], row["Fatal Y/N"]), axis=1)


# Check RESULTS--------------------------------------------------------------------------------------------------------------

# Count number of records in each injury type
counts = df["Type"].value_counts()

# Calculate percentages
percentages = (counts / len(df)) * 100

# Create summary DataFrame
summary = pd.DataFrame({
    "Cases": counts,
    "Percentage": percentages.round(2)
})

# Save summary to CSV
summary.to_csv("injury_type_summary.csv", index=True)

# Print SUMMARY
# print(summary)

print("Injury Type Summary:\n")
for category in counts.index:
    print(f"{category}: {counts[category]} cases  ({percentages[category]:.2f}%)")



Injury Type Summary:

Unknown: 399 cases  (71.12%)
No injury: 87 cases  (15.51%)
Missing / Unknown: 38 cases  (6.77%)
Not a shark: 27 cases  (4.81%)
Provoked incident: 5 cases  (0.89%)
Hoax / False report: 5 cases  (0.89%)


In [52]:
df.head(5)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
0,25 Aug 2023,2023.0,Unknown,AUSTRALIA,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF"
1,21 Aug-2023,2023.0,Unknown,BAHAMAS,New Providence Isoad,"Saunders Beach, Nassau",,male,M,25,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023"
2,07-Jun-2023,2023.0,Unknown,BAHAMAS,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF"
3,02-Mar-2023,2023.0,Unknown,SEYCHELLES,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023"
4,18-Feb-2023,2023.0,Unknown,ARGENTINA,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023"


In [58]:
import re

# Function to classify injury severity
def classify_severity(val):
    if pd.isna(val):
        val = ""
    val = val.lower()
    
    if re.search(r'fatal|death|died|deceased', val):
        return "Death"
    if re.search(r'amputation|severed|severely bitten', val):
        return "Severe injury"
    if re.search(r'bitten|bite|laceration|cuts|fracture', val):
        return "Minor injury"
    if re.search(r'no injury|not injured|unharmed', val):
        return "No injury"
    if re.search(r'possible drowning|unknown', val):
        return "Uncertain"
    
    return "Other/Unknown"

# Apply to create Injury_Severity column
df["Injury_Severity"] = df["Injury"].apply(classify_severity)

# Now derive Fatal Y/N based on Injury_Severity
def derive_fatal(severity):
    if severity == "Death":
        return "Y"
    elif severity in ["Severe injury", "Injury", "No injury", "Uncertain", "Other/Unknown"]:
        return "N"
    return "N"

df["Fatal Y/N"] = df["Injury_Severity"].apply(derive_fatal)

# Check the first rows
print(df[["Injury", "Injury_Severity", "Fatal Y/N"]].head(20))


                                               Injury Injury_Severity  \
0                      Severe injuries to lower limbs   Other/Unknown   
1   Body found with shark bites. Possible drowning...           Death   
2                                Calf severely bitten   Severe injury   
3                                    Left foot bitten    Minor injury   
4                               Death by misadventure           Death   
5                 Right forearm and left hand injured   Other/Unknown   
6                            Minor cuts to left thigh    Minor injury   
7                                         Disappeared   Other/Unknown   
8            Laceration to arm caused by metal object    Minor injury   
9   No injury to occupants, injury to shark attemp...       No injury   
10                                                NaN   Other/Unknown   
11                                  PROVOKED INCIDENT   Other/Unknown   
12                  Severe injuries to arms and che

In [60]:
df.tail(20)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source,Injury_Severity
541,27-Jan-1849,1849.0,Unknown,AUSTRALIA,New South Wales,Sydney,boat capsized,William Henry Elliott,M,26,Torso bitten but may have been postmortem,N,,Shark involvement prior to death unconfirmed,"Maitland Mercury, 2/3/1849",Minor injury
542,Reported 17-Jul-1848,1848.0,Unknown,USA,Massachusetts,Cape Cod,,male,M,26,Human remains recovered from 4.9 m shark,N,,Shark involvement prior to death unconfirmed,"New Orleans Crescent, 7/17/1858",Other/Unknown
543,1847,1847.0,Missing / Unknown,USA,South Carolina,"Charleston Harbor, Charleston County",Swimming,a young sailor,M,26,"Disappeared, thought to have been taken by a s...",N,,Shark involvement prior to death unconfirmed,"W.H. Gregg, pp. 21-22",Other/Unknown
544,17-Nov-1839,1839.0,Unknown,AUSTRALIA,New South Wales,,,Mr.Johnson (male),M,26,"""Drowned, 2 days later his head was bitten off...",N,,,"H.D. Baldridge, p.146",Minor injury
545,Ca. 1837,1837.0,Unknown,USA,South Carolina,‚ÄúSouthern Wharf‚Äù,,"adult male, a sailor",M,26,7.6 m [25'] shark caught contained human remains,N,,Shark involvement prior to death unconfirmed,W. H. Gregg,Other/Unknown
546,1836.07.26.R,1836.0,Unknown,SPAIN,,,,,,26,"Shark caught, contained human remains",N,,Shark involvement prior to death unconfirmed,"C. Moore, GSAF",Other/Unknown
547,Reported 22- Jan-1831,1831.0,Unknown,AUSTRALIA,Tasmania,Hobart,"Boat capsized, clinging to line",Robert Dudlow,M,26,"Drowned, no shark involvement",N,,Invalid,"C. Black, GSAF; Sydney Gazette, 1/22/1831",Other/Unknown
548,Reported 15-Aug-1826,1826.0,Unknown,ENGLAND,Cumberland,Whitehaven,Bathing,child,,26,FATAL,Y,,,"The Times (London), 8/15/1826",Death
549,Reported 30-Dec-1823,1823.0,Unknown,JAMAICA,,,,male,,26,Human remains found in shark,N,,Shark involvement prior to death unconfirmed,,Other/Unknown
550,Reported 08-Jul-1819,1819.0,No injury,SPAIN,,Cadiz,,male,M,26,No injury / No attack,N,,Invalid,"C. Moore, GSAF",No injury


In [61]:
df['Fatal Y/N'].describe()

count     561
unique      2
top         N
freq      450
Name: Fatal Y/N, dtype: object

In [66]:
df['Age'].describe().round().astype(int)

count    561
mean      27
std       10
min        1
25%       26
50%       26
75%       26
max       77
Name: Age, dtype: int64

In [68]:
df.columns

Index(['Date', 'Year', 'Type', 'Country', 'State', 'Location', 'Activity',
       'Name', 'Sex', 'Age', 'Injury', 'Fatal Y/N', 'Time', 'Species ',
       'Source', 'Injury_Severity'],
      dtype='object')

In [69]:
df['Type'].describe().round()

count         561
unique          6
top       Unknown
freq          399
Name: Type, dtype: object